{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T02:15:37Z","timestamp":1758593737840,"version":"3.44.0"},"reference-count":44,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2025,7,20]],"date-time":"2025-07-20T00:00:00Z","timestamp":1752969600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"publisher","award":["2022YFB3103900"],"award-info":[{"award-number":["2022YFB3103900"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62202466"],"award-info":[{"award-number":["62202466"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004739","name":"Youth Innovation Promotion Association CAS","doi-asserted-by":"publisher","award":["2022159"],"award-info":[{"award-number":["2022159"]}],"id":[{"id":"10.13039\/501100004739","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Key Laboratory of Network Assessment Technology"},{"DOI":"10.13039\/501100002367","name":"Chinese Academy of Sciences","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002367","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Beijing Key Laboratory of Network Security and Protection Technology"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,9,21]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Open Source Software (OSS) is an essential part of modern software development, with platforms such as PyPI for Python, NPM for JavaScript, and RubyGems for Ruby facilitating code sharing and reuse. However, these repositories also pose significant security risks due to potential software supply chain attacks, where payloads are injected into components, propagating threats to downstream users and critical infrastructure. Existing automatic malicious component detection tools, particularly for PyPI, struggle to distinguish between subtle differences in malicious and benign behaviors, leading to high false positive rates. To address these issues, we systematically compare and explore these subtle differences, offering a more refined and accurate detection method, Open-Source Component Code Slices BERT (OCS-BERT). OCS-BERT leverages taint-based program slicing to isolate sensitive behavior segments and fine-tunes pre-trained model to capture subtle semantic differences across programming languages. This system excels in detecting malicious Python components and exhibits encouraging cross-language transferability to JavaScript's NPM and Ruby's RubyGems. Additionally, OCS-BERT successfully detected 107 malicious components from a total of 25,759 newly-uploaded PyPI components, taking two weeks to complete the process. This achievement demonstrates the effectiveness of our method, which serves as a potent enhancement to the current repertoire of software supply chain detection methodologies.<\/jats:p>","DOI":"10.1093\/comjnl\/bxaf029","type":"journal-article","created":{"date-parts":[[2025,3,23]],"date-time":"2025-03-23T02:48:00Z","timestamp":1742698080000},"page":"1163-1180","source":"Crossref","is-referenced-by-count":0,"title":["Advanced code slicing with pre-trained model fine-tuned for open-source component malware detection"],"prefix":"10.1093","volume":"68","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-7335-3366","authenticated-orcid":false,"given":"Yongshan","family":"Wang","sequence":"first","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5394-4584","authenticated-orcid":false,"given":"Siyuan","family":"Pang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-3192-9919","authenticated-orcid":false,"given":"Zijing","family":"Fan","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7596-3018","authenticated-orcid":false,"given":"Shang","family":"Shang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0017-3302","authenticated-orcid":false,"given":"Yepeng","family":"Yao","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0843-4482","authenticated-orcid":false,"given":"Zhengwei","family":"Jiang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9851-5548","authenticated-orcid":false,"given":"Baoxu","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering , Chinese Academy of Sciences, Beijing, 100085 ,","place":["China"]},{"name":"School of Cyber Security, University of Chinese Academy of Sciences , Beijing, 100049 ,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2025,7,20]]},"reference":[{"key":"2025092201571259800_ref1","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1007\/978-3-030-52683-2_2","article-title":"Backstabber's knife collection: a review of open source software supply chain attacks","volume-title":"Detection Of Intrusions And Malware, And Vulnerability Assessment: 17th International Conference, DIMVA 2020, Lisbon, Portugal, June 24\u201326, 2020, Proceedings 17","author":"Ohm","year":"2020"},{"volume-title":"Sonatype 9th Annual State of the Software Supply Chain","year":"2023","author":"Sonatype","key":"2025092201571259800_ref2"},{"key":"2025092201571259800_ref3","first-page":"499","article-title":"Bad snakes: understanding and improving python package index malware scanning","volume-title":"2023 IEEE\/ACM 45th International Conference On Software Engineering (ICSE), Melbourne Victoria Australia, May 14\u201320","author":"Vu","year":"2023"},{"key":"2025092201571259800_ref4","doi-asserted-by":"crossref","first-page":"1578","DOI":"10.1109\/SP46215.2023.10179332","article-title":"Investigating package related security threats in software registries","volume-title":"2023 IEEE Symposium On Security And Privacy (SP), San Francisco, CA, May 22\u201324","author":"Gu","year":"2023"},{"key":"2025092201571259800_ref5","doi-asserted-by":"crossref","DOI":"10.14722\/ndss.2021.23055","article-title":"Towards measuring supply chain attacks on package managers for interpreted languages","volume-title":"28th Annual Network And Distributed System Security Symposium, NDSS 2021, Virtually, February 21\u201325, 2021","author":"Duan","year":"2021"},{"key":"2025092201571259800_ref6","first-page":"166","article-title":"An empirical study of malicious code In PyPI ecosystem","volume-title":"2023 38th IEEE\/ACM International Conference On Automated Software Engineering (ASE). Echternach Luxembourg, November 11\u201315","author":"Guo","year":"2023"},{"key":"2025092201571259800_ref7","first-page":"728","article-title":"MalwareBench: malware samples are not enough","volume-title":"2024 IEEE\/ACM 21st International Conference On Mining Software Repositories (MSR), Lisbon, Portugal, May 15\u201316","author":"Zahan","year":"2024"},{"key":"2025092201571259800_ref8","first-page":"307","article-title":"A needle is an outlier in a haystack: hunting malicious PyPI packages with code clustering","volume-title":"2023 38th IEEE\/ACM International Conference On Automated Software Engineering (ASE), Echternach Luxembourg, November 11\u201315","author":"Liang","year":"2023"},{"article-title":"The Hitchhiker's guide to python: best practices for development","year":"2023","author":"Reitz","key":"2025092201571259800_ref9"},{"article-title":"How to package your python code\u2014Python packaging tutorial","year":"2012","author":"Torborg","key":"2025092201571259800_ref10"},{"key":"2025092201571259800_ref11","first-page":"1993","article-title":"MalWuKong: towards fast, accurate, and multilingual detection of malicious code poisoning in OSS supply chains","volume-title":"2023 38th IEEE\/ACM International Conference On Automated Software Engineering (ASE), Echternach Luxembourg, November 11\u201315","author":"Li","year":"2023"},{"key":"2025092201571259800_ref12","first-page":"3439","article-title":"Beyond typosquatting: an in-depth look at package confusion","volume-title":"32nd USENIX security symposium (USENIX security 23), Anaheim CA USA, August 9\u201311","author":"Neupane","year":"2023"},{"key":"2025092201571259800_ref13","first-page":"1681","article-title":"Practical automated detection of malicious npm packages","volume-title":"Proceedings Of The 44th International Conference On Software Engineering, Pittsburgh Pennsylvania, May 21\u201329","author":"Sejfia","year":"2022"},{"article-title":"A survey of neural code intelligence: paradigms, advances and beyond","year":"2024","author":"Sun","key":"2025092201571259800_ref14"},{"key":"2025092201571259800_ref15","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2020.findings-emnlp.139","article-title":"Codebert: a pre-trained model for programming and natural languages","author":"Feng","year":"2020"},{"article-title":"Graphcodebert: pre-training code representations with data flow","year":"2020","author":"Guo","key":"2025092201571259800_ref16"},{"article-title":"Codexglue: a machine learning benchmark dataset for code understanding and generation","year":"2021","author":"Shuai","key":"2025092201571259800_ref17"},{"article-title":"Wizardcoder: empowering code large language models with evol-instruct","year":"2023","author":"Luo","key":"2025092201571259800_ref18"},{"article-title":"Code llama: open foundation models for code","year":"2023","author":"Roziere","key":"2025092201571259800_ref19"},{"key":"2025092201571259800_ref20","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.acl-long.499","article-title":"Unixcoder: unified cross-modal pre-training for code representation","author":"Guo","year":"2022"},{"key":"2025092201571259800_ref21","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.emnlp-main.685","article-title":"Codet5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation","author":"Wang","year":"2021"},{"key":"2025092201571259800_ref22","first-page":"481","article-title":"Transformer-based language models for software vulnerability detection","volume-title":"Proceedings Of The 38th Annual Computer Security Applications Conference, Austin TX USA, December 5\u20139","author":"Thapa","year":"2022"},{"key":"2025092201571259800_ref23","doi-asserted-by":"publisher","first-page":"147","DOI":"10.1145\/3545258.3545260","article-title":"EL-CodeBert: better exploiting CodeBert to support source code-related classification tasks","volume-title":"Internetware 2022: 13th Asia-Pacific Symposium On Internetware, Hohhot China, June 11\u201312","author":"Liu","year":"2022"},{"key":"2025092201571259800_ref24","doi-asserted-by":"publisher","first-page":"103802","DOI":"10.1016\/j.cose.2024.103802","article-title":"Python source code vulnerability detection with named entity recognition","volume":"140","author":"Ehrenberg","year":"2024","journal-title":"Comput Secur"},{"key":"2025092201571259800_ref25","doi-asserted-by":"publisher","DOI":"10.1145\/3705304","article-title":"Killing two birds with one stone: malicious package detection in NPM and PyPI using a single model of malicious behavior sequence","author":"Zhang","year":"2024","journal-title":"ACM Trans Softw Eng Methodol"},{"article-title":"Voyage-large-2-instruct: instruction-tuned and rank 1 on MTEB","year":"2024","author":"Voyage-AI","key":"2025092201571259800_ref26"},{"article-title":"MTEB: massive text embedding benchmark","year":"2022","author":"Muennighoff","key":"2025092201571259800_ref27"},{"article-title":"Learning from labeled and unlabeled data with label propagation","year":"2002","author":"Zhu","key":"2025092201571259800_ref28"},{"key":"2025092201571259800_ref29","doi-asserted-by":"publisher","first-page":"036106","DOI":"10.1103\/PhysRevE.76.036106","article-title":"Near linear time algorithm to detect community structures in large-scale networks. Physical review E\u2014Statistical, nonlinear, and soft matter","volume":"76","author":"Raghavan","year":"2007","journal-title":"Phys Ther"},{"key":"2025092201571259800_ref30","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1109\/SP.2014.44","article-title":"Modeling and discovering vulnerabilities with code property graphs","volume-title":"2014 IEEE Symposium On Security And Privacy, May 18\u201321, Berkeley CA USA","author":"Yamaguchi","year":"2014"},{"key":"2025092201571259800_ref31","article-title":"OpenAI ChatGPT-4"},{"key":"2025092201571259800_ref32","article-title":"VirusTotal VirusTotal\u2014Analyze files and URLs to detect viruses, worms, trojans and other malicious content including suspicious websites and suspicious files"},{"article-title":"Lora: low-rank adaptation of large language models","volume-title":"2022 International Conference on Learning Representations.","author":"Hu","key":"2025092201571259800_ref33"},{"journal-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23)","article-title":"Lima: less is more for alignment","author":"Zhou","key":"2025092201571259800_ref34"},{"key":"2025092201571259800_ref35","article-title":"Open-source dataset of malicious software packages"},{"key":"2025092201571259800_ref36","article-title":"Hugging face PyPI raw dataset by VIKP"},{"key":"2025092201571259800_ref37","article-title":"PyPI malware check: documentation on malware checks"},{"key":"2025092201571259800_ref38","article-title":"DataDog. GuardDog: a CLI tool to identify malicious PyPI and npm packages"},{"key":"2025092201571259800_ref39","article-title":"TUNA mirror TUNA mirror of PyPI"},{"article-title":"Malicious PyPI packages targeting highly specific MacOS machines","year":"2024","author":"Obregoso","key":"2025092201571259800_ref40"},{"key":"2025092201571259800_ref41","article-title":"JFrog. Malicious PyPI packages"},{"key":"2025092201571259800_ref42","article-title":"Snyk. Snyk vulnerability database"},{"article-title":"Three new malicious PyPI packages deploy CoinMiner on Linux devices","year":"2023","author":"Xiong","key":"2025092201571259800_ref43"},{"article-title":"Over 100 malicious pkgs target popular ML PyPi libraries","year":"2024","author":"Abai","key":"2025092201571259800_ref44"}],"container-title":["The Computer Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/68\/9\/1163\/63806948\/bxaf029.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/68\/9\/1163\/63806948\/bxaf029.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,22]],"date-time":"2025-09-22T05:57:26Z","timestamp":1758520646000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/comjnl\/article\/68\/9\/1163\/8209415"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,20]]},"references-count":44,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,7,20]]},"published-print":{"date-parts":[[2025,9,21]]}},"URL":"https:\/\/doi.org\/10.1093\/comjnl\/bxaf029","relation":{},"ISSN":["0010-4620","1460-2067"],"issn-type":[{"type":"print","value":"0010-4620"},{"type":"electronic","value":"1460-2067"}],"subject":[],"published-other":{"date-parts":[[2025,9]]},"published":{"date-parts":[[2025,7,20]]}}}