{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T22:44:03Z","timestamp":1757544243981,"version":"3.37.3"},"reference-count":53,"publisher":"Springer Science and Business Media LLC","issue":"7","license":[{"start":{"date-parts":[[2022,9,20]],"date-time":"2022-09-20T00:00:00Z","timestamp":1663632000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,9,20]],"date-time":"2022-09-20T00:00:00Z","timestamp":1663632000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001866","name":"Fonds National de la Recherche Luxembourg","doi-asserted-by":"crossref","award":["INTER\/ANR\/18\/12632675\/SATOCROSS"],"award-info":[{"award-number":["INTER\/ANR\/18\/12632675\/SATOCROSS"]}],"id":[{"id":"10.13039\/501100001866","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Empir Software Eng"],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Vulnerability prediction refers to the problem of identifying system components that are most likely to be vulnerable. Typically, this problem is tackled by training binary classifiers on historical data. Unfortunately, recent research has shown that such approaches underperform due to the following two reasons: a) the imbalanced nature of the problem, and b) the inherently noisy historical data, i.e., most vulnerabilities are discovered much later than they are introduced. This misleads classifiers as they learn to recognize actual vulnerable components as non-vulnerable. To tackle these issues, we propose <jats:italic>TROVON<\/jats:italic>, a technique that learns from known vulnerable components rather than from vulnerable and non-vulnerable components, as typically performed. We perform this by contrasting the known vulnerable, and their respective fixed components. This way, <jats:italic>TROVON<\/jats:italic> manages to learn from the things we know, i.e., vulnerabilities, hence reducing the effects of noisy and unbalanced data. We evaluate <jats:italic>TROVON<\/jats:italic> by comparing it with existing techniques on three security-critical open source systems, i.e., Linux Kernel, OpenSSL, and Wireshark, with historical vulnerabilities that have been reported in the National Vulnerability Database (NVD). Our evaluation demonstrates that the prediction capability of <jats:italic>TROVON<\/jats:italic> significantly outperforms existing vulnerability prediction techniques such as <jats:italic>Software Metrics<\/jats:italic>, <jats:italic>Imports<\/jats:italic>, <jats:italic>Function Calls<\/jats:italic>, <jats:italic>Text Mining<\/jats:italic>, <jats:italic>Devign<\/jats:italic>, <jats:italic>LSTM<\/jats:italic>, and <jats:italic>LSTM-RF<\/jats:italic> with an improvement of 40.84% in <jats:italic>Matthews Correlation Coefficient<\/jats:italic> (MCC) score under <jats:italic>Clean Training Data Settings<\/jats:italic>, and an improvement of 35.52% under <jats:italic>Realistic Training Data Settings<\/jats:italic>.<\/jats:p>","DOI":"10.1007\/s10664-022-10197-4","type":"journal-article","created":{"date-parts":[[2022,9,20]],"date-time":"2022-09-20T09:04:03Z","timestamp":1663664643000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["Learning from what we know: How to perform vulnerability prediction using noisy historical data"],"prefix":"10.1007","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2507-8846","authenticated-orcid":false,"given":"Aayush","family":"Garg","sequence":"first","affiliation":[]},{"given":"Renzo","family":"Degiovanni","sequence":"additional","affiliation":[]},{"given":"Matthieu","family":"Jimenez","sequence":"additional","affiliation":[]},{"given":"Maxime","family":"Cordy","sequence":"additional","affiliation":[]},{"given":"Mike","family":"Papadakis","sequence":"additional","affiliation":[]},{"given":"Yves","family":"Le Traon","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,9,20]]},"reference":[{"key":"10197_CR1","unstructured":"Abadi M, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org"},{"key":"10197_CR2","unstructured":"Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate"},{"key":"10197_CR3","doi-asserted-by":"crossref","unstructured":"Britz D, Goldie A, Luong T, Le Q (2017) Massive exploration of neural machine translation architectures. arXiv e-prints","DOI":"10.18653\/v1\/D17-1151"},{"key":"10197_CR4","unstructured":"Brownlee J (2021) When to use mlp, cnn, and rnn neural networks. https:\/\/machinelearningmastery.com\/when-to-use-mlp-cnn-and-rnn-neural-networks. Accessed 1 May 2018"},{"key":"10197_CR5","doi-asserted-by":"crossref","unstructured":"Brownlee J (2022) Encoder-decoder recurrent neural network models for neural machine translation. https:\/\/machinelearningmastery.com\/encoder-decoder-recurrent-neural-network-models-neural-machine-translation\/. Accessed 1 Feb 2018","DOI":"10.1155\/2022\/9714800"},{"issue":"3","key":"10197_CR6","doi-asserted-by":"publisher","first-page":"294","DOI":"10.1016\/j.sysarc.2010.06.003","volume":"57","author":"I Chowdhury","year":"2011","unstructured":"Chowdhury I, Zulkernine M (2011) Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. J Syst Archit 57(3):294\u2013313","journal-title":"J Syst Archit"},{"key":"10197_CR7","doi-asserted-by":"crossref","unstructured":"Collard ML, Maletic JI (2016) srcml 1.0: explore, analyze, and manipulate source code. In: 2016 IEEE International conference on software maintenance and evolution (ICSME), pp 649\u2013649","DOI":"10.1109\/ICSME.2016.36"},{"key":"10197_CR8","unstructured":"Dam HK, Tran T, Pham T T M, Ng SW, Grundy J, Ghose A (2018) Automatic feature learning for predicting vulnerable software components. IEEE Trans Softw Eng 1\u20131"},{"issue":"4\u20135","key":"10197_CR9","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1007\/s10664-011-9173-9","volume":"17","author":"M D\u2019Ambros","year":"2012","unstructured":"D\u2019Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4\u20135):531\u2013577","journal-title":"Empir Softw Eng"},{"key":"10197_CR10","unstructured":"Definition of vulnerability (2021) https:\/\/cve.mitre.org\/about\/terminology.html. Accessed 1 May 2021"},{"key":"10197_CR11","unstructured":"Falleri J-R, Morandat F, Blanc X, Martinez M, Monperrus M (2018) Fine-grained and accurate source code differencing. In: Proceedings of the international conference on automated software engineering. Update for oadoi on Nov 02 2018, V\u00e4steras, pp 313\u2013324"},{"key":"10197_CR12","doi-asserted-by":"crossref","unstructured":"Garg A, Ojdanic M, Degiovanni R, Chekam TT, Papadakis M, Le Traon Y (2022) Cerebro: static subsuming mutant selection. IEEE Trans Softw Eng 1\u20131","DOI":"10.1109\/TSE.2022.3140510"},{"key":"10197_CR13","doi-asserted-by":"crossref","unstructured":"Gu X, Zhang H, Zhang D, Kim S (2016) Deep api learning. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016. Association for Computing Machinery, New York, pp 631\u2013642","DOI":"10.1145\/2950290.2950334"},{"key":"10197_CR14","doi-asserted-by":"crossref","unstructured":"Gupta R, Pal S, Kanade A, Shevade S (2017) Deepfix: fixing common c language errors by deep learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI\u201917. AAAI Press, pp 1345\u20131351","DOI":"10.1609\/aaai.v31i1.10742"},{"issue":"6","key":"10197_CR15","doi-asserted-by":"publisher","first-page":"1276","DOI":"10.1109\/TSE.2011.103","volume":"38","author":"T Hall","year":"2012","unstructured":"Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276\u20131304","journal-title":"IEEE Trans Softw Eng"},{"issue":"8","key":"10197_CR16","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735\u20131780","journal-title":"Neural Comput"},{"key":"10197_CR17","unstructured":"Huo X, Li M, Zhou Z-H (2016) Learning unified features from natural and programming languages for locating buggy source code. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI\u201916. AAAI Press, pp 1606\u20131612"},{"key":"10197_CR18","doi-asserted-by":"crossref","unstructured":"Jimenez M, Papadakis M, Le Traon Y (2016) An empirical analysis of vulnerabilities in openssl and the linux kernel. In: 2016 23rd Asia-pacific software engineering conference (APSEC). IEEE, pp 105\u2013112","DOI":"10.1109\/APSEC.2016.025"},{"key":"10197_CR19","doi-asserted-by":"crossref","unstructured":"Jimenez M, Papadakis M, Le Traon Y (2018) Enabling the continous analysis of security vulnerabilities with vuldata7. In: Proceedings of the 18th IEEE international working conference on source code analysis and manipulation SCAM 2018, Madrid, Spain, September 23\u201324, 2018","DOI":"10.1109\/SCAM.2018.00014"},{"key":"10197_CR20","doi-asserted-by":"crossref","unstructured":"Jimenez M, Rwemalika R, Papadakis M, Sarro F, Le Traon Y, Harman M (2019) The importance of accounting for real-world labelling when predicting software vulnerabilities. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, ESEC\/FSE 2019. Association for Computing Machinery, New York, pp 695\u2013705","DOI":"10.1145\/3338906.3338941"},{"key":"10197_CR21","unstructured":"Kononenko I (1995) On biases in estimating multi-valued attributes. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2, IJCAI\u201995. Morgan Kaufmann Publishers Inc, San Francisco, pp 1034\u20131040"},{"key":"10197_CR22","doi-asserted-by":"crossref","unstructured":"Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. In: 25th Annual network and distributed system security symposium, NDSS 2018, San Diego, California, USA, February 18\u201321, 2018","DOI":"10.14722\/ndss.2018.23158"},{"key":"10197_CR23","unstructured":"Linux in 2020 (2020) 27.8 million lines of code in the kernel. https:\/\/www.linux.com\/news\/linux-in-2020-27-8-million-lines-of-code-in-the-kernel-1-3-million-in-systemd\/. Accessed 1 May 2021"},{"key":"10197_CR24","unstructured":"Linux kernal (2021) https:\/\/www.kernel.org. Accessed 1 May 2021"},{"issue":"2","key":"10197_CR25","doi-asserted-by":"publisher","first-page":"442","DOI":"10.1016\/0005-2795(75)90109-9","volume":"405","author":"BW Matthews","year":"1975","unstructured":"Matthews B W (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)\u2014Protein Structure 405(2):442\u2013451","journal-title":"Biochimica et Biophysica Acta (BBA)\u2014Protein Structure"},{"key":"10197_CR26","doi-asserted-by":"crossref","unstructured":"Morrison P, Herzig K, Murphy B, Williams L (2015) Challenges with applying vulnerability prediction models. In: Proceedings of the 2015 symposium and bootcamp on the science of security, HotSoS \u201915. Association for Computing Machinery, New York","DOI":"10.1145\/2746194.2746198"},{"key":"#cr-split#-10197_CR27.1","doi-asserted-by":"crossref","unstructured":"Moshtari S, Sami A (2016) Evaluating and comparing complexity, coupling and a new proposed set of coupling metrics in cross-project vulnerability prediction. In: Ossowski S","DOI":"10.1145\/2851613.2851777"},{"key":"#cr-split#-10197_CR27.2","unstructured":"(ed) Proceedings of the 31st annual ACM symposium on applied computing, Pisa, Italy, April 4-8, 2016. ACM, pp 1415-1421"},{"key":"10197_CR28","unstructured":"National vulnerability database (2021) https:\/\/nvd.nist.gov. Accessed 1 May 2021"},{"key":"10197_CR29","doi-asserted-by":"crossref","unstructured":"Neuhaus S, Zimmermann T, Holler C, Zeller A (2007) Predicting vulnerable software components. In: Proceedings of the 14th ACM conference on computer and communications security, CCS \u201907. Association for Computing Machinery, New York, pp 529\u2013540","DOI":"10.1145\/1315245.1315311"},{"key":"10197_CR30","unstructured":"Openssl (2021) https:\/\/www.openssl.org. Accessed 1 May 2021"},{"issue":"5","key":"10197_CR31","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1109\/MSP.2004.84","volume":"2","author":"B Potter","year":"2004","unstructured":"Potter B, McGraw G (2004) Software security testing. IEEE Security Privacy 2(5):81\u201385","journal-title":"IEEE Security Privacy"},{"issue":"10","key":"10197_CR32","doi-asserted-by":"publisher","first-page":"993","DOI":"10.1109\/TSE.2014.2340398","volume":"40","author":"R Scandariato","year":"2014","unstructured":"Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40(10):993\u20131006","journal-title":"IEEE Trans Softw Eng"},{"issue":"6","key":"10197_CR33","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1109\/TSE.2014.2322358","volume":"40","author":"M Shepperd","year":"2014","unstructured":"Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603\u2013616","journal-title":"IEEE Trans Softw Eng"},{"key":"10197_CR34","doi-asserted-by":"publisher","first-page":"235","DOI":"10.2478\/jaiscr-2019-0006","volume":"9","author":"A Shewalkar","year":"2019","unstructured":"Shewalkar A, Nyavanandi D, Ludwig S (2019) Performance evaluation of deep neural networks applied to speech recognition Rnn, lstm and gru. J Artif Intell Soft Comput Res 9:235\u2013245","journal-title":"J Artif Intell Soft Comput Res"},{"key":"10197_CR35","doi-asserted-by":"crossref","unstructured":"Shin Y, Williams L (2008) An empirical model to predict security vulnerabilities using code complexity metrics. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement, ESEM \u201908. Association for Computing Machinery, New York, pp 315\u2013317","DOI":"10.1145\/1414004.1414065"},{"issue":"1","key":"10197_CR36","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1007\/s10664-011-9190-8","volume":"18","author":"Y Shin","year":"2013","unstructured":"Shin Y, Williams L (2013) Can traditional fault prediction models be used for vulnerability prediction? Empir Softw Eng 18(1):25\u201359","journal-title":"Empir Softw Eng"},{"issue":"6","key":"10197_CR37","doi-asserted-by":"publisher","first-page":"772","DOI":"10.1109\/TSE.2010.81","volume":"37","author":"Y Shin","year":"2011","unstructured":"Shin Y, Meneely A, Williams L, Osborne JA (2011) Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Trans Softw Eng 37(6):772\u2013787","journal-title":"IEEE Trans Softw Eng"},{"key":"10197_CR38","unstructured":"Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks"},{"key":"10197_CR39","doi-asserted-by":"crossref","unstructured":"Tang Y, Zhao F, Yang Y, Lu H, Zhou Y, Xu B (2015) Predicting vulnerable components via text mining or software metrics? an effort-aware perspective. In: QRS. IEEE, pp 27\u201336","DOI":"10.1109\/QRS.2015.15"},{"key":"10197_CR40","unstructured":"The heartbleed bug (2021) https:\/\/heartbleed.com\/. Accessed 1 May 2021"},{"key":"10197_CR41","doi-asserted-by":"crossref","unstructured":"Theisen C, Williams LA (2020) Better together: comparing vulnerability prediction models. Inf Softw Technol 119","DOI":"10.1016\/j.infsof.2019.106204"},{"key":"10197_CR42","doi-asserted-by":"crossref","unstructured":"Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2019a) Learning how to mutate source code from bug-fixes. In: 2019 IEEE International conference on software maintenance and evolution (ICSME)","DOI":"10.1109\/ICSME.2019.00046"},{"issue":"4","key":"10197_CR43","doi-asserted-by":"publisher","first-page":"19:1","DOI":"10.1145\/3340544","volume":"28","author":"M Tufano","year":"2019","unstructured":"Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2019b) An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Trans Softw Eng Methodol 28(4):19:1\u201319:29","journal-title":"ACM Trans Softw Eng Methodol"},{"issue":"2","key":"10197_CR44","first-page":"101","volume":"25","author":"A Vargha","year":"2000","unstructured":"Vargha A, Delaney HD (2000) A critique and improvement of the \u201ccl\u201d common language effect size statistics of Mcgraw and Wong. J Educ Behav Stat 25 (2):101\u2013132","journal-title":"J Educ Behav Stat"},{"key":"10197_CR45","unstructured":"Vulnerabilities (2021) https:\/\/owasp.org\/www-community\/vulnerabilities\/. Accessed 1 May 2021"},{"key":"10197_CR46","doi-asserted-by":"crossref","unstructured":"Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: Proceedings of the 38th international conference on software engineering, ICSE \u201916. Association for Computing Machinery, New York, pp 297\u2013308","DOI":"10.1145\/2884781.2884804"},{"key":"10197_CR47","doi-asserted-by":"crossref","unstructured":"White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: 2016 31st IEEE\/ACM international conference on automated software engineering (ASE), pp 87\u201398","DOI":"10.1145\/2970276.2970326"},{"issue":"6","key":"10197_CR48","doi-asserted-by":"publisher","first-page":"80","DOI":"10.2307\/3001968","volume":"1","author":"F Wilcoxon","year":"1945","unstructured":"Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bull 1(6):80\u201383","journal-title":"Biometrics Bull"},{"key":"10197_CR49","unstructured":"Wireshark (2021) https:\/\/www.wireshark.org. Accessed 1 May 2021"},{"key":"10197_CR50","doi-asserted-by":"crossref","unstructured":"Yang X, Lo D, Xia X, Zhang Y, Sun J (2015) Deep learning for just-in-time defect prediction. In: 2015 IEEE International conference on software quality, reliability and security, pp 17\u201326","DOI":"10.1109\/QRS.2015.14"},{"key":"10197_CR51","unstructured":"Zhou Y, Liu S, Siow J, Du X, Liu Y (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks"},{"key":"10197_CR52","doi-asserted-by":"crossref","unstructured":"Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC\/FSE \u201909. Association for Computing Machinery, New York, pp 91\u2013100","DOI":"10.1145\/1595696.1595713"}],"container-title":["Empirical Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-022-10197-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10664-022-10197-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-022-10197-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,21]],"date-time":"2022-11-21T02:13:07Z","timestamp":1668996787000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10664-022-10197-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,20]]},"references-count":53,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["10197"],"URL":"https:\/\/doi.org\/10.1007\/s10664-022-10197-4","relation":{},"ISSN":["1382-3256","1573-7616"],"issn-type":[{"type":"print","value":"1382-3256"},{"type":"electronic","value":"1573-7616"}],"subject":[],"published":{"date-parts":[[2022,9,20]]},"assertion":[{"value":"2 July 2022","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 September 2022","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"<!--Emphasis Type='Bold' removed-->Conflict of Interest"}}],"article-number":"169"}}