{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,22]],"date-time":"2026-01-22T01:02:44Z","timestamp":1769043764686,"version":"3.49.0"},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"abstract":"<jats:p>Regularly testing deep learning-powered systems on newly collected data is critical to ensure their reliability, robustness, and efficacy in real-world applications. This process is demanding due to the significant time and human effort required for labeling new data. While test selection methods alleviate manual labor by labeling and evaluating only a subset of data while meeting testing criteria, we observe that such methods with reported promising results are simply evaluated, e.g., testing on original test data. The question arises: are they always reliable? In this paper, we explore when and to what extent test selection methods fail. First, we identify potential pitfalls of 11 selection methods based on their construction. Second, we conduct a study to empirically confirm the existence of these pitfalls. Furthermore, we demonstrate how pitfalls can break the reliability of these methods. Concretely, methods for fault detection suffer from data that are: 1) correctly classified but uncertain, or 2) misclassified but confident. Remarkably, the test relative coverage achieved by such methods drops by up to 86.85%. Besides, methods for performance estimation are sensitive to the choice of intermediate-layer output. The effectiveness of such methods can be even worse than random selection when using an inappropriate layer.<\/jats:p>","DOI":"10.1145\/3715693","type":"journal-article","created":{"date-parts":[[2025,1,29]],"date-time":"2025-01-29T15:45:49Z","timestamp":1738165549000},"update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Assessing the Robustness of Test Selection Methods for Deep Neural Networks"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8251-1669","authenticated-orcid":false,"given":"Qiang","family":"Hu","sequence":"first","affiliation":[{"name":"Tianjin University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5535-2420","authenticated-orcid":false,"given":"Yuejun","family":"Guo","sequence":"additional","affiliation":[{"name":"Luxembourg Institute of Science and Technology, Luxembourg"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1288-6502","authenticated-orcid":false,"given":"Xiaofei","family":"Xie","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8312-1358","authenticated-orcid":false,"given":"Maxime","family":"Cordy","sequence":"additional","affiliation":[{"name":"University of Luxembourg, Luxembourg"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0044-466X","authenticated-orcid":false,"given":"Wei","family":"Ma","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1852-2547","authenticated-orcid":false,"given":"Mike","family":"Papadakis","sequence":"additional","affiliation":[{"name":"University of Luxembourg, Luxembourg"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8621-2420","authenticated-orcid":false,"given":"Lei","family":"Ma","sequence":"additional","affiliation":[{"name":"The University of Tokyo &amp; University of Alberta, Japan &amp; Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1045-4861","authenticated-orcid":false,"given":"Yves","family":"Le Traon","sequence":"additional","affiliation":[{"name":"University of Luxembourg, Luxembourg"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,1,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2020.113816"},{"key":"e_1_2_1_2_1","volume-title":"On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705","author":"Carlini Nicholas","year":"2019","unstructured":"Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705 (2019)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee 39\u201357.","DOI":"10.1109\/SP.2017.49"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3394112","article-title":"Practical accuracy estimation for efficient deep neural network testing","volume":"29","author":"Chen Junjie","year":"2020","unstructured":"Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4 (2020), 1\u201335.","journal-title":"ACM Transactions on Software Engineering and Methodology (TOSEM)"},{"key":"e_1_2_1_5_1","volume-title":"International Conference on Machine Learning. PMLR, 1617\u20131629","author":"Chen Mayee","year":"2021","unstructured":"Mayee Chen, Karan Goel, Nimit S Sohoni, Fait Poms, Kayvon Fatahalian, and Christopher R\u00e9. 2021. Mandoline: Model evaluation under distribution shift. In International Conference on Machine Learning. PMLR, 1617\u20131629."},{"key":"e_1_2_1_6_1","volume-title":"Test input prioritization for Machine Learning Classifiers","author":"Dang Xueqi","year":"2024","unstructured":"Xueqi Dang, Yinghua Li, Mike Papadakis, Jacques Klein, Tegawend\u00e9 F Bissyand\u00e9, and Yves Le Traon. 2024. Test input prioritization for Machine Learning Classifiers. IEEE Transactions on Software Engineering (2024)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICECCS51672.2020.00016"},{"key":"e_1_2_1_8_1","volume-title":"A systematic review of robustness in deep learning for computer vision: Mind the gap? arXiv preprint arXiv:2112.00639","author":"Drenkow Nathan","year":"2021","unstructured":"Nathan Drenkow, Numair Sani, Ilya Shpitser, and Mathias Unberath. 2021. A systematic review of robustness in deep learning for computer vision: Mind the gap? arXiv preprint arXiv:2112.00639 (2021)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3395363.3397357"},{"key":"e_1_2_1_10_1","volume-title":"IEEE\/ACM 44th International Conference on Software Engineering (ICSE). 73\u201385","author":"Gao Xinyu","year":"2022","unstructured":"Xinyu Gao, Yang Feng, Yining Yin, Zixi Liu, Zhenyu Chen, and Baowen Xu. 2022. Adaptive test selection for deep neural networks. In IEEE\/ACM 44th International Conference on Software Engineering (ICSE). 73\u201385. https:\/\/doi.org\/1O.1145\/3510003.3510232"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377811.3380415"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2018.00058"},{"key":"e_1_2_1_13_1","volume-title":"Generative adversarial nets. Advances in neural information processing systems 27","author":"Goodfellow Ian","year":"2014","unstructured":"Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST60714.2024.00016"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the IEEE\/ACM 46th International Conference on Software Engineering. 1\u201312","author":"Guerriero Antonio","year":"2024","unstructured":"Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2024. DeepSample: DNN sampling-based testing for operational accuracy assessment. In Proceedings of the IEEE\/ACM 46th International Conference on Software Engineering. 1\u201312."},{"key":"e_1_2_1_16_1","volume-title":"International conference on machine learning. PMLR, 1321\u20131330","author":"Guo Chuan","year":"2017","unstructured":"Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International conference on machine learning. PMLR, 1321\u20131330."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_18_1","volume-title":"International Conference on Learning Representations","author":"Hendrycks Dan","year":"2019","unstructured":"Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations (2019)."},{"key":"e_1_2_1_19_1","volume-title":"5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=Hkg4TI9xl","author":"Hendrycks Dan","year":"2017","unstructured":"Dan Hendrycks and Kevin Gimpel. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=Hkg4TI9xl"},{"key":"e_1_2_1_20_1","volume-title":"International Joint Conference on Neural Networks.","author":"Houben Sebastian","year":"2013","unstructured":"Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. 2013. Detection of traffic signs in real-world images: the German traffic sign detection benchmark. In International Joint Conference on Neural Networks."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2015.58"},{"key":"e_1_2_1_22_1","volume-title":"An empirical study on data distribution-aware test selection for deep learning enhancement. ACM Transactions on Software Engineering and Methodology","author":"Hu Qiang","year":"2022","unstructured":"Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An empirical study on data distribution-aware test selection for deep learning enhancement. ACM Transactions on Software Engineering and Methodology (2022)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678672"},{"key":"e_1_2_1_24_1","unstructured":"Qiang Hu Yuejun Guo Xiaofei Xie Maxime Cordy Lei Ma Mike Papadakis and Yves Le Traon. 2024. Test Optimization in DNN Testing: A Survey. ACM Trans. Softw. Eng. Methodol. (2024)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2019.00126"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2019.00108"},{"key":"e_1_2_1_29_1","unstructured":"Yann LeCun et al. 2015. LeNet-5 convolutional neural networks. URL: http:\/\/yann. lecun. com\/exdb\/lenet 20 5 (2015) 14."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327757.3327819"},{"key":"e_1_2_1_32_1","volume-title":"Wortman Vaughan (Eds.)","volume":"34","author":"YU LI","year":"2021","unstructured":"YU LI, Min LI, Qiuxia LAI, Yannan Liu, and Qiang Xu. 2021. TestRank: bringing order into unlabeled test instances for deep learning tasks. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 20874\u201320886."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338930"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3238147.3238202"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3417330","article-title":"Test selection for deep learning systems","volume":"30","author":"Ma Wei","year":"2021","unstructured":"Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 2 (2021), 1\u201322.","journal-title":"ACM Transactions on Software Engineering and Methodology (TOSEM)"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.21105\/joss.00205"},{"key":"e_1_2_1_37_1","volume-title":"NIPS Workshop on Deep Learning and Unsupervised Feature Learning","author":"Netzer Yuval","year":"2011","unstructured":"Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011."},{"key":"e_1_2_1_38_1","volume-title":"Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems 32","author":"Ovadia Yaniv","year":"2019","unstructured":"Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. 2019. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_1_39_1","first-page":"320","article-title":"The power of Student's t-test","volume":"60","author":"Owen Donald B","year":"1965","unstructured":"Donald B Owen. 1965. The power of Student's t-test. J. Amer. Statist. Assoc. 60, 309 (1965), 320\u2013333.","journal-title":"J. Amer. Statist. Assoc."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132785"},{"key":"e_1_2_1_41_1","volume-title":"Energy-based Automated Model Evaluation. arXiv preprint arXiv:2401.12689","author":"Peng Ru","year":"2024","unstructured":"Ru Peng, Heming Zou, Haobo Wang, Yawen Zeng, Zenan Huang, and Junbo Zhao. 2024. Energy-based Automated Model Evaluation. arXiv preprint arXiv:2401.12689 (2024)."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678764"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3324884.3416621"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_45_1","volume-title":"International Conference on Learning Representations","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (2014)."},{"key":"e_1_2_1_46_1","volume-title":"Fast and effective robustness certification. Advances in neural information processing systems 31","author":"Singh Gagandeep","year":"2018","unstructured":"Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus P\u00fcschel, and Martin Vechev. 2018. Fast and effective robustness certification. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_2_1_47_1","doi-asserted-by":"crossref","unstructured":"Charles Spearman. 1961. The proof and measurement of association between two things. (1961).","DOI":"10.1037\/11491-005"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2023.3330982"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00046"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00046"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3533767.3534375"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331388"},{"key":"e_1_2_1_53_1","volume-title":"Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333","author":"Xia Mengzhou","year":"2024","unstructured":"Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333 (2024)."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293882.3330579"},{"key":"e_1_2_1_55_1","doi-asserted-by":"crossref","unstructured":"Xiaofei Xie Lei Ma Haijun Wang Yuekang Li Yang Liu and Xiaohong Li. 2019b. DiffChaser: detecting disagreements for deep neural networks.. In IJCAI. 5772\u20135778.","DOI":"10.24963\/ijcai.2019\/800"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3409671"},{"key":"e_1_2_1_57_1","volume-title":"Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models. arXiv preprint arXiv:2403.07384","author":"Yang Yu","year":"2024","unstructured":"Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and Baharan Mirzasoleiman. 2024. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models. arXiv preprint arXiv:2403.07384 (2024)."},{"key":"e_1_2_1_58_1","volume-title":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 408\u2013419","author":"Yang Zhou","year":"2022","unstructured":"Zhou Yang, Jieke Shi, Muhammad Hilmi Asyrofi, and David Lo. 2022. Revisiting neuron coverage metrics and quality of deep neural networks. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 408\u2013419."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377811.3380368"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3715693","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,29]],"date-time":"2025-01-29T15:46:04Z","timestamp":1738165564000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3715693"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,29]]},"references-count":58,"alternative-id":["10.1145\/3715693"],"URL":"https:\/\/doi.org\/10.1145\/3715693","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,29]]},"assertion":[{"value":"2024-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-18","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"3715693"}}