{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T04:59:33Z","timestamp":1758862773528,"version":"3.41.0"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,12,21]],"date-time":"2023-12-21T00:00:00Z","timestamp":1703116800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2024,2,29]]},"abstract":"<jats:p>\n            Code search is a common yet important activity of software developers. An efficient code search model can largely facilitate the development process and improve the programming quality. Given the superb performance of learning the contextual representations, deep learning models, especially pre-trained language models, have been widely explored for the code search task. However, studies mainly focus on proposing new architectures for ever-better performance on designed test sets but ignore the performance on unseen test data where only natural language queries are available. The same problem in other domains, e.g., CV and NLP, is usually solved by test input selection that uses a subset of the unseen set to reduce the labeling effort. However, approaches from other domains are not directly applicable and still require labeling effort. In this article, we propose the\n            <jats:bold>k<\/jats:bold>\n            NN-b\n            <jats:bold>a<\/jats:bold>\n            sed\n            <jats:bold>p<\/jats:bold>\n            erformance t\n            <jats:bold>e<\/jats:bold>\n            sting (\n            <jats:bold>KAPE<\/jats:bold>\n            ) to efficiently solve the problem without manually matching code snippets to test queries. The main idea is to use semantically similar training data to perform the evaluation. Extensive experiments on six programming language datasets, three state-of-the-art pre-trained models, and seven baseline methods demonstrate that KAPE can effectively assess the model performance (e.g., CodeBERT achieves MRR 0.5795 on JavaScript) with a slight difference (e.g., 0.0261).\n          <\/jats:p>","DOI":"10.1145\/3624735","type":"journal-article","created":{"date-parts":[[2023,9,16]],"date-time":"2023-09-16T10:21:44Z","timestamp":1694859704000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["KAPE:\n            <i>k<\/i>\n            NN-based Performance Testing for Deep Code Search"],"prefix":"10.1145","volume":"33","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5535-2420","authenticated-orcid":false,"given":"Yuejun","family":"Guo","sequence":"first","affiliation":[{"name":"Luxembourg Institute of Science and Technology, Luxembourg"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8251-1669","authenticated-orcid":false,"given":"Qiang","family":"Hu","sequence":"additional","affiliation":[{"name":"SnT, University of Luxembourg, Luxembourg"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1288-6502","authenticated-orcid":false,"given":"Xiaofei","family":"Xie","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8312-1358","authenticated-orcid":false,"given":"Maxime","family":"Cordy","sequence":"additional","affiliation":[{"name":"SnT, University of Luxembourg, Luxembourg"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1852-2547","authenticated-orcid":false,"given":"Mike","family":"Papadakis","sequence":"additional","affiliation":[{"name":"SnT, University of Luxembourg, Luxembourg"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1045-4861","authenticated-orcid":false,"given":"Yves","family":"Le Traon","sequence":"additional","affiliation":[{"name":"SnT, University of Luxembourg, Luxembourg"}]}],"member":"320","published-online":{"date-parts":[[2023,12,21]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"crossref","unstructured":"David Adedayo Adeniyi Zhaoqiang Wei and Yang Yongquan. 2016. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method. Appl. Comput. Inform. 12 1 (2016) 90\u2013108.","DOI":"10.1016\/j.aci.2014.10.001"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2019.2944914"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1080\/00031305.1992.10475879"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3340458"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394112"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1810.04805"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2021.106542"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3395363.3397357"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2002.08155"},{"key":"e_1_3_3_11_2","unstructured":"GitHub. 2008. GitHub: A Platform and Cloud-based Service for Software Development and Version Control. Retrieved from https:\/\/github.com\/"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10710-017-9314-z"},{"key":"e_1_3_3_13_2","volume-title":"Softmax Units for Multinoulli Output Distributions. Deep Learning","author":"Goodfellow Ian","year":"2016","unstructured":"Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Softmax Units for Multinoulli Output Distributions. Deep Learning. MIT Press."},{"key":"e_1_3_3_14_2","unstructured":"Google. 2007. AI Platform Data Labeling Service Pricing. Retrieved from https:\/\/cloud.google.com\/ai-platform\/data-labeling\/pricing"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3180155.3180167"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2009.08366"},{"key":"e_1_3_3_17_2","unstructured":"Yuejun Guo. 2022. Project Site of KAPE. Retrieved from https:\/\/sites.google.com\/view\/kape4dcs\/"},{"key":"e_1_3_3_18_2","unstructured":"Yuejun Guo Qiang Hu Maxime Cordy Mike Papadakis and Yves Le Traon. 2021. Robust active learning: Sample-efficient training of robust deep learning models. CoRR abs\/2112.02542 (2021)."},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3511598"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678672"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00152"},{"key":"e_1_3_3_22_2","unstructured":"Hamel Husain Ho-Hsiang Wu Tiferet Gazit Miltiadis Allamanis and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)."},{"key":"e_1_3_3_23_2","unstructured":"Been Kim Rajiv Khanna and Oluwasanmi O. Koyejo. 2016. Examples are not enough learn to criticize! Criticism for interpretability. Adv. Neural Inf. Process. Syst. 29 (2016)."},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3180155.3180187"},{"key":"e_1_3_3_25_2","series-title":"Proceedings of the 38th International Conference on Machine Learning","first-page":"5637","volume":"139","author":"Koh Pang Wei","year":"2021","unstructured":"Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M. Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. 2021. WILDS: A benchmark of in-the-wild distribution shifts. In Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 5637\u20135664. Retrieved from https:\/\/proceedings.mlr.press\/v139\/koh21a.html"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3395363.3397346"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338930"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3480027"},{"key":"e_1_3_3_29_2","unstructured":"Shangqing Liu Xiaofei Xie Lei Ma Jing Kai Siow and Yang Liu. 2021. GraphSearchNet: Enhancing GNNs via capturing global dependency for semantic code search. CoRR abs\/2111.02671 (2021)."},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1907.11692"},{"key":"e_1_3_3_31_2","unstructured":"Google LLC. 1998. Google. Retrieved from https:\/\/www.google.com\/"},{"key":"e_1_3_3_32_2","unstructured":"Shuai Lu Daya Guo Shuo Ren Junjie Huang Alexey Svyatkovskiy Ambrosio Blanco Colin B. Clement Dawn Drain Daxin Jiang Duyu Tang Ge Li Lidong Zhou Linjun Shou Long Zhou Michele Tufano Ming Gong Ming Zhou Nan Duan Neel Sundaresan Shao Kun Deng Shengyu Fu and Shujie Liu. 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. CoRR abs\/2102.04664 (2021)."},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2015.42"},{"key":"e_1_3_3_34_2","first-page":"120","volume-title":"DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems","author":"Ma Lei","year":"2018","unstructured":"Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. Association for Computing Machinery, New York, NY, 120\u2013131. Retrieved from https:\/\/doi-org.proxy.bnl.lu\/10.1145\/3238147.3238202"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2011.84"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2522920.2522930"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132785"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11431-020-1647-3"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/2884781.2884808"},{"key":"e_1_3_3_40_2","doi-asserted-by":"crossref","unstructured":"Peter J. Rousseeuw and Mia Hubert. 2011. Robust statistics for outlier detection. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 1 1 (2011) 73\u201379.","DOI":"10.1002\/widm.2"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/2786805.2786855"},{"key":"e_1_3_3_42_2","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1007\/3-540-44816-0_31","volume-title":"Advances in Intelligent Data Analysis","author":"Scheffer Tobias","year":"2001","unstructured":"Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hidden Markov models for information extraction. In Advances in Intelligent Data Analysis. Springer Berlin, 309\u2013318."},{"key":"e_1_3_3_43_2","unstructured":"SciPy. 2023. SciPy: Open-source Python Library. Retrieved from https:\/\/scipy.org\/"},{"key":"e_1_3_3_44_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Sener Ozan","year":"2018","unstructured":"Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_45_2","volume-title":"Active Learning Literature Survey","author":"Settles Burr","year":"2010","unstructured":"Burr Settles. 2010. Active Learning Literature Survey. Technical Report 1648. University of Wisconsin, Madison."},{"key":"e_1_3_3_46_2","doi-asserted-by":"crossref","unstructured":"Kanish Shah Henil Patel Devanshi Sanghvi and Manan Shah. 2020. A comparative analysis of logistic regression random forest and KNN models for the text classification. Augm. Hum. Res. 5 1 (2020) 1\u201316.","DOI":"10.1007\/s41133-020-00032-0"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2110.09610"},{"key":"e_1_3_3_48_2","doi-asserted-by":"crossref","first-page":"410","DOI":"10.1145\/3324884.3416621","volume-title":"Proceedings of the IEEE\/ACM International Conference on Automated Software Engineering","author":"Shen Weijun","year":"2020","unstructured":"Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. In Proceedings of the IEEE\/ACM International Conference on Automated Software Engineering. Association for Computing Machinery, New York, United States, 410\u2013422."},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3387904.3389269"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2013.6624044"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/aitb48515.2019.8947433"},{"key":"e_1_3_3_52_2","unstructured":"StackOverflow. 2008. StackOverflow. Retrieved from https:\/\/stackoverflow.com\/"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2202.06649"},{"key":"e_1_3_3_54_2","doi-asserted-by":"crossref","unstructured":"G. J. G. Upton. 1987. An introduction to mathematical statistics and its applications by R. J. Larsen and M. L. Marx. Pp 630.\u00a3 17\u00b7 95. 1987. ISBN 13-487166-9 (Prentice-Hall). Math. Gaz. 71 458 (1987) 330\u2013330.","DOI":"10.2307\/3617085"},{"key":"e_1_3_3_55_2","first-page":"146","volume-title":"DeepHunter: A Coverage-guided Fuzz Testing Framework for Deep Neural Networks","author":"Xie Xiaofei","year":"2019","unstructured":"Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-guided Fuzz Testing Framework for Deep Neural Networks. Association for Computing Machinery, New York, NY, 146\u2013157. DOI:https:\/\/doi-org.proxy.bnl.lu\/10.1145\/3293882.3330579"},{"key":"e_1_3_3_56_2","unstructured":"R. Baeza Yates and B. Ribeiro Neto. 2011. Modern Information Retrieval: The Concepts and Technology behind Search. Addison-Wesley Professional ."},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3546066"},{"key":"e_1_3_3_58_2","unstructured":"Jie M. Zhang Mark Harman Lei Ma and Yang Liu. 2019. Machine learning testing: Survey landscapes and horizons. CoRR abs\/1906.10742 (2019)."},{"key":"e_1_3_3_59_2","doi-asserted-by":"crossref","unstructured":"Shichao Zhang Xuelong Li Ming Zong Xiaofeng Zhu and Debo Cheng. 2017. Learning k for knn classification. ACM Trans. Intell. Syst. Technol. 8 3 (2017) 1\u201319.","DOI":"10.1145\/2990508"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETCI.2021.3100641"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3624735","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3624735","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:35:44Z","timestamp":1750178144000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3624735"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,21]]},"references-count":59,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,2,29]]}},"alternative-id":["10.1145\/3624735"],"URL":"https:\/\/doi.org\/10.1145\/3624735","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"type":"print","value":"1049-331X"},{"type":"electronic","value":"1557-7392"}],"subject":[],"published":{"date-parts":[[2023,12,21]]},"assertion":[{"value":"2022-06-13","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-22","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}