{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,3]],"date-time":"2026-07-03T16:33:29Z","timestamp":1783096409276,"version":"3.54.6"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"OOPSLA2","license":[{"start":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T00:00:00Z","timestamp":1759968000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CCF-2145774,CCF-2217696"],"award-info":[{"award-number":["CCF-2145774,CCF-2217696"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2025,10,9]]},"abstract":"<jats:p>Regression testing is an essential part of software development, but it suffers from the presence of flaky tests \u2013 tests that pass and fail non-deterministically when run on the same code. These unpredictable failures waste developers\u2019 time and often hide real bugs. Prior work showed that fine-tuned large language models (LLMs) can classify flaky tests into different categories with very high accuracy. However, we find that prior approaches over-estimated the accuracy of the models due to incorrect experimental design and unrealistic datasets \u2013 making the flaky test classification problem seem simpler than it is.<\/jats:p>\n          <jats:p>In this paper, we first show how prior flaky test classifiers over-estimate the prediction accuracy due to 1) flawed experiment design and 2) mis-representation of the real distribution of flaky (and non-flaky) tests in their datasets. After we fix the experimental design and construct a more realistic dataset (which we name FlakeBench), the prior state-of-the-art model shows a steep drop in F1-score, from 81.82% down to 56.62%. Motivated by these observations, we develop a new training strategy to fine-tune a flaky test classifier, FlakyLens, that improves the classification F1-score to 65.79% (9.17pp higher than the state-of-the-art). We also compare FlakyLens against recent pre-trained LLMs, such as CodeLlama and DeepSeekCoder, on the same classification task. Our results show that FlakyLens consistently outperforms these models, highlighting that general-purpose LLMs still fall short on this specialized task.<\/jats:p>\n          <jats:p>Using our improved flaky test classifier, we identify the important tokens in the test code that influence the models in making correct or incorrect predictions. By leveraging attribution scores computed per code token in each test, we investigate the tokens that have higher impact on the flaky test classifier\u2019s decision-making per flaky test category. To assess the influence of these important tokens, we introduce adversarial perturbation using these important tokens into the tests and observe whether the model\u2019s predictions change. Our findings show that, when introducing perturbations using the most important tokens, the classification accuracy can change by as much as -18.37pp. These results highlight that these models still struggle to generalize beyond their training data and rely on identifying category-specific tokens (instead of understanding their semantic context), calling for further research into more robust training methodologies.<\/jats:p>","DOI":"10.1145\/3763098","type":"journal-article","created":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T08:51:31Z","timestamp":1759999891000},"page":"1345-1371","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Understanding and Improving Flaky Test Classification"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0599-8215","authenticated-orcid":false,"given":"Shanto","family":"Rahman","sequence":"first","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8346-4019","authenticated-orcid":false,"given":"Saikat","family":"Dutta","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8239-3124","authenticated-orcid":false,"given":"August","family":"Shi","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,10,9]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2016. Flaky tests at Google and how we mitigate them. https:\/\/testing.googleblog.com\/2016\/05\/flaky-tests-at-google-and-how-we.html"},{"key":"e_1_2_1_2_1","unstructured":"2021. IDoFT. http:\/\/mir.cs.illinois.edu\/flakytests"},{"key":"e_1_2_1_3_1","unstructured":"2021. netty. https:\/\/github.com\/netty\/netty"},{"key":"e_1_2_1_4_1","unstructured":"2023. Code Pretraining Models. https:\/\/github.com\/microsoft\/CodeBERT"},{"key":"e_1_2_1_5_1","unstructured":"2024. huggingface. https:\/\/huggingface.co\/"},{"key":"e_1_2_1_6_1","unstructured":"2024. sklearn. https:\/\/scikit-learn.org\/1.5\/modules\/generated\/sklearn.metrics.classification_report.html"},{"key":"e_1_2_1_7_1","unstructured":"2025. Captum. https:\/\/captum.ai\/docs\/extension\/integrated_gradients"},{"key":"e_1_2_1_8_1","unstructured":"2025. Flakify. https:\/\/github.com\/uOttawa-Nanda-Lab\/Flakify\/commit\/3b726f44d5ce5dffc9b190327bc0b95d557c5c6a"},{"key":"e_1_2_1_9_1","unstructured":"2025. FlakyCat. https:\/\/github.com\/Amal-AK\/FLAKYCAT"},{"key":"e_1_2_1_10_1","unstructured":"2025. FlakyLens. https:\/\/github.com\/UT-SE-Research\/FlakyLens"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","unstructured":"2025. Understanding and Improving Flaky Test Classification Artifact. https:\/\/doi.org\/10.5281\/zenodo.15761937 10.5281\/zenodo.15761937","DOI":"10.5281\/zenodo.15761937"},{"key":"e_1_2_1_12_1","unstructured":"2025. Understanding and Improving Flaky Test Classification Website. https:\/\/sites.google.com\/view\/robust-model"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/AST58925.2023.00018"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00140"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3208864"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3460319.3464844"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3395363.3397366"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3468264.3468615"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338945"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3201209"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3591227"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST49551.2021.00026"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2771783.2771793"},{"key":"e_1_2_1_24_1","volume-title":"Kolla Bhanu Prakash, and GR Kanagachidambaresan","author":"Imambi Sagar","year":"2021","unstructured":"Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. 2021. PyTorch. Programming with TensorFlow: Solution for Edge Computing Applications, 87\u2013104."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i12.26739"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2009.07896"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377813.3381370"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293882.3330570"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST.2019.00038"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSRE5003.2020.00045"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3540250.3558956"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.324"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295230"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2635868.2635920"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP.2019.00018"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP.2017.16"},{"key":"e_1_2_1_37_1","volume-title":"International Conference on Software Testing, Verification, and Validation.","author":"Micco John","year":"2017","unstructured":"John Micco. 2017. The state of continuous integration testing@ google. In International Conference on Software Testing, Verification, and Validation."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3455008"},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","unstructured":"Gustavo Pinto Breno Miranda Supun Dissanayake Marcelo d\u2019Amorim Christoph Treude and Antonia Bertolino. 2020. What is the Vocabulary of Flaky Tests? In Mining Software Repositories. 492\u2013502.","DOI":"10.1145\/3379597.3387482"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST60714.2024.00018"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST60714.2024.00032"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","unstructured":"Soumya Sanyal and Xiang Ren. 2021. Discretized integrated gradients for explaining language models. arXiv preprint arXiv:2108.13654 https:\/\/doi.org\/10.48550\/arXiv.2108.13654 10.48550\/arXiv.2108.13654","DOI":"10.48550\/arXiv.2108.13654"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICST.2016.40"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSME46990.2020.00037"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305890.3306024"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3510454.3516846"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510146"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2610384.2610404"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2202.00089"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP.2017.13"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3763098","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3763098","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:40:38Z","timestamp":1760031638000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3763098"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,9]]},"references-count":51,"journal-issue":{"issue":"OOPSLA2","published-print":{"date-parts":[[2025,10,9]]}},"alternative-id":["10.1145\/3763098"],"URL":"https:\/\/doi.org\/10.1145\/3763098","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,9]]},"assertion":[{"value":"2025-03-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}