{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:37:36Z","timestamp":1772908656073,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":90,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,23]]},"DOI":"10.1145\/3696630.3728531","type":"proceedings-article","created":{"date-parts":[[2025,7,28]],"date-time":"2025-07-28T19:08:09Z","timestamp":1753729689000},"page":"51-63","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-1988-6219","authenticated-orcid":false,"given":"Zhihan","family":"Jiang","sequence":"first","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong SAR, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6962-5292","authenticated-orcid":false,"given":"Junjie","family":"Huang","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong SAR, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6195-9088","authenticated-orcid":false,"given":"Guangba","family":"Yu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong SAR, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5158-6716","authenticated-orcid":false,"given":"Zhuangbin","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Software Engineering, Sun Yat-sen University, Zhuhai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8370-644X","authenticated-orcid":false,"given":"Yichen","family":"Li","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong SAR, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6626-4437","authenticated-orcid":false,"given":"Renyi","family":"Zhong","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong SAR, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-5556-4004","authenticated-orcid":false,"given":"Cong","family":"Feng","sequence":"additional","affiliation":[{"name":"Computing and Networking Innovation Lab,Huawei Cloud Computing Technology Co., Ltd, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9733-4346","authenticated-orcid":false,"given":"Yongqiang","family":"Yang","sequence":"additional","affiliation":[{"name":"Huawei Technologies, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6307-7310","authenticated-orcid":false,"given":"Zengyin","family":"Yang","sequence":"additional","affiliation":[{"name":"Huawei Cloud Computing Technology Co., Ltd, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3666-5798","authenticated-orcid":false,"given":"Michael","family":"Lyu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong SAR, Hong Kong"}]}],"member":"320","published-online":{"date-parts":[[2025,7,28]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2022. The Technology Behind BLOOM Training. https:\/\/huggingface.co\/blog\/bloom-megatron-deepspeed"},{"key":"e_1_3_2_1_2_1","unstructured":"2024. AI Core failure. https:\/\/www.hiascend.com\/document\/detail\/zh\/canncommercial\/80RC3\/developmentguide\/maintenref\/troubleshooting\/troubleshooting_0004.html"},{"key":"e_1_3_2_1_3_1","unstructured":"2024. Amazon SageMaker. https:\/\/aws.amazon.com\/sagemaker\/"},{"key":"e_1_3_2_1_4_1","unstructured":"2024. CANN Toolkit. https:\/\/www.hiascend.com\/en\/software\/cann"},{"key":"e_1_3_2_1_5_1","unstructured":"2024. Creating a child process in the fork method causes the application process to get stuck. https:\/\/www.hiascend.com\/document\/detail\/zh\/canncommercial\/80RC3\/developmentguide\/maintenref\/troubleshooting\/troubleshooting_0075.html"},{"key":"e_1_3_2_1_6_1","unstructured":"2024. CUDA Toolkit. https:\/\/developer.nvidia.com\/cuda-toolkit"},{"key":"e_1_3_2_1_7_1","unstructured":"2024. DeepSpeed. https:\/\/www.deepspeed.ai\/"},{"key":"e_1_3_2_1_8_1","unstructured":"2024. ECC failure. https:\/\/support.huaweicloud.com\/trouble-ecs\/ecs_trouble_1626.html"},{"key":"e_1_3_2_1_9_1","unstructured":"2024. Failed to load checkpoint after write failure to S3 backend. https:\/\/github.com\/pulumi\/pulumi\/issues\/2801"},{"key":"e_1_3_2_1_10_1","unstructured":"2024. Google Vertex AI. https:\/\/console.cloud.google.com\/vertex-ai?hl=en&inv=1&invt=Abkx0g&project=fine-effect-362306"},{"key":"e_1_3_2_1_11_1","unstructured":"2024. Grok-1. https:\/\/github.com\/xai-org\/grok-1"},{"key":"e_1_3_2_1_12_1","unstructured":"2024. Invalid Ranktable Configuration. https:\/\/www.hiascend.com\/document\/detail\/zh\/canncommercial\/80RC3\/developmentguide\/maintenref\/troubleshooting\/atlaserrorcode_15_0244.html"},{"key":"e_1_3_2_1_13_1","unstructured":"2024. MindSpore LLM Platform. https:\/\/xihe.mindspore.cn\/en"},{"key":"e_1_3_2_1_14_1","unstructured":"2024. NPU network port Link failure. https:\/\/www.hiascend.com\/document\/caselibrary\/detail\/topic_0000001792986414"},{"key":"e_1_3_2_1_15_1","unstructured":"2024. Ongoing research training transformer models at scale. https:\/\/github.com\/NVIDIA\/Megatron-LM"},{"key":"e_1_3_2_1_16_1","unstructured":"2024. PyTorch. https:\/\/pytorch.org"},{"key":"e_1_3_2_1_17_1","unstructured":"2024. Stream mode cannot be set in current driver version. https:\/\/www.hiascend.com\/document\/caselibrary\/detail\/case_9542"},{"key":"e_1_3_2_1_18_1","unstructured":"2024. Transformers. https:\/\/huggingface.co\/docs\/transformers\/en\/index"},{"key":"e_1_3_2_1_19_1","unstructured":"2024. The wait execution of the Notify register times out. https:\/\/www.hiascend.com\/document\/caselibrary\/detail\/case_9526"},{"key":"e_1_3_2_1_20_1","volume-title":"2019 IEEE\/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 140\u2013151","author":"Amar Anunay","year":"2019","unstructured":"Anunay Amar and Peter C Rigby. 2019. Mining historical test logs to predict bugs and localize faults in the test logs. In 2019 IEEE\/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 140\u2013151."},{"key":"e_1_3_2_1_21_1","volume-title":"2022 IEEE International Conference on Services Computing (SCC). IEEE, 321\u2013326","author":"Bogatinovski Jasmin","year":"2022","unstructured":"Jasmin Bogatinovski, Gjorgji Madjarov, Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2022. Leveraging log instructions in log-based anomaly detection. In 2022 IEEE International Conference on Services Computing (SCC). IEEE, 321\u2013326."},{"key":"e_1_3_2_1_22_1","volume-title":"Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3","author":"Chandola Varun","year":"2009","unstructured":"Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 1\u201358."},{"key":"e_1_3_2_1_23_1","volume-title":"NeuralLog: Natural language inference with joint neural and logical reasoning. arXiv preprint arXiv:2105.14167","author":"Chen Zeming","year":"2021","unstructured":"Zeming Chen, Qiyue Gao, and Lawrence S Moss. 2021. NeuralLog: Natural language inference with joint neural and logical reasoning. arXiv preprint arXiv:2105.14167 (2021)."},{"key":"e_1_3_2_1_24_1","volume-title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https:\/\/vicuna.lmsys.org (accessed","author":"Chiang Wei-Lin","year":"2023","unstructured":"Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https:\/\/vicuna.lmsys.org (accessed 14 April 2023) 2, 3 (2023), 6."},{"key":"e_1_3_2_1_25_1","volume-title":"Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin 70, 4","author":"Cohen Jacob","year":"1968","unstructured":"Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin 70, 4 (1968), 213."},{"key":"e_1_3_2_1_26_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_27_1","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Eisenman Assaf","year":"2022","unstructured":"Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. {Check-N-Run}: A checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 929\u2013943."},{"key":"e_1_3_2_1_28_1","volume-title":"2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE). IEEE, 761\u2013773","author":"Gao Shuzheng","year":"2023","unstructured":"Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R Lyu. 2023. What makes good in-context demonstrations for code intelligence tasks with llms?. In 2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE). IEEE, 761\u2013773."},{"key":"e_1_3_2_1_29_1","volume-title":"An Empirical Study on Quality Issues of Deep Learning Platform. In 2023 IEEE\/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 455\u2013466","author":"Gao Yanjie","year":"2023","unstructured":"Yanjie Gao, Xiaoxiang Shi, Haoxiang Lin, Hongyu Zhang, Hao Wu, Rui Li, and Mao Yang. 2023. An Empirical Study on Quality Issues of Deep Learning Platform. In 2023 IEEE\/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 455\u2013466."},{"key":"e_1_3_2_1_30_1","unstructured":"Daya Guo Qihao Zhu Dejian Yang Zhenda Xie Kai Dong Wentao Zhang Guanting Chen Xiao Bi Yu Wu YK Li et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)."},{"key":"e_1_3_2_1_31_1","volume-title":"Proceedings of the Nineteenth European Conference on Computer Systems. 1110\u20131125","author":"Gupta Tanmaey","year":"2024","unstructured":"Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vijeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. 2024. Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. In Proceedings of the Nineteenth European Conference on Computer Systems. 1110\u20131125."},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of the 44th International Conference on Software Engineering. 1669\u20131680","author":"He Jingzhu","year":"2022","unstructured":"Jingzhu He, Yuhang Lin, Xiaohui Gu, Chin-Chia Michael Yeh, and Zhongfang Zhuang. 2022. Perfsig: extracting performance bug signatures via multi-modality causal analysis. In Proceedings of the 44th International Conference on Software Engineering. 1669\u20131680."},{"key":"e_1_3_2_1_33_1","volume-title":"2017 IEEE international conference on web services (ICWS). IEEE, 33\u201340","author":"He Pinjia","year":"2017","unstructured":"Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). IEEE, 33\u201340."},{"key":"e_1_3_2_1_34_1","volume-title":"A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6","author":"He Shilin","year":"2021","unstructured":"Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6 (2021), 1\u201337."},{"key":"e_1_3_2_1_35_1","volume-title":"Proceedings of the 50th Annual International Symposium on Computer Architecture. 1\u201316","author":"He Yi","year":"2023","unstructured":"Yi He, Mike Hutton, Steven Chan, Robert De Gruijl, Rama Govindaraju, Nishant Patil, and Yanjing Li. 2023. Understanding and mitigating hardware failures in deep learning training systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1\u201316."},{"key":"e_1_3_2_1_36_1","volume-title":"Proceedings of the 32nd ACM international conference on information and knowledge management. 720\u2013730","author":"He Zhankui","year":"2023","unstructured":"Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management. 720\u2013730."},{"key":"e_1_3_2_1_37_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Hu Qinghao","year":"2024","unstructured":"Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, et al. 2024. Characterization of large language model development in the datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 709\u2013729."},{"key":"e_1_3_2_1_38_1","volume-title":"LUNAR: Unsupervised LLM-based Log Parsing. arXiv preprint arXiv:2406.07174","author":"Huang Junjie","year":"2024","unstructured":"Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael R Lyu. 2024. LUNAR: Unsupervised LLM-based Log Parsing. arXiv preprint arXiv:2406.07174 (2024)."},{"key":"e_1_3_2_1_39_1","volume-title":"2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 511\u2013522","author":"Huang Junjie","year":"2024","unstructured":"Junjie Huang, Zhihan Jiang, Jinyang Liu, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Hui Dong, Zengyin Yang, and Michael R Lyu. 2024. Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis. In 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 511\u2013522."},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 392\u2013404","author":"Huang Junjie","year":"2024","unstructured":"Junjie Huang, Jinyang Liu, Zhuangbin Chen, Zhihan Jiang, Yichen Li, Jiazhen Gu, Cong Feng, Zengyin Yang, Yongqiang Yang, and Michael R Lyu. 2024. Faultprofit: Hierarchical fault profiling of incident tickets in large-scale cloud systems. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 392\u2013404."},{"key":"e_1_3_2_1_41_1","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC 19)","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 947\u2013960."},{"key":"e_1_3_2_1_42_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al. 2024. {MegaScale}: Scaling Large Language Model Training to More Than 10,000 {GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 745\u2013760."},{"key":"e_1_3_2_1_43_1","volume-title":"Proceedings of the ACM on Software Engineering 1, FSE","author":"Jiang Zhihan","year":"2024","unstructured":"Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. 2024. Lilac: Log parsing using llms with adaptive parsing cache. Proceedings of the ACM on Software Engineering 1, FSE (2024), 137\u2013160."},{"key":"e_1_3_2_1_44_1","volume-title":"Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis.","author":"Jiang Zhihan","year":"2024","unstructured":"Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R Lyu. 2024. A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis."},{"key":"e_1_3_2_1_45_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)."},{"key":"e_1_3_2_1_46_1","volume-title":"Revisiting Reliability in Large-Scale Machine Learning Research Clusters. arXiv preprint arXiv:2410.21680","author":"Kokolis Apostolos","year":"2024","unstructured":"Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Saladi, and Carole-Jean Wu. 2024. Revisiting Reliability in Large-Scale Machine Learning Research Clusters. arXiv preprint arXiv:2410.21680 (2024)."},{"key":"e_1_3_2_1_47_1","volume-title":"Proceedings of the 44th international conference on software engineering. 1356\u20131367","author":"Le Van-Hoang","year":"2022","unstructured":"Van-Hoang Le and Hongyu Zhang. 2022. Log-based anomaly detection with deep learning: How far are we?. In Proceedings of the 44th international conference on software engineering. 1356\u20131367."},{"key":"e_1_3_2_1_48_1","volume-title":"2013 35th International Conference on Software Engineering (ICSE). IEEE, 963\u2013972","author":"Li Sihan","year":"2013","unstructured":"Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. 2013. A characteristic study on failures of production distributed data-parallel programs. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 963\u2013972."},{"key":"e_1_3_2_1_49_1","volume-title":"Proceedings of the31st International Symposium on Software Reliability Engineering. 92\u2013103","author":"Li Xiaoyun","year":"2020","unstructured":"Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, and Guangba Yu. 2020. Swiss-Log: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults. In Proceedings of the31st International Symposium on Software Reliability Engineering. 92\u2013103."},{"key":"e_1_3_2_1_50_1","volume-title":"Exploring the effectiveness of llms in automated logging generation: An empirical study. arXiv preprint arXiv:2307.05950","author":"Li Yichen","year":"2023","unstructured":"Yichen Li, Yintong Huo, Zhihan Jiang, Renyi Zhong, Pinjia He, Yuxin Su, and Michael R Lyu. 2023. Exploring the effectiveness of llms in automated logging generation: An empirical study. arXiv preprint arXiv:2307.05950 (2023)."},{"key":"e_1_3_2_1_51_1","volume-title":"Go Static: Contextualized Logging Statement Generation. arXiv preprint arXiv:2402.12958","author":"Li Yichen","year":"2024","unstructured":"Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, and Michael R Lyu. 2024. Go Static: Contextualized Logging Statement Generation. arXiv preprint arXiv:2402.12958 (2024)."},{"key":"e_1_3_2_1_52_1","volume-title":"COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge. In 2025 IEEE\/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 770\u2013770","author":"Li Yichen","year":"2025","unstructured":"Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu, and Michael Lyu. 2025. COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge. In 2025 IEEE\/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 770\u2013770."},{"key":"e_1_3_2_1_53_1","volume-title":"2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789\u2013801","author":"Liao Heng","year":"2021","unstructured":"Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789\u2013801."},{"key":"e_1_3_2_1_54_1","first-page":"1","article-title":"Fast dimensional analysis for root cause investigation in a large-scale service environment","volume":"4","author":"Lin Fred","year":"2020","unstructured":"Fred Lin, Keyur Muzumdar, Nikolay Pavlovich Laptev, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar. 2020. Fast dimensional analysis for root cause investigation in a large-scale service environment. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4, 2 (2020), 1\u201323.","journal-title":"Proceedings of the ACM on Measurement and Analysis of Computing Systems"},{"key":"e_1_3_2_1_55_1","volume-title":"Proceedings of the 38th International Conference on Software Engineering Companion. 102\u2013111","author":"Lin Qingwei","year":"2016","unstructured":"Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. 102\u2013111."},{"key":"e_1_3_2_1_56_1","volume-title":"Proceedings of the 38th International Conference on Software Engineering Companion. 102\u2013111","author":"Lin Qingwei","year":"2016","unstructured":"Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. 102\u2013111."},{"key":"e_1_3_2_1_57_1","volume-title":"Kai Ming Ting, and Zhi-Hua Zhou","author":"Liu Fei Tony","year":"2008","unstructured":"Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413\u2013422."},{"key":"e_1_3_2_1_58_1","volume-title":"Proceedings of the Workshop on Hot Topics in Operating Systems. 155\u2013162","author":"Liu Haopeng","year":"2019","unstructured":"Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems. 155\u2013162."},{"key":"e_1_3_2_1_59_1","volume-title":"Scalable and adaptive log-based anomaly detection with expert in the loop. arXiv preprint arXiv:2306.05032","author":"Liu Jinyang","year":"2023","unstructured":"Jinyang Liu, Junjie Huang, Yintong Huo, Zhihan Jiang, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Minzhi Yan, and Michael R Lyu. 2023. Scalable and adaptive log-based anomaly detection with expert in the loop. arXiv preprint arXiv:2306.05032 (2023)."},{"key":"e_1_3_2_1_60_1","volume-title":"2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE). IEEE, 268\u2013280","author":"Liu Jinyang","year":"2023","unstructured":"Jinyang Liu, Zhihan Jiang, Jiazhen Gu, Junjie Huang, Zhuangbin Chen, Cong Feng, Zengyin Yang, Yongqiang Yang, and Michael R Lyu. 2023. Prism: Revealing hidden functional clusters from massive instances in cloud systems. In 2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE). IEEE, 268\u2013280."},{"key":"e_1_3_2_1_61_1","volume-title":"2010 USENIX Annual Technical Conference (USENIX ATC 10)","author":"Lou Jian-Guang","year":"2010","unstructured":"Jian-Guang Lou, Qiang Fu, Shenqi Yang, Ye Xu, and Jiang Li. 2010. Mining invariants from console logs for system problem detection. In 2010 USENIX Annual Technical Conference (USENIX ATC 10)."},{"key":"e_1_3_2_1_62_1","first-page":"4739","article-title":"Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs","volume":"19","author":"Meng Weibin","year":"2019","unstructured":"Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. 2019. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.. In IJCAI, Vol. 19. 4739\u20134745.","journal-title":"IJCAI"},{"key":"e_1_3_2_1_63_1","volume-title":"Fine-Grained DNN Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21)","author":"Mohan Jayashree","year":"2021","unstructured":"Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 203\u2013216."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-540-74048-3","volume-title":"Dynamic time warping. Information retrieval for music and motion","author":"M\u00fcller Meinard","year":"2007","unstructured":"Meinard M\u00fcller. 2007. Dynamic time warping. Information retrieval for music and motion (2007), 69\u201384."},{"key":"e_1_3_2_1_65_1","volume-title":"Swee Liang Wong, and Yidi Yuan","author":"Pan Jonathan","year":"2023","unstructured":"Jonathan Pan, Swee Liang Wong, and Yidi Yuan. 2023. RAGLog: Log Anomaly Detection using Retrieval Augmented Generation. arXiv preprint arXiv:2311.05261 (2023)."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1080\/00031305.1994.10476030","article-title":"The three sigma rule","volume":"48","author":"Pukelsheim Friedrich","year":"1994","unstructured":"Friedrich Pukelsheim. 1994. The three sigma rule. The American Statistician 48, 2 (1994), 88\u201391.","journal-title":"The American Statistician"},{"key":"e_1_3_2_1_67_1","volume-title":"Proceedings of the 14th ACM\/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1\u201312","author":"Rosenberg Carl Martin","year":"2020","unstructured":"Carl Martin Rosenberg and Leon Moonen. 2020. Spectrum-based log diagnosis. In Proceedings of the 14th ACM\/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1\u201312."},{"key":"e_1_3_2_1_68_1","volume-title":"Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2135\u20132135","author":"Seide Frank","year":"2016","unstructured":"Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2135\u20132135."},{"key":"e_1_3_2_1_69_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_3_2_1_70_1","volume-title":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331\u2013342","author":"Tiwari Devesh","year":"2015","unstructured":"Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331\u2013342."},{"key":"e_1_3_2_1_71_1","volume-title":"Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210","author":"Wang Longyue","year":"2023","unstructured":"Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023. Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210 (2023)."},{"key":"e_1_3_2_1_72_1","volume-title":"Proceedings of the 44th International Conference on Software Engineering. 363\u2013375","author":"Wang Tao","year":"2022","unstructured":"Tao Wang, Qingxin Xu, Xiaoning Chang, Wensheng Dou, Jiaxin Zhu, Jinhui Xie, Yuetang Deng, Jianbo Yang, Jiaheng Yang, Jun Wei, et al. 2022. Characterizing and detecting bugs in WeChat mini-programs. In Proceedings of the 44th International Conference on Software Engineering. 363\u2013375."},{"key":"e_1_3_2_1_73_1","volume-title":"Proceedings of the 29th Symposium on Operating Systems Principles. 364\u2013381","author":"Wang Zhuang","year":"2023","unstructured":"Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eugene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles. 364\u2013381."},{"key":"e_1_3_2_1_74_1","volume-title":"RedPajama: an Open Dataset for Training Large Language Models. NeurIPS Datasets and Benchmarks Track","author":"Weber Maurice","year":"2024","unstructured":"Maurice Weber, Daniel Y. Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher R\u00e9, Irina Rish, and Ce Zhang. 2024. RedPajama: an Open Dataset for Training Large Language Models. NeurIPS Datasets and Benchmarks Track (2024)."},{"key":"e_1_3_2_1_75_1","volume-title":"Transom: An efficient fault-tolerant system for training llms. arXiv preprint arXiv:2310.10046","author":"Wu Baodong","year":"2023","unstructured":"Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, Yuheng Chen, and Shigang Li. 2023. Transom: An efficient fault-tolerant system for training llms. arXiv preprint arXiv:2310.10046 (2023)."},{"key":"e_1_3_2_1_76_1","volume-title":"Proceedings of SOSP","volume":"9","author":"Xu Wei","year":"2009","unstructured":"Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Largescale system problem detection by mining console logs. In Proceedings of SOSP, Vol. 9. Citeseer, 1\u201317."},{"key":"e_1_3_2_1_77_1","volume-title":"Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP). 117\u2013132","author":"Xu Wei","year":"2009","unstructured":"Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP). 117\u2013132."},{"key":"e_1_3_2_1_78_1","volume-title":"Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data 18, 6","author":"Yang Jingfeng","year":"2024","unstructured":"Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data 18, 6 (2024), 1\u201332."},{"key":"e_1_3_2_1_79_1","volume-title":"A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems. arXiv preprint arXiv:2402.18013","author":"Yi Zihao","year":"2024","unstructured":"Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. 2024. A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems. arXiv preprint arXiv:2402.18013 (2024)."},{"key":"e_1_3_2_1_80_1","volume-title":"Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 553\u2013565","author":"Yu Guangba","year":"2023","unstructured":"Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 553\u2013565."},{"key":"e_1_3_2_1_81_1","unstructured":"Guangba Yu Gou Tan Haojia Huang Zhenyu Zhang Pengfei Chen Roberto Natella and Zibin Zheng. 2024. A Survey on Failure Analysis and Fault Injection in AI Systems. arXiv:2407.00125 [cs.SE] https:\/\/arxiv.org\/abs\/2407.00125"},{"key":"e_1_3_2_1_82_1","volume-title":"10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12)","author":"Yuan Ding","year":"2012","unstructured":"Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: Enhancing failure diagnosis with proactive logging. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 293\u2013306."},{"key":"e_1_3_2_1_83_1","doi-asserted-by":"crossref","first-page":"106234","DOI":"10.1016\/j.infsof.2019.106234","article-title":"How are distributed bugs diagnosed and fixed through system logs","volume":"119","author":"Yuan Wei","year":"2020","unstructured":"Wei Yuan, Shan Lu, Hailong Sun, and Xudong Liu. 2020. How are distributed bugs diagnosed and fixed through system logs? Information and Software Technology 119 (2020), 106234.","journal-title":"Information and Software Technology"},{"key":"e_1_3_2_1_84_1","volume-title":"Proceedings of the ACM\/IEEE 42nd international conference on software engineering. 1159\u20131170","author":"Zhang Ru","year":"2020","unstructured":"Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. In Proceedings of the ACM\/IEEE 42nd international conference on software engineering. 1159\u20131170."},{"key":"e_1_3_2_1_85_1","volume-title":"Xi Victoria Lin, et al","author":"Zhang Susan","year":"2023","unstructured":"Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2023. Opt: Open pre-trained transformer language models, 2022. URL https:\/\/arxiv.org\/abs\/2205.01068 3 (2023), 19\u20130."},{"key":"e_1_3_2_1_86_1","volume-title":"Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 807\u2013817","author":"Zhang Xu","year":"2019","unstructured":"Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, et al. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 807\u2013817."},{"key":"e_1_3_2_1_87_1","volume-title":"Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1253\u20131263","author":"Zhang Xu","year":"2021","unstructured":"Xu Zhang, Yong Xu, Si Qin, Shilin He, Bo Qiao, Ze Li, Hongyu Zhang, Xukun Li, Yingnong Dang, Qingwei Lin, et al. 2021. Onion: identifying incident-indicating logs for cloud systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1253\u20131263."},{"key":"e_1_3_2_1_88_1","volume-title":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 447\u2013449","author":"Zhong Yuchen","year":"2023","unstructured":"Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu. 2023. Swift: Expedited Failure Recovery for Large-Scale DNN Training. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 447\u2013449."},{"key":"e_1_3_2_1_89_1","volume-title":"2015 IEEE\/ACM 37th IEEE International Conference on Software Engineering","volume":"2","author":"Zhou Hucheng","year":"2015","unstructured":"Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Haibo Lin, Haoxiang Lin, and Tingting Qin. 2015. An empirical study on quality issues of production big data platform. In 2015 IEEE\/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 17\u201326."},{"key":"e_1_3_2_1_90_1","volume-title":"HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources. ACM Transactions on Software Engineering and Methodology","author":"Zhu Zhouruixing","year":"2024","unstructured":"Zhouruixing Zhu, Cheryl Lee, Xiaoying Tang, and Pinjia He. 2024. HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources. ACM Transactions on Software Engineering and Methodology (2024)."}],"event":{"name":"FSE Companion '25: 33rd ACM International Conference on the Foundations of Software Engineering","location":"Clarion Hotel Trondheim Trondheim Norway","acronym":"FSE Companion '25","sponsor":["SIGSOFT ACM Special Interest Group on Software Engineering"]},"container-title":["Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3696630.3728531","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,28]],"date-time":"2025-07-28T19:14:53Z","timestamp":1753730093000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696630.3728531"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,23]]},"references-count":90,"alternative-id":["10.1145\/3696630.3728531","10.1145\/3696630"],"URL":"https:\/\/doi.org\/10.1145\/3696630.3728531","relation":{},"subject":[],"published":{"date-parts":[[2025,6,23]]},"assertion":[{"value":"2025-07-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}