{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T07:35:14Z","timestamp":1777102514120,"version":"3.51.4"},"reference-count":110,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput. Syst."],"published-print":{"date-parts":[[2026,5,31]]},"abstract":"<jats:p>Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, known as \u201cgray failure\u201d, for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions.<\/jats:p>\n                  <jats:p>We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator that learns benchmark criteria to pinpoint defective components clearly. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61\u00d7. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs every year.<\/jats:p>","DOI":"10.1145\/3767334","type":"journal-article","created":{"date-parts":[[2025,9,13]],"date-time":"2025-09-13T07:28:41Z","timestamp":1757748521000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["SuperBench: A Proactive Validation System for Improving Reliability of Cloud AI Infrastructure"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9056-1386","authenticated-orcid":false,"given":"Yifan","family":"Xiong","sequence":"first","affiliation":[{"name":"Microsoft Research","place":["Vancouver, Canada"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5032-6318","authenticated-orcid":false,"given":"Yuting","family":"Jiang","sequence":"additional","affiliation":[{"name":"Microsoft Research","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0491-7082","authenticated-orcid":false,"given":"Ziyue","family":"Yang","sequence":"additional","affiliation":[{"name":"Microsoft Research","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-7770-573X","authenticated-orcid":false,"given":"Lei","family":"Qu","sequence":"additional","affiliation":[{"name":"Microsoft Research","place":["Vancouver, Canada"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-9955-4385","authenticated-orcid":false,"given":"Guoshuai","family":"Zhao","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6414-2862","authenticated-orcid":false,"given":"Shuguang","family":"Liu","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7651-2059","authenticated-orcid":false,"given":"Dong","family":"Zhong","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1238-7446","authenticated-orcid":false,"given":"Boris","family":"Pinzur","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5223-8446","authenticated-orcid":false,"given":"Jie","family":"Zhang","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-9052-2279","authenticated-orcid":false,"given":"Yang","family":"Wang","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9549-7918","authenticated-orcid":false,"given":"Jithin","family":"Jose","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-2678-5384","authenticated-orcid":false,"given":"Hossein","family":"Pourreza","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5344-3593","authenticated-orcid":false,"given":"Jeff","family":"Baxter","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1608-6040","authenticated-orcid":false,"given":"Kushal","family":"Datta","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3281-5186","authenticated-orcid":false,"given":"Prabhat","family":"Ram","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9780-3658","authenticated-orcid":false,"given":"Luke","family":"Melton","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-2669-3847","authenticated-orcid":false,"given":"Joe","family":"Chau","sequence":"additional","affiliation":[{"name":"Microsoft Corporation","place":["Redmond, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4014-4757","authenticated-orcid":false,"given":"Peng","family":"Cheng","sequence":"additional","affiliation":[{"name":"Microsoft Research","place":["Redmond, USA"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4175-0097","authenticated-orcid":false,"given":"Yongqiang","family":"Xiong","sequence":"additional","affiliation":[{"name":"Microsoft Research","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7258-3116","authenticated-orcid":false,"given":"Lidong","family":"Zhou","sequence":"additional","affiliation":[{"name":"Microsoft Research","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,4,24]]},"reference":[{"key":"e_1_3_3_2_2","volume-title":"AMD Instinct MI200 Adopted for Large-Scale AI Training in Microsoft Azure","year":"2022","unstructured":"AMD. 2022. AMD Instinct MI200 Adopted for Large-Scale AI Training in Microsoft Azure. Retrieved Jan 31, 2025 from https:\/\/ir.amd.com\/news-events\/press-releases\/detail\/1072\/amd-instinct-mi200-adopted-for-large-scale-ai-training"},{"key":"e_1_3_3_3_2","volume-title":"Introducing AMD CDNA 2 Architecture","year":"2022","unstructured":"AMD. 2022. Introducing AMD CDNA 2 Architecture. Retrieved April 29, 2022 from https:\/\/www.amd.com\/system\/files\/documents\/amd-cdna2-white-paper.pdf"},{"key":"e_1_3_3_4_2","volume-title":"Next Generation BLAS Implementation for ROCm Platform","year":"2022","unstructured":"AMD. 2022. Next Generation BLAS Implementation for ROCm Platform. Retrieved April 29, 2022 from https:\/\/github.com\/ROCmSoftwarePlatform\/rocBLAS"},{"key":"e_1_3_3_5_2","volume-title":"RCCL Tests","year":"2022","unstructured":"AMD. 2022. RCCL Tests. Retrieved April 29, 2022 from https:\/\/github.com\/ROCmSoftwarePlatform\/rccl-tests"},{"key":"e_1_3_3_6_2","volume-title":"AMD Instinct Accelerator Claims","year":"2023","unstructured":"AMD. 2023. AMD Instinct Accelerator Claims. Retrieved Jan 15, 2024 from https:\/\/www.amd.com\/en\/claims\/instinct"},{"key":"e_1_3_3_7_2","volume-title":"Democratizing AI with PyTorch Foundation and ROCm Support for PyTorch","year":"2023","unstructured":"AMD. 2023. Democratizing AI with PyTorch Foundation and ROCm Support for PyTorch. Retrieved Jan 15, 2024 from https:\/\/pytorch.org\/blog\/democratizing-ai-with-pytorch\/"},{"key":"e_1_3_3_8_2","volume-title":"ROCm Releases","year":"2023","unstructured":"AMD. 2023. ROCm Releases. Retrieved April 7, 2023 from https:\/\/github.com\/RadeonOpenCompute\/ROCm\/releases"},{"key":"e_1_3_3_9_2","volume-title":"Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators","year":"2024","unstructured":"AMD. 2024. Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators. Retrieved June 25, 2025 from https:\/\/github.com\/ROCm\/composable_kernel"},{"key":"e_1_3_3_10_2","volume-title":"NVIDIA Hopper Architecture In-Depth","author":"Andersch Michael","year":"2022","unstructured":"Michael Andersch, Greg Palmer, Ronny Krashinsky, Nick Stam, Vishal Mehta, Gonzalo Brito, and Sridhar Ramaswamy. 2022. NVIDIA Hopper Architecture In-Depth. Retrieved April 29, 2022 from https:\/\/developer.nvidia.com\/blog\/nvidia-hopper-architecture-in-depth\/"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTOS.2001.990058"},{"key":"e_1_3_3_12_2","volume-title":"InfiniBand Architecture Specification Volume 1 Release 1.4","author":"Association InfiniBand Trade","year":"2020","unstructured":"InfiniBand Trade Association. 2020. InfiniBand Architecture Specification Volume 1 Release 1.4. Retrieved June 25, 2025 from https:\/\/www.infinibandta.org\/ibta-enhances-data-center-performance-and-management-with-new-infiniband-architecture-specification-releases\/"},{"key":"e_1_3_3_13_2","volume-title":"Get Started with EFA and NCCL for ML Workloads on Amazon EC2","author":"AWS Amazon","year":"2025","unstructured":"Amazon AWS. 2025. Get Started with EFA and NCCL for ML Workloads on Amazon EC2. Retrieved Jan 31, 2025 from https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/efa-start-nccl.html"},{"key":"e_1_3_3_14_2","volume-title":"Flexible I\/O Tester","author":"Axboe Jens","year":"2022","unstructured":"Jens Axboe. 2022. Flexible I\/O Tester. Retrieved April 29, 2022 from https:\/\/github.com\/axboe\/fio"},{"key":"e_1_3_3_15_2","volume-title":"Topology Files in Azure HPC\/AI VM Images","year":"2024","unstructured":"Azure. 2024. Topology Files in Azure HPC\/AI VM Images. Retrieved Jan 31, 2025 from https:\/\/github.com\/Azure\/azhpc-images\/tree\/ubuntu-hpc-20241023\/topology"},{"key":"e_1_3_3_16_2","volume-title":"Azure Announces General Availability of Scale-out NVIDIA A100 GPU Clusters: The Fastest Public Cloud Supercomputer","author":"Azure Microsoft","year":"2021","unstructured":"Microsoft Azure. 2021. Azure Announces General Availability of Scale-out NVIDIA A100 GPU Clusters: The Fastest Public Cloud Supercomputer. Retrieved January 31, 2025 from https:\/\/azure.microsoft.com\/en-us\/blog\/azure-announces-general-availability-of-scaleup-scaleout-nvidia-a100-gpu-instances-claims-title-of-fastest-public-cloud-super\/"},{"key":"e_1_3_3_17_2","volume-title":"HPC Images in Azure Marketplace","author":"Azure Microsoft","year":"2022","unstructured":"Microsoft Azure. 2022. HPC Images in Azure Marketplace. Retrieved April 29, 2022 from https:\/\/github.com\/Azure\/azhpc-images"},{"key":"e_1_3_3_18_2","volume-title":"Linux Virtual Machines Pricing","author":"Azure Microsoft","year":"2023","unstructured":"Microsoft Azure. 2023. Linux Virtual Machines Pricing. Retrieved April 7, 2023 from https:\/\/azure.microsoft.com\/en-us\/pricing\/details\/virtual-machines\/linux\/#pricing"},{"key":"e_1_3_3_19_2","volume-title":"NDasrA100_v4 Sizes Series","author":"Azure Microsoft","year":"2024","unstructured":"Microsoft Azure. 2024. NDasrA100_v4 Sizes Series. Retrieved Jan 31, 2025 from https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/sizes\/gpu-accelerated\/ndasra100v4-series?tabs=sizebasic"},{"key":"e_1_3_3_20_2","volume-title":"ND_H100_v5 Sizes Series","author":"Azure Microsoft","year":"2024","unstructured":"Microsoft Azure. 2024. ND_H100_v5 Sizes Series. Retrieved Jan 31, 2025 from https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/sizes\/gpu-accelerated\/ndh100v5-series?tabs=sizebasic"},{"key":"e_1_3_3_21_2","volume-title":"Reliability Engineering","author":"Birolini Alessandro","year":"2007","unstructured":"Alessandro Birolini. 2007. Reliability Engineering. Springer."},{"key":"e_1_3_3_22_2","unstructured":"Gil Bloch Diego Crupnicoff Michael Kagan Ido Bukspan Itamar Rabenstein Alon Webman and Amiad Marelli. 2013. High-performance adaptive routing. US Patent 8 576 715."},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939699"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335388"},{"key":"e_1_3_3_25_2","first-page":"25","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada). Curran Associates Inc., Red Hook, NY, USA, 25 pages."},{"key":"e_1_3_3_26_2","volume-title":"Scale Generative AI with New Azure AI Infrastructure Advancements and Availability","author":"Chappell Nidhi","year":"2023","unstructured":"Nidhi Chappell and Eric Boyd. 2023. Scale Generative AI with New Azure AI Infrastructure Advancements and Availability. Retrieved January 15, 2025 from https:\/\/azure.microsoft.com\/en-us\/blog\/scale-generative-ai-with-new-azure-ai-infrastructure-advancements-and-availability\/"},{"key":"e_1_3_3_27_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Ponde De Oliveira Pinto Jared Kaplan Harri Edwards Yuri Burda Nicholas Joseph Greg Brockman et\u00a0al. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1953.tb01433.x"},{"key":"e_1_3_3_29_2","volume-title":"Cloud TPU Pricing","author":"Cloud Google","year":"2023","unstructured":"Google Cloud. 2023. Cloud TPU Pricing. Retrieved April 7, 2023 from https:\/\/cloud.google.com\/tpu\/pricing\/#pricing-components"},{"issue":"101","key":"e_1_3_3_30_2","first-page":"102","article-title":"Dawnbench: An end-to-end deep learning benchmark and competition","volume":"100","author":"Coleman Cody","year":"2017","unstructured":"Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris R\u00e9, and Matei Zaharia. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training 100, 101 (2017), 102.","journal-title":"Training"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1972.tb00899.x"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/1-84628-168-7"},{"key":"e_1_3_3_33_2","unstructured":"Kaivalya M. Dixit. 1993. Overview of the SPEC Benchmarks."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523627"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.5555\/73950.73977"},{"key":"e_1_3_3_36_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et\u00a0al. 2024. The Llama 3 Herd of Models. arxiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_3_37_2","unstructured":"Freddy Gabbay Ido Bukshpan Alon Webman Miriam Menes George Elias and Noam Katz Abramovich. 2017. Packet switch with reduced latency. US Patent 9 641 465."},{"key":"e_1_3_3_38_2","first-page":"14","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Gruver Nate","year":"2023","unstructured":"Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. 2023. Large language models are zero-shot time series forecasters. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA). Curran Associates Inc., Red Hook, NY, USA, 14 pages."},{"key":"e_1_3_3_39_2","unstructured":"Yile Gu Yifan Xiong Jonathan Mace Yuting Jiang Yigong Hu Baris Kasikci and Peng Cheng. 2025. Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models. arxiv:2501.14170. Retrieved from https:\/\/arxiv.org\/abs\/2501.14170"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670986"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242086"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1950.tb00463.x"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.2307\/2346830"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_3_46_2","volume-title":"Elastic Horovod","year":"2023","unstructured":"Horovod. 2023. Elastic Horovod. Retrieved April 7, 2023 from https:\/\/horovod.readthedocs.io\/en\/v0.27.0\/elastic.html"},{"key":"e_1_3_3_47_2","volume-title":"Modifying the Training Script with State Synchronization","year":"2023","unstructured":"Horovod. 2023. Modifying the Training Script with State Synchronization. Retrieved April 7, 2023 from https:\/\/horovod.readthedocs.io\/en\/stable\/elastic_include.html#modifying-the-training-script-with-state-synchronization"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3102980.3103005"},{"key":"e_1_3_3_50_2","volume-title":"Intel Memory Latency Checker","year":"2013","unstructured":"Intel. 2013. Intel Memory Latency Checker. Retrieved April 29, 2022 from https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/articles\/tool\/intelr-memory-latency-checker.html"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS55958.2022.9895480"},{"key":"e_1_3_3_52_2","volume-title":"MVAPICH2 at Azure: Enabling High Performance on Cloud","author":"Jose Jithin","year":"2022","unstructured":"Jithin Jose. 2022. MVAPICH2 at Azure: Enabling High Performance on Cloud. Retrieved April 7, 2023 from https:\/\/hibd.cse.ohio-state.edu\/static\/media\/talks\/slide\/Jithin-sc22-osu-bof.pdf"},{"key":"e_1_3_3_53_2","volume-title":"Google\u2019s Cloud TPU v4 Provides exaFLOPS-scale ML with Industry-leading Efficiency","author":"Jouppi Norm","year":"2023","unstructured":"Norm Jouppi and David Patterson. 2023. Google\u2019s Cloud TPU v4 Provides exaFLOPS-scale ML with Industry-leading Efficiency. Retrieved April 7, 2023 from https:\/\/cloud.google.com\/blog\/topics\/systems\/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190546"},{"key":"e_1_3_3_55_2","first-page":"191","article-title":"On a problem in combinations","volume":"2","author":"Kirkman Thomas P.","year":"1847","unstructured":"Thomas P. Kirkman. 1847. On a problem in combinations. Cambridge and Dublin Mathematical Journal 2 (1847), 191\u2013204.","journal-title":"Cambridge and Dublin Mathematical Journal"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1007\/b97377"},{"key":"e_1_3_3_57_2","volume-title":"NVIDIA Ampere Architecture In-Depth","author":"Krashinsky Ronny","year":"2020","unstructured":"Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam, and Sridhar Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. Retrieved April 29, 2022 from https:\/\/developer.nvidia.com\/blog\/nvidia-ampere-architecture-in-depth\/"},{"issue":"129","key":"e_1_3_3_58_2","first-page":"1","article-title":"Time-to-event prediction with neural networks and cox regression","volume":"20","author":"Kvamme H\u00e5vard","year":"2019","unstructured":"H\u00e5vard Kvamme, \u00d8rnulf Borgan, and Ida Scheel. 2019. Time-to-event prediction with neural networks and cox regression. Journal of Machine Learning Research 20, 129 (2019), 1\u201330. Retrieved from http:\/\/jmlr.org\/papers\/v20\/18-424.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_59_2","volume-title":"Microsoft Announces New Supercomputer, Lays Out Vision for Future AI Work","author":"Langston Jennifer","year":"2020","unstructured":"Jennifer Langston. 2020. Microsoft Announces New Supercomputer, Lays Out Vision for Future AI Work. Retrieved April 29, 2022 from https:\/\/blogs.microsoft.com\/ai\/openai-azure-supercomputer\/"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1985.6312192"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607054"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3385187"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3236024.3236060"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2004.1302913"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.5555\/3388242.3388284"},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.5555\/3291168.3291198"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2974843"},{"key":"e_1_3_3_68_2","volume-title":"Reinventing Search with a New AI-powered Microsoft Bing and Edge, your Copilot for the Web","author":"Mehdi Yusuf","year":"2023","unstructured":"Yusuf Mehdi. 2023. Reinventing Search with a New AI-powered Microsoft Bing and Edge, your Copilot for the Web. Retrieved April 7, 2023 from https:\/\/blogs.microsoft.com\/blog\/2023\/02\/07\/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web\/"},{"key":"e_1_3_3_69_2","volume-title":"Understanding Up\/Down InfiniBand Routing Algorithm","year":"2022","unstructured":"Mellanox. 2022. Understanding Up\/Down InfiniBand Routing Algorithm. Retrieved April 29, 2022 from https:\/\/mymellanox.force.com\/mellanoxcommunity\/s\/article\/understanding-up-down-infiniband-routing-algorithm"},{"key":"e_1_3_3_70_2","volume-title":"Add Micro-benchmark for Composable Kernel Gemm","year":"2024","unstructured":"Microsoft. 2024. Add Micro-benchmark for Composable Kernel Gemm. Retrieved June 25, 2025 from https:\/\/github.com\/microsoft\/superbenchmark\/commit\/0e86db8"},{"key":"e_1_3_3_71_2","volume-title":"NVIDIA A100 GPU Memory Error Management - Row Remapping","year":"2020","unstructured":"NVIDIA. 2020. NVIDIA A100 GPU Memory Error Management - Row Remapping. Retrieved Dec 6, 2022 from https:\/\/docs.nvidia.com\/deploy\/a100-gpu-mem-error-mgmt\/index.html#row-mapping"},{"key":"e_1_3_3_72_2","volume-title":"The NVIDIA Container Image for PyTorch","year":"2020","unstructured":"NVIDIA. 2020. The NVIDIA Container Image for PyTorch. Retrieved April 29, 2022 from https:\/\/docs.nvidia.com\/deeplearning\/frameworks\/pytorch-release-notes\/rel_20-12.html#rel_20-12"},{"key":"e_1_3_3_73_2","volume-title":"NVIDIA\u2019s New Ampere Data Center GPU in Full Production","year":"2020","unstructured":"NVIDIA. 2020. NVIDIA\u2019s New Ampere Data Center GPU in Full Production. Retrieved January 31, 2025 from https:\/\/nvidianews.nvidia.com\/news\/nvidias-new-ampere-data-center-gpu-in-full-production"},{"key":"e_1_3_3_74_2","volume-title":"CUDA Templates for Linear Algebra Subroutines","year":"2022","unstructured":"NVIDIA. 2022. CUDA Templates for Linear Algebra Subroutines. Retrieved April 29, 2022 from https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_3_3_75_2","volume-title":"NCCL Environment Variables","year":"2022","unstructured":"NVIDIA. 2022. NCCL Environment Variables. Retrieved Jan 31, 2025 from https:\/\/docs.nvidia.com\/deeplearning\/nccl\/archives\/nccl_21210\/user-guide\/docs\/env.html"},{"key":"e_1_3_3_76_2","volume-title":"NCCL Tests","year":"2022","unstructured":"NVIDIA. 2022. NCCL Tests. Retrieved April 29, 2022 from https:\/\/github.com\/NVIDIA\/nccl-tests"},{"key":"e_1_3_3_77_2","volume-title":"NVIDIA Hopper in Full Production","year":"2022","unstructured":"NVIDIA. 2022. NVIDIA Hopper in Full Production. Retrieved January 15, 2025 from https:\/\/nvidianews.nvidia.com\/news\/nvidia-hopper-in-full-production"},{"key":"e_1_3_3_78_2","volume-title":"NVIDIA Unified Fabric Manager (UFM)","year":"2022","unstructured":"NVIDIA. 2022. NVIDIA Unified Fabric Manager (UFM). Retrieved April 29, 2022 from https:\/\/www.nvidia.com\/en-us\/networking\/infiniband\/ufm\/"},{"key":"e_1_3_3_79_2","volume-title":"CUDA Toolkit Archive","year":"2023","unstructured":"NVIDIA. 2023. CUDA Toolkit Archive. Retrieved April 7, 2023 from https:\/\/developer.nvidia.com\/cuda-toolkit-archive"},{"key":"e_1_3_3_80_2","volume-title":"NVIDIA InfiniBand Adaptive Routing Technology","year":"2023","unstructured":"NVIDIA. 2023. NVIDIA InfiniBand Adaptive Routing Technology. Retrieved Jan 31, 2025 from https:\/\/resources.nvidia.com\/en-us-cloud-native-supercomputing-dpus-campaign\/infiniband-white-paper-adaptive-routing"},{"key":"e_1_3_3_81_2","volume-title":"GB200 NVL72","year":"2024","unstructured":"NVIDIA. 2024. GB200 NVL72. Retrieved January 31, 2025 from https:\/\/www.nvidia.com\/en-us\/data-center\/gb200-nvl72\/"},{"key":"e_1_3_3_82_2","volume-title":"Introducing ChatGPT","year":"2022","unstructured":"OpenAI. 2022. Introducing ChatGPT. Retrieved April 7, 2023 from https:\/\/openai.com\/blog\/chatgpt"},{"key":"e_1_3_3_83_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_3_84_2","volume-title":"An Important Next Step on Our AI Journey","author":"Pichai Sundar","year":"2023","unstructured":"Sundar Pichai. 2023. An Important Next Step on Our AI Journey. Retrieved April 7, 2023 from https:\/\/blog.google\/technology\/ai\/bard-google-ai-search-updates\/"},{"key":"e_1_3_3_85_2","volume-title":"\u201cUncorrectable NVLink Error\u201d Making the Tests Fail","year":"2021","unstructured":"PyTorch. 2021. \u201cUncorrectable NVLink Error\u201d Making the Tests Fail. Retrieved January 31, 2025 from https:\/\/github.com\/pytorch\/pytorch\/issues\/58155"},{"key":"e_1_3_3_86_2","volume-title":"Rendezvous","year":"2023","unstructured":"PyTorch. 2023. Rendezvous. Retrieved April 7, 2023 from https:\/\/pytorch.org\/docs\/2.0\/elastic\/rendezvous.html"},{"key":"e_1_3_3_87_2","volume-title":"Torch Distributed Elastic","year":"2023","unstructured":"PyTorch. 2023. Torch Distributed Elastic. Retrieved April 7, 2023 from https:\/\/pytorch.org\/docs\/2.0\/elastic\/quickstart.html"},{"key":"e_1_3_3_88_2","volume-title":"Torchrun to Support NUMA Affiinity","year":"2023","unstructured":"PyTorch. 2023. Torchrun to Support NUMA Affiinity. Retrieved Jan 31, 2025 from https:\/\/github.com\/pytorch\/pytorch\/issues\/115305"},{"key":"e_1_3_3_89_2","first-page":"631","volume-title":"Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference","author":"Qiao Aurick","year":"2018","unstructured":"Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. 2018. Litz: Elastic framework for high-performance distributed machine learning. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA). USENIX Association, USA, 631\u2013643. Retrieved from https:\/\/www.usenix.org\/conference\/atc18\/presentation\/qiao"},{"key":"e_1_3_3_90_2","series-title":"Proceedings of Machine Learning Research","first-page":"5220","volume-title":"Proceedings of the 36th International Conference on Machine Learning","volume":"97","author":"Qiao Aurick","year":"2019","unstructured":"Aurick Qiao, Bryon Aragam, Bingjing Zhang, and Eric Xing. 2019. Fault tolerance in iterative-convergent machine learning. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97). PMLR, 5220\u20135230. Retrieved from https:\/\/proceedings.mlr.press\/v97\/qiao19a.html"},{"key":"e_1_3_3_91_2","volume-title":"InfiniBand Verbs Performance Tests","author":"RDMA Linux","year":"2022","unstructured":"Linux RDMA. 2022. InfiniBand Verbs Performance Tests. Retrieved April 29, 2022 from https:\/\/github.com\/linux-rdma\/perftest"},{"key":"e_1_3_3_92_2","volume-title":"Benchmarking Deep Learning Operations on Different Hardware","author":"Research Baidu","year":"2022","unstructured":"Baidu Research. 2022. Benchmarking Deep Learning Operations on Different Hardware. Retrieved April 29, 2022 from https:\/\/github.com\/baidu-research\/DeepBench"},{"key":"e_1_3_3_93_2","doi-asserted-by":"publisher","DOI":"10.1162\/089976601750264965"},{"key":"e_1_3_3_94_2","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-92bf1922-011"},{"key":"e_1_3_3_95_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2023.3328832"},{"key":"e_1_3_3_96_2","volume-title":"Proceedings of the 3rd International Conference on Learning Representations.","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations.Computational and Biological Learning Society."},{"key":"e_1_3_3_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00070"},{"key":"e_1_3_3_98_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2022.3163122"},{"key":"e_1_3_3_99_2","volume-title":"Multi-GPU CUDA Stress Test","author":"Timonen Ville","year":"2017","unstructured":"Ville Timonen. 2017. Multi-GPU CUDA Stress Test. Retrieved Jan 31, 2025 from https:\/\/github.com\/wilicc\/gpu-burn"},{"key":"e_1_3_3_100_2","volume-title":"TOP10 System - Nov 2023","year":"2023","unstructured":"TOP500. 2023. TOP10 System - Nov 2023. Retrieved Jan 14, 2024 from https:\/\/www.top500.org\/lists\/top500\/2023\/11\/"},{"key":"e_1_3_3_101_2","volume-title":"NUMA Balancing","author":"Riel Rik van","year":"2009","unstructured":"Rik van Riel and Shen Feng. 2009. NUMA Balancing. Retrieved Jan 31, 2025 from https:\/\/docs.kernel.org\/admin-guide\/sysctl\/kernel.html#numa-balancing"},{"key":"e_1_3_3_102_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_3_103_2","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523633"},{"key":"e_1_3_3_104_2","doi-asserted-by":"publisher","DOI":"10.1145\/2741948.2741964"},{"key":"e_1_3_3_105_2","volume-title":"Microsoft Built a Supercomputer to Power OpenAI\u2019s ChatGPT","author":"Waters Rob","year":"2023","unstructured":"Rob Waters. 2023. Microsoft Built a Supercomputer to Power OpenAI\u2019s ChatGPT. Retrieved April 7, 2023 from https:\/\/www.cybercareers.blog\/2023\/03\/microsoft-built-a-supercomputer-to-power-openais-chatgpt\/"},{"key":"e_1_3_3_106_2","volume-title":"CPU Performance Scaling","author":"Wysocki Rafael J.","year":"2017","unstructured":"Rafael J. Wysocki. 2017. CPU Performance Scaling. Retrieved Jan 31, 2025 from https:\/\/docs.kernel.org\/admin-guide\/pm\/cpufreq.html"},{"key":"e_1_3_3_107_2","first-page":"835","volume-title":"Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference","author":"Xiong Yifan","year":"2024","unstructured":"Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, Joe Chau, Peng Cheng, Yongqiang Xiong, and Lidong Zhou. 2024. SuperBench: Improving cloud AI infrastructure reliability with proactive validation. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA). USENIX Association, USA, 835\u2013850. Retrieved from https:\/\/www.usenix.org\/conference\/atc24\/presentation\/xiong"},{"key":"e_1_3_3_108_2","doi-asserted-by":"publisher","DOI":"10.5555\/3277355.3277402"},{"key":"e_1_3_3_109_2","doi-asserted-by":"publisher","DOI":"10.1145\/1254882.1254922"},{"key":"e_1_3_3_110_2","volume-title":"OPT-175B Logbook","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang. 2022. OPT-175B Logbook. Retrieved May 19, 2022 from https:\/\/github.com\/facebookresearch\/metaseq\/blob\/main\/projects\/OPT\/chronicles\/OPT175B_Logbook.pdf"},{"key":"e_1_3_3_111_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin Todor Mihaylov Myle Ott Sam Shleifer Kurt Shuster Daniel Simig Punit Singh Koura Anjali Sridhar Tianlu Wang and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. arxiv:2205.01068. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068"}],"container-title":["ACM Transactions on Computer Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3767334","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T06:39:23Z","timestamp":1777099163000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3767334"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,24]]},"references-count":110,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,5,31]]}},"alternative-id":["10.1145\/3767334"],"URL":"https:\/\/doi.org\/10.1145\/3767334","relation":{},"ISSN":["0734-2071","1557-7333"],"issn-type":[{"value":"0734-2071","type":"print"},{"value":"1557-7333","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,24]]},"assertion":[{"value":"2025-02-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-15","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-04-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}