{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T23:08:39Z","timestamp":1768345719065,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":53,"publisher":"ACM","funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["2241818"],"award-info":[{"award-number":["2241818"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"NGI Enrichers"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,19]]},"DOI":"10.1145\/3772052.3772225","type":"proceedings-article","created":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:00Z","timestamp":1768321140000},"page":"196-208","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["THORN-ML: Transparent Hardware Offloaded Resilient Networks for RDMA based Distributed ML Workloads"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7355-2426","authenticated-orcid":false,"given":"Maziyar","family":"Nazari","sequence":"first","affiliation":[{"name":"University of Colorado Boulder, Boulder, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-3340-5316","authenticated-orcid":false,"given":"Daniel","family":"Noland","sequence":"additional","affiliation":[{"name":"Unaffiliated, Longmont, CO, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7317-1834","authenticated-orcid":false,"given":"Giulio","family":"Sidoretti","sequence":"additional","affiliation":[{"name":"University of Colorado Boulder, Boulder, CO, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5499-8871","authenticated-orcid":false,"given":"Erika","family":"Hunhoff","sequence":"additional","affiliation":[{"name":"University of Colorado Boulder, Boulder, CO, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9779-1838","authenticated-orcid":false,"given":"Tamara Silbergleit","family":"Lehman","sequence":"additional","affiliation":[{"name":"University of Colorado Boulder, Boulder, CO, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2556-9394","authenticated-orcid":false,"given":"Eric","family":"Keller","sequence":"additional","affiliation":[{"name":"University of Colorado Boulder, Boulder, CO, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2024. Dell Technical White Paper (H20003)-Generative AI in the Enterprise - Model Training (Network Architecture section). https:\/\/infohub.delltechnologies.com\/pt-br\/l\/technical-white-paper-generative-ai-in-the-enterprise-model-training\/network-architecture-155\/."},{"key":"e_1_3_2_1_2_1","unstructured":"2025. Accelerate. https:\/\/huggingface.co\/docs\/accelerate\/en\/index"},{"key":"e_1_3_2_1_3_1","unstructured":"2025. Bidirectional Forwarding Detection Commands on the Cisco IOS XR Software. https:\/\/www.cisco.com\/c\/en\/us\/td\/docs\/routers\/xr12000\/software\/xr12k_r4-1\/interfaces\/command\/reference\/interfaces_cr41xr12k_chapter2.html"},{"key":"e_1_3_2_1_4_1","unstructured":"2025. cifar10 | TensorFlow Datasets. https:\/\/www.tensorflow.org\/datasets\/catalog\/cifar10"},{"key":"e_1_3_2_1_5_1","unstructured":"2025. codeparrot\/codeparrot - Hugging Face. https:\/\/huggingface.co\/codeparrot\/codeparrot"},{"key":"e_1_3_2_1_6_1","unstructured":"2025. codeparrot\/codeparrot-clear - Datasets at Hugging Face. https:\/\/huggingface.co\/datasets\/codeparrot\/codeparrot-clean"},{"key":"e_1_3_2_1_7_1","unstructured":"2025. CoreWeave. https:\/\/www.coreweave.com\/."},{"key":"e_1_3_2_1_8_1","unstructured":"2025. Ethernet switch device driver model (switchdev) - The Linux Kernel documentation. https:\/\/docs.kernel.org\/networking\/switchdev.html"},{"key":"e_1_3_2_1_9_1","unstructured":"2025. FRRouting. https:\/\/frrouting.org\/"},{"key":"e_1_3_2_1_10_1","unstructured":"2025. GitHub CoPilot - Your AI pair programmer. https:\/\/github.com\/features\/copilot"},{"key":"e_1_3_2_1_11_1","unstructured":"2025. Gloo. https:\/\/github.com\/facebookincubator\/gloo."},{"key":"e_1_3_2_1_12_1","unstructured":"2025. GoBGP. https:\/\/osrg.github.io\/gobgp\/"},{"key":"e_1_3_2_1_13_1","unstructured":"2025. Intel oneAPI Collective Communications Library. https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/tools\/oneapi\/oneccl.html."},{"key":"e_1_3_2_1_14_1","unstructured":"2025. MSTFLINT Package - Firmware Burning and Diagnostics Tools. https:\/\/github.com\/Mellanox\/mstflint"},{"key":"e_1_3_2_1_15_1","unstructured":"2025. netlink(7) \u2014 Linux manual page. https:\/\/man7.org\/linux\/man-pages\/man7\/netlink.7.html"},{"key":"e_1_3_2_1_16_1","unstructured":"2025. Network Function Representors - The Linux Kernel documentation. https:\/\/docs.kernel.org\/networking\/representors.html"},{"key":"e_1_3_2_1_17_1","unstructured":"2025. Network Lessons: Bidirectional Forwarding Detection (BFD). https:\/\/networklessons.com\/cisco\/ccie-routing-switching\/bidirectional-forwarding-detection-bfd"},{"key":"e_1_3_2_1_18_1","unstructured":"2025. NVIDIA GPUDirect. https:\/\/developer.nvidia.com\/gpudirect."},{"key":"e_1_3_2_1_19_1","unstructured":"2025. Ray Train. https:\/\/docs.ray.io\/en\/latest\/train\/train.html"},{"key":"e_1_3_2_1_20_1","unstructured":"2025. RCCL documentation. https:\/\/rocm.docs.amd.com\/projects\/rccl\/en\/latest\/."},{"key":"e_1_3_2_1_21_1","unstructured":"2025. rust-netlink. https:\/\/github.com\/rust-netlink"},{"key":"e_1_3_2_1_22_1","unstructured":"2025. Saving and loading a general checkpoint in PyTorch. https:\/\/pytorch.org\/tutorials\/recipes\/recipes\/saving_and_loading_a_general_checkpoint.html"},{"key":"e_1_3_2_1_23_1","first-page":"38","volume":"2","year":"2025","unstructured":"2025. Saving and Loading Checkpoints - Ray 2.38.0. \u201chttps:\/\/docs.ray.io\/en\/latest\/train\/user- guides\/checkpoints.html\u201d","journal-title":"Saving and Loading Checkpoints - Ray"},{"key":"e_1_3_2_1_24_1","unstructured":"2025. Single Root IO Virtualization (SR-IOV) - NVIDIA Docs. https:\/\/docs.nvidia.com\/doca\/sdk\/single+root+io+virtualization+(sr-iov)\/index.html"},{"key":"e_1_3_2_1_25_1","unstructured":"2025. TensorFlow v2.16.1 API: MinitoredTrainingSession. https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/compat\/v1\/train\/MonitoredTrainingSession."},{"key":"e_1_3_2_1_26_1","unstructured":"2025. tf.keras.applications.ResNet101 | TensorFlow v2.16.1. https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/applications\/ResNet101"},{"key":"e_1_3_2_1_27_1","unstructured":"2025. The BIRD Internet Routing Daemon. https:\/\/bird.network.cz\/"},{"key":"e_1_3_2_1_28_1","unstructured":"2025. Training Checkpoints | TensorFlow core. https:\/\/www.tensorflow.org\/guide\/checkpoint"},{"key":"e_1_3_2_1_29_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_30_1","volume-title":"EVPN in the Data Center","author":"Dutt Dinesh G","unstructured":"Dinesh G Dutt. 2018. EVPN in the Data Center. O'Reilly Media, Inc."},{"key":"e_1_3_2_1_31_1","volume-title":"Azure Accelerated Networking: SmartNICs in the Public Cloud. In USENIX Symposium on Networked Systems Design and Implementation (NSDI).","author":"Daniel","unstructured":"Daniel Firestone et al. 2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In USENIX Symposium on Networked Systems Design and Implementation (NSDI)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672233"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1592568.1592576"},{"key":"e_1_3_2_1_34_1","first-page":"1","article-title":"Distributed Snapshots: Determining Global States of a Distributed System","volume":"3","author":"Lamport Leslie","year":"1985","unstructured":"Leslie Lamport and K. Mani Chandy. 1985. Distributed Snapshots: Determining Global States of a Distributed System. ACM Transactions on Computer Systems 3, 1 (Feb. 1985).","journal-title":"ACM Transactions on Computer Systems"},{"key":"e_1_3_2_1_35_1","unstructured":"Kevin Lee Adi Gangidi and Mathew Oldham. 2024. Building Meta's GenAI Infrastructure. https:\/\/engineering.fb.com\/2024\/03\/12\/data-center-engineering\/building-metas-genai-infrastructure\/."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3663408.3663411"},{"key":"e_1_3_2_1_37_1","volume-title":"15th USENIX symposium on networked systems design and implementation (NSDI 18)","author":"Lu Yuanwei","year":"2018","unstructured":"Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. 2018. {Multi-Path} transport for {RDMA} in datacenters. In 15th USENIX symposium on networked systems design and implementation (NSDI 18). 357\u2013371."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Mallik Mahalingam Dinesh G. Dutt Kenneth Duda Puneet Agarwal Lawrence Kreeger T. Sridhar Mike Bursell and Chris Wright. 2014. RFC 7348: Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks. https:\/\/datatracker.ietf.org\/doc\/html\/rfc7348","DOI":"10.17487\/rfc7348"},{"key":"e_1_3_2_1_39_1","unstructured":"NVIDIA. 2025. NCCL. https:\/\/github.com\/NVIDIA\/nccl"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544216.3544265"},{"key":"e_1_3_2_1_41_1","unstructured":"Pytorch. 2025. Pytorch Distributed Data Parallelism. https:\/\/pytorch.org\/tutorials\/intermediate\/ddp_tutorial.html"},{"key":"e_1_3_2_1_42_1","unstructured":"Robi Rahman David Owen and Josh You. 2024. Tracking Large-Scale AI Models. https:\/\/epochai.org\/blog\/tracking-large-scale-ai-models."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Renato J. Recio Bernard Metzler Paul R. Culley Jeff Hilland and Dave Garcia. 2007. A Remote Direct Memory Access Protocol Specification. Technical Report RFC 5040. Internet Engineering Task Force.","DOI":"10.17487\/rfc5040"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Hemal Shah Felix Marti Wael Noureddine Asgeir Eiriksson and Robert Sharp. 2014. Remote Direct Memory Access RDMA Protocol Extensions. Technical Report RFC 7306. ISSN: 2070\u20131721. Internet Engineering Task Force.","DOI":"10.17487\/rfc7306"},{"key":"e_1_3_2_1_45_1","volume-title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs\/1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs\/1909.08053 (2019). arXiv:1909.08053 http:\/\/arxiv.org\/abs\/1909.08053"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604849"},{"key":"e_1_3_2_1_47_1","unstructured":"Tensorflow. 2025. Tensorflow Multiworker Mirrored Strategy. https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/MultiWorkerMirroredStrategy"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3482898.3483363"},{"key":"e_1_3_2_1_49_1","volume-title":"Vigraham and Benjamin Leonhardi","author":"Saranyan","year":"2024","unstructured":"Saranyan A. Vigraham and Benjamin Leonhardi. 2024. Maintaining large-scale AI capacity at Meta. https:\/\/engineering.fb.com\/2024\/06\/12\/production-engineering\/maintaining-large-scale-ai-capacity-meta\/."},{"key":"e_1_3_2_1_50_1","unstructured":"Pablo Villalobos and Anson Ho. 2022. Trends in Training Dataset Sizes. https:\/\/epochai.org\/blog\/trends-in-training-dataset-sizes."},{"key":"e_1_3_2_1_51_1","volume-title":"Ultima: Robust and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud. arXiv:2310.06993 [cs.DC] https:\/\/arxiv.org\/abs\/2310.06993","author":"Warraich Ertza","year":"2023","unstructured":"Ertza Warraich, Omer Shabtai, Khalid Manaa, Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh, Lalith Suresh, and Muhammad Shahbaz. 2023. Ultima: Robust and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud. arXiv:2310.06993 [cs.DC] https:\/\/arxiv.org\/abs\/2310.06993"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3452296.3472897"},{"key":"e_1_3_2_1_53_1","volume-title":"USENIX Symposium on Networked Systems Design and Implementation (NSDI). 761\u2013774","author":"Zu Yazhou","year":"2024","unstructured":"Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, et al. 2024. Resiliency at Scale: Managing {Google's}{TPUv4} Machine Learning Supercomputer. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). 761\u2013774."}],"event":{"name":"SoCC '25: ACM Symposium on Cloud Computing","location":"Online USA","acronym":"SoCC '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 2025 ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772052.3772225","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:21:42Z","timestamp":1768321302000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772052.3772225"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":53,"alternative-id":["10.1145\/3772052.3772225","10.1145\/3772052"],"URL":"https:\/\/doi.org\/10.1145\/3772052.3772225","relation":{},"subject":[],"published":{"date-parts":[[2025,11,19]]},"assertion":[{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}