{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T01:42:53Z","timestamp":1773193373555,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":85,"publisher":"ACM","funder":[{"DOI":"10.13039\/100017223","name":"National Energy Research Scientific Computing Center","doi-asserted-by":"publisher","award":["NERSC DDR-ERCAP0032726"],"award-info":[{"award-number":["NERSC DDR-ERCAP0032726"]}],"id":[{"id":"10.13039\/100017223","id-type":"DOI","asserted-by":"publisher"}]},{"name":"ACE, one of the seven centers sponsored by the Semiconductor Research Corporation (SRC) and DARPA under the Joint University Microelectronics Program 2.0"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,17]]},"DOI":"10.1145\/3772356.3772414","type":"proceedings-article","created":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T12:02:48Z","timestamp":1763380968000},"page":"149-159","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Photonic Rails in ML Datacenters"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-0232-9821","authenticated-orcid":false,"given":"Eric","family":"Ding","sequence":"first","affiliation":[{"name":"Cornell University, Ithaca, New York, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7137-498X","authenticated-orcid":false,"given":"Chuhan","family":"Ouyang","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, New York, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8118-3026","authenticated-orcid":false,"given":"Rachee","family":"Singh","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, New York, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,11,17]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Agarwal Saksham","year":"2024","unstructured":"Saksham Agarwal, Qizhe Cai, Rachit Agarwal, David Shmoys, and Amin Vahdat. 2024. Harmony: A congestion-free datacenter architecture. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 329\u2013343."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672248"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3406221"},{"key":"e_1_3_2_1_4_1","volume-title":"Proceedings of the ACM SIGCOMM 2025 Conference. 234\u2013247","author":"Benyahya Kaoutar","year":"2025","unstructured":"Kaoutar Benyahya, Ariel Gomez Diaz, Junyi Liu, Vassily Lyutsarev, Marianna Pantouvaki, Kai Shi, Shawn Yohanes Siew, Hitesh Ballani, Thomas Burridge, Daniel Cletheroe, et al. 2025. Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs. In Proceedings of the ACM SIGCOMM 2025 Conference. 234\u2013247."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2620728.2620744"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.34"},{"key":"e_1_3_2_1_7_1","unstructured":"Broadcom Inc. 2025. BCM78909 51.2-Tb\/s Multilayer Co-Packaged Optics Switch. Online; accessed July 5 2025. https:\/\/www.broadcom.com\/products\/fiber-optic-modules-components\/co-packaged-optics\/switches\/bcm78909 A high-radix high-bandwidth CPO switch supporting up to 64x800GbE or 128x400GbE.."},{"key":"e_1_3_2_1_8_1","unstructured":"Broadcom Inc. 2025. Co-Packaged Optics (CPO). https:\/\/www.broadcom.com\/info\/optics\/cpo. Accessed: 2025-07-03."},{"key":"e_1_3_2_1_9_1","volume-title":"Optical Systems Division","author":"Broadcom Inc.","year":"2021","unstructured":"Broadcom Inc. Optical Systems Division. 2021. SiPh Chiplets In Package (SCIP). Technical Report. Broadcom Inc., Irvine, CA, USA. https:\/\/docs.broadcom.com\/doc\/siph-chiplets-in-package-scip OSD CPO SCIP_20211106 V5."},{"key":"e_1_3_2_1_10_1","unstructured":"CALIENT Technologies Inc. 2022. Calient's Optical Circuit Switch (S-Series) Datasheet. https:\/\/www.calient.net\/wp-content\/uploads\/2022\/06\/Datasheet_Calients-Optical-Circuit-Switches.pdf. Accessed: 2025-07-03."},{"key":"e_1_3_2_1_11_1","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Chen Li","year":"2017","unstructured":"Li Chen, Kai Chen, Zhonghua Zhu, Minlan Yu, George Porter, Chunming Qiao, and Shan Zhong. 2017. Enabling {Wide-Spread} Communications on Optical Fabric with {MegaSwitch}. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 577\u2013593."},{"key":"e_1_3_2_1_12_1","volume-title":"Proceedings of the 52nd Annual International Symposium on Computer Architecture. 1703\u20131716","author":"Chu Weiwei","year":"2025","unstructured":"Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Amar Phanishayee, Chunqiang Tang, Yuchen Hao, Jianyu Huang, Mustafa Ozdal, Jun Wang, et al. 2025. Scaling Llama 3 Training with Efficient Parallelism Strategies. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 1703\u20131716."},{"key":"e_1_3_2_1_13_1","volume-title":"Optical Circuit Switch (OCS). https:\/\/www.coherent.com\/networking\/optical-circuit-switch. Accessed: 2025-07-10","author":"Coherent Corp. 2025.","year":"2024","unstructured":"Coherent Corp. 2025. Optical Circuit Switch (OCS). https:\/\/www.coherent.com\/networking\/optical-circuit-switch. Accessed: 2025-07-10; Based on press release published March 25,2024; Coherent's liquid-crystal-based OCS architecture supports up to 300x300 ports and is optimized for AI\/ML data center fabrics."},{"key":"e_1_3_2_1_14_1","volume-title":"Products","author":"EpiPhotonics Corp. 2025.","unstructured":"EpiPhotonics Corp. 2025. Products. http:\/\/epiphotonics.com\/products.html. Accessed: 2025-07-03."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1851182.1851223"},{"key":"e_1_3_2_1_16_1","first-page":"1","article-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","volume":"23","author":"Fedus William","year":"2022","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1\u201339.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_17_1","unstructured":"FS.COM. n.d.. Cisco Compatible 400GBASE-XDR4 QSFP-DD PAM4 1310nm 2km Module. https:\/\/www.fs.com\/products\/110530.html?attribute=94270&id=4477813. Accessed: 2025-07-02."},{"key":"e_1_3_2_1_18_1","unstructured":"FS.COM. n.d.. N9510-64D 64-Port Ethernet L3 Data Center Switch (Broadcom Tomahawk-4 64x400GbE). https:\/\/www.fs.com\/products\/149853.html. Accessed: 2025-07-02."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695960"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672233"},{"key":"e_1_3_2_1_21_1","volume-title":"Proceedings of the 23rd ACM Workshop on Hot Topics in Networks. 195\u2013204","author":"Gherghescu Alexandru M","year":"2024","unstructured":"Alexandru M Gherghescu, Vlad-Andrei B\u0103doiu, Alexandra Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, and Costin Raiciu. 2024. I've Got 99 Problems But FLOPS Ain't One. In Proceedings of the 23rd ACM Workshop on Hot Topics in Networks. 195\u2013204."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934872.2934911"},{"key":"e_1_3_2_1_23_1","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1592568.1592576"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1592568.1592577"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2619239.2626328"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422604.3425945"},{"key":"e_1_3_2_1_28_1","unstructured":"Hewlett Packard Enterprise. 2021. HPE Cray EX Supercomputer Overview. https:\/\/www.hpe.com\/psnow\/doc\/a50002546enw. Accessed: 2025-07-09."},{"key":"e_1_3_2_1_29_1","volume-title":"2010 USENIX Annual Technical Conference (USENIX ATC 10)","author":"Hunt Patrick","year":"2010","unstructured":"Patrick Hunt, Mahadev Konar, Flavio P Junqueira, and Benjamin Reed. 2010. {ZooKeeper}: Wait-free coordination for internet-scale systems. In 2010 USENIX Annual Technical Conference (USENIX ATC 10)."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613152"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_2_1_32_1","volume-title":"SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference.","author":"Khani Mehrdad","unstructured":"Mehrdad Khani, Manya Ghobadi, Mohammad Alizadeh, Ziyi Zhu, Madeleine Glick, Keren Bergman, Amin Vahdat, Benjamin Klenk, and Eiman Ebrahimi. [n. d.]. SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference."},{"key":"e_1_3_2_1_33_1","first-page":"341","article-title":"Reducing activation recomputation in large transformer models","volume":"5","author":"Korthikanti Vijay Anand","year":"2023","unstructured":"Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023), 341\u2013353.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3696348.3696856"},{"key":"e_1_3_2_1_35_1","volume-title":"LUMION: Fast Fault Recovery for ML Jobs Using Programmable Optical Fabrics. arXiv:2505.23105 [cs.LG] https:\/\/arxiv.org\/abs\/2505.23105","author":"Kumar Abhishek Vijaya","year":"2025","unstructured":"Abhishek Vijaya Kumar, Eric Ding, Arjun Devraj, Darius Bunandar, and Rachee Singh. 2025. LUMION: Fast Fault Recovery for ML Jobs Using Programmable Optical Fabrics. arXiv:2505.23105 [cs.LG] https:\/\/arxiv.org\/abs\/2505.23105"},{"key":"e_1_3_2_1_36_1","unstructured":"ChonLam Lao Minlan Yu Aditya Akella Jiamin Cao Yu Guan Pengcheng Zhang Zhilong Zheng Yichi Xu Ennan Zhai Dennis Cai et al. 2024. TrainMover: Efficient ML Training Live Migration with No Memory Overhead. arXiv e-prints (2024) arXiv-2412."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672222"},{"key":"e_1_3_2_1_38_1","unstructured":"Wanchao Liang Tianyu Liu Less Wright Will Constable Andrew Gu Chien-Chin Huang Iris Zhang Wei Feng Howard Huang Junjie Wang et al. 2024. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training. arXiv preprint arXiv:2410.06511 (2024)."},{"key":"e_1_3_2_1_39_1","volume-title":"Kin Fai Tse, et al","author":"Liao Xudong","year":"2025","unstructured":"Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, et al. 2025. mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training. arXiv preprint arXiv:2501.03905 (2025)."},{"key":"e_1_3_2_1_40_1","unstructured":"Lightmatter Inc. 2025. Passage Technology. https:\/\/lightmatter.co\/products\/passage\/. Accessed: 2025-07-03."},{"key":"e_1_3_2_1_41_1","unstructured":"Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604836"},{"key":"e_1_3_2_1_43_1","volume-title":"Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889","author":"Liu Hao","year":"2023","unstructured":"Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889 (2023)."},{"key":"e_1_3_2_1_44_1","unstructured":"Lumentum Holdings Inc. 2025. Lumentum Optical Circuit Switch to Improve Next-Generation AI Data Center Scalability. https:\/\/www.lumentum.com\/en\/media-room\/news-releases\/lumentum-optical-circuit-switch-improve-next-generation-ai-data-center. Accessed June 20 2025."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3098822.3098838"},{"key":"e_1_3_2_1_46_1","unstructured":"Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David Garcia Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)."},{"key":"e_1_3_2_1_47_1","unstructured":"National Energy Research Scientific Computing Center (NERSC). 2025. Perlmutter Architecture \u2014 NERSC Documentation. https:\/\/docs.nersc.gov\/systems\/perlmutter\/architecture\/. Accessed: 2025-07-04."},{"key":"e_1_3_2_1_48_1","volume-title":"https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/teams\/dgxc-benchmarking\/resources\/llama31-405b-dgxc-benchmarking-a. Version 24.11.1, modified","author":"Benchmarking Recipe NVIDIA.","year":"2025","unstructured":"NVIDIA. 2025. Llama-3.1-405B DGXC Benchmarking Recipe. https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/teams\/dgxc-benchmarking\/resources\/llama31-405b-dgxc-benchmarking-a. Version 24.11.1, modified January 29, 2025."},{"key":"e_1_3_2_1_49_1","volume-title":"NVIDIA Collective Communication Library (NCCL): Creating a Communicator. NVIDIA. https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/communicators.html Accessed","author":"NVIDIA Corporation","year":"2025","unstructured":"NVIDIA Corporation. 2020. NVIDIA Collective Communication Library (NCCL): Creating a Communicator. NVIDIA. https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/usage\/communicators.html Accessed July 6, 2025."},{"key":"e_1_3_2_1_50_1","unstructured":"NVIDIA Corporation. 2022. Doubling all-to-all Performance with NCCL 2.12: Introducing PXN (PCI X NVLink). NVIDIA Developer Blog. https:\/\/developer.nvidia.com\/blog\/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12\/ Describes PXN which enables GPU-to-NIC communication via NVLink to optimize rail-aligned collective performance."},{"key":"e_1_3_2_1_51_1","unstructured":"NVIDIA Corporation. 2024. ConnectX-7 400G Adapters Datasheet. https:\/\/resources.nvidia.com\/en-us-accelerated-networking-resource-library\/connectx-7-datasheet. Accessed: 2025-07-02."},{"key":"e_1_3_2_1_52_1","volume-title":"Co-Packaged Silicon Photonics Networking Switches. Online","author":"NVIDIA Corporation","year":"2025","unstructured":"NVIDIA Corporation. 2025. Co-Packaged Silicon Photonics Networking Switches. Online; accessed July 5,2025. https:\/\/www.nvidia.com\/en-us\/networking\/products\/silicon-photonics\/ Describes NVIDIA's co-packaged optics (CPO) switches with integrated silicon photonics."},{"key":"e_1_3_2_1_53_1","volume-title":"Co-Packaged Optics Networking Switches to Scale AI Factories to Millions of GPUs","author":"NVIDIA Corporation","unstructured":"NVIDIA Corporation. 2025. NVIDIA Announces Spectrum-X Photonics, Co-Packaged Optics Networking Switches to Scale AI Factories to Millions of GPUs. Press Release. NVIDIA Corporation, Santa Clara, CA, USA. https:\/\/nvidianews.nvidia.com\/news\/nvidia-spectrum-x-co-packaged-optics-networking-switches-ai-factories Unveiled at GTC 2025."},{"key":"e_1_3_2_1_54_1","volume-title":"NVIDIA Collective Communications Library (NCCL). NVIDIA Developer. https:\/\/developer.nvidia.com\/nccl Version 2.x","author":"NVIDIA Corporation","unstructured":"NVIDIA Corporation. 2025. NVIDIA Collective Communications Library (NCCL). NVIDIA Developer. https:\/\/developer.nvidia.com\/nccl Version 2.x; MPI-compatible multi-GPU \/ multi-node collective communication library."},{"key":"e_1_3_2_1_55_1","volume-title":"NVIDIA DGX H200 Datasheet. Datasheet","author":"NVIDIA Corporation","unstructured":"NVIDIA Corporation. 2025. NVIDIA DGX H200 Datasheet. Datasheet. NVIDIA Corporation, Santa Clara, CA. https:\/\/resources.nvidia.com\/en-us-dgx-systems\/dgx-h200-datasheet Includes specifications of the DGX H200 system, featuring 8x H200 GPUs, dual Xeon Platinum 8480C CPUs, 2 TB system memory, 30 TB NVMe SSD, and full NVIDIA AI Enterprise software stack."},{"key":"e_1_3_2_1_56_1","unstructured":"NVIDIA Corporation. 2025. NVIDIA DGX SuperPOD. NVIDIA. https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-superpod\/ Full-stack data center platform scaling to tens of thousands of GPUs; includes compute networking storage and software."},{"key":"e_1_3_2_1_57_1","unstructured":"NVIDIA Corporation. 2025. NVIDIA HGX Platform. NVIDIA. https:\/\/www.nvidia.com\/en-us\/data-center\/hgx\/ Reference architecture combining GPUs NVLink\/NVSwitch networking and AI\/HPC software stack."},{"key":"e_1_3_2_1_58_1","volume-title":"Rail Optimized Topology Validation. NVIDIA Networking","author":"NVIDIA Corporation","unstructured":"NVIDIA Corporation. 2025. Rail Optimized Topology Validation. NVIDIA Networking, Santa Clara, CA. https:\/\/docs.nvidia.com\/networking\/display\/ibdiagnetusermanualv221\/Rail+Optimized+Topology+Validation Part of the ibdiagnet InfiniBand Fabric Diagnostic Tool User Manual; describes cabling validation and compute-fabric alignment in DGX SuperPOD rail-optimized fabrics."},{"key":"e_1_3_2_1_59_1","volume-title":"International conference on machine learning. Pmlr, 1310\u20131318","author":"Pascanu Razvan","year":"2013","unstructured":"Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International conference on machine learning. Pmlr, 1310\u20131318."},{"key":"e_1_3_2_1_60_1","unstructured":"Polatis (a HUBER+SUHNER company). n.d.. Series 7000 \u2014 384x384-port Software-Defined Optical Circuit Switch. https:\/\/www.polatis.com\/series-7000-384x384-port-software-controlled-optical-circuit-switch-sdn-enabled.asp. Accessed: 2025-07-01."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544216.3544265"},{"key":"e_1_3_2_1_62_1","unstructured":"PyTorch Team. 2025. Automatic Mixed Precision package (torch.amp). https:\/\/pytorch.org\/docs\/stable\/amp.html. Accessed: 2025-07-09."},{"key":"e_1_3_2_1_63_1","volume-title":"Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241","author":"Qi Penghui","year":"2023","unstructured":"Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241 (2023)."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2009.09.001"},{"key":"e_1_3_2_1_66_1","first-page":"A1","article-title":"Optical switching will innovate intra data center networks [Invited Tutorial]","volume":"16","year":"2023","unstructured":"Ken-ichi Sato. 2023. Optical switching will innovate intra data center networks [Invited Tutorial]. Journal of Optical Communications and Networking 16, 1 (2023), A1\u2013A23.","journal-title":"Journal of Optical Communications and Networking"},{"key":"e_1_3_2_1_67_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Shah Aashaka","year":"2023","unstructured":"Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2023. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 593\u2013612. https:\/\/www.usenix.org\/conference\/nsdi23\/presentation\/shah"},{"key":"e_1_3_2_1_68_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_3_2_1_69_1","volume-title":"Shoal: A Network Architecture for Disaggregated Racks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Shrivastav Vishal","year":"2019","unstructured":"Vishal Shrivastav, Asaf Valadarsky, Hitesh Ballani, Paolo Costa, Ki Suh Lee, Han Wang, Rachit Agarwal, and Hakim Weatherspoon. 2019. Shoal: A Network Architecture for Disaggregated Racks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 255\u2013270. https:\/\/www.usenix.org\/conference\/nsdi19\/presentation\/shrivastav"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/2829988.2787508"},{"key":"e_1_3_2_1_71_1","volume-title":"11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14)","author":"Singla Ankit","year":"2014","unstructured":"Ankit Singla, P Brighten Godfrey, and Alexandra Kolla. 2014. High throughput data center topology design. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 29\u201341."},{"key":"e_1_3_2_1_72_1","volume-title":"Jellyfish: Networking Data Centers Randomly. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Singla Ankit","unstructured":"Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. 2012. Jellyfish: Networking Data Centers Randomly. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX Association, San Jose, CA, 225\u2013238. https:\/\/www.usenix.org\/conference\/nsdi12\/technical-sessions\/presentation\/singla"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"crossref","first-page":"388","DOI":"10.3390\/cryst9080388","article-title":"Fast ferroelectric liquid crystal based optical switch: simulation and experiments","volume":"9","author":"Sreenilayam Sithara P","year":"2019","unstructured":"Sithara P Sreenilayam, Dermot Brabazon, and Yuri P Panarin. 2019. Fast ferroelectric liquid crystal based optical switch: simulation and experiments. Crystals 9, 8 (2019), 388.","journal-title":"Crystals"},{"key":"e_1_3_2_1_74_1","unstructured":"Nouamane Tazi Ferdinand Mom Haojun Zhao Phuc Nguyen Mohamed Mekkouri Leandro Werra and Thomas Wolf. 2025. The Ultra-Scale Playbook: Training LLMs on GPU Clusters. https:\/\/huggingface.co\/spaces\/nanotron\/ultrascale-playbook. Accessed: 2025-05-16."},{"key":"e_1_3_2_1_75_1","unstructured":"Telescent Inc. n.d.. Products | Telescent. https:\/\/www.telescent.com\/products. Accessed: 2025-07-01."},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-39924-7_38"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/1851182.1851222"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTI63208.2024.00013"},{"key":"e_1_3_2_1_79_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Wang Weiyang","year":"2023","unstructured":"Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, and Anthony Kewitsch. 2023. {TopoOpt}: Co-optimizing network topology and parallelization strategy for distributed training jobs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 739\u2013767."},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1364\/OFC.2023.W1G.1"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1364\/JOCN.497372"},{"key":"e_1_3_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783323"},{"key":"e_1_3_2_1_83_1","unstructured":"Sharada Yeluri. 2023. Optimizing Power Consumption in High-End Routers."},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"crossref","unstructured":"Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer et al. 2023. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_3_2_1_85_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Zu Yazhou","year":"2024","unstructured":"Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, Steve Lacy, Hang Wang, Aaron Wisner, Chris Lewis, and Henri Bahini. 2024. Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 761\u2013774."}],"event":{"name":"HotNets '25: 24th ACM Workshop on Hot Topics in Networks","location":"UMD Campus College Park MD USA","acronym":"HotNets '25","sponsor":["SIGCOMM ACM Special Interest Group on Data Communication"]},"container-title":["Proceedings of the 24th ACM Workshop on Hot Topics in Networks"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772356.3772414","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T12:04:39Z","timestamp":1763381079000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772356.3772414"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,17]]},"references-count":85,"alternative-id":["10.1145\/3772356.3772414","10.1145\/3772356"],"URL":"https:\/\/doi.org\/10.1145\/3772356.3772414","relation":{},"subject":[],"published":{"date-parts":[[2025,11,17]]},"assertion":[{"value":"2025-11-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}