{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T20:18:18Z","timestamp":1764793098384,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,11,18]],"date-time":"2024-11-18T00:00:00Z","timestamp":1731888000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,11,18]]},"DOI":"10.1145\/3696348.3696880","type":"proceedings-article","created":{"date-parts":[[2024,11,11]],"date-time":"2024-11-11T00:20:52Z","timestamp":1731284452000},"page":"177-185","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["When ML Training Cuts Through Congestion: Just-in-Time Gradient Compression via Packet Trimming"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4131-4113","authenticated-orcid":false,"given":"Xiaoqi","family":"Chen","sequence":"first","affiliation":[{"name":"Purdue University and VMware Research Group"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0982-7894","authenticated-orcid":false,"given":"Shay","family":"Vargaftik","sequence":"additional","affiliation":[{"name":"VMware Research"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0196-9190","authenticated-orcid":false,"given":"Ran Ben","family":"Basat","sequence":"additional","affiliation":[{"name":"UCL"}]}],"member":"320","published-online":{"date-parts":[[2024,11,18]]},"reference":[{"unstructured":"[n. d.]. Huggingface - Fully Sharded Data Parallel. https:\/\/huggingface.co\/docs\/transformers\/en\/fsdp. Accessed: 2024-10-21.","key":"e_1_3_2_1_1_1"},{"unstructured":"[n. d.]. Pytorch Reproducibility. https:\/\/pytorch.org\/docs\/stable\/notes\/randomness.html. Accessed: 2024-10-21.","key":"e_1_3_2_1_2_1"},{"key":"e_1_3_2_1_3_1","volume-title":"Implementing packet trimming support in hardware. arXiv preprint arXiv:2207.04967","author":"Adrian Popa","year":"2022","unstructured":"Popa Adrian, Dumitrescu Dragos, Handley Mark, Nikolaidis Georgios, Lee Jeongkeun, and Raiciu Costin. 2022. Implementing packet trimming support in hardware. arXiv preprint arXiv:2207.04967 (2022)."},{"key":"e_1_3_2_1_4_1","volume-title":"Harmony: A Congestion-free Datacenter Architecture. In NSDI. 329--343.","author":"Agarwal Saksham","year":"2024","unstructured":"Saksham Agarwal, Qizhe Cai, Rachit Agarwal, David Shmoys, and Amin Vahdat. 2024. Harmony: A Congestion-free Datacenter Architecture. In NSDI. 329--343."},{"key":"e_1_3_2_1_5_1","first-page":"652","article-title":"On the utility of gradient compression in distributed training systems","volume":"4","author":"Agarwal Saurabh","year":"2022","unstructured":"Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, and Dimitris Papailiopoulos. 2022. On the utility of gradient compression in distributed training systems. In MLSys, Vol. 4. 652--672.","journal-title":"MLSys"},{"doi-asserted-by":"crossref","unstructured":"Youhui Bai Cheng Li Quan Zhou Jun Yi Ping Gong Feng Yan Ruichuan Chen and Yinlong Xu. 2021. Gradient compression supercharged high-performance data parallel dnn training. In SOSP. 359--375.","key":"e_1_3_2_1_6_1","DOI":"10.1145\/3477132.3483553"},{"unstructured":"Ran Ben Basat Yaniv Ben-Itzhak Michael Mitzenmacher and Shay Vargaftik. 2024. Optimal and Approximate Adaptive Stochastic Quantization. In NeurIPS.","key":"e_1_3_2_1_7_1"},{"key":"e_1_3_2_1_8_1","volume-title":"48th International Colloquium on Automata, Languages, and Programming (ICALP","author":"Basat Ran Ben","year":"2021","unstructured":"Ran Ben Basat, Michael Mitzenmacher, and Shay Vargaftik. 2021. How to Send a Real Number Using a Single Bit (And Some Shared Randomness). In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)."},{"unstructured":"Ran Ben Basat Amit Portnoy Gil Einziger Yaniv Ben-Itzhak and Michael Mitzenmacher. 2024. Accelerating Federated Learning with Quick Distributed Mean Estimation. In ICML.","key":"e_1_3_2_1_9_1"},{"key":"e_1_3_2_1_10_1","volume-title":"Expanding the Reach of Federated Learning by Reducing Client Resource Requirements. arXiv preprint arXiv:1812.07210","author":"Caldas Sebastian","year":"2018","unstructured":"Sebastian Caldas, Jakub Kone\u010dn\u00fd, H Brendan McMahan, and Ameet Talwalkar. 2018. Expanding the Reach of Federated Learning by Reducing Client Resource Requirements. arXiv preprint arXiv:1812.07210 (2018)."},{"unstructured":"Tri Dao. 2024. Fast Hadamard Transform in CUDA with a PyTorch interface. https:\/\/pypi.org\/project\/fast-hadamard-transform\/.","key":"e_1_3_2_1_11_1"},{"doi-asserted-by":"crossref","unstructured":"Jiawei Fei Chen-Yu Ho Atal N Sahu Marco Canini and Amedeo Sapio. 2021. Efficient sparse collective communication and its application to accelerate distributed deep learning. In SIGCOMM. 676--691.","key":"e_1_3_2_1_12_1","DOI":"10.1145\/3452296.3472904"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_13_1","DOI":"10.1007\/s11227-015-1483-z"},{"key":"e_1_3_2_1_14_1","volume-title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In ICLR.","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In ICLR."},{"unstructured":"Adi Gangidi. 2023. Scaling RoCE Networks for AI Training. https:\/\/atscaleconference.com\/videos\/scaling-roce-networks-for-ai-training\/. https:\/\/atscaleconference.com\/videos\/scaling-roce-networks-for-ai-training\/ In Networking @ Scale 2023 Conference.","key":"e_1_3_2_1_15_1"},{"unstructured":"Wenchen Han Shay Vargaftik Michael Mitzenmacher Brad Karp and Ran Ben Basat. 2024. Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression. In HotNets.","key":"e_1_3_2_1_16_1"},{"doi-asserted-by":"crossref","unstructured":"Mark Handley Costin Raiciu Alexandru Agache Andrei Voinescu Andrew W. Moore Gianni Antichi and Marcin W\u00f3jcik. 2017. Re-Architecting Datacenter Networks and Stacks for Low Latency and High Performance. In SIGCOMM. 29--42.","key":"e_1_3_2_1_17_1","DOI":"10.1145\/3098822.3098825"},{"key":"e_1_3_2_1_18_1","first-page":"418","article-title":"Tictac: Accelerating distributed deep learning with communication scheduling","volume":"1","author":"Hashemi Sayed Hadi","year":"2019","unstructured":"Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy Campbell. 2019. Tictac: Accelerating distributed deep learning with communication scheduling. In MLSys, Vol. 1. 418--430.","journal-title":"MLSys"},{"unstructured":"Ziheng Jiang et al. 2024. MegaScale: Scaling Large Language Model Training to More Than 10 000 GPUs. In NSDI. 745--760.","key":"e_1_3_2_1_19_1"},{"key":"e_1_3_2_1_20_1","volume-title":"Ananda Theertha Suresh, and Dave Bacon","author":"Konecny Jakub","year":"2016","unstructured":"Jakub Konecny, H Brendan McMahan, Felix X Yu, Peter Richt\u00e1rik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 8 (2016)."},{"key":"e_1_3_2_1_21_1","volume-title":"ATP: In-network aggregation for multi-tenant learning. In NSDI. 741--761.","author":"Lao ChonLam","year":"2021","unstructured":"ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network aggregation for multi-tenant learning. In NSDI. 741--761."},{"key":"e_1_3_2_1_22_1","volume-title":"Accelerating Distributed Deep Learning using Lossless Homomorphic Compression. arXiv preprint arXiv:2402.07529","author":"Li Haoyu","year":"2024","unstructured":"Haoyu Li, Yuchen Xu, Jiayi Chen, Rohit Dwivedula, Wenfei Wu, Keqiang He, Aditya Akella, and Daehyeok Kim. 2024. Accelerating Distributed Deep Learning using Lossless Homomorphic Compression. arXiv preprint arXiv:2402.07529 (2024)."},{"key":"e_1_3_2_1_23_1","volume-title":"Shay Vargaftik, ChonLam Lao, Kevin Xu, Michael Mitzenmacher, and Minlan Yu.","author":"Li Minghao","year":"2024","unstructured":"Minghao Li, Ran Ben Basat, Shay Vargaftik, ChonLam Lao, Kevin Xu, Michael Mitzenmacher, and Minlan Yu. 2024. THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression. In NSDI. 1191--1211."},{"unstructured":"Yujun Lin Song Han Huizi Mao Yu Wang and William J Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR.","key":"e_1_3_2_1_24_1"},{"key":"e_1_3_2_1_25_1","first-page":"297","article-title":"An efficient statistical-based gradient compression technique for distributed training systems","volume":"3","author":"Abdelmoniem Ahmed M","year":"2021","unstructured":"Ahmed M Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini, and Marco Canini. 2021. An efficient statistical-based gradient compression technique for distributed training systems. In MLSys, Vol. 3. 297--322.","journal-title":"MLSys"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_26_1","DOI":"10.1145\/2378956.2378964"},{"unstructured":"Timothy Prickett Morgan. 2023. Inside the Infrastructure that Microsoft Builds to Run AI. https:\/\/www.nextplatform.com\/2023\/03\/21\/inside-the-infrastructure-that-microsoft-builds-to-run-ai\/. Accessed: 2024-06-25.","key":"e_1_3_2_1_27_1"},{"unstructured":"Vladimir Olteanu Haggai Eran Dragos Dumitrescu Adrian Popa Cristi Baciu Mark Silberstein Georgios Nikolaidis Mark Handley and Costin Raiciu. 2022. An edge-queued datagram service for all datacenter traffic. In NSDI. 761--777.","key":"e_1_3_2_1_28_1"},{"key":"e_1_3_2_1_29_1","volume-title":"Microsoft Build 2023 Conference.","author":"Russinovich Mark","year":"2023","unstructured":"Mark Russinovich. 2023. Inside Microsoft AI Innovation with Mark Russinovich. https:\/\/build.microsoft.com\/en-US\/sessions\/984ca69a-ffca-4729-bf72-72ea0cd8a5db. In Microsoft Build 2023 Conference."},{"key":"e_1_3_2_1_30_1","volume-title":"Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA","author":"Safaryan Mher","year":"2020","unstructured":"Mher Safaryan, Egor Shulgin, and Peter Richt\u00e1rik. 2020. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA (2020)."},{"unstructured":"Amedeo Sapio Marco Canini Chen-Yu Ho Jacob Nelson Panos Kalnis Changhoon Kim Arvind Krishnamurthy Masoud Moshref Dan Ports and Peter Richt\u00e1rik. 2021. Scaling distributed machine learning with in-network aggregation. In NSDI. 785--808.","key":"e_1_3_2_1_31_1"},{"unstructured":"Hamid Shojanazeri Yanli Zhao and Shen Li. [n. d.]. Getting Started with Fully Sharded Data Parallel (FSDP). https:\/\/pytorch.org\/tutorials\/intermediate\/FSDP_tutorial.html. Accessed: 2024-10-21.","key":"e_1_3_2_1_32_1"},{"unstructured":"Ananda Theertha Suresh X Yu Felix Sanjiv Kumar and H Brendan McMahan. 2017. Distributed mean estimation with limited communication. In ICML. 3329--3337.","key":"e_1_3_2_1_33_1"},{"unstructured":"P.K Tseng. 2023. TrendForce Says with Cloud Companies Initiating AI Arms Race GPU Demand from ChatGPT Could Reach 30 000 Chips as It Readies for Commercialization. https:\/\/www.trendforce.com\/presscenter\/news\/20230301-11584.html. Accessed: 2024-06-25.","key":"e_1_3_2_1_34_1"},{"unstructured":"Ultra Ethernet Consortium. 2024. UEC Progresses Towards v1.0 Set of Specifications. https:\/\/ultraethernet.org\/uec-progresses-towards-v1-0-set-of-specifications\/. Accessed: 2024-10-21.","key":"e_1_3_2_1_35_1"},{"unstructured":"Ultra Ethernet Consortium. 2024. Ultra Ethernet Specification Update. https:\/\/ultraethernet.org\/ultra-ethernet-specification-update\/. Accessed: 2024-10-21.","key":"e_1_3_2_1_36_1"},{"key":"e_1_3_2_1_37_1","volume-title":"Amit Portnoy, Gal Mendelson, Yaniv Ben Itzhak, and Michael Mitzenmacher.","author":"Vargaftik Shay","year":"2022","unstructured":"Shay Vargaftik, Ran Ben Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben Itzhak, and Michael Mitzenmacher. 2022. Eden: Communication-efficient and robust distributed mean estimation for federated learning. In ICML. 21984--22014."},{"key":"e_1_3_2_1_38_1","first-page":"362","article-title":"Drive: One-bit distributed mean estimation","volume":"34","author":"Vargaftik Shay","year":"2021","unstructured":"Shay Vargaftik, Ran Ben Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben-Itzhak, and Michael Mitzenmacher. 2021. Drive: One-bit distributed mean estimation. Advances in Neural Information Processing Systems 34 (2021), 362--377.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_39_1","volume-title":"Sai Praneeth Karimireddy, and Martin Jaggi","author":"Vogels Thijs","year":"2019","unstructured":"Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. 2019. PowerSGD: Practical low-rank gradient compression for distributed optimization. In NeurIPS, Vol. 32."},{"unstructured":"Hao Wang Han Tian Jingrong Chen Xinchen Wan Jiacheng Xia Gaoxiong Zeng Wei Bai Junchen Jiang Yong Wang and Kai Chen. 2024. Towards Domain-Specific Network Transport for Distributed DNN Training. In NSDI. 1421--1443.","key":"e_1_3_2_1_40_1"},{"doi-asserted-by":"crossref","unstructured":"Yawen Wang Kapil Arya Marios Kogias Manohar Vanga Aditya Bhandari Neeraja J. Yadwadkar Siddhartha Sen Sameh Elnikety Christos Kozyrakis and Ricardo Bianchini. 2021. SmartHarvest: Harvesting Idle CPUs Safely and Efficiently in the Cloud. In EuroSys.","key":"e_1_3_2_1_41_1","DOI":"10.1145\/3447786.3456225"},{"doi-asserted-by":"crossref","unstructured":"Zhuang Wang Haibin Lin Yibo Zhu and TS Eugene Ng. 2023. HiSpeed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies. In EuroSys. 867--882.","key":"e_1_3_2_1_42_1","DOI":"10.1145\/3552326.3567505"},{"key":"e_1_3_2_1_43_1","first-page":"1508","article-title":"TernGrad: ternary gradients to reduce communication in distributed deep learning","volume":"31","author":"Wen Wei","year":"2017","unstructured":"Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: ternary gradients to reduce communication in distributed deep learning. In NeurIPS, Vol. 31. 1508--1518.","journal-title":"NeurIPS"},{"doi-asserted-by":"crossref","unstructured":"Xiyu Yu Tongliang Liu Xinchao Wang and Dacheng Tao. 2017. On compressing deep models by low rank and sparse decomposition. In CVPR. 7370--7379.","key":"e_1_3_2_1_44_1","DOI":"10.1109\/CVPR.2017.15"}],"event":{"sponsor":["SIGCOMM ACM Special Interest Group on Data Communication"],"acronym":"HotNets '24","name":"HotNets '24: The 23rd ACM Workshop on Hot Topics in Networks","location":"Irvine CA USA"},"container-title":["Proceedings of the 23rd ACM Workshop on Hot Topics in Networks"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696348.3696880","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3696348.3696880","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T16:10:21Z","timestamp":1755879021000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696348.3696880"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,18]]},"references-count":44,"alternative-id":["10.1145\/3696348.3696880","10.1145\/3696348"],"URL":"https:\/\/doi.org\/10.1145\/3696348.3696880","relation":{},"subject":[],"published":{"date-parts":[[2024,11,18]]},"assertion":[{"value":"2024-11-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}