{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T20:59:10Z","timestamp":1757624350484,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":69,"publisher":"ACM","funder":[{"name":"Shenzhen Science and Technology Program","award":["JCYJ20220530161006015"],"award-info":[{"award-number":["JCYJ20220530161006015"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,7,20]]},"DOI":"10.1145\/3731545.3731581","type":"proceedings-article","created":{"date-parts":[[2025,9,9]],"date-time":"2025-09-09T12:46:16Z","timestamp":1757421976000},"page":"1-14","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["SAFusion: Efficient Tensor Fusion with Sparsification Ahead for High-Performance Distributed DNN Training"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1616-8054","authenticated-orcid":false,"given":"Zhangqiang","family":"Ming","sequence":"first","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1265-7141","authenticated-orcid":false,"given":"Yuchong","family":"Hu","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"},{"name":"Shenzhen Research Institute of Huazhong University of Science and Technology, Shenzhen, Guangdong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7321-9264","authenticated-orcid":false,"given":"Xinjue","family":"Zheng","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6391-7798","authenticated-orcid":false,"given":"Wenxiang","family":"Zhou","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4674-6006","authenticated-orcid":false,"given":"Dan","family":"Feng","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, Hubei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,9]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2024. BytePS. https:\/\/github.com\/bytedance\/byteps."},{"key":"e_1_3_2_1_2_1","unstructured":"2024. DeepSpeed. https:\/\/https:\/\/github.com\/microsoft\/DeepSpeed."},{"key":"e_1_3_2_1_3_1","unstructured":"2024. Horovod. https:\/\/github.com\/horovod\/horovod."},{"key":"e_1_3_2_1_4_1","unstructured":"2024. Horovod Analyze Performance. https:\/\/horovod.readthedocs.io\/en\/stable\/tensor-fusion_include.html."},{"key":"e_1_3_2_1_5_1","unstructured":"2024. Horovod Automated Performance Tuning. https:\/\/horovod.readthedocs.io\/en\/stable\/autotune_include.html."},{"key":"e_1_3_2_1_6_1","unstructured":"2024. NVIDIA NCCL. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_3_2_1_7_1","volume-title":"Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021","author":"Aji Alham Fikri","year":"2017","unstructured":"Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017)."},{"key":"e_1_3_2_1_8_1","unstructured":"Dan Alistarh Torsten Hoefler Mikael Johansson Nikola Konstantinov Sarit Khirirat and C\u00e9dric Renggli. 2018. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems. 5977\u20135987."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018769"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483553"},{"key":"e_1_3_2_1_11_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems. 1877\u20131901."},{"key":"e_1_3_2_1_12_1","unstructured":"Chia-Yu Chen Jiamin Ni Songtao Lu Xiaodong Cui Pin-Yu Chen Xiao Sun Naigang Wang Swagath Venkataramani Vijayalakshmi Srinivasan Wei Zhang et al. 2020. ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training. In Advances in Neural Information Processing Systems. 13551\u201313563."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_14_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_15_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_16_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 3817\u20133824","author":"Dutta Aritra","year":"2020","unstructured":"Aritra Dutta, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 3817\u20133824."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Jiarui Fang Haohuan Fu Guangwen Yang and Cho-Jui Hsieh. 2019. RedSync: reducing synchronization bandwidth for distributed deep learning training system. J. Parallel and Distrib. Comput. (2019) 30\u201339.","DOI":"10.1016\/j.jpdc.2019.05.016"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3452296.3472904"},{"key":"e_1_3_2_1_19_1","unstructured":"Andrew Gibiansky. 2017. Bringing HPC techniques to deep learning. Baidu Research Tech. Rep. (2017)."},{"key":"e_1_3_2_1_20_1","volume-title":"Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 249\u2013256","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 249\u2013256."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_22_1","unstructured":"Horovod. 2024. Tensor Fusion. https:\/\/github.com\/horovod\/horovod. Online accessed on Sept-2023."},{"key":"e_1_3_2_1_23_1","first-page":"623","article-title":"dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training","volume":"4","author":"Hu Hanpeng","year":"2022","unstructured":"Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, and Chuanxiong Guo. 2022. dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training. Proceedings of Machine Learning and Systems 4 (2022), 623\u2013637.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507778"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/3488766.3488792"},{"key":"e_1_3_2_1_26_1","volume-title":"International Conference on Machine Learning. PMLR, 3252\u20133261","author":"Karimireddy Sai Praneeth","year":"2019","unstructured":"Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. 2019. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning. PMLR, 3252\u20133261."},{"key":"e_1_3_2_1_27_1","unstructured":"Alex Krizhevsky Geoffrey Hinton et al. 2009. Learning multiple layers of features from tiny images. Master's thesis University of Tront (2009)."},{"key":"e_1_3_2_1_28_1","volume-title":"Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508399"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_3_2_1_31_1","volume-title":"Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In International Conference on Learning Representations.","author":"Lin Yujun","year":"2018","unstructured":"Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. 2018. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In International Conference on Learning Representations."},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of the ACM Web Conference","author":"Liu Ting","year":"2022","unstructured":"Ting Liu, Tianhao Miao, Qinghua Wu, Zhenyu Li, Guangxin He, Jiaoren Wu, Shengzhuo Zhang, Xingwu Yang, Gareth Tyson, and Gaogang Xie. 2022. Modeling and optimizing the scaling performance in distributed deep learning training. In Proceedings of the ACM Web Conference 2022. 1764\u20131773."},{"key":"e_1_3_2_1_33_1","volume-title":"Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 1\u201310","author":"Liu Yunzhuo","year":"2023","unstructured":"Yunzhuo Liu, Bo Jiang, Shizhen Zhao, Tao Lin, Xinbing Wang, and Chenghu Zhou. 2023. Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 1\u201310."},{"key":"e_1_3_2_1_34_1","volume-title":"Proceedings of Machine Learning and Systems. 297\u2013322","author":"Abdelmoniem Ahmed M","year":"2021","unstructured":"Ahmed M Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini, and Marco Canini. 2021. An efficient statistical-based gradient compression technique for distributed training systems. In Proceedings of Machine Learning and Systems. 297\u2013322."},{"key":"e_1_3_2_1_35_1","volume-title":"Nitish Shirish Keskar, and Richard Socher","author":"Merity Stephen","year":"2017","unstructured":"Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 (2017)."},{"key":"e_1_3_2_1_36_1","volume-title":"Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843","author":"Merity Stephen","year":"2016","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016)."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3625549.3658678"},{"key":"e_1_3_2_1_38_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8026\u20138037.","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8026\u20138037."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_2_1_40_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_1_41_1","volume-title":"Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822","author":"Rajpurkar Pranav","year":"2018","unstructured":"Pranav Rajpurkar and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018)."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356222"},{"key":"e_1_3_2_1_44_1","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Romero Joshua","year":"2022","unstructured":"Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, M Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Matheson. 2022. Accelerating collective communication in data parallel training across deep learning frameworks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1027\u20131040."},{"key":"e_1_3_2_1_45_1","unstructured":"Atal Sahu Aritra Dutta Ahmed M Abdelmoniem and Panos Kalnis. 2021. Rethinking gradient sparsification as total error minimization. In Advances in Neural Information Processing Systems. 8133\u20138146."},{"key":"e_1_3_2_1_46_1","volume-title":"18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)","author":"Sapio Amedeo","year":"2021","unstructured":"Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richt\u00e1rik. 2021. Scaling distributed machine learning with {In-Network} aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 785\u2013808."},{"key":"e_1_3_2_1_47_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)."},{"key":"e_1_3_2_1_48_1","volume-title":"Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772","author":"Shi Shaohuai","year":"2019","unstructured":"Shaohuai Shi, Xiaowen Chu, Cheung, and Simon See. 2019. Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772 (2019)."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM.2019.8737367"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3052862"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM41043.2020.9155269"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2019.00220"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"crossref","unstructured":"Shaohuai Shi Kaiyong Zhao Qiang Wang Zhenheng Tang and Xiaowen Chu. 2019. A convergence analysis of distributed SGD with communication-efficient gradient sparsification. In IJCAI. 3411\u20133417.","DOI":"10.24963\/ijcai.2019\/473"},{"key":"e_1_3_2_1_54_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"crossref","first-page":"909","DOI":"10.1109\/TPDS.2022.3230938","article-title":"Gossipfl: A decentralized federated learning framework with sparsified and adaptive communication","volume":"34","author":"Tang Zhenheng","year":"2022","unstructured":"Zhenheng Tang, Shaohuai Shi, Bo Li, and Xiaowen Chu. 2022. Gossipfl: A decentralized federated learning framework with sparsified and adaptive communication. IEEE Transactions on Parallel and Distributed Systems 34, 3 (2022), 909\u2013922.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_1_56_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000\u20136010."},{"key":"e_1_3_2_1_57_1","volume-title":"Proceedings of the Eighteenth European Conference on Computer Systems (EuroSys '23)","author":"Wang Zhuang","unstructured":"Zhuang Wang, Haibin Lin, Yibo Zhu, and T. S. Eugene Ng. 2023. Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies. In Proceedings of the Eighteenth European Conference on Computer Systems (EuroSys '23). 867\u2013882."},{"key":"e_1_3_2_1_58_1","volume-title":"Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys' 23). Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys' 23)","author":"Wang Zhuang","year":"2023","unstructured":"Zhuang Wang, Xinyu Crystal Wu, Zhaozhuo Xu, and TS Eugene Ng. 2023. Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training. In Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys' 23). Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys' 23)."},{"key":"e_1_3_2_1_59_1","unstructured":"Jianqiao Wangni Jialei Wang Ji Liu and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems. 1306\u20131316."},{"key":"e_1_3_2_1_60_1","volume-title":"BIRD: A Lightweight and Adaptive Compressor for Communication-Efficient Distributed Learning Using Tensor-wise Bi-Random Sampling. In 2023 IEEE 41st International Conference on Computer Design (ICCD)","author":"Wu Donglei","year":"2023","unstructured":"Donglei Wu, Weihao Yang, Cai Deng, Xiangyu Zou, Shiyi Li, and Wen Xia. 2023. BIRD: A Lightweight and Adaptive Compressor for Communication-Efficient Distributed Learning Using Tensor-wise Bi-Random Sampling. In 2023 IEEE 41st International Conference on Computer Design (ICCD). IEEE, 605\u2013613."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS51616.2021.00060"},{"key":"e_1_3_2_1_62_1","volume-title":"Deepreduce: A sparse-tensor communication framework for federated deep learning. In Advances in Neural Information Processing Systems. 21150\u201321163.","author":"Xu Hang","year":"2021","unstructured":"Hang Xu, Kelly Kostopoulou, Aritra Dutta, Xin Li, Alexandros Ntoulas, and Panos Kalnis. 2021. Deepreduce: A sparse-tensor communication framework for federated deep learning. In Advances in Neural Information Processing Systems. 21150\u201321163."},{"key":"e_1_3_2_1_63_1","volume-title":"MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training. arXiv preprint arXiv:2310.00967","author":"Yoon Daegun","year":"2023","unstructured":"Daegun Yoon and Sangyoon Oh. 2023. MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training. arXiv preprint arXiv:2310.00967 (2023)."},{"key":"e_1_3_2_1_64_1","volume-title":"Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning. arXiv preprint arXiv:2402.13781","author":"Yoon Daegun","year":"2024","unstructured":"Daegun Yoon and Sangyoon Oh. 2024. Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning. arXiv preprint arXiv:2402.13781 (2024)."},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126912"},{"key":"e_1_3_2_1_66_1","volume-title":"Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In USENIX Annual Technical Conference","volume":"1","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In USENIX Annual Technical Conference, Vol. 1. 1\u20132."},{"key":"e_1_3_2_1_67_1","volume-title":"DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining. In 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS). IEEE, 142\u2013153","author":"Zhang Lin","year":"2023","unstructured":"Lin Zhang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li, and Chengjian Liu. 2023. DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining. In 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS). IEEE, 142\u2013153."},{"key":"e_1_3_2_1_68_1","volume-title":"Accelerating Distributed K-FAC with Efficient Collective Communication and Scheduling. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 1\u201310","author":"Zhang Lin","year":"2023","unstructured":"Lin Zhang, Shaohuai Shi, and Bo Li. 2023. Accelerating Distributed K-FAC with Efficient Collective Communication and Scheduling. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 1\u201310."},{"key":"e_1_3_2_1_69_1","first-page":"3053","article-title":"MIPD: An adaptive gradient sparsification framework for distributed DNNs training","volume":"33","author":"Zhang Zhaorui","year":"2022","unstructured":"Zhaorui Zhang and Choli Wang. 2022. MIPD: An adaptive gradient sparsification framework for distributed DNNs training. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2022), 3053\u20133066.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"}],"event":{"name":"HPDC '25: 34th International Symposium on High-Performance Parallel and Distributed Computing","location":"University of Notre Dame Conference Facilities Notre Dame IN USA","acronym":"HPDC '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing","SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 34th International Symposium on High-Performance Parallel and Distributed Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731545.3731581","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,9]],"date-time":"2025-09-09T12:47:38Z","timestamp":1757422058000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731545.3731581"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,20]]},"references-count":69,"alternative-id":["10.1145\/3731545.3731581","10.1145\/3731545"],"URL":"https:\/\/doi.org\/10.1145\/3731545.3731581","relation":{},"subject":[],"published":{"date-parts":[[2025,7,20]]},"assertion":[{"value":"2025-09-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}