{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T14:03:52Z","timestamp":1768313032708,"version":"3.49.0"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2019,4,18]],"date-time":"2019-04-18T00:00:00Z","timestamp":1555545600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science and Technology Major Projects on Core Electronic Devices, High-End Generic Chips and Basic Software","award":["2018ZX01028101"],"award-info":[{"award-number":["2018ZX01028101"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,6,30]]},"abstract":"<jats:p>With the fast development of deep learning (DL), the communication is increasingly a bottleneck for distributed workloads, and a series of optimization works have been done to scale out successfully. Nevertheless, the network behavior has not been investigated much yet. We intend to analyze the network behavior and then carry out some research through network simulation. Under this circumstance, an accurate communication measurement is necessary, as it is an effective way to study the network behavior and the basis for accurate simulation. Therefore, we propose to capture the deep learning communication (DLC) trace to achieve the measurement.<\/jats:p>\n          <jats:p>To the best of our knowledge, we make the first attempt to capture the communication trace for DL training. In this article, we first provide detailed analyses about the communication mechanism of MXNet, which is a representative framework for distributed DL. Secondly, we define the DLC trace format to describe and record the communication behaviors. Third, we present the implementation of method for trace capturing. Finally, we make some statistics and analyses about the distributed DL training, including communication pattern, overlap ratio between computation and communication, computation overhead, synchronization overhead, update overhead, and so forth. Both the statistics and analyses are based on the trace files captured in a cluster with six machines. On the one hand, our trace files provide a sketch on the DLC, which contributes to understanding the communication details. On the other hand, the captured trace files can be used for figuring out various overheads, as they record the communication behaviors of each node.<\/jats:p>","DOI":"10.1145\/3312570","type":"journal-article","created":{"date-parts":[[2019,4,19]],"date-time":"2019-04-19T16:56:23Z","timestamp":1555692983000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["SketchDLC"],"prefix":"10.1145","volume":"16","author":[{"given":"Yemao","family":"Xu","sequence":"first","affiliation":[{"name":"National University of Defense Technology"}]},{"given":"Dezun","family":"Dong","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}]},{"given":"Weixia","family":"Xu","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}]},{"given":"Xiangke","family":"Liao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}]}],"member":"320","published-online":{"date-parts":[[2019,4,18]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Retrieved","author":"Deep High-Performance","year":"2018","unstructured":"High-Performance Deep Learning (HiDL). 2018 . RDMA-TensorFlow 0.9.1 . Retrieved March 21, 2019 from http:\/\/hidl.cse.ohio-state.edu. High-Performance Deep Learning (HiDL). 2018. RDMA-TensorFlow 0.9.1. Retrieved March 21, 2019 from http:\/\/hidl.cse.ohio-state.edu."},{"key":"e_1_2_1_2_1","volume-title":"Retrieved","year":"2018","unstructured":"GitHub. 2018 . Caffe-MPI 2.0 . Retrieved March 21, 2019 from https:\/\/github.com\/Caffe-MPI\/. GitHub. 2018. Caffe-MPI 2.0. Retrieved March 21, 2019 from https:\/\/github.com\/Caffe-MPI\/."},{"key":"e_1_2_1_3_1","volume-title":"Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo , Zhifeng Chen , Craig Citro , Greg S. Corrado , 2016 . Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2015.141"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018769"},{"key":"e_1_2_1_6_1","volume-title":"Theano: New features and speed improvements. arXiv:1211.5590.","author":"Bastien Fr\u00e9d\u00e9ric","year":"2012","unstructured":"Fr\u00e9d\u00e9ric Bastien , Pascal Lamblin , Razvan Pascanu , James Bergstra , Ian Goodfellow , Arnaud Bergeron , Nicolas Bouchard , 2012 . Theano: New features and speed improvements. arXiv:1211.5590. Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, et al. 2012. Theano: New features and speed improvements. arXiv:1211.5590."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3278"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2644865.2541967"},{"key":"e_1_2_1_9_1","unstructured":"Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao etal 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.  Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao et al. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2996864"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.58"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.13"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201914)","author":"Chilimbi Trishul M.","year":"2014","unstructured":"Trishul M. Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project Adam: Building an efficient and scalable deep learning training system . In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201914) . 571--582. Trishul M. Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201914). 571--582."},{"key":"e_1_2_1_14_1","volume-title":"Retrieved","author":"Azure Microsoft","year":"2018","unstructured":"Microsoft Azure . 2018 . The Microsoft Cognitive Toolkit . Retrieved March 21, 2019 from https:\/\/www.cntk.ai\/. Microsoft Azure. 2018. The Microsoft Cognitive Toolkit. Retrieved March 21, 2019 from https:\/\/www.cntk.ai\/."},{"key":"e_1_2_1_15_1","unstructured":"Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Andrew Senior etal 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.   Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Andrew Senior et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_17_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2018","unstructured":"NVIDIA. 2018 . NVIDIA DGX-1 . Retrieved March 21, 2019 from https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-1\/. NVIDIA. 2018. NVIDIA DGX-1. Retrieved March 21, 2019 from https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-1\/."},{"key":"e_1_2_1_18_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2018","unstructured":"NVIDIA. 2018 . NVIDIA DGX-2 . Retrieved March 21, 2019 from https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-2\/. NVIDIA. 2018. NVIDIA DGX-2. Retrieved March 21, 2019 from https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-2\/."},{"key":"e_1_2_1_19_1","volume-title":"Retrieved","year":"2018","unstructured":"Caffe2. 2018 . Gloo . Retrieved March 21, 2019 from https:\/\/caffe2.ai\/docs\/distributed-training.html. Caffe2. 2018. Gloo. Retrieved March 21, 2019 from https:\/\/caffe2.ai\/docs\/distributed-training.html."},{"key":"e_1_2_1_20_1","unstructured":"Song Han Jeff Pool John Tran and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143.   Song Han Jeff Pool John Tran and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_22_1","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.   Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_25_1","volume-title":"Environments and Tools for Parallel Scientific Computing","author":"Karrels Edward","unstructured":"Edward Karrels and Ewing Lusk . 1994. Performance analysis of MPI programs . In Environments and Tools for Parallel Scientific Computing , J. J. Dongarra and B. Tourancheau (Eds.). Advances in Parallel Computing. North-Holland , 195--200. Edward Karrels and Ewing Lusk. 1994. Performance analysis of MPI programs. In Environments and Tools for Parallel Scientific Computing, J. J. Dongarra and B. Tourancheau (Eds.). Advances in Parallel Computing. North-Holland, 195--200."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-58667-0_12"},{"key":"e_1_2_1_27_1","volume-title":"Learning multiple layers of features from tiny images","author":"Krizhevsky Alex","unstructured":"Alex Krizhevsky and Geoffrey Hinton . 2009. Learning multiple layers of features from tiny images . Vol. 1 . No. 4. Technical report, University of Toronto. Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Vol. 1. No. 4. Technical report, University of Toronto."},{"key":"e_1_2_1_28_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E . Hinton . 2012 . ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems . 1097--1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105."},{"key":"e_1_2_1_29_1","volume-title":"Retrieved","author":"LeCun Yann","year":"1998","unstructured":"Yann LeCun . 1998 . The MNIST database of handwritten digits . Retrieved March 21, 2019 from http:\/\/yann.lecun.com\/exdb\/mnist\/. Yann LeCun. 1998. The MNIST database of handwritten digits. Retrieved March 21, 2019 from http:\/\/yann.lecun.com\/exdb\/mnist\/."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/2685048.2685095"},{"key":"e_1_2_1_32_1","volume-title":"Dally","author":"Lin Yujun","year":"2017","unstructured":"Yujun Lin , Song Han , Huizi Mao , Yu Wang , and William J . Dally . 2017 . Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887. Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-008-0089-y"},{"key":"e_1_2_1_34_1","volume-title":"Retrieved","author":"Wire HPC","year":"2017","unstructured":"HPC Wire . 2017 . In-Network Computing and Next Generation HDR 200G InfiniBand . Retrieved March 21, 2019 from https:\/\/www.hpcwire.com\/2017\/10\/23\/network-computing-next-generation-hdr-200g-infiniband\/. HPC Wire. 2017. In-Network Computing and Next Generation HDR 200G InfiniBand. Retrieved March 21, 2019 from https:\/\/www.hpcwire.com\/2017\/10\/23\/network-computing-next-generation-hdr-200g-infiniband\/."},{"key":"e_1_2_1_35_1","volume-title":"Retrieved","author":"Platform Next","year":"2018","unstructured":"Next Platform . 2018 . Programmable Networks Train Neural Nets Faster . Retrieved March 21, 2019 from https:\/\/www.nextplatform.com\/2018\/02\/14\/programmable-networks-train-neural-nets-faster\/. Next Platform. 2018. Programmable Networks Train Neural Nets Faster. Retrieved March 21, 2019 from https:\/\/www.nextplatform.com\/2018\/02\/14\/programmable-networks-train-neural-nets-faster\/."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00063"},{"key":"e_1_2_1_37_1","volume-title":"PARAVER: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and Occam Developments","author":"Pillet Vincent","year":"1995","unstructured":"Vincent Pillet , Jes\u00fas Labarta , Toni Cortes , and Sergi Girona . 1995 . PARAVER: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and Occam Developments , Vol. 44 . IOS Press , Amsterdam, Netherlands , 17--31. Vincent Pillet, Jes\u00fas Labarta, Toni Cortes, and Sergi Girona. 1995. PARAVER: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and Occam Developments, Vol. 44. IOS Press, Amsterdam, Netherlands, 17--31."},{"key":"e_1_2_1_38_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2018","unstructured":"NVIDIA. 2018 . Scalable AI Platform for Autonomous Driving . Retrieved March 21, 2019 from https:\/\/www.nvidia.com\/en-us\/self-driving-cars\/drive-platform\/. NVIDIA. 2018. Scalable AI Platform for Autonomous Driving. Retrieved March 21, 2019 from https:\/\/www.nvidia.com\/en-us\/self-driving-cars\/drive-platform\/."},{"key":"e_1_2_1_39_1","volume-title":"Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.","author":"Recht Benjamin","year":"2011","unstructured":"Benjamin Recht , Christopher Re , Stephen Wright , and Feng Niu . 2011 . Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701."},{"key":"e_1_2_1_40_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556."},{"key":"e_1_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan etal 2015. Going deeper with convolutions. arXiv:1409.4842.  Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan et al. 2015. Going deeper with convolutions. arXiv:1409.4842.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCSim.2011.5999914"},{"key":"e_1_2_1_43_1","volume-title":"Beck","author":"Wang Dayong","year":"2016","unstructured":"Dayong Wang , Aditya Khosla , Rishab Gargeya , Humayun Irshad , and Andrew H . Beck . 2016 . Deep learning for identifying metastatic breast cancer. arXiv:1606.05718. Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H. Beck. 2016. Deep learning for identifying metastatic breast cancer. arXiv:1606.05718."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2017.06.003"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2806777.2806778"},{"key":"e_1_2_1_46_1","unstructured":"Wei Wen Cong Xu Feng Yan Chunpeng Wu Yandan Wang Yiran Chen and Hai Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519.   Wei Wen Cong Xu Feng Yan Chunpeng Wu Yandan Wang Yiran Chen and Hai Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519."},{"key":"e_1_2_1_47_1","volume-title":"Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, and Eric Xing.","author":"Xie Pengtao","year":"2015","unstructured":"Pengtao Xie , Jin Kyu Kim , Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, and Eric Xing. 2015 . Distributed machine learning via sufficient factor broadcasting. arXiv:1511.08486. Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, and Eric Xing. 2015. Distributed machine learning via sufficient factor broadcasting. arXiv:1511.08486."},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Jilong Xue Youshan Miao Cheng Chen Ming Wu Lintao Zhang and Lidong Zhou. 2018. RPC considered harmful: Fast distributed deep learning on RDMA. arXiv:1805.08430.  Jilong Xue Youshan Miao Cheng Chen Ming Wu Lintao Zhang and Lidong Zhou. 2018. RPC considered harmful: Fast distributed deep learning on RDMA. arXiv:1805.08430.","DOI":"10.1145\/3302424.3303975"},{"key":"e_1_2_1_49_1","unstructured":"Wikipedia. 2018. ZeroMQ. Available at https:\/\/en.wikipedia.org\/.  Wikipedia. 2018. ZeroMQ. Available at https:\/\/en.wikipedia.org\/."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC\u201917)","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , 2017 . Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters . In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC\u201917) . 181--193. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, et al. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC\u201917). 181--193."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3312570","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3312570","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:54:28Z","timestamp":1750204468000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3312570"}},"subtitle":["A Sketch on Distributed Deep Learning Communication via Trace Capturing"],"short-title":[],"issued":{"date-parts":[[2019,4,18]]},"references-count":50,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,6,30]]}},"alternative-id":["10.1145\/3312570"],"URL":"https:\/\/doi.org\/10.1145\/3312570","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,4,18]]},"assertion":[{"value":"2018-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-04-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}