{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T16:06:39Z","timestamp":1777651599403,"version":"3.51.4"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2020,8]]},"abstract":"<jats:p>This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. Py-Torch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.<\/jats:p>","DOI":"10.14778\/3415478.3415530","type":"journal-article","created":{"date-parts":[[2020,9,14]],"date-time":"2020-09-14T18:46:35Z","timestamp":1600109195000},"page":"3005-3018","source":"Crossref","is-referenced-by-count":413,"title":["PyTorch distributed"],"prefix":"10.14778","volume":"13","author":[{"given":"Shen","family":"Li","sequence":"first","affiliation":[{"name":"Facebook AI"}]},{"given":"Yanli","family":"Zhao","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Rohan","family":"Varma","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Omkar","family":"Salpekar","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Pieter","family":"Noordhuis","sequence":"additional","affiliation":[{"name":"University of Warsaw"}]},{"given":"Teng","family":"Li","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Adam","family":"Paszke","sequence":"additional","affiliation":[{"name":"University of Warsaw"}]},{"given":"Jeff","family":"Smith","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Brian","family":"Vaughan","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Pritam","family":"Damania","sequence":"additional","affiliation":[{"name":"Facebook AI"}]},{"given":"Soumith","family":"Chintala","sequence":"additional","affiliation":[{"name":"Facebook AI"}]}],"member":"320","published-online":{"date-parts":[[2020,8]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"a collective communications library. https:\/\/github.com\/facebookincubator\/gloo","author":"Gloo","year":"2019","unstructured":"Gloo: a collective communications library. https:\/\/github.com\/facebookincubator\/gloo, 2019."},{"key":"e_1_2_1_2_1","volume-title":"https:\/\/developer.nvidia.com\/nccl","author":"Collective Communications NVIDIA","year":"2019","unstructured":"NVIDIA Collective Communications Library (NCCL). https:\/\/developer.nvidia.com\/nccl, 2019."},{"key":"e_1_2_1_3_1","volume-title":"The Building Blocks of Advanced Multi-GPU Communication. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/","author":"NVLINK AND","year":"2019","unstructured":"NVLINK AND NVSWITCH: The Building Blocks of Advanced Multi-GPU Communication. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/, 2019."},{"key":"e_1_2_1_4_1","volume-title":"A High Performance Message Passing Library. https:\/\/www.open-mpi.org\/","author":"Open","year":"2019","unstructured":"Open MPI: A High Performance Message Passing Library. https:\/\/www.open-mpi.org\/, 2019."},{"key":"e_1_2_1_5_1","volume-title":"Seamless operability between C++11 and Python. https:\/\/pybind11.readthedocs.io\/","year":"2019","unstructured":"Pybind11: Seamless operability between C++11 and Python. https:\/\/pybind11.readthedocs.io\/, 2019."},{"key":"e_1_2_1_6_1","volume-title":"https:\/\/pytorch.org\/docs\/master\/rpc.html","author":"Framework PyTorch","year":"2019","unstructured":"PyTorch Distributed RPC Framework. https:\/\/pytorch.org\/docs\/master\/rpc.html, 2019."},{"key":"e_1_2_1_7_1","volume-title":"https:\/\/pytorch.org\/docs\/stable\/nn.html#torch.nn.Module.forward","author":"Function PyTorch","year":"2019","unstructured":"PyTorch Module forward Function. https:\/\/pytorch.org\/docs\/stable\/nn.html#torch.nn.Module.forward, 2019."},{"key":"e_1_2_1_8_1","volume-title":"open-source software for mathematics, science, and engineering. https:\/\/docs.scipy.org\/","year":"2019","unstructured":"SciPy: open-source software for mathematics, science, and engineering. https:\/\/docs.scipy.org\/, 2019."},{"key":"e_1_2_1_9_1","volume-title":"https:\/\/pytorch.org\/docs\/stable\/nn.html#torch.nn.parallel.DistributedDataParallel","author":"DistributedDataParallel PyTorch","year":"2020","unstructured":"PyTorch DistributedDataParallel. https:\/\/pytorch.org\/docs\/stable\/nn.html#torch.nn.parallel.DistributedDataParallel, 2020."},{"key":"e_1_2_1_10_1","volume-title":"https:\/\/www.tensorflow.org\/guide\/distributed_training#multiworkermirroredstrategy","author":"Distributed Training TensorFlow","year":"2020","unstructured":"TensorFlow Distributed Training MultiWorkerMirroredStrategy. https:\/\/www.tensorflow.org\/guide\/distributed_training#multiworkermirroredstrategy, 2020."},{"key":"e_1_2_1_11_1","volume-title":"https:\/\/www.tensorflow.org\/guide\/distributed_training#parameterserverstrategy","author":"Distributed Training TensorFlow","year":"2020","unstructured":"TensorFlow Distributed Training ParameterServerStrategy. https:\/\/www.tensorflow.org\/guide\/distributed_training#parameterserverstrategy, 2020."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM41043.2020.9155446"},{"key":"e_1_2_1_13_1","volume-title":"End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316","author":"Bojarski M.","year":"2016","unstructured":"M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2019.2947013"},{"key":"e_1_2_1_15_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin J.","year":"2018","unstructured":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134015"},{"key":"e_1_2_1_17_1","volume-title":"Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556","author":"Fan A.","year":"2019","unstructured":"A. Fan, E. Grave, and A. Joulin. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019."},{"key":"e_1_2_1_18_1","first-page":"3338","volume-title":"Advances in neural information processing systems","author":"Guo X.","year":"2014","unstructured":"X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, pages 3338--3346, 2014."},{"key":"e_1_2_1_19_1","volume-title":"Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288","author":"Hashemi S. H.","year":"2018","unstructured":"S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288, 2018."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_21_1","first-page":"103","volume-title":"Advances in Neural Information Processing Systems","author":"Huang Y.","year":"2019","unstructured":"Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pages 103--112, 2019."},{"key":"e_1_2_1_22_1","first-page":"132","volume-title":"Proceedings of Machine Learning and Systems 2019","author":"Jayarajan A.","year":"2019","unstructured":"A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, and G. Pekhimenko. Priority-based parameter propagation for distributed dnn training. In Proceedings of Machine Learning and Systems 2019, pages 132--145, 2019."},{"key":"e_1_2_1_23_1","volume-title":"February","author":"Jeaugey S.","year":"2019","unstructured":"S. Jeaugey. Massively Scale Your Deep Learning Training with NCCL 2.4. https:\/\/devblogs.nvidia.com\/massively-scale-deep-learning-training-nccl-2-4\/, February 2019."},{"key":"e_1_2_1_24_1","first-page":"1","volume-title":"Proceedings of the Fourteenth EuroSys Conference 2019","author":"Kim S.","year":"2019","unstructured":"S. Kim, G.-I. Yu, H. Park, S. Cho, E. Jeong, H. Ha, S. Lee, J. S. Jeong, and B.-G. Chun. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--15, 2019."},{"key":"e_1_2_1_25_1","unstructured":"Y. LeCun C. Cortes and C. Burges. The MNIST Database. http:\/\/yann.lecun.com\/exdb\/mnist\/ 1999."},{"key":"e_1_2_1_26_1","first-page":"21","volume-title":"Proceedings of the 1988 connectionist models summer school","author":"LeCun Y.","year":"1988","unstructured":"Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21--28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2640087.2644155"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123405"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_2_1_30_1","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems 32","author":"Paszke A.","year":"2019","unstructured":"A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024--8035. Curran Associates, Inc., 2019."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_2_1_32_1","volume-title":"Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054","author":"Rajbhandari S.","year":"2019","unstructured":"S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054, 2019."},{"key":"e_1_2_1_33_1","volume-title":"Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. \"O'Reilly Media","author":"Ramsundar B.","year":"2019","unstructured":"B. Ramsundar, P. Eastman, P. Walters, and V. Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. \"O'Reilly Media, Inc.\", 2019."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-274"},{"key":"e_1_2_1_35_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799","author":"Sergeev A.","year":"2018","unstructured":"A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018."},{"key":"e_1_2_1_36_1","first-page":"10414","volume-title":"Advances in Neural Information Processing Systems","author":"Shazeer N.","year":"2018","unstructured":"N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414--10423, 2018."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2019.2957478"},{"key":"e_1_2_1_38_1","first-page":"2643","volume-title":"Advances in neural information processing systems","author":"den Oord A. Van","year":"2013","unstructured":"A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems, pages 2643--2651, 2013."},{"key":"e_1_2_1_39_1","volume-title":"Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940","author":"Wang G.","year":"2019","unstructured":"G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. Devanur, and I. Stoica. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940, 2019."},{"key":"e_1_2_1_40_1","volume-title":"Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643","author":"Wang J.","year":"2019","unstructured":"J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3415478.3415530","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T02:27:45Z","timestamp":1758076065000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3415478.3415530"}},"subtitle":["experiences on accelerating data parallel training"],"short-title":[],"issued":{"date-parts":[[2020,8]]},"references-count":40,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2020,8]]}},"alternative-id":["10.14778\/3415478.3415530"],"URL":"https:\/\/doi.org\/10.14778\/3415478.3415530","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2020,8]]}}}