{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T15:16:01Z","timestamp":1778080561881,"version":"3.51.4"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.<\/jats:p>","DOI":"10.14778\/3611540.3611569","type":"journal-article","created":{"date-parts":[[2023,9,15]],"date-time":"2023-09-15T11:32:37Z","timestamp":1694777557000},"page":"3848-3860","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":166,"title":["PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel"],"prefix":"10.14778","volume":"16","author":[{"given":"Yanli","family":"Zhao","sequence":"first","affiliation":[{"name":"Meta AI"}]},{"given":"Andrew","family":"Gu","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Rohan","family":"Varma","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Liang","family":"Luo","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Chien-Chin","family":"Huang","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Min","family":"Xu","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Less","family":"Wright","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Hamid","family":"Shojanazeri","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Myle","family":"Ott","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Sam","family":"Shleifer","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Alban","family":"Desmaison","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Can","family":"Balioglu","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Pritam","family":"Damania","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Bernard","family":"Nguyen","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Geeta","family":"Chauhan","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Yuchen","family":"Hao","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Ajit","family":"Mathews","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Shen","family":"Li","sequence":"additional","affiliation":[{"name":"Meta AI"}]}],"member":"320","published-online":{"date-parts":[[2023,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2023. torch.amp Gradient Scaling. https:\/\/pytorch.org\/docs\/2.0\/amp.html#gradient-scaling."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483553"},{"key":"e_1_2_1_3_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3219819"},{"key":"e_1_2_1_5_1","volume-title":"Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377","author":"Harlap Aaron","year":"2018","unstructured":"Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)."},{"key":"e_1_2_1_6_1","volume-title":"Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:2102.03161","author":"He Chaoyang","year":"2021","unstructured":"Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. 2021. Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:2102.03161 (2021)."},{"key":"e_1_2_1_7_1","volume-title":"Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training. In 2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"He Xin","year":"2022","unstructured":"Xin He, Jianhua Sun, Hao Chen, and Dong Li. 2022. Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 505--518. https:\/\/www.usenix.org\/conference\/atc22\/presentation\/he"},{"key":"e_1_2_1_8_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","unstructured":"Zhihao Jia Matei Zaharia and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. 10.48550\/ARXIV.1807.05358","DOI":"10.48550\/ARXIV.1807.05358"},{"key":"e_1_2_1_10_1","unstructured":"Andrej Karpathy. 2020. MinGPT Transformer model. https:\/\/github.com\/karpathy\/minGPT."},{"key":"e_1_2_1_11_1","volume-title":"torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910","author":"Kim Chiheon","year":"2020","unstructured":"Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. 2020. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910 (2020)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","unstructured":"Marisa Kirisame Steven Lyubomirsky Altan Haan Jennifer Brennan Mike He Jared Roesch Tianqi Chen and Zachary Tatlock. 2020. Dynamic Tensor Rematerialization. 10.48550\/ARXIV.2006.09616","DOI":"10.48550\/ARXIV.2006.09616"},{"key":"e_1_2_1_13_1","volume-title":"Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198 (2022)."},{"key":"e_1_2_1_14_1","unstructured":"Shen Li Yanli Zhao Rohan Varma Omkar Salpekar Pieter Noordhuis Teng Li Adam Paszke Jeff Smith Brian Vaughan Pritam Damania et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020)."},{"key":"e_1_2_1_15_1","volume-title":"International Conference on Machine Learning. PMLR, 6543--6552","author":"Li Zhuohan","year":"2021","unstructured":"Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. PMLR, 6543--6552."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037731"},{"key":"e_1_2_1_17_1","first-page":"82","article-title":"Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud","volume":"2","author":"Luo Liang","year":"2020","unstructured":"Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud. Proceedings of Machine Learning and Systems 2 (2020), 82--97.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","unstructured":"Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David Garcia Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh and Hao Wu. 2017. Mixed Precision Training. 10.48550\/ARXIV.1710.03740","DOI":"10.48550\/ARXIV.1710.03740"},{"key":"e_1_2_1_19_1","unstructured":"Dheevatsa Mudigere Yuchen Hao Jianyu Huang Andrew Tulloch Srinivas Sridharan Xing Liu Mustafa Ozdal Jade Nie Jongsoo Park Liang Luo et al. 2021. High-performance distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_1_22_1","unstructured":"NVIDIA. 2023. The NVIDIA Collective Communication Library (NCCL). https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_2_1_23_1","unstructured":"OpenAI. 2023. ChatGPT. https:\/\/chat.openai.com\/."},{"key":"e_1_2_1_24_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035. http:\/\/papers.nips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf"},{"key":"e_1_2_1_25_1","unstructured":"Team PyTorch. 2023. DISTRIBUTED RPC FRAMEWORK. https:\/\/pytorch.org\/docs\/stable\/rpc.html."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_1_28_1","volume-title":"USENIX Annual Technical Conference. 551--564","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Conference. 551--564."},{"key":"e_1_2_1_29_1","volume-title":"IEEE","author":"Schneider Nick","year":"2017","unstructured":"Nick Schneider, Florian Piewak, Christoph Stiller, and Uwe Franke. 2017. Reg-Net: Multimodal sensor registration using deep neural networks. In 2017 IEEE intelligent vehicles symposium (IV). IEEE, 1803--1810."},{"key":"e_1_2_1_30_1","volume-title":"Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336","author":"Xu Yuanzhong","year":"2020","unstructured":"Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. 2020. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336 (2020)."},{"key":"e_1_2_1_31_1","unstructured":"Yuanzhong Xu HyoukJoong Lee Dehao Chen Blake Hechtman Yanping Huang Rahul Joshi Maxim Krikun Dmitry Lepikhin Andy Ly Marcello Maggioni et al. 2021. GSPMD: general and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663 (2021)."},{"key":"e_1_2_1_32_1","volume-title":"Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032","author":"Yuan Jinhui","year":"2021","unstructured":"Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, et al. 2021. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032 (2021)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2203.11014"},{"key":"e_1_2_1_34_1","volume-title":"MiCS: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119","author":"Zhang Zhen","year":"2022","unstructured":"Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119 (2022)."},{"key":"e_1_2_1_35_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and {Intra-Operator} Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578."},{"key":"e_1_2_1_36_1","volume-title":"Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886","author":"Zhou Daquan","year":"2021","unstructured":"Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. 2021. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3611540.3611569","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T22:34:03Z","timestamp":1757543643000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3611540.3611569"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":36,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.14778\/3611540.3611569"],"URL":"https:\/\/doi.org\/10.14778\/3611540.3611569","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,8]]},"assertion":[{"value":"2023-08-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}