{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T19:14:22Z","timestamp":1774120462745,"version":"3.50.1"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,5,26]],"date-time":"2023-05-26T00:00:00Z","timestamp":1685059200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,5,26]]},"abstract":"<jats:p>With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks. However, such a sparse conditional computation may not be effective as expected in practical systems due to the routing imbalance and fluctuation problems. Generally, MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible.<\/jats:p>\n          <jats:p>In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow. We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism. Then we introduce a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic. A simple but efficient heuristic algorithm is exploited to dynamically optimize the device placement during training. We have conducted experiments on both NLP models (e.g., BERT and GPT) and vision models (e.g., Swin). And results show FlexMoE can achieve superior performance compared with existing systems on real-world workloads --- FlexMoE outperforms DeepSpeed by 1.70x on average and up to 2.10x, and outperforms FasterMoE by 1.30x on average and up to 1.45x.<\/jats:p>","DOI":"10.1145\/3588964","type":"journal-article","created":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T17:42:05Z","timestamp":1685468525000},"page":"1-19","source":"Crossref","is-referenced-by-count":30,"title":["FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6766-757X","authenticated-orcid":false,"given":"Xiaonan","family":"Nie","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9371-8358","authenticated-orcid":false,"given":"Xupeng","family":"Miao","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4026-0104","authenticated-orcid":false,"given":"Zilong","family":"Wang","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7734-5766","authenticated-orcid":false,"given":"Zichao","family":"Yang","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4495-1997","authenticated-orcid":false,"given":"Jilong","family":"Xue","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9524-5476","authenticated-orcid":false,"given":"Lingxiao","family":"Ma","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1142-0128","authenticated-orcid":false,"given":"Gang","family":"Cao","sequence":"additional","affiliation":[{"name":"Beijing Academy of Artificial Intelligence, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1681-4677","authenticated-orcid":false,"given":"Bin","family":"Cui","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,5,30]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"2014. Azure A100 VMs. https:\/\/docs.microsoft.com\/en-us\/azure\/virtual-machines\/nda100-v4-series."},{"key":"e_1_2_2_2_1","unstructured":"2014. ImageNet-1K Dataset. https:\/\/huggingface.co\/datasets\/imagenet-1k."},{"key":"e_1_2_2_3_1","unstructured":"2014. Wikipedia Dataset. https:\/\/huggingface.co\/datasets\/wikipedia."},{"key":"e_1_2_2_4_1","unstructured":"2017. NCCL 2.0. https:\/\/github.com\/NVIDIA\/nccl."},{"key":"e_1_2_2_5_1","unstructured":"2021. The AI2 Reasoning Challenge (ARC). https:\/\/leaderboard.allenai.org\/arc\/submissions\/public."},{"key":"e_1_2_2_6_1","volume-title":"et al","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.489"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00156-2"},{"key":"e_1_2_2_9_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186."},{"key":"e_1_2_2_10_1","volume-title":"International Conference on Learning Representations.","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations."},{"key":"e_1_2_2_11_1","volume-title":"International Conference on Machine Learning. PMLR, 5547--5569","author":"Du Nan","year":"2022","unstructured":"Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al . 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547--5569."},{"key":"e_1_2_2_12_1","first-page":"1","article-title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","volume":"23","author":"Fedus William","year":"2022","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23, 120 (2022), 1--39.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_2_13_1","volume-title":"International Conference on Machine Learning. PMLR, 3304--3314","author":"Fu Fangcheng","year":"2020","unstructured":"Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. 2020. Don't waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning. PMLR, 3304--3314."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342273"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3547305.3547316"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-021-1350-8"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508418"},{"key":"e_1_2_2_18_1","volume-title":"et al","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al . 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019), 103--112."},{"key":"e_1_2_2_19_1","volume-title":"Tutel: Adaptive Mixture-of-Experts at Scale. CoRR abs\/2206.03382","author":"Hwang Changho","year":"2022","unstructured":"Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. 2022. Tutel: Adaptive Mixture-of-Experts at Scale. CoRR abs\/2206.03382 (2022)."},{"key":"e_1_2_2_20_1","volume-title":"Scaling laws for neural language models. CoRR abs\/2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. CoRR abs\/2001.08361 (2020)."},{"key":"e_1_2_2_21_1","volume-title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.","author":"Lepikhin Dmitry","year":"2021","unstructured":"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations."},{"key":"e_1_2_2_22_1","volume-title":"International Conference on Machine Learning. PMLR, 6265--6274","author":"Lewis Mike","year":"2021","unstructured":"Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning. PMLR, 6265--6274."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_2_2_24_1","unstructured":"Junyang Lin Rui Men An Yang Chang Zhou Ming Ding Yichang Zhang Peng Wang Ang Wang Le Jiang Xianyan Jia et al. 2021. M6: A chinese multimodal pretrainer. CoRR abs\/2103.00823 (2021)."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508417"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-022-00191-7"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-022-3581-9"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517902"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489511"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_2_34_1","volume-title":"Dense-to-sparse gate for mixture-of-experts. CoRR abs\/2112.14397","author":"Nie Xiaonan","year":"2021","unstructured":"Xiaonan Nie, Shijie Cao, Xupeng Miao, Lingxiao Ma, Jilong Xue, Youshan Miao, Zichao Yang, Zhi Yang, and Bin Cui. 2021. Dense-to-sparse gate for mixture-of-experts. CoRR abs\/2112.14397 (2021)."},{"key":"e_1_2_2_35_1","volume-title":"Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent. arXiv preprint arXiv:2303.02868","author":"Nie Xiaonan","year":"2023","unstructured":"Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, and Bin Cui. 2023. Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent. arXiv preprint arXiv:2303.02868 (2023)."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00241"},{"key":"e_1_2_2_37_1","volume-title":"HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System. CoRR abs\/2203.14685","author":"Nie Xiaonan","year":"2022","unstructured":"Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong Zhao, and Bin Cui. 2022. HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System. CoRR abs\/2203.14685 (2022)."},{"key":"e_1_2_2_38_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026--8037."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00155-3"},{"key":"e_1_2_2_40_1","volume-title":"Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9."},{"key":"e_1_2_2_41_1","first-page":"1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21 (2020), 1--67.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_2_42_1","volume-title":"International Conference on Machine Learning. PMLR","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. In International Conference on Machine Learning. PMLR, 18332--18346."},{"key":"e_1_2_2_43_1","first-page":"8583","article-title":"Scaling vision with sparse mixture of experts","volume":"34","author":"Riquelme Carlos","year":"2021","unstructured":"Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr\u00e9 Susano Pinto, Daniel Keysers, and Neil Houlsby. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021), 8583--8595.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_44_1","first-page":"17555","article-title":"Hash layers for large sparse models","volume":"34","author":"Roller Stephen","year":"2021","unstructured":"Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. 2021. Hash layers for large sparse models. Advances in Neural Information Processing Systems 34 (2021), 17555--17566.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_45_1","volume-title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations.","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations."},{"key":"e_1_2_2_46_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017), 6000--6010."},{"key":"e_1_2_2_47_1","volume-title":"Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32","author":"Wang Alex","year":"2019","unstructured":"Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32 (2019), 3266--3280."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-020-3137-8"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-022-2140-7"},{"key":"e_1_2_2_50_1","volume-title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models. CoRR abs\/2202.08906","author":"Zoph Barret","year":"2022","unstructured":"Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. ST-MoE: Designing Stable and Transferable Sparse Expert Models. CoRR abs\/2202.08906 (2022)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3588964","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3588964","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:47:38Z","timestamp":1750178858000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3588964"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,26]]},"references-count":50,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,5,26]]}},"alternative-id":["10.1145\/3588964"],"URL":"https:\/\/doi.org\/10.1145\/3588964","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,26]]}}}