{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T05:09:33Z","timestamp":1775538573322,"version":"3.50.1"},"reference-count":104,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,12,4]]},"abstract":"<jats:p>To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66\u00d7. Our source code is available: https:\/\/github.com\/PKU-DAIR\/Hetu.<\/jats:p>","DOI":"10.1145\/3769802","type":"journal-article","created":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T04:32:13Z","timestamp":1764995533000},"page":"1-30","source":"Crossref","is-referenced-by-count":0,"title":["Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-5342-0194","authenticated-orcid":false,"given":"Haoyang","family":"Li","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1658-0380","authenticated-orcid":false,"given":"Fangcheng","family":"Fu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-1495-0499","authenticated-orcid":false,"given":"Sheng","family":"Lin","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-3367-7486","authenticated-orcid":false,"given":"Hao","family":"Ge","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6458-7033","authenticated-orcid":false,"given":"Xuanyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0922-1942","authenticated-orcid":false,"given":"Jiawen","family":"Niu","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4087-9873","authenticated-orcid":false,"given":"Jinbao","family":"Xue","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0536-4321","authenticated-orcid":false,"given":"Yangyu","family":"Tao","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2330-6854","authenticated-orcid":false,"given":"Di","family":"Wang","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9658-5127","authenticated-orcid":false,"given":"Jie","family":"Jiang","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1681-4677","authenticated-orcid":false,"given":"Bin","family":"Cui","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,5]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265-283","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265-283. https:\/\/www.usenix.org\/conference\/osdi16\/technical-sessions\/presentation\/abadi."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620678.3624666"},{"key":"e_1_2_1_4_1","unstructured":"Alexei Baevski Yuhao Zhou Abdelrahman Mohamed and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/arxiv.org\/abs\/2006.11477."},{"key":"e_1_2_1_5_1","volume-title":"Modern Distributed Data-Parallel Large-Scale Pre-training Strategies For NLP models. arXiv preprint arXiv:2206.06356","author":"Bai Hao","year":"2022","unstructured":"Hao Bai. 2022. Modern Distributed Data-Parallel Large-Scale Pre-training Strategies For NLP models. arXiv preprint arXiv:2206.06356 (2022). https:\/\/arxiv.org\/abs\/2206.06356."},{"key":"e_1_2_1_6_1","volume-title":"LongAlign: A Recipe for Long Context Alignment of Large Language Models. arXiv preprint arXiv:2401.18058","author":"Bai Yushi","year":"2024","unstructured":"Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. arXiv preprint arXiv:2401.18058 (2024). https:\/\/arxiv.org\/abs\/2401.18058."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML). https:\/\/arxiv.org\/abs\/2102","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In Proceedings of the International Conference on Machine Learning (ICML). https:\/\/arxiv.org\/abs\/2102.05095."},{"key":"e_1_2_1_8_1","volume-title":"Striped Attention: Faster Ring Attention for Causal Transformers. arXiv preprint arXiv:2311.09431","author":"Brandon William","year":"2023","unstructured":"William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. 2023. Striped Attention: Faster Ring Attention for Causal Transformers. arXiv preprint arXiv:2311.09431 (2023). https:\/\/arxiv.org\/abs\/2311.09431."},{"key":"e_1_2_1_9_1","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/arxiv.org\/abs\/2005.14165."},{"key":"e_1_2_1_10_1","volume-title":"End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). https:\/\/arxiv.org\/abs\/2005","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, et al., 2020. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). https:\/\/arxiv.org\/abs\/2005.12872."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651379"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","unstructured":"Cong Chen Paolo Penna and Yinfeng Xu. 2020. Online Scheduling of Jobs with Favorite Machines. http:\/\/dx.doi.org\/10.1016\/j.cor.2019.104868.","DOI":"10.1016\/j.cor.2019.104868"},{"key":"e_1_2_1_13_1","unstructured":"Qiaoling Chen Diandian Gu Guoteng Wang Xun Chen YingTong Xiong Ting Huang Qinghao Hu Xin Jin Yonggang Wen Tianwei Zhang and Peng Sun. 2024a. InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding. https:\/\/arxiv.org\/abs\/2401.09149."},{"key":"e_1_2_1_14_1","volume-title":"TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability. arXiv preprint arXiv:2411.18211","author":"Chen Shimin","year":"2024","unstructured":"Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. 2024b. TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability. arXiv preprint arXiv:2411.18211 (2024). https:\/\/arxiv.org\/abs\/2411.18211."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Rohan Choudhury Guanglei Zhu Sihan Liu Koichiro Niinuma Kris M. Kitani and L\u00e1szl\u00f3 Jeni. 2024. Don't Look Twice: Faster Video Transformers with Run-Length Tokenization. In Advances in Neural Information Processing Systems (NeurIPS). Spotlight https:\/\/arxiv.org\/abs\/2411.05222.","DOI":"10.52202\/079017-0882"},{"key":"e_1_2_1_16_1","unstructured":"CodeParrot. [n.d.]. GitHub Dataset. https:\/\/huggingface.co\/datasets\/codeparrot\/github-code."},{"key":"e_1_2_1_17_1","volume-title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:2307.08691","author":"Dao Tri","year":"2023","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:2307.08691 (2023). https:\/\/arxiv.org\/abs\/2307.08691."},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Tri Dao Daniel Y. Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS). https:\/\/arxiv.org\/abs\/2205.14135.","DOI":"10.52202\/068431-1189"},{"key":"e_1_2_1_19_1","unstructured":"Harm de Vries. 2023. In the long (context) run. https:\/\/www.harmdevries.com\/post\/context-length\/"},{"key":"e_1_2_1_20_1","volume-title":"Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. arXiv preprint arXiv:2307.06304","author":"Dehghani Mostafa","year":"2023","unstructured":"Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lucic, and Neil Houlsby. 2023. Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. arXiv preprint arXiv:2307.06304 (2023). https:\/\/arxiv.org\/abs\/2307.06304."},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). https:\/\/arxiv.org\/abs\/1810","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). https:\/\/arxiv.org\/abs\/1810.04805."},{"key":"e_1_2_1_22_1","volume-title":"Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https:\/\/arxiv.org\/abs\/1804","author":"Dong Li","year":"2018","unstructured":"Li Dong, Shuang Xu, and Bo Xu. 2018. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https:\/\/arxiv.org\/abs\/1804.07133."},{"key":"e_1_2_1_23_1","volume-title":"International Conference on Learning Representations (ICLR). https:\/\/arxiv.org\/abs\/2010","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al., 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). https:\/\/arxiv.org\/abs\/2010.11929."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-024-3767-3"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2529989"},{"key":"e_1_2_1_26_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri et al. 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024). https:\/\/arxiv.org\/abs\/2407.21783."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"e_1_2_1_28_1","volume-title":"USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. arXiv preprint arXiv:2405.07719","author":"Fang Jiarui","year":"2024","unstructured":"Jiarui Fang and Shangchun Zhao. 2024. USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. arXiv preprint arXiv:2405.07719 (2024). https:\/\/arxiv.org\/abs\/2405.07719."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695969"},{"key":"e_1_2_1_30_1","volume-title":"LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. arXiv preprint arXiv:2406.18485","author":"Gu Diandian","year":"2024","unstructured":"Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, and Xuanzhe Liu. 2024. LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. arXiv preprint arXiv:2406.18485 (2024). https:\/\/arxiv.org\/abs\/2406.18485."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-024-3872-3"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613157"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582018"},{"key":"e_1_2_1_35_1","volume-title":"PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv preprint arXiv:1806.03377","author":"Harlap Aaron","year":"2018","unstructured":"Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv preprint arXiv:1806.03377 (2018). https:\/\/arxiv.org\/abs\/1806.03377."},{"key":"e_1_2_1_36_1","unstructured":"Cunchen Hu Heyang Huang Liangliang Xu Xusheng Chen Jiang Xu Shuang Chen Hao Feng Chenxi Wang Sa Wang Yungang Bao Ninghui Sun and Yizhou Shan. 2024. Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. https:\/\/arxiv.org\/abs\/2401.11181."},{"key":"e_1_2_1_37_1","first-page":"103","article-title":"GPipe: Efficient training of giant neural networks using pipeline parallelism","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Zhifeng Chen, Yanping Hu, Maxim Krikun, Quoc V. Le, and Yonghui Chen. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems (NeurIPS). 103-112. https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/093f65e080a295f8076b1c5722a46aa2-Paper.pdf.","journal-title":"Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_38_1","volume-title":"Samyam Rajbhandari, and Yuxiong He.","author":"Jacobs Sam Ade","year":"2023","unstructured":"Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv preprint arXiv:2309.14509 (2023). https:\/\/arxiv.org\/abs\/2309.14509."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507778"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 2nd Conference on Machine Learning and Systems (MLSys). 1-13","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of the 2nd Conference on Machine Learning and Systems (MLSys). 1-13. https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2019\/file\/b422680f3db0986ddd7f8f126baaf0fa-Paper.pdf."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629585"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI). 745-760","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, and Xin Liu. 2024b. MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI). 745-760. https:\/\/www.usenix.org\/conference\/nsdi24\/presentation\/jiang-ziheng."},{"key":"e_1_2_1_43_1","volume-title":"Eichenberger","author":"Jin Tian","year":"2020","unstructured":"Tian Jin, Gheorghe-Teodor Bercea, Tung D. Le, Tong Chen, Gong Su, Haruki Imai, Yasushi Negishi, Anh Leu, Kevin O'Brien, Kiyokuni Kawachiya, and Alexandre E. Eichenberger. 2020. Compiling ONNX Neural Network Models Using MLIR. https:\/\/arxiv.org\/abs\/2008.08272."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2024.3355010"},{"key":"e_1_2_1_45_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). https:\/\/arxiv.org\/abs\/2001.08361."},{"key":"e_1_2_1_46_1","first-page":"551","article-title":"Torchgpipe: On-the-fly pipeline parallelism for training giant models","author":"Kim Sangwoo","year":"2020","unstructured":"Sangwoo Kim, Soomin Kim, Daewoon Kim, Junghyun Kim, Woonhyuk Kim, Seung-won Choi, Jongsoo Park, Sungjoo Yoon, and Minjoon Kim. 2020. Torchgpipe: On-the-fly pipeline parallelism for training giant models. In Proceedings of Machine Learning and Systems (MLSys). 551-563. https:\/\/proceedings.mlsys.org\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.","journal-title":"Proceedings of Machine Learning and Systems (MLSys)."},{"key":"e_1_2_1_47_1","volume-title":"Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations (ICLR). https:\/\/arxiv.org\/abs\/1412","author":"Diederik","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations (ICLR). https:\/\/arxiv.org\/abs\/1412.6980."},{"key":"e_1_2_1_48_1","volume-title":"Reducing Activation Recomputation in Large Transformer Models. arXiv preprint arXiv:2205.05198","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. arXiv preprint arXiv:2205.05198 (2022). https:\/\/arxiv.org\/abs\/2205.05198."},{"key":"e_1_2_1_49_1","volume-title":"Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. arXiv preprint arXiv:2107.02027","author":"Krell Mario Michael","year":"2022","unstructured":"Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. arXiv preprint arXiv:2107.02027 (2022). https:\/\/arxiv.org\/abs\/2107.02027."},{"key":"e_1_2_1_50_1","volume-title":"Laura Wynter, Raghu Kiran Ganti, and Mayank Mishra.","author":"Kundu Achintya","year":"2024","unstructured":"Achintya Kundu, Rhui Dih Lee, Laura Wynter, Raghu Kiran Ganti, and Mayank Mishra. 2024. Enhancing Training Efficiency Using Packing with Flash Attention. arXiv preprint arXiv:2407.09105 (2024). https:\/\/arxiv.org\/abs\/2407.09105."},{"key":"e_1_2_1_51_1","first-page":"48","volume-title":"Proceedings of Machine Learning and Systems, D. Song, M. Carbin, and T. Chen (Eds.)","volume":"5","author":"Lamy-Poirier Joel","year":"2023","unstructured":"Joel Lamy-Poirier. 2023. Breadth-First Pipeline Parallelism. In Proceedings of Machine Learning and Systems, D. Song, M. Carbin, and T. Chen (Eds.), Vol. 5. Curan, 48-67. https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2023\/file\/24e845415c1486dd2d582a9d639237f9-Paper-mlsys2023.pdf."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370308"},{"key":"e_1_2_1_53_1","unstructured":"Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang Maxim Krikun Noam Shazeer and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL] https:\/\/arxiv.org\/abs\/2006.16668"},{"key":"e_1_2_1_54_1","volume-title":"DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training. arXiv preprint arXiv:2310.03294","author":"Li Dacheng","year":"2024","unstructured":"Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, and Hao Zhang. 2024. DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training. arXiv preprint arXiv:2310.03294 (2024). https:\/\/arxiv.org\/abs\/2310.03294."},{"key":"e_1_2_1_55_1","unstructured":"Haoyang Li Fangcheng Fu Hao Ge Sheng Lin Xuanyu Wang Jiawen Niu Xupeng Miao and Bin Cui. 2025a. Hetu v2: A General and Scalable Deep Learning System with Hierarchical and Heterogeneous Single Program Multiple Data Annotations. arXiv:2504.20490 [cs.DC] https:\/\/arxiv.org\/abs\/2504.20490"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3725322"},{"key":"e_1_2_1_57_1","first-page":"2236","article-title":"Pytorch distributed: Experiences on accelerating data parallel training","author":"Li Shuang","year":"2020","unstructured":"Shuang Li, Zeyi Yao, Haibin Zheng, Dehao Zhang, Xing Song, Yuxiong Yang, and Cho-Jui Hsieh. 2020. Pytorch distributed: Experiences on accelerating data parallel training. In Proceedings of the VLDB Endowment. 2236-2248. https:\/\/arxiv.org\/abs\/2006.15704.","journal-title":"Proceedings of the VLDB Endowment."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168955"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/3742728.3742752"},{"key":"e_1_2_1_60_1","volume-title":"Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv preprint arXiv:2310.01889","author":"Liu Hao","year":"2023","unstructured":"Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv preprint arXiv:2310.01889 (2023). https:\/\/arxiv.org\/abs\/2310.01889."},{"key":"e_1_2_1_61_1","volume-title":"PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance. arXiv preprint arXiv:2411.02327","author":"Liu Ruyang","year":"2024","unstructured":"Ruyang Liu, Haoran Tang, Haibo Liu, Yixiao Ge, Ying Shan, Chen Li, and Jiankun Yang. 2024. PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance. arXiv preprint arXiv:2411.02327 (2024). https:\/\/arxiv.org\/abs\/2411.02327."},{"key":"e_1_2_1_62_1","volume-title":"The Thirteenth International Conference on Learning Representations (ICLR","author":"Liu Xinyi","year":"2025","unstructured":"Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, and Bin Cui. 2025. NetMoE: Accelerating MoE Training through Dynamic Sample Placement. In The Thirteenth International Conference on Learning Representations (ICLR 2025). https:\/\/openreview.net\/forum?id=1qP3lsatCR"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3652892.3700781"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_2_1_66_1","volume-title":"Mixed Precision Training. arXiv preprint arXiv:1710.03740","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv preprint arXiv:1710.03740 (2018). https:\/\/arxiv.org\/abs\/1710.03740."},{"key":"e_1_2_1_67_1","unstructured":"Tom\u00e1\u0161 Mikolov Ilya Sutskever Anoop Deoras Hai-Son Le Stefan Kombrink and Jan Cernocky. 2012. Subword language modeling with neural networks. (2012). https:\/\/www.fit.vut.cz\/person\/imikolov\/public\/rnnlm\/char.pdf."},{"key":"e_1_2_1_68_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning (ICML). https:\/\/arxiv.org\/abs\/2006","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Keshav Santhanam, Aaron Harlap, et al., 2021a. Memory-Efficient Pipeline-Parallel DNN Training. In Proceedings of the 37th International Conference on Machine Learning (ICML). https:\/\/arxiv.org\/abs\/2006.09503."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_1_70_1","unstructured":"NVIDIA. 2023a. Mixed Precision Training. https:\/\/docs.nvidia.com\/deeplearning\/performance\/mixed-precision-training\/index.html."},{"key":"e_1_2_1_71_1","unstructured":"NVIDIA. 2023b. Sequence Packing Optimization in NeMo. https:\/\/docs.nvidia.com\/nemo-framework\/user-guide\/latest\/nemotoolkit\/features\/optimizations\/sequence_packing.html."},{"key":"e_1_2_1_72_1","volume-title":"Splitwise: Efficient generative LLM inference using phase splitting. https:\/\/arxiv.org\/abs\/2311.18677.","author":"Patel Pratyush","year":"2024","unstructured":"Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, \u00cd\u00f1igo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative LLM inference using phase splitting. https:\/\/arxiv.org\/abs\/2311.18677."},{"key":"e_1_2_1_73_1","volume-title":"S. M. Mehedi Zaman, Vinija Jain, Aman Chadha, and Amitava Das.","author":"Pawar Saurav","year":"2024","unstructured":"Saurav Pawar, S. M. Towhidul Islam Tonmoy, S. M. Mehedi Zaman, Vinija Jain, Aman Chadha, and Amitava Das. 2024. The What, Why, and How of Context Length Extension Techniques in Large Language Models - A Detailed Survey. https:\/\/arxiv.org\/abs\/2401.07872."},{"key":"e_1_2_1_74_1","volume-title":"Zero Bubble Pipeline Parallelism. arXiv preprint arXiv:2401.10241","author":"Qi Penghui","year":"2023","unstructured":"Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism. arXiv preprint arXiv:2401.10241 (2023). https:\/\/arxiv.org\/abs\/2401.10241."},{"key":"e_1_2_1_75_1","unstructured":"Colin Raffel Noam Shazeer Adam Roberts et al. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In Journal of Machine Learning Research (JMLR). https:\/\/arxiv.org\/abs\/1910.10683."},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_2_1_78_1","volume-title":"Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He.","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. https:\/\/arxiv.org\/abs\/2101.06840."},{"key":"e_1_2_1_79_1","volume-title":"Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 69-87","author":"Shan Yizhou","year":"2018","unstructured":"Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 69-87. https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/shan."},{"key":"e_1_2_1_80_1","volume-title":"Yong Jae Lee, and Yan Yan","author":"Shang Yuzhang","year":"2024","unstructured":"Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, and Yan Yan. 2024. Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner. arXiv preprint arXiv:2409.12963 (2024). https:\/\/arxiv.org\/abs\/2409.12963."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.5555\/3113606.3113856"},{"key":"e_1_2_1_82_1","volume-title":"Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Rajiv Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). https:\/\/arxiv.org\/abs\/1909.08053."},{"key":"e_1_2_1_83_1","unstructured":"TIIUAE. [n.d.]. CommonCrawl Dataset. https:\/\/huggingface.co\/datasets\/tiiuae\/falcon-refinedweb."},{"key":"e_1_2_1_84_1","unstructured":"Hugo Touvron Louis Martin Alex Stone Peter Albert Amjad Almahairi Yasmine Babaei Denis Bashlykov Siddharth Batra Akhilesh Bhargava Shruti Bhosale et al. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023). https:\/\/arxiv.org\/abs\/2302.13971."},{"key":"e_1_2_1_85_1","first-page":"5998","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). 5998-6008. https:\/\/papers.nips.cc\/paper\/7181-attention-is-all-you-need.pdf.","journal-title":"Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_1_86_1","volume-title":"Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He.","author":"Wang Guanhua","year":"2023","unstructured":"Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. 2023a. ZeRO: Extremely Efficient Collective Communication for Giant Model Training. https:\/\/arxiv.org\/abs\/2306.10209."},{"key":"e_1_2_1_87_1","volume-title":"Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning. arXiv preprint arXiv:2410.08081","author":"Wang Shuhe","year":"2024","unstructured":"Shuhe Wang, Guoyin Wang, Yizhong Wang, Jiwei Li, Eduard Hovy, and Chen Guo. 2024b. Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning. arXiv preprint arXiv:2410.08081 (2024). https:\/\/arxiv.org\/abs\/2410.08081."},{"key":"e_1_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1145\/3567955.3567959"},{"key":"e_1_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1109\/tkde.2024.3370614"},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3715998"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3715992"},{"key":"e_1_2_1_92_1","unstructured":"Houming Wu Ling Chen and Wenjie Yu. 2024. BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training. https:\/\/arxiv.org\/abs\/2410.19367."},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4222-0"},{"key":"e_1_2_1_94_1","unstructured":"Yifei Xia Suhan Ling Fangcheng Fu Yujie Wang Huixia Li Xuefeng Xiao and Bin Cui. 2025. Training-free and Adaptive Sparse Attention for Efficient Long Video Generation. arXiv:2502.21079 [cs.CV] https:\/\/arxiv.org\/abs\/2502.21079."},{"key":"e_1_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-025-00296-9"},{"key":"e_1_2_1_96_1","doi-asserted-by":"publisher","DOI":"10.1145\/322186.322187"},{"key":"e_1_2_1_97_1","volume-title":"8th International Conference on Learning Representations (ICLR","author":"You Yang","year":"2020","unstructured":"Yang You, Zhao Zhang, Cho-Jui Hsieh, et al., 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In 8th International Conference on Learning Representations (ICLR 2020). https:\/\/openreview.net\/forum?id=Syx4wnEtvH."},{"key":"e_1_2_1_98_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-023-3956-3"},{"key":"e_1_2_1_99_1","volume-title":"SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile. arXiv preprint arXiv:2411.00284","author":"Zhang Ruisi","year":"2024","unstructured":"Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, and Francisco Massa. 2024a. SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile. arXiv preprint arXiv:2411.00284 (2024). https:\/\/arxiv.org\/abs\/2411.00284."},{"key":"e_1_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1145\/3709703"},{"key":"e_1_2_1_101_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_2_1_102_1","volume-title":"Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 559-578","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 559-578. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/zheng."},{"key":"e_1_2_1_103_1","volume-title":"Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 193-210","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 193-210. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/zhong-yinmin."},{"key":"e_1_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-023-00235-6"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769802","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T04:27:20Z","timestamp":1775536040000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769802"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,4]]},"references-count":104,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12,4]]}},"alternative-id":["10.1145\/3769802"],"URL":"https:\/\/doi.org\/10.1145\/3769802","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,4]]}}}