{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T19:01:22Z","timestamp":1774983682633,"version":"3.50.1"},"reference-count":104,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T00:00:00Z","timestamp":1739145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,2,10]]},"abstract":"<jats:p>Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing memory reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.<\/jats:p>","DOI":"10.1145\/3709703","type":"journal-article","created":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T15:45:06Z","timestamp":1739288706000},"page":"1-28","source":"Crossref","is-referenced-by-count":2,"title":["MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-9492-9250","authenticated-orcid":false,"given":"Pinxue","family":"Zhao","sequence":"first","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China, &amp; Key Lab of High Confidence Software Technologies (MOE), Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4188-7742","authenticated-orcid":false,"given":"Hailin","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China, &amp; Key Lab of High Confidence Software Technologies (MOE), Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1658-0380","authenticated-orcid":false,"given":"Fangcheng","family":"Fu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China, &amp; Key Lab of High Confidence Software Technologies (MOE), Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6766-757X","authenticated-orcid":false,"given":"Xiaonan","family":"Nie","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China, &amp; Key Lab of High Confidence Software Technologies (MOE), Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3871-5105","authenticated-orcid":false,"given":"Qibin","family":"Liu","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0529-8128","authenticated-orcid":false,"given":"Fang","family":"Yang","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6416-0943","authenticated-orcid":false,"given":"Yuanbo","family":"Peng","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-1540-971X","authenticated-orcid":false,"given":"Dian","family":"Jiao","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6325-7114","authenticated-orcid":false,"given":"Shuaipeng","family":"Li","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4087-9873","authenticated-orcid":false,"given":"Jinbao","family":"Xue","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0536-4321","authenticated-orcid":false,"given":"Yangyu","family":"Tao","sequence":"additional","affiliation":[{"name":"Tencent Inc., Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1681-4677","authenticated-orcid":false,"given":"Bin","family":"Cui","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China, Key Lab of High Confidence Software Technologies (MOE), Peking University, Beijing, China, &amp; Institute of Computational Social Science, Peking University, Qingdao, China"}]}],"member":"320","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"Moonshot AI. 2024. KimiChat. https:\/\/kimi.moonshot.cn\/"},{"key":"e_1_2_2_2_1","volume-title":"The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4. arxiv preprint, 2311.07361","author":"Science Microsoft Research","year":"2023","unstructured":"Microsoft Research AI4Science and Microsoft Azure Quantum. 2023. The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4. arxiv preprint, 2311.07361 (2023)."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640366"},{"key":"e_1_2_2_4_1","volume-title":"Longformer: The Long-Document Transformer. arxiv preprint","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arxiv preprint, 2004.05150 (2020)."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-023-06185-3"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.7554\/eLife.82819"},{"key":"e_1_2_2_7_1","volume-title":"InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding. arxiv preprint, 2401.09149","author":"Chen Qiaoling","year":"2024","unstructured":"Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, and Peng Sun. 2024. InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding. arxiv preprint, 2401.09149 (2024)."},{"key":"e_1_2_2_8_1","volume-title":"Training Deep Nets with Sublinear Memory Cost. arxiv preprint, 1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arxiv preprint, 1604.06174 (2016)."},{"key":"e_1_2_2_9_1","unstructured":"Alibaba Cloud. 2024. Tongyi Qianwen. https:\/\/tongyi.aliyun.com\/qianwen\/"},{"key":"e_1_2_2_10_1","unstructured":"PyTorch Contributors. 2023. Understanding CUDA Memory Usage. https:\/\/pytorch.org\/docs\/stable\/torch_cuda_memory.html."},{"key":"e_1_2_2_11_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Dao Tri","year":"2024","unstructured":"Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Tri Dao Daniel Y. Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems (NeurIPS).","DOI":"10.52202\/068431-1189"},{"key":"e_1_2_2_13_1","volume-title":"Ng","author":"Dean Jeffrey","year":"2012","unstructured":"Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_2_14_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML).","author":"Ding Yiran","year":"2024","unstructured":"Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. In Proceedings of the International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_15_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3329785.3329918"},{"key":"e_1_2_2_17_1","volume-title":"USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. arxiv preprint, 2405.07719","author":"Fang Jiarui","year":"2024","unstructured":"Jiarui Fang and Shangchun Zhao. 2024. USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. arxiv preprint, 2405.07719 (2024)."},{"key":"e_1_2_2_18_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML).","author":"Fu Yao","year":"2025","unstructured":"Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. 2025. Data engineering for scaling language models to 128K context. In Proceedings of the International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654977"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695969"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-024-3872-3"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640423"},{"key":"e_1_2_2_23_1","volume-title":"DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence. arxiv preprint, 2401.14196","author":"Guo Daya","year":"2024","unstructured":"Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024b. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence. arxiv preprint, 2401.14196 (2024)."},{"key":"e_1_2_2_24_1","unstructured":"Gurobi Optimization LLC. 2024. Gurobi Optimizer Reference Manual."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3524059.3532394"},{"key":"e_1_2_2_26_1","volume-title":"HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517848"},{"key":"e_1_2_2_28_1","volume-title":"Samyam Rajbhandari, and Yuxiong He.","author":"Jacobs Sam Ade","year":"2023","unstructured":"Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arxiv preprint, 2309.14509 (2023)."},{"key":"e_1_2_2_29_1","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de Las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William El Sayed. 2023. Mistral 7B. arxiv preprint 2310.06825 (2023)."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2019.2921572"},{"key":"e_1_2_2_31_1","unstructured":"Dhiraj D. Kalamkar Dheevatsa Mudigere Naveen Mellempudi Dipankar Das Kunal Banerjee Sasikanth Avancha Dharma Teja Vooturi Nataraj Jammalamadaka Jianyu Huang Hector Yuen Jiyan Yang Jongsoo Park Alexander Heinecke Evangelos Georganas Sudarshan Srinivasan Abhisek Kundu Misha Smelyanskiy Bharat Kaul and Pradeep Dubey. 2019. A Study of BFLOAT16 for Deep Learning Training. arxiv preprint 1905.12322 (2019)."},{"key":"e_1_2_2_32_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML).","author":"Katharopoulos Angelos","year":"2020","unstructured":"Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran\u00e7ois Fleuret. 2020. Transformers are RNNs: fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3375661"},{"key":"e_1_2_2_34_1","volume-title":"Dynamic Tensor Rematerialization. In International Conference on Learning Representations (ICLR).","author":"Kirisame Marisa","year":"2021","unstructured":"Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2021. Dynamic Tensor Rematerialization. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_35_1","volume-title":"Reducing Activation Recomputation in Large Transformer Models. arxiv preprint, 2205.05198","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. arxiv preprint, 2205.05198 (2022)."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3467861.3467869"},{"key":"e_1_2_2_37_1","volume-title":"LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers. arxiv preprint, 2310.03294","author":"Li Dacheng","year":"2023","unstructured":"Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023a. LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers. arxiv preprint, 2310.03294 (2023)."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2640087.2644155"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588737"},{"key":"e_1_2_2_40_1","volume-title":"Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences. arxiv preprint, 2201.11838","author":"Li Yikuan","year":"2022","unstructured":"Yikuan Li, Ramsey M. Wehbe, Faraz S. Ahmad, Hanyin Wang, and Yuan Luo. 2022. Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences. arxiv preprint, 2201.11838 (2022)."},{"key":"e_1_2_2_41_1","volume-title":"World Model on Million-Length Video And Language With Blockwise RingAttention. arxiv preprint, 2402.08268","author":"Liu Hao","year":"2024","unstructured":"Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024b. World Model on Million-Length Video And Language With Blockwise RingAttention. arxiv preprint, 2402.08268 (2024)."},{"key":"e_1_2_2_42_1","volume-title":"Ring Attention with Blockwise Transformers for Near-Infinite Context. arxiv preprint, 2310.01889","author":"Liu Hao","year":"2023","unstructured":"Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arxiv preprint, 2310.01889 (2023)."},{"key":"e_1_2_2_43_1","volume-title":"Scaling Laws of RoPE-based Extrapolation. In International Conference on Learning Representations (ICLR).","author":"Liu Xiaoran","year":"2024","unstructured":"Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. 2024a. Scaling Laws of RoPE-based Extrapolation. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626246.3654683"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/3598581.3598604"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489511"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2024\/904"},{"key":"e_1_2_2_49_1","volume-title":"Mixed Precision Training. In International Conference on Learning Representations (ICLR).","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3446095.3446100"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476374"},{"key":"e_1_2_2_52_1","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).","author":"Nakandala Supun","year":"2020","unstructured":"Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi. 2020. A Tensor Compiler for Unified Machine Learning Prediction Serving. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)."},{"key":"e_1_2_2_53_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML).","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-Efficient Pipeline-Parallel DNN Training. In Proceedings of the International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_54_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML).","author":"Nguyen Tung","year":"2023","unstructured":"Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K. Gupta, and Aditya Grover. 2023. ClimaX: A foundation model for weather and climate. In Proceedings of the International Conference on Machine Learning (ICML)."},{"key":"e_1_2_2_55_1","unstructured":"NVIDIA. 2024a. Context Parallelism. https:\/\/docs.nvidia.com\/megatron-core\/developer-guide\/latest\/api-guide\/context_parallel.html."},{"key":"e_1_2_2_56_1","unstructured":"NVIDIA. 2024b. Transformer Engine. https:\/\/github.com\/NVIDIA\/TransformerEngine."},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-024-4125-9"},{"key":"e_1_2_2_58_1","unstructured":"OpenAI. 2023. GPT-4 Technical Report. (2023)."},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137629"},{"key":"e_1_2_2_60_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_2_61_1","volume-title":"Scalable Diffusion Models with Transformers. In IEEE\/CVF International Conference on Computer Vision (ICCV).","author":"Peebles William","year":"2023","unstructured":"William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. In IEEE\/CVF International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_2_62_1","volume-title":"YaRN: Efficient Context Window Extension of Large Language Models. In International Conference on Learning Representations (ICLR).","author":"Peng Bowen","year":"2024","unstructured":"Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2024. YaRN: Efficient Context Window Extension of Large Language Models. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_2_2_64_1","volume-title":"Dynamic Materialization of Query Views for Data Warehouse Workloads. In IEEE International Conference on Data Engineering (ICDE).","author":"Phan Thomas","year":"2008","unstructured":"Thomas Phan and Wen-Syan Li. 2008. Dynamic Materialization of Query Views for Data Warehouse Workloads. In IEEE International Conference on Data Engineering (ICDE)."},{"key":"e_1_2_2_65_1","volume-title":"Proceedings of the Conference on Machine Learning and Systems (MLSys).","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Transformer Inference. In Proceedings of the Conference on Machine Learning and Systems (MLSys)."},{"key":"e_1_2_2_66_1","volume-title":"Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In International Conference on Learning Representations (ICLR).","author":"Press Ofir","year":"2022","unstructured":"Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_67_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Qin Zhen","year":"2022","unstructured":"Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. 2022. cosFormer: Rethinking Softmax In Attention. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_68_1","volume-title":"ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. arxiv preprint","author":"Rajbhandari Samyam","year":"1910","unstructured":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2019. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. arxiv preprint, 1910.02054 (2019)."},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_2_2_70_1","volume-title":"Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He.","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. arxiv preprint, 2101.06840 (2021)."},{"key":"e_1_2_2_71_1","volume-title":"International Symposium on Microarchitecture (MICRO).","author":"Rhu Minsoo","unstructured":"Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In International Symposium on Microarchitecture (MICRO)."},{"key":"e_1_2_2_72_1","unstructured":"Baptiste Rozi\u00e8re Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat Xiaoqing Ellen Tan Yossi Adi Jingyu Liu Tal Remez J\u00e9r\u00e9my Rapin Artyom Kozhevnikov Ivan Evtimov Joanna Bitton Manish Bhatt Cristian Canton-Ferrer Aaron Grattafiori Wenhan Xiong Alexandre D\u00e9fossez Jade Copet Faisal Azhar Hugo Touvron Louis Martin Nicolas Usunier Thomas Scialom and Gabriel Synnaeve. 2023. Code Llama: Open Foundation Models for Code. arxiv preprint 2308.12950 (2023)."},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.aiopen.2022.01.001"},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.14778\/3436905.3436927"},{"key":"e_1_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588717"},{"key":"e_1_2_2_76_1","volume-title":"Profile-guided memory optimization for deep neural networks. arxiv preprint","author":"Sekiyama Taro","year":"1804","unstructured":"Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond. 2018. Profile-guided memory optimization for deep neural networks. arxiv preprint, 1804.10001 (2018)."},{"key":"e_1_2_2_77_1","volume-title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arxiv preprint","author":"Shoeybi Mohammad","year":"1909","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arxiv preprint, 1909.08053 (2019)."},{"key":"e_1_2_2_78_1","volume-title":"OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. arxiv preprint, 2210.12924","author":"Steiner Benoit","year":"2022","unstructured":"Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, and James Hegarty. 2022. OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. arxiv preprint, 2210.12924 (2022)."},{"key":"e_1_2_2_79_1","volume-title":"Alpaca: A strong, replicable instruction-following model","author":"Taori Rohan","year":"2023","unstructured":"Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https:\/\/crfm. stanford. edu\/2023\/03\/13\/alpaca. html, Vol. 3 (2023)."},{"key":"e_1_2_2_80_1","unstructured":"Together.ai. 2023. LLaMA-2--7B-32K. https:\/\/huggingface.co\/togethercomputer\/LLaMA-2--7B-32K"},{"key":"e_1_2_2_81_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton-Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aur\u00e9lien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv preprint 2307.09288 (2023)."},{"key":"e_1_2_2_82_1","doi-asserted-by":"publisher","DOI":"10.14778\/3579075.3579083"},{"key":"e_1_2_2_83_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-022-00202-7"},{"key":"e_1_2_2_84_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732967.2732976"},{"key":"e_1_2_2_85_1","volume-title":"Linformer: Self-Attention with Linear Complexity. arxiv preprint","author":"Wang Sinong","year":"2020","unstructured":"Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity. arxiv preprint, 2006.04768 (2020)."},{"key":"e_1_2_2_86_1","article-title":"Improving Automatic Parallel Training via Balanced Memory Workload Optimization","volume":"36","author":"Wang Yujie","year":"2024","unstructured":"Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. 2024. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 36 (2024).","journal-title":"IEEE Transactions on Knowledge and Data Engineering (TKDE)"},{"key":"e_1_2_2_87_1","volume-title":"Nystr\u00f6mformer: A Nystr\u00f6m-based Algorithm for Approximating Self-Attention. In The AAAI Conference on Artificial Intelligence (AAAI).","author":"Xiong Yunyang","year":"2021","unstructured":"Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. 2021. Nystr\u00f6mformer: A Nystr\u00f6m-based Algorithm for Approximating Self-Attention. In The AAAI Conference on Artificial Intelligence (AAAI)."},{"key":"e_1_2_2_88_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-022-1457-6"},{"key":"e_1_2_2_89_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457399"},{"key":"e_1_2_2_90_1","volume-title":"Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI).","author":"Zaharia Matei","year":"2012","unstructured":"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI)."},{"key":"e_1_2_2_91_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934664"},{"key":"e_1_2_2_92_1","volume-title":"Joshua Ainslie, Chris Alberti, Santiago Onta n\u00f3n","author":"Zaheer Manzil","year":"2020","unstructured":"Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta n\u00f3n, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_2_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639306"},{"key":"e_1_2_2_94_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-023-3956-3"},{"key":"e_1_2_2_95_1","doi-asserted-by":"publisher","DOI":"10.14778\/3636218.3636234"},{"key":"e_1_2_2_96_1","volume-title":"Coop: Memory is not a Commodity. In Advances in Neural Information Processing Systems (NeurIPS).","author":"Zhang Jianhao","year":"2023","unstructured":"Jianhao Zhang, Shihan Ma, Peihong Liu, and Jinhui Yuan. 2023a. Coop: Memory is not a Commodity. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_2_2_97_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561265"},{"key":"e_1_2_2_98_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-99-7587-7"},{"key":"e_1_2_2_99_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570703"},{"key":"e_1_2_2_100_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_2_2_101_1","unstructured":"Zangwei Zheng Xiangyu Peng Tianji Yang Chenhui Shen Shenggui Li Hongxin Liu Yukun Zhou Tianyi Li and Yang You. 2024. Open-Sora: Democratizing Efficient Video Production for All. https:\/\/github.com\/hpcaitech\/Open-Sora"},{"key":"e_1_2_2_102_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-023-00235-6"},{"key":"e_1_2_2_103_1","unstructured":"Wenhao Zhu Hongyi Liu Qingxiu Dong Jingjing Xu Shujian Huang Lingpeng Kong Jiajun Chen and Lei Li. 2024. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. In Findings of the Association for Computational Linguistics (NAACL)."},{"key":"e_1_2_2_104_1","unstructured":"Martin Zinkevich Markus Weimer Alexander J. Smola and Lihong Li. 2010. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NeurIPS)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709703","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3709703","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:18:58Z","timestamp":1774981138000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709703"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,10]]},"references-count":104,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2,10]]}},"alternative-id":["10.1145\/3709703"],"URL":"https:\/\/doi.org\/10.1145\/3709703","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,10]]}}}