{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T03:34:27Z","timestamp":1769657667774,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,1,28]]},"DOI":"10.1145\/3774934.3786417","type":"proceedings-article","created":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T15:25:57Z","timestamp":1769613957000},"page":"413-424","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8898-4355","authenticated-orcid":false,"given":"Geng","family":"Zhang","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7966-2941","authenticated-orcid":false,"given":"Shenggan","family":"Cheng","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4877-3115","authenticated-orcid":false,"given":"Xuanlei","family":"Zhao","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-3355-6770","authenticated-orcid":false,"given":"Ziming","family":"Liu","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2816-4384","authenticated-orcid":false,"given":"Yang","family":"You","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,28]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Josh Abramson Jonas Adler Jack Dunger Richard Evans Tim Green Alexander Pritzel Olaf Ronneberger Lindsay Willmore Andrew J Ballard Joshua Bambrick et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 1\u20133."},{"key":"e_1_3_2_1_2_1","unstructured":"Xiao Bi Deli Chen Guanting Chen Shanhuang Chen Damai Dai Chengqi Deng Honghui Ding Kai Dong Qiushi Du Zhe Fu et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954."},{"key":"e_1_3_2_1_3_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3669940.3707223"},{"key":"e_1_3_2_1_5_1","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35 (2022), 16344\u201316359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_6_1","unstructured":"DeepSeek-AI. 2024. DeepSeek-V2: A Strong Economical and Efficient Mixture-of-Experts Language Model. arxiv:2405.04434."},{"key":"e_1_3_2_1_7_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_2_1_9_1","unstructured":"Jiarui Fang and Shangchun Zhao. 2024. A Unified Sequence Parallelism Approach for Long Context Generative AI. arXiv preprint arXiv:2405.07719."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640423"},{"key":"e_1_3_2_1_11_1","unstructured":"Daya Guo Qihao Zhu Dejian Yang Zhenda Xie Kai Dong Wentao Zhang Guanting Chen Xiao Bi Yu Wu YK Li et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming\u2013The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196."},{"key":"e_1_3_2_1_12_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32 (2019)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3662158.3662806"},{"key":"e_1_3_2_1_14_1","volume-title":"International Conference on Machine Learning. 16639\u201316653","author":"Kim Taebum","year":"2023","unstructured":"Taebum Kim, Hyoungjoo Kim, Gyeong-In Yu, and Byung-Gon Chun. 2023. Bpipe: Memory-balanced pipeline parallelism for training large language models. In International Conference on Machine Learning. 16639\u201316653."},{"key":"e_1_3_2_1_15_1","unstructured":"Weijie Kong Qi Tian Zijian Zhang Rox Min Zuozhuo Dai Jin Zhou Jiangfeng Xiong Xin Li Bo Wu Jianwei Zhang et al. 2024. HunyuanVideo: A Systematic Framework For Large Video Generative Models. arXiv preprint arXiv:2412.03603."},{"key":"e_1_3_2_1_16_1","first-page":"5","volume-title":"Proceedings of Machine Learning and Systems","author":"Korthikanti Vijay Anand","year":"2023","unstructured":"Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5 (2023)."},{"key":"e_1_3_2_1_17_1","first-page":"5","volume-title":"Proceedings of Machine Learning and Systems","author":"Lamy-Poirier Joel","year":"2023","unstructured":"Joel Lamy-Poirier. 2023. Breadth-first pipeline parallelism. Proceedings of Machine Learning and Systems, 5 (2023)."},{"key":"e_1_3_2_1_18_1","volume-title":"First Conference on Language Modeling.","author":"Li Dacheng","year":"2024","unstructured":"Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Xuezhe Ma, Ion Stoica, Joseph E Gonzalez, and Hao Zhang. 2024. Distflashattn: Distributed memory-efficient attention for long-context llms training. In First Conference on Language Modeling."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476145"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.134"},{"key":"e_1_3_2_1_21_1","volume-title":"International Conference on Machine Learning. 6543\u20136552","author":"Li Zhuohan","year":"2021","unstructured":"Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. 6543\u20136552."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3710848.3710869"},{"key":"e_1_3_2_1_23_1","unstructured":"Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437."},{"key":"e_1_3_2_1_24_1","unstructured":"Hao Liu Matei Zaharia and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607073"},{"key":"e_1_3_2_1_26_1","unstructured":"Ziming Liu Shaoyu Wang Shenggan Cheng Zhongkai Zhao Kai Wang Xuanlei Zhao James Demmel and Yang You. 2024. WallFacer: Harnessing Multi-dimensional Ring Parallelism for Efficient Long Sequence Model Training. arxiv:2407.00611. arxiv:2407.00611"},{"key":"e_1_3_2_1_27_1","volume-title":"International Conference on Machine Learning. 7937\u20137947","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. 7937\u20137947."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_29_1","unstructured":"NVIDIA. 2025. Context Parallelism. https:\/\/docs.nvidia.com\/megatron-core\/developer-guide\/latest\/api-guide\/context_parallel.html"},{"key":"e_1_3_2_1_30_1","unstructured":"NVIDIA. 2025. NCCL. https:\/\/github.com\/NVIDIA\/nccl"},{"key":"e_1_3_2_1_31_1","unstructured":"NVIDIA. 2025. NVSwitch. https:\/\/developer.nvidia.com\/blog\/nvidia-nvlink-and-nvidia-nvswitch-supercharge-large-language-model-inference\/?ncid=no-ncid"},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of the Seventeenth European Conference on Computer Systems. 435\u2013452","author":"Oh Hyungjun","year":"2022","unstructured":"Hyungjun Oh, Junyeol Lee, Hyeongju Kim, and Jiwon Seo. 2022. Out-of-order backprop: An effective scheduling technique for deep learning. In Proceedings of the Seventeenth European Conference on Computer Systems. 435\u2013452."},{"key":"e_1_3_2_1_33_1","volume-title":"Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Conference on Learning Representations.","author":"Qi Penghui","year":"2024","unstructured":"Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_1_35_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651359"},{"key":"e_1_3_2_1_37_1","unstructured":"Hugo Touvron and Louis Martin et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv:2307.09288."},{"key":"e_1_3_2_1_38_1","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arxiv:2302.13971."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783323"},{"key":"e_1_3_2_1_40_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Syx4wnEtvH","author":"You Yang","year":"2020","unstructured":"Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Syx4wnEtvH"},{"key":"e_1_3_2_1_41_1","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Yuan Tailing","year":"2024","unstructured":"Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. 2024. Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 545\u2013561."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2024.3385639"},{"key":"e_1_3_2_1_43_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and $Intra-Operator$ parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559\u2013578."}],"event":{"name":"PPoPP '26: 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","location":"Sydney NSW Australia","acronym":"PPoPP '26","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing","SIGPLAN ACM Special Interest Group on Programming Languages"]},"container-title":["Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3774934.3786417","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T15:29:29Z","timestamp":1769614169000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3774934.3786417"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,28]]},"references-count":43,"alternative-id":["10.1145\/3774934.3786417","10.1145\/3774934"],"URL":"https:\/\/doi.org\/10.1145\/3774934.3786417","relation":{},"subject":[],"published":{"date-parts":[[2026,1,28]]},"assertion":[{"value":"2026-01-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}