{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,2]],"date-time":"2025-10-02T00:47:57Z","timestamp":1759366077669,"version":"build-2065373602"},"publisher-location":"New York, NY, USA","reference-count":63,"publisher":"ACM","funder":[{"name":"NSF","award":["2124039"],"award-info":[{"award-number":["2124039"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,13]]},"DOI":"10.1145\/3731569.3764798","type":"proceedings-article","created":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T12:43:24Z","timestamp":1759322604000},"page":"1046-1061","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-7433-2627","authenticated-orcid":false,"given":"Yue","family":"Guan","sequence":"first","affiliation":[{"name":"UCSD, La Jolla, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1182-6737","authenticated-orcid":false,"given":"Xinwei","family":"Qiang","sequence":"additional","affiliation":[{"name":"UCSD, La Jolla, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6759-2616","authenticated-orcid":false,"given":"Zaifeng","family":"Pan","sequence":"additional","affiliation":[{"name":"UCSD, La Jolla, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1126-7445","authenticated-orcid":false,"given":"Daniels","family":"Johnson","sequence":"additional","affiliation":[{"name":"Meta, Mountain View, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5600-026X","authenticated-orcid":false,"given":"Yuanwei","family":"Fang","sequence":"additional","affiliation":[{"name":"Meta, Mountain View, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7977-3182","authenticated-orcid":false,"given":"Keren","family":"Zhou","sequence":"additional","affiliation":[{"name":"George Mason University, Washington DC, USA"},{"name":"OpenAI, San Francisco, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1634-8549","authenticated-orcid":false,"given":"Yuke","family":"Wang","sequence":"additional","affiliation":[{"name":"Rice University, Houston, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0098-0670","authenticated-orcid":false,"given":"Wanlu","family":"Li","sequence":"additional","affiliation":[{"name":"UCSD, La Jolla, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8716-5793","authenticated-orcid":false,"given":"Yufei","family":"Ding","sequence":"additional","affiliation":[{"name":"UCSD, La Jolla, USA"},{"name":"Meta, Mountain View, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5855-6861","authenticated-orcid":false,"given":"Adnan","family":"Aziz","sequence":"additional","affiliation":[{"name":"Meta, Mountain VIew, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,10,12]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Inc","author":"Devices Advanced Micro","year":"2025","unstructured":"Advanced Micro Devices, Inc. RCCL: ROCm Communication Collectives Library, 2025. Version 2.23.4."},{"key":"e_1_3_2_1_2_1","volume-title":"Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations. arXiv preprint arXiv:2409.17264","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Junda Chen, \u00cd\u00f1igo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, and Esha Choukse. Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations. arXiv preprint arXiv:2409.17264, 2024."},{"key":"e_1_3_2_1_3_1","volume-title":"Grouped-query attention. arXiv preprint arXiv:2305.13245","author":"Ainslie Joshua","year":"2023","unstructured":"Joshua Ainslie, James Lee-Thorp, Xuezhi Wu, Adam Roberts, Sharan Narang, Hongkun Zhou, Zihang Wang, Jaehoon Lee, Maarten Bosma, and Yi Chen. Grouped-query attention. arXiv preprint arXiv:2305.13245, 2023."},{"key":"e_1_3_2_1_4_1","first-page":"810","volume-title":"Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1","author":"Alabed Sami","year":"2025","unstructured":"Sami Alabed, Daniel Belov, Bart Chrzaszcz, Juliana Franco, Dominik Grewe, Dougal Maclaurin, James Molloy, Tom Natan, Tamara Norman, Xiaoyue Pan, et al. Partir: Composing spmd partitioning strategies for machine learning. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 794\u2013810, 2025."},{"key":"e_1_3_2_1_5_1","first-page":"947","volume-title":"Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"2","author":"Ansel Jason","year":"2024","unstructured":"Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, C. K. Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Shunting Zhang, Michael Suo, Phil Tillet, Xu Zhao, Eikan Wang, Keren Zhou, Richard Zou, Xiaodong Wang, Ajit Mathews, William Wen, Gregory Chanan, Peng Wu, and Soumith Chintala. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS '24, page 929\u2013947, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_6_1","volume-title":"Flux: fast software-based communication overlap on gpus through kernel fusion. arXiv preprint arXiv:2406.06858","author":"Chang Li-Wen","year":"2024","unstructured":"Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. Flux: fast software-based communication overlap on gpus through kernel fusion. arXiv preprint arXiv:2406.06858, 2024."},{"key":"e_1_3_2_1_7_1","first-page":"191","volume-title":"Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"3","author":"Chen Chang","year":"2024","unstructured":"Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 178\u2013191, 2024."},{"key":"e_1_3_2_1_8_1","first-page":"594","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578\u2013594, 2018."},{"key":"e_1_3_2_1_9_1","first-page":"3404","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, page 3393\u20133404, Red Hook, NY, USA, 2018. Curran Associates Inc."},{"key":"e_1_3_2_1_10_1","volume-title":"OSDI. USENIX","author":"Cheng Yu","year":"2025","unstructured":"Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, Mao Yang, and Zhi Yang. Pipethreader: Software-defined pipelining for efficient dnn execution. In OSDI. USENIX, July 2025."},{"key":"e_1_3_2_1_11_1","volume-title":"cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014."},{"key":"e_1_3_2_1_12_1","volume-title":"Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691","author":"Dao Tri","year":"2023","unstructured":"Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023."},{"key":"e_1_3_2_1_13_1","volume-title":"Flashattention: Fast and memory-efficient exact attention with ioawareness. Advances in neural information processing systems, 35:16344\u201316359","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. Flashattention: Fast and memory-efficient exact attention with ioawareness. Advances in neural information processing systems, 35:16344\u201316359, 2022."},{"issue":"1","key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/77626.79170","article-title":"A set of level 3 basic linear algebra subprograms","volume":"16","author":"Dongarra J. J.","year":"1990","unstructured":"J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1\u201317, March 1990.","journal-title":"ACM Trans. Math. Softw."},{"key":"e_1_3_2_1_15_1","volume-title":"A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719","author":"Fang Jiarui","year":"2024","unstructured":"Jiarui Fang and Shangchun Zhao. A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024."},{"key":"e_1_3_2_1_16_1","first-page":"817","volume-title":"Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"2","author":"Feng Siyuan","year":"2023","unstructured":"Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, et al. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 804\u2013817, 2023."},{"key":"e_1_3_2_1_17_1","volume-title":"The llama 3 herd of models. arXiv preprint arXiv:2407.21783","author":"Grattafiori Aaron","year":"2024","unstructured":"Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024."},{"key":"e_1_3_2_1_18_1","volume-title":"et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism. arXiv preprint arXiv:2406.18485","author":"Gu Diandian","year":"2024","unstructured":"Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism. arXiv preprint arXiv:2406.18485, 2024."},{"key":"e_1_3_2_1_19_1","volume-title":"et al. Flashoverlap: A lightweight design for efficiently overlapping communication and computation. arXiv preprint arXiv:2504.19519","author":"Hong Ke","year":"2025","unstructured":"Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, et al. Flashoverlap: A lightweight design for efficiently overlapping communication and computation. arXiv preprint arXiv:2504.19519, 2025."},{"key":"e_1_3_2_1_20_1","volume-title":"Samyam Rajbhandari, and Yuxiong He. Deep-speed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509","author":"Jacobs Sam Ade","year":"2023","unstructured":"Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deep-speed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023."},{"key":"e_1_3_2_1_21_1","first-page":"416","volume-title":"Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Jangda Abhinav","year":"2022","unstructured":"Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 402\u2013416, 2022."},{"key":"e_1_3_2_1_22_1","first-page":"1","article-title":"Beyond data and model parallelism for deep neural networks","volume":"1","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1:1\u201313, 2019.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_23_1","first-page":"341","article-title":"Reducing activation recomputation in large transformer models","volume":"5","author":"Korthikanti Vijay Anand","year":"2023","unstructured":"Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341\u2013353, 2023.","journal-title":"Proceedings of Machine Learning and Systems"},{"issue":"12","key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","first-page":"3005","DOI":"10.14778\/3415478.3415530","article-title":"experiences on accelerating data parallel training","volume":"13","author":"Li Shen","year":"2020","unstructured":"Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005\u20133018, August 2020.","journal-title":"Proc. VLDB Endow."},{"key":"e_1_3_2_1_25_1","first-page":"2404","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Li Shenggui","year":"2023","unstructured":"Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391\u20132404, 2023."},{"key":"e_1_3_2_1_26_1","volume-title":"et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511","author":"Liang Wanchao","year":"2024","unstructured":"Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511, 2024."},{"key":"e_1_3_2_1_27_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Liu Hao","unstructured":"Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with block-wise transformers for near-infinite context. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_2_1_28_1","first-page":"667","volume-title":"2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Luo Weile","unstructured":"Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. Benchmarking and dissecting the nvidia hopper gpu architecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 656\u2013667. IEEE, 2024."},{"key":"e_1_3_2_1_29_1","first-page":"15","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)","author":"Narayanan Deepak","year":"2019","unstructured":"Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 1\u201315, 2019."},{"key":"e_1_3_2_1_30_1","volume-title":"NVIDIA, mar","author":"NVIDIA.","year":"2022","unstructured":"NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. Technical report, NVIDIA, mar 2022. White paper."},{"key":"e_1_3_2_1_31_1","volume-title":"Nvidia nvlink high-speed interconnect: Application performance. Technical report","author":"NVIDIA Corporation","year":"2015","unstructured":"NVIDIA Corporation. Nvidia nvlink high-speed interconnect: Application performance. Technical report, NVIDIA Corporation, 2015. Accessed: 2025-04-16."},{"key":"e_1_3_2_1_32_1","volume-title":"cuBLAS Library","author":"NVIDIA Corporation","year":"2023","unstructured":"NVIDIA Corporation. cuBLAS Library, 2023. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cublas\/."},{"key":"e_1_3_2_1_33_1","volume-title":"Nvidia b100 blackwell gpu. https:\/\/www.cudocompute.com\/blog\/nvidias-blackwell-architecture-breaking-down-the-b100-b200-and-gb200","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. Nvidia b100 blackwell gpu. https:\/\/www.cudocompute.com\/blog\/nvidias-blackwell-architecture-breaking-down-the-b100-b200-and-gb200, 2024. Accessed: 2025-04-17."},{"key":"e_1_3_2_1_34_1","volume-title":"NVIDIA Collective Communications Library (NCCL)","author":"NVIDIA Corporation","year":"2025","unstructured":"NVIDIA Corporation. NVIDIA Collective Communications Library (NCCL), 2025. Version 2.26.2."},{"key":"e_1_3_2_1_35_1","volume-title":"NVIDIA cuBLAS Library","author":"NVIDIA Corporation","year":"2025","unstructured":"NVIDIA Corporation. NVIDIA cuBLAS Library, 2025. Version 12.8."},{"issue":"6","key":"e_1_3_2_1_36_1","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1145\/2499370.2462176","article-title":"a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines","volume":"48","author":"Ragan-Kelley Jonathan","year":"2013","unstructured":"Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr\u00e9do Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not., 48(6):519\u2013530, June 2013.","journal-title":"SIGPLAN Not."},{"key":"e_1_3_2_1_37_1","first-page":"16","volume-title":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Rajbhandari Samyam","unstructured":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1\u201316. IEEE, 2020."},{"key":"e_1_3_2_1_38_1","volume-title":"Distir: An intermediate representation and simulator for efficient neural network distribution. arXiv preprint arXiv:2111.05426","author":"Santhanam Keshav","year":"2021","unstructured":"Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, Tim Harris, and Matei Zaharia. Distir: An intermediate representation and simulator for efficient neural network distribution. arXiv preprint arXiv:2111.05426, 2021."},{"key":"e_1_3_2_1_39_1","first-page":"68658","article-title":"Fast and accurate attention with asynchrony and low-precision","volume":"37","author":"Shah Jay","year":"2024","unstructured":"Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37:68658\u201368685, 2024.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_40_1","first-page":"35783","article-title":"Tensor program optimization with probabilistic programs","volume":"35","author":"Shao Junru","year":"2022","unstructured":"Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. Tensor program optimization with probabilistic programs. Advances in Neural Information Processing Systems, 35:35783\u201335796, 2022.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_41_1","volume-title":"Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150","author":"Shazeer Noam","year":"2019","unstructured":"Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019."},{"key":"e_1_3_2_1_42_1","first-page":"100","volume-title":"Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06","author":"Shipman Galen M.","year":"2006","unstructured":"Galen M. Shipman, Tim S. Woodall, Rich L. Graham, Arthur B. Maccabe, and Patrick G. Bridges. Infiniband scalability in open mpi. In Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06, page 100, USA, 2006. IEEE Computer Society."},{"key":"e_1_3_2_1_43_1","volume-title":"Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019."},{"key":"e_1_3_2_1_44_1","volume-title":"Tree attention: Topology-aware decoding for long-context attention on gpu clusters. arXiv preprint arXiv:2408.04093","author":"Shyam Vasudev","year":"2024","unstructured":"Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, and Beren Millidge. Tree attention: Topology-aware decoding for long-context attention on gpu clusters. arXiv preprint arXiv:2408.04093, 2024."},{"key":"e_1_3_2_1_45_1","first-page":"52","volume-title":"European Conference on Parallel Processing","author":"Soyt\u00fcrk Muhammet Abdullah","unstructured":"Muhammet Abdullah Soyt\u00fcrk, Palwisha Akhtar, Erhan Tezcan, and Didem Unat. Monitoring collective communication among gpus. In European Conference on Parallel Processing, pages 41\u201352. Springer, 2021."},{"key":"e_1_3_2_1_46_1","first-page":"19","volume-title":"Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019","author":"Tillet Philippe","year":"2019","unstructured":"Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10\u201319, New York, NY, USA, 2019. Association for Computing Machinery."},{"key":"e_1_3_2_1_47_1","volume-title":"Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730","author":"Vasilache Nicolas","year":"2018","unstructured":"Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018."},{"key":"e_1_3_2_1_48_1","volume-title":"Attention is all you need. Advances in neural information processing systems, 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017."},{"key":"e_1_3_2_1_49_1","first-page":"55","volume-title":"2012 IEEE 20th Annual Symposium on HighPerformance Interconnects","author":"Vienne Jerome","year":"2012","unstructured":"Jerome Vienne, Jitong Chen, Md. Wasi-Ur-Rahman, Nusrat S. Islam, Hari Subramoni, and Dhabaleswar K. Panda. Performance analysis and evaluation of infiniband fdr and 40gige roce on hpc and cloud computing systems. In 2012 IEEE 20th Annual Symposium on HighPerformance Interconnects, pages 48\u201355, 2012."},{"key":"e_1_3_2_1_50_1","first-page":"817","volume-title":"Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"3","author":"Wang Haoran","year":"2024","unstructured":"Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han. Primepar: Efficient spatial-temporal tensor partitioning for large transformer model training. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS '24, page 801\u2013817, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_51_1","first-page":"106","volume-title":"Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1","author":"Wang Shibo","year":"2022","unstructured":"Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 93\u2013106, 2022."},{"key":"e_1_3_2_1_52_1","volume-title":"Tokenring: An efficient parallelism framework for infinite-context llms via bidirectional communication. arXiv preprint arXiv:2412.20501","author":"Wang Zongwu","year":"2024","unstructured":"Zongwu Wang, Fangxin Liu, Mingshuai Li, and Li Jiang. Tokenring: An efficient parallelism framework for infinite-context llms via bidirectional communication. arXiv preprint arXiv:2412.20501, 2024."},{"key":"e_1_3_2_1_53_1","first-page":"38","volume-title":"19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)","author":"Wu Mengdi","year":"2025","unstructured":"Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A {Multi-Level} superoptimizer for tensor programs. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21\u201338, 2025."},{"key":"e_1_3_2_1_54_1","volume-title":"CoRR","author":"Wu Tongtong","year":"2024","unstructured":"Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey. CoRR, 2024."},{"key":"e_1_3_2_1_55_1","first-page":"300","volume-title":"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation","author":"Yadav Rohan","year":"2022","unstructured":"Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. DISTAL: the distributed tensor algebra compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 286\u2013300, San Diego CA USA, June 2022. ACM."},{"key":"e_1_3_2_1_56_1","first-page":"678","volume-title":"Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"3","author":"Ye Zihao","year":"2023","unstructured":"Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. Sparsetir: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, page 660\u2013678, New York, NY, USA, 2023. Association for Computing Machinery."},{"key":"e_1_3_2_1_57_1","volume-title":"et al. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. arXiv preprint arXiv:2502.19811","author":"Zhang Shulai","year":"2025","unstructured":"Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. arXiv preprint arXiv:2502.19811, 2025."},{"key":"e_1_3_2_1_58_1","volume-title":"Deepep: an efficient expert-parallel communication library. https:\/\/github.com\/deepseek-ai\/DeepEP","author":"Zhao Chenggang","year":"2025","unstructured":"Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. Deepep: an efficient expert-parallel communication library. https:\/\/github.com\/deepseek-ai\/DeepEP, 2025."},{"issue":"12","key":"e_1_3_2_1_59_1","doi-asserted-by":"crossref","first-page":"3848","DOI":"10.14778\/3611540.3611569","article-title":"Experiences on scaling fully sharded data parallel","volume":"16","author":"Zhao Yanli","year":"2023","unstructured":"Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848\u20133860, August 2023.","journal-title":"Proc. VLDB Endow."},{"key":"e_1_3_2_1_60_1","volume-title":"Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, OSDI'20, USA","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: generating highperformance tensor programs for deep learning. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, OSDI'20, USA, 2020. USENIX Association."},{"key":"e_1_3_2_1_61_1","first-page":"578","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559\u2013578, 2022."},{"key":"e_1_3_2_1_62_1","volume-title":"et al. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler. arXiv preprint arXiv:2504.19442","author":"Zheng Size","year":"2025","unstructured":"Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, et al. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler. arXiv preprint arXiv:2504.19442, 2025."},{"key":"e_1_3_2_1_63_1","volume-title":"et al. Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives. arXiv preprint arXiv:2503.20313","author":"Zheng Size","year":"2025","unstructured":"Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, et al. Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives. arXiv preprint arXiv:2503.20313, 2025."}],"event":{"name":"SOSP '25: ACM SIGOPS 31st Symposium on Operating Systems Principles","location":"Lotte Hotel World Seoul Republic of Korea","acronym":"SOSP '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","USENIX"]},"container-title":["Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles"],"original-title":[],"deposited":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T12:57:44Z","timestamp":1759323464000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731569.3764798"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,12]]},"references-count":63,"alternative-id":["10.1145\/3731569.3764798","10.1145\/3731569"],"URL":"https:\/\/doi.org\/10.1145\/3731569.3764798","relation":{},"subject":[],"published":{"date-parts":[[2025,10,12]]},"assertion":[{"value":"2025-10-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}