{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T15:41:01Z","timestamp":1780674061538,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":37,"publisher":"ACM","funder":[{"name":"NSF","award":["CNS 1956007"],"award-info":[{"award-number":["CNS 1956007"]}]},{"name":"ACE"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,21]]},"DOI":"10.1145\/3695053.3731077","type":"proceedings-article","created":{"date-parts":[[2025,6,20]],"date-time":"2025-06-20T16:43:11Z","timestamp":1750437791000},"page":"821-834","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8065-5151","authenticated-orcid":false,"given":"Hyoungwook","family":"Nam","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Champaign, Illinois, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7946-2683","authenticated-orcid":false,"given":"Gerasimos","family":"Gerogiannis","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Champaign, Illinois, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2595-5228","authenticated-orcid":false,"given":"Josep","family":"Torrellas","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Champaign, Illinois, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,6,20]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Ramesh\u00a0C Agarwal Susanne\u00a0M Balle Fred\u00a0G Gustavson Mahesh Joshi and Prasad Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 5 (1995) 575\u2013582.","DOI":"10.1147\/rd.395.0575"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS53621.2022.00014"},{"key":"e_1_3_3_2_4_2","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared\u00a0D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et\u00a0al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.5555\/905686"},{"key":"e_1_3_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651379"},{"key":"e_1_3_3_2_7_2","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. CuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1410.0759 (2014)."},{"key":"e_1_3_3_2_8_2","first-page":"571","volume-title":"11th USENIX symposium on operating systems design and implementation (OSDI 14)","author":"Chilimbi Trishul","year":"2014","unstructured":"Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX symposium on operating systems design and implementation (OSDI 14). 571\u2013582."},{"key":"e_1_3_3_2_9_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et\u00a0al. 2024. The Llama 3 herd of models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.21783 (2024)."},{"key":"e_1_3_3_2_10_2","unstructured":"Roy Frostig Matthew\u00a0James Johnson and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. Systems for Machine Learning 4 9 (2018)."},{"key":"e_1_3_3_2_11_2","volume-title":"Cloud TPU performance guide","year":"2024","unstructured":"Google. 2024. Cloud TPU performance guide. https:\/\/cloud.google.com\/tpu\/docs\/performance-guide"},{"key":"e_1_3_3_2_12_2","volume-title":"Cloud Tensor Processing Units (TPUs)","year":"2025","unstructured":"Google. 2025. Cloud Tensor Processing Units (TPUs). https:\/\/cloud.google.com\/tpu"},{"key":"e_1_3_3_2_13_2","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc\u00a0V Le Yonghui Wu et\u00a0al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_3_2_16_2","unstructured":"Thomas\u00a0N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1609.02907 (2016)."},{"key":"e_1_3_3_2_17_2","unstructured":"Vijay\u00a0Anand Korthikanti Jared Casper Sangkug Lym Lawrence McAfee Michael Andersch Mohammad Shoeybi and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023) 341\u2013353."},{"key":"e_1_3_3_2_18_2","unstructured":"Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang Maxim Krikun Noam Shazeer and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2006.16668 (2020)."},{"key":"e_1_3_3_2_19_2","doi-asserted-by":"crossref","unstructured":"Shang Li Zhiyuan Yang Dhiraj Reddy Ankur Srivastava and Bruce Jacob. 2020. DRAMsim3: A cycle-accurate thermal-capable DRAM simulator. IEEE Computer Architecture Letters 19 2 (2020) 106\u2013109.","DOI":"10.1109\/LCA.2020.2973991"},{"key":"e_1_3_3_2_20_2","volume-title":"DGX H100: AI for Enterprise","year":"2023","unstructured":"NVIDIA. 2023. DGX H100: AI for Enterprise. https:\/\/www.nvidia.com\/en-gb\/data-center\/dgx-h100\/"},{"key":"e_1_3_3_2_21_2","doi-asserted-by":"crossref","unstructured":"Suchita Pati Shaizeen Aga Mahzabeen Islam Nuwan Jayasena and Matthew\u00a0D Sinclair. 2024. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.16677 (2024).","DOI":"10.1145\/3620665.3640410"},{"key":"e_1_3_3_2_22_2","unstructured":"Reiner Pope Sholto Douglas Aakanksha Chowdhery Jacob Devlin James Bradbury Anselm Levskaya Jonathan Heek Kefan Xiao Shivani Agrawal and Jeff Dean. 2022. Efficiently scaling transformer inference. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2211.05102 (2022)."},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Arun\u00a0F Rodrigues K\u00a0Scott Hemmert Brian\u00a0W Barrett Chad Kersey Ron Oldfield Marlo Weston Rolf Risen Jeanine Cook Paul Rosenfeld Elliot Cooper-Balis et\u00a0al. 2011. The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review 38 4 (2011) 37\u201342.","DOI":"10.1145\/1964218.1964225"},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS48437.2020.00016"},{"key":"e_1_3_3_2_26_2","doi-asserted-by":"crossref","unstructured":"Martin\u00a0D Schatz Robert\u00a0A Van\u00a0de Geijn and Jack Poulson. 2016. Parallel matrix multiplication: A systematic journey. SIAM Journal on Scientific Computing 38 6 (2016) C748\u2013C781.","DOI":"10.1137\/140993478"},{"key":"e_1_3_3_2_27_2","unstructured":"Noam Shazeer Azalia Mirhoseini Krzysztof Maziarz Andy Davis Quoc Le Geoffrey Hinton and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1701.06538 (2017)."},{"key":"e_1_3_3_2_28_2","unstructured":"Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari Jared Casper Zhun Liu Shrimai Prabhumoye George Zerveas Vijay Korthikanti et\u00a0al. 2022. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B a large-scale generative language model. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2201.11990 (2022)."},{"key":"e_1_3_3_2_29_2","first-page":"90","volume-title":"European Conference on Parallel Processing","author":"Solomonik Edgar","year":"2011","unstructured":"Edgar Solomonik and James Demmel. 2011. Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms. In European Conference on Parallel Processing. Springer, 90\u2013109."},{"key":"e_1_3_3_2_30_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.09288 (2023)."},{"key":"e_1_3_3_2_31_2","doi-asserted-by":"crossref","unstructured":"Robert\u00a0A Van De\u00a0Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9 4 (1997) 255\u2013274.","DOI":"10.1002\/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2"},{"key":"e_1_3_3_2_32_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_3_2_33_2","volume-title":"International Conference on Learning Representations","author":"Veli\u010dkovi\u0107 Petar","year":"2018","unstructured":"Petar Veli\u010dkovi\u0107, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJXMpikCZ"},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651357"},{"key":"e_1_3_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3567955.3567959"},{"key":"e_1_3_3_2_36_2","unstructured":"Yuanzhong Xu HyoukJoong Lee Dehao Chen Blake Hechtman Yanping Huang Rahul Joshi Maxim Krikun Dmitry Lepikhin Andy Ly Marcello Maggioni et\u00a0al. 2021. GSPMD: general and scalable parallelization for ML computation graphs. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2105.04663 (2021)."},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2006.1639613"},{"key":"e_1_3_3_2_38_2","unstructured":"Yanli Zhao Andrew Gu Rohan Varma Liang Luo Chien-Chin Huang Min Xu Less Wright Hamid Shojanazeri Myle Ott Sam Shleifer et\u00a0al. 2023. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2304.11277 (2023)."}],"event":{"name":"ISCA '25: Proceedings of the 52nd Annual International Symposium on Computer Architecture","location":"Tokyo Japan","acronym":"SIGARCH '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 52nd Annual International Symposium on Computer Architecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695053.3731077","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T11:10:02Z","timestamp":1750504202000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695053.3731077"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,20]]},"references-count":37,"alternative-id":["10.1145\/3695053.3731077","10.1145\/3695053"],"URL":"https:\/\/doi.org\/10.1145\/3695053.3731077","relation":{},"subject":[],"published":{"date-parts":[[2025,6,20]]},"assertion":[{"value":"2025-06-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}