{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T07:43:07Z","timestamp":1768030987677,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":25,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,16]]},"DOI":"10.1145\/3731599.3767407","type":"proceedings-article","created":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T16:18:44Z","timestamp":1762532324000},"page":"540-544","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Batch Tiling on Attention: Efficient Mixture of Experts Training on Wafer-Scale Processors"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-2654-3767","authenticated-orcid":false,"given":"Daria","family":"Soboleva","sequence":"first","affiliation":[{"name":"Cerebras, Sunnyvale, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0120-0392","authenticated-orcid":false,"given":"Etienne","family":"Goffinet","sequence":"additional","affiliation":[{"name":"Cerebras, Abu Dabi, United Arab Emirates"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9719-309X","authenticated-orcid":false,"given":"Hui","family":"Zeng","sequence":"additional","affiliation":[{"name":"Cerebras, Sunnyvale, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9189-2310","authenticated-orcid":false,"given":"Sangamesh","family":"Ragate","sequence":"additional","affiliation":[{"name":"Cerebras, Sunnyvale, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9203-0611","authenticated-orcid":false,"given":"Elif","family":"Albuz","sequence":"additional","affiliation":[{"name":"Cerebras, Sunnyvale, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-1038-1572","authenticated-orcid":false,"given":"Natalia","family":"Vassilieva","sequence":"additional","affiliation":[{"name":"Cerebras, Sunnyvale, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,11,15]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"2025. Megatron-Core: GPU-Optimized Parallelism Library for Large-Scale Model Training. NVIDIA Developer Documentation and Library. https:\/\/docs.nvidia.com\/megatron-core\/developer-guide\/latest\/api-guide\/moe.html Supports MoE expert parallelism alongside tensor data pipeline sequence and context parallelism."},{"key":"e_1_3_3_1_3_2","unstructured":"Weilin Cai Juyong Jiang Le Qin Junwei Cui Sunghun Kim and Jiayi Huang. 2024. Shortcut-connected expert parallelism for accelerating mixture-of-experts. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.05019 (2024)."},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"crossref","unstructured":"Fahao Chen Peng Li Zicong Hong Zhou Su and Song Guo. 2025. Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation. IEEE Transactions on Networking (2025).","DOI":"10.1109\/TON.2025.3585359"},{"key":"e_1_3_3_1_5_2","unstructured":"Brian Chu Mihir Patel Less Wright Vitaliy Chiley Evan Racah Wanchao Liang Iris Zhang and Andrew Gu. 2024. Training MoEs at Scale with PyTorch. PyTorch Blog. https:\/\/pytorch.org\/blog\/training-moes\/ Accessed: 2025-08-22."},{"key":"e_1_3_3_1_6_2","unstructured":"Tri Dao Dan Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022) 16344\u201316359."},{"key":"e_1_3_3_1_7_2","volume-title":"Getting Started with DeepSpeed-MoE for Inferencing Large-Scale MoE Models","author":"Team DeepSpeed","year":"2025","unstructured":"DeepSpeed Team. 2025. Getting Started with DeepSpeed-MoE for Inferencing Large-Scale MoE Models. DeepSpeed \/ Microsoft. https:\/\/www.deepspeed.ai\/tutorials\/mixture-of-experts-inference\/ Tutorial."},{"key":"e_1_3_3_1_8_2","unstructured":"William Fedus Barret Zoph and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:arXiv:2101.03961"},{"key":"e_1_3_3_1_9_2","unstructured":"William Fedus Barret Zoph and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23 120 (2022) 1\u201339."},{"key":"e_1_3_3_1_10_2","unstructured":"Trevor Gale Deepak Narayanan Cliff Young and Matei Zaharia. 2023. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems 5 (2023) 288\u2013304."},{"key":"e_1_3_3_1_11_2","unstructured":"Seokjin Go and Divya Mahajan. 2025. Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.06643 (2025)."},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508418"},{"key":"e_1_3_3_1_13_2","unstructured":"Changho Hwang Wei Cui Yifan Xiong Ziyue Yang Ze Liu Han Hu Zilong Wang Rafael Salas Jithin Jose Prabhat Ram et\u00a0al. 2023. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023) 269\u2013287."},{"key":"e_1_3_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00078"},{"key":"e_1_3_3_1_15_2","volume-title":"Forty-first International Conference on Machine Learning","author":"Kim Yechan","year":"2024","unstructured":"Yechan Kim, Hwijoon Lim, and Dongsu Han. 2024. Scaling beyond the GPU memory limit for large mixture-of-experts model training. In Forty-first International Conference on Machine Learning."},{"key":"e_1_3_3_1_16_2","unstructured":"Vijay Korthikanti Jared Casper Sangkug Lym Lawrence McAfee Michael Andersch Mohammad Shoeybi and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. arXiv:arXiv:2205.05198"},{"key":"e_1_3_3_1_17_2","unstructured":"Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang Maxim Krikun Noam Shazeer and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2006.16668 (2020)."},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3605573.3605613"},{"key":"e_1_3_3_1_19_2","first-page":"18332","volume-title":"International conference on machine learning","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza\u00a0Yazdani Aminabadi, Ammar\u00a0Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning. PMLR, 18332\u201318346."},{"key":"e_1_3_3_1_20_2","unstructured":"Noam Shazeer Azalia Mirhoseini Krzysztof Maziarz Andy Davis Quoc Le Geoffrey Hinton and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:arXiv:1701.06538"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3650083"},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593704"},{"key":"e_1_3_3_1_23_2","unstructured":"Yongji Wu Xueshen Liu Shuowei Jin Ceyu Xu Feng Qian Z\u00a0Morley Mao Matthew Lentz Danyang Zhuo and Ion Stoica. 2025. Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2504.03871 (2025)."},{"key":"e_1_3_3_1_24_2","volume-title":"Qwen3 Technical Report","author":"Yang An","year":"2025","unstructured":"An Yang, Anfeng Li, Baosong Yang, et\u00a0al. 2025. Qwen3 Technical Report. arXiv:arXiv:2505.09388"},{"key":"e_1_3_3_1_25_2","first-page":"961","volume-title":"2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Zhai Mingshu","year":"2023","unstructured":"Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. { SmartMoE} : Efficiently training { Sparsely-Activated} models through combining offline and online parallelization. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 961\u2013975."},{"key":"e_1_3_3_1_26_2","first-page":"559","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric\u00a0P Xing, et\u00a0al. 2022. Alpa: Automating inter-and { Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559\u2013578."}],"event":{"name":"SC Workshops '25: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St Louis MO USA","acronym":"SC Workshops '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731599.3767407","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:33:35Z","timestamp":1767987215000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731599.3767407"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,15]]},"references-count":25,"alternative-id":["10.1145\/3731599.3767407","10.1145\/3731599"],"URL":"https:\/\/doi.org\/10.1145\/3731599.3767407","relation":{},"subject":[],"published":{"date-parts":[[2025,11,15]]},"assertion":[{"value":"2025-11-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}