{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T22:21:13Z","timestamp":1778278873838,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":46,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T00:00:00Z","timestamp":1674777600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,1,27]]},"DOI":"10.1145\/3575693.3575724","type":"proceedings-article","created":{"date-parts":[[2023,1,30]],"date-time":"2023-01-30T22:56:55Z","timestamp":1675119415000},"page":"502-514","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":30,"title":["MSCCLang: Microsoft Collective Communication Language"],"prefix":"10.1145","author":[{"given":"Meghan","family":"Cowan","sequence":"first","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Saeed","family":"Maleki","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Madanlal","family":"Musuvathi","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Olli","family":"Saarikivi","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Yifan","family":"Xiong","sequence":"additional","affiliation":[{"name":"Microsoft Research, China"}]}],"member":"320","published-online":{"date-parts":[[2023,1,30]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2022. AI and Compute. https:\/\/openai.com\/blog\/ai-and-compute\/ 2022. AI and Compute. https:\/\/openai.com\/blog\/ai-and-compute\/"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Michael Barnett Rick Littlefield David G Payne and Robert van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium. 156\u2013162. Michael Barnett Rick Littlefield David G Payne and Robert van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium. 156\u2013162.","DOI":"10.1109\/IPPS.1993.262873"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/SHPCC.1992.232628"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441620"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1285358.1285359"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the 2nd SysML Conference.","author":"Cho Minsik","year":"2019","unstructured":"Minsik Cho , Ulrich Finkler , and David Kung . 2019 . BlueConnect: Novel hierarchical all-reduce on multi-tired network for deep learning . In Proceedings of the 2nd SysML Conference. Minsik Cho, Ulrich Finkler, and David Kung. 2019. BlueConnect: Novel hierarchical all-reduce on multi-tired network for deep learning. In Proceedings of the 2nd SysML Conference."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2019.2947013"},{"key":"e_1_3_2_1_8_1","unstructured":"2017. Cooperative Groups: Flexible CUDA Thread Programming. https:\/\/developer.nvidia.com\/blog\/cooperative-groups\/ 2017. Cooperative Groups: Flexible CUDA Thread Programming. https:\/\/developer.nvidia.com\/blog\/cooperative-groups\/"},{"key":"e_1_3_2_1_9_1","unstructured":"2021. DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression\/ 2021. DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression\/"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437963.3441727"},{"key":"e_1_3_2_1_11_1","first-page":"32","article-title":"MPI: A message-passing interface standard version 3.0","volume":"2","author":"Dongarra Jack","year":"2013","unstructured":"Jack Dongarra . 2013 . MPI: A message-passing interface standard version 3.0 . High Performance Computing Center Stuttgart (HLRS) , 2 , 5 (2013), 32 . Jack Dongarra. 2013. MPI: A message-passing interface standard version 3.0. High Performance Computing Center Stuttgart (HLRS), 2, 5 (2013), 32.","journal-title":"High Performance Computing Center Stuttgart (HLRS)"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2363.2433"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/COMHPC.2016.006"},{"key":"e_1_3_2_1_14_1","volume-title":"Using MPI: Portable Parallel Programming with the Message-Passing Interface","author":"Gropp William","unstructured":"William Gropp , Ewing Lusk , and Anthony Skjellum . 2014. Using MPI: Portable Parallel Programming with the Message-Passing Interface . The MIT Press . isbn:0262527391 William Gropp, Ewing Lusk, and Anthony Skjellum. 2014. Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press. isbn:0262527391"},{"key":"e_1_3_2_1_15_1","volume-title":"Sangeetha Abdu Jyothi, and Roy H Campbell","author":"Hashemi Sayed Hadi","year":"2019","unstructured":"Sayed Hadi Hashemi , Sangeetha Abdu Jyothi, and Roy H Campbell . 2019 . TicTac : Accelerating distributed deep learning with communication scheduling. March. Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. March."},{"key":"e_1_3_2_1_16_1","volume-title":"High Performance Computing: 36th International Conference, ISC High Performance","author":"Hashmi Jahanzeb Maqbool","year":"2021","unstructured":"Jahanzeb Maqbool Hashmi and Dhabaleswar K Panda . 2021. BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs . In High Performance Computing: 36th International Conference, ISC High Performance 2021 , Virtual Event, June 24\u2013July 2, 2021, Proceedings . 12728, 18. Jahanzeb Maqbool Hashmi and Dhabaleswar K Panda. 2021. BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24\u2013July 2, 2021, Proceedings. 12728, 18."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/H2RC54759.2021.00009"},{"key":"e_1_3_2_1_19_1","unstructured":"2021. Intel Concurrent Collections for C++. https:\/\/icnc.github.io\/ 2021. Intel Concurrent Collections for C++. https:\/\/icnc.github.io\/"},{"key":"e_1_3_2_1_20_1","unstructured":"Anand Jayarajan Jinliang Wei Garth Gibson Alexandra Fedorova and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. March. Anand Jayarajan Jinliang Wei Garth Gibson Alexandra Fedorova and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. March."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/3488766.3488792"},{"key":"e_1_3_2_1_22_1","volume-title":"ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)","author":"Lao ChonLam","year":"2021","unstructured":"ChonLam Lao , Yanfang Le , Kshiteej Mahajan , Yixi Chen , Wenfei Wu , Aditya Akella , and Michael Swift . 2021 . ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) . 741\u2013761. ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741\u2013761."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Charles Leiserson and Aske Plaat. 1997. Programming Parallel Applications in Cilk. Siam News 07. Charles Leiserson and Aske Plaat. 1997. Programming Parallel Applications in Cilk. Siam News 07.","DOI":"10.1007\/3-540-63138-0_6"},{"key":"e_1_3_2_1_24_1","volume-title":"Proceedings of Machine Learning and Systems","author":"Luo Liang","year":"2020","unstructured":"Liang Luo , Peter West , Jacob Nelson , Arvind Krishnamurthy , and Luis Ceze . 2020 . PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud . In Proceedings of Machine Learning and Systems 2020. 82\u201397. Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In Proceedings of Machine Learning and Systems 2020. 82\u201397."},{"key":"e_1_3_2_1_25_1","unstructured":"2021. Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime. https:\/\/www.nvidia.com\/en-us\/on-demand\/session\/gtcspring21-s31578\/ 2021. Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime. https:\/\/www.nvidia.com\/en-us\/on-demand\/session\/gtcspring21-s31578\/"},{"key":"e_1_3_2_1_26_1","unstructured":"2022. NVIDIA GPUDirect: Enhancing Data Movement and Access for GPUs. https:\/\/developer.nvidia.com\/gpudirect 2022. NVIDIA GPUDirect: Enhancing Data Movement and Access for GPUs. https:\/\/developer.nvidia.com\/gpudirect"},{"key":"e_1_3_2_1_27_1","unstructured":"2022. NVIDIA Collective Communication Library (NCCL). https:\/\/github.com\/nvidia\/nccl 2022. NVIDIA Collective Communication Library (NCCL). https:\/\/github.com\/nvidia\/nccl"},{"key":"e_1_3_2_1_28_1","unstructured":"2022. ONNX Runtime Mixture of Experts. https:\/\/github.com\/pytorch\/ort 2022. ONNX Runtime Mixture of Experts. https:\/\/github.com\/pytorch\/ort"},{"key":"e_1_3_2_1_29_1","unstructured":"2022. Parameter counts in Machine Learning. https:\/\/www.alignmentforum.org\/posts\/GzoWcYibWYwJva8aL\/parameter-counts-in-machine-learning 2022. Parameter counts in Machine Learning. https:\/\/www.alignmentforum.org\/posts\/GzoWcYibWYwJva8aL\/parameter-counts-in-machine-learning"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-007-0012-0"},{"key":"e_1_3_2_1_32_1","unstructured":"2022. PyTorch on Azure. https:\/\/azure.microsoft.com\/en-us\/resources\/developers\/pytorch\/ 2022. PyTorch on Azure. https:\/\/azure.microsoft.com\/en-us\/resources\/developers\/pytorch\/"},{"key":"e_1_3_2_1_33_1","unstructured":"2022. ROCm Communication Collectives Library (RCCL). https:\/\/github.com\/ROCmSoftwarePlatform\/rccl 2022. ROCm Communication Collectives Library (RCCL). https:\/\/github.com\/ROCmSoftwarePlatform\/rccl"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45706-2_112"},{"key":"e_1_3_2_1_35_1","volume-title":"Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)","author":"Sapio Amedeo","year":"2021","unstructured":"Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan Ports , and Peter Richtarik . 2021 . Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) . USENIX Association, 785\u2013808. isbn:978-1-939133-21-2 https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/sapio Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 785\u2013808. isbn:978-1-939133-21-2 https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/sapio"},{"key":"e_1_3_2_1_36_1","volume-title":"Dan RK Ports, and Peter Richt\u00e1rik","author":"Sapio Amedeo","year":"2019","unstructured":"Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan RK Ports, and Peter Richt\u00e1rik . 2019 . Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701. Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richt\u00e1rik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/DMCC.1991.633174"},{"key":"e_1_3_2_1_38_1","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arxiv:1802.05799. Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arxiv:1802.05799."},{"key":"e_1_3_2_1_39_1","volume-title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs\/1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs\/1909.08053 ( 2019 ), arXiv:1909.08053. arxiv:1909.08053 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs\/1909.08053 (2019), arXiv:1909.08053. arxiv:1909.08053"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/331532.331555"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2003.1213188"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45825-5_57"},{"key":"e_1_3_2_1_44_1","first-page":"172","article-title":"Blink: Fast and generic collectives for distributed ml","volume":"2","author":"Wang Guanhua","year":"2020","unstructured":"Guanhua Wang , Shivaram Venkataraman , Amar Phanishayee , Nikhil Devanur , Jorgen Thelin , and Ion Stoica . 2020 . Blink: Fast and generic collectives for distributed ml . Proceedings of Machine Learning and Systems , 2 (2020), 172 \u2013 186 . Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and generic collectives for distributed ml. Proceedings of Machine Learning and Systems, 2 (2020), 172\u2013186.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_45_1","unstructured":"Ningning Xie Tamara Norman Dominik Grewe and Dimitrios Vytiniotis. 2021. Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning. arXiv preprint arXiv:2110.10548. Ningning Xie Tamara Norman Dominik Grewe and Dimitrios Vytiniotis. 2021. Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning. arXiv preprint arXiv:2110.10548."},{"key":"e_1_3_2_1_46_1","volume-title":"2017 USENIX Annual Technical Conference (USENIX ATC 17)","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , Jinliang Wei , Pengtao Xie , and Eric P Xing . 2017 . Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters . In 2017 USENIX Annual Technical Conference (USENIX ATC 17) . 181\u2013193. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181\u2013193."}],"event":{"name":"ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2","location":"Vancouver BC Canada","acronym":"ASPLOS '23","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages"]},"container-title":["Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3575693.3575724","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3575693.3575724","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:20Z","timestamp":1750182680000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3575693.3575724"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,27]]},"references-count":46,"alternative-id":["10.1145\/3575693.3575724","10.1145\/3575693"],"URL":"https:\/\/doi.org\/10.1145\/3575693.3575724","relation":{},"subject":[],"published":{"date-parts":[[2023,1,27]]},"assertion":[{"value":"2023-01-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}