{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T20:58:10Z","timestamp":1775854690740,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":30,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,2,17]],"date-time":"2021-02-17T00:00:00Z","timestamp":1613520000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,2,17]]},"DOI":"10.1145\/3437801.3441620","type":"proceedings-article","created":{"date-parts":[[2021,2,20]],"date-time":"2021-02-20T23:04:20Z","timestamp":1613862260000},"page":"62-75","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":47,"title":["Synthesizing optimal collective algorithms"],"prefix":"10.1145","author":[{"given":"Zixian","family":"Cai","sequence":"first","affiliation":[{"name":"Australian National University, Canberra, ACT, Australia"}]},{"given":"Zhengyang","family":"Liu","sequence":"additional","affiliation":[{"name":"University of Utah"}]},{"given":"Saeed","family":"Maleki","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Madanlal","family":"Musuvathi","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Todd","family":"Mytkowicz","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Jacob","family":"Nelson","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Olli","family":"Saarikivi","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]}],"member":"320","published-online":{"date-parts":[[2021,2,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"AMD Radeon Instinct MI50 2020. AMD Radeon Instinct MI50 Accelerator. https:\/\/www.amd.com\/system\/files\/documents\/radeon-instinctmi50-datasheet.pdf.  AMD Radeon Instinct MI50 2020. AMD Radeon Instinct MI50 Accelerator. https:\/\/www.amd.com\/system\/files\/documents\/radeon-instinctmi50-datasheet.pdf."},{"key":"e_1_3_2_1_2_1","unstructured":"AMD RCCL Library 2020. ROCm Communication Collectives Library. https:\/\/github.com\/ROCmSoftwarePlatform\/rccl.  AMD RCCL Library 2020. ROCm Communication Collectives Library. https:\/\/github.com\/ROCmSoftwarePlatform\/rccl."},{"key":"e_1_3_2_1_3_1","volume-title":"Supercomputing'94: Proceedings of the 1994 ACM\/IEEE Conference on Supercomputing. IEEE, 107--116","author":"Barnett Mike","unstructured":"Mike Barnett , Satya Gupta , David G Payne , Lance Shuler , Robert van de Geijn, and Jerrell Watts. 1994. Building a high-performance collective communication library . In Supercomputing'94: Proceedings of the 1994 ACM\/IEEE Conference on Supercomputing. IEEE, 107--116 . Mike Barnett, Satya Gupta, David G Payne, Lance Shuler, Robert van de Geijn, and Jerrell Watts. 1994. Building a high-performance collective communication library. In Supercomputing'94: Proceedings of the 1994 ACM\/IEEE Conference on Supercomputing. IEEE, 107--116."},{"key":"e_1_3_2_1_4_1","volume-title":"Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium","author":"Barnett Michael","unstructured":"Michael Barnett , Rick Littlefield , David G Payne , and Robert van de Geijn . 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium . IEEE , 156--162. Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium. IEEE, 156--162."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/SHPCC.1992.232628"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/1285358.1285359"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2019.2947013"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78800-3_24"},{"key":"e_1_3_2_1_9_1","first-page":"32","article-title":"MPI: A message-passing interface standard version 3.0","volume":"2","author":"Jack Dongarra","year":"2013","unstructured":"Jack Dongarra et al. 2013 . MPI: A message-passing interface standard version 3.0 . High Performance Computing Center Stuttgart (HLRS) 2 , 5 (2013), 32 . Jack Dongarra et al. 2013. MPI: A message-passing interface standard version 3.0. High Performance Computing Center Stuttgart (HLRS) 2, 5 (2013), 32.","journal-title":"High Performance Computing Center Stuttgart (HLRS)"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-30218-6_19"},{"key":"e_1_3_2_1_11_1","volume-title":"TPU","author":"Google","year":"2020","unstructured":"Google TPU 2020 . Google Cloud TPU. https:\/\/cloud.google.com\/tpu. Google TPU 2020. Google Cloud TPU. https:\/\/cloud.google.com\/tpu."},{"key":"e_1_3_2_1_12_1","volume-title":"IPU","author":"Graphcore","year":"2020","unstructured":"Graphcore IPU 2020 . Graphcore Intelligence Processing Unit. https:\/\/www.graphcore.ai\/products\/ipu. Graphcore IPU 2020. Graphcore Intelligence Processing Unit. https:\/\/www.graphcore.ai\/products\/ipu."},{"key":"e_1_3_2_1_13_1","volume-title":"Sangeetha Abdu Jyothi, and Roy H Campbell","author":"Hashemi Sayed Hadi","year":"2019","unstructured":"Sayed Hadi Hashemi , Sangeetha Abdu Jyothi, and Roy H Campbell . 2019 . TicTac : Accelerating distributed deep learning with communication scheduling. (March 2019). Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. (March 2019)."},{"key":"e_1_3_2_1_14_1","volume-title":"The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing 20, 3","author":"Hockney Roger W","year":"1994","unstructured":"Roger W Hockney . 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing 20, 3 ( 1994 ), 389--398. Roger W Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing 20, 3 (1994), 389--398."},{"key":"e_1_3_2_1_15_1","volume-title":"Priority-based parameter propagation for distributed DNN training. (March","author":"Jayarajan Anand","year":"2019","unstructured":"Anand Jayarajan , Jinliang Wei , Garth Gibson , Alexandra Fedorova , and Gennady Pekhimenko . 2019. Priority-based parameter propagation for distributed DNN training. (March 2019 ). Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. (March 2019)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2928289"},{"key":"e_1_3_2_1_17_1","volume-title":"Proceedings of Machine Learning and Systems","author":"Luo Liang","year":"2020","unstructured":"Liang Luo , Peter West , Jacob Nelson , Arvind Krishnamurthy , and Luis Ceze . 2020 . PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud . In Proceedings of Machine Learning and Systems 2020. 82--97. Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In Proceedings of Machine Learning and Systems 2020. 82--97."},{"key":"e_1_3_2_1_18_1","unstructured":"NVIDIA NCCL Library 2020. NVIDIA Collective Communications Library. https:\/\/github.com\/NVIDIA\/nccl.  NVIDIA NCCL Library 2020. NVIDIA Collective Communications Library. https:\/\/github.com\/NVIDIA\/nccl."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-007-0012-0"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45706-2_112"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/DMCC.1991.633174"},{"key":"e_1_3_2_1_23_1","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799 [cs.LG]  Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799 [cs.LG]"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/331532.331555"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2003.1213188"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45825-5_57"},{"key":"e_1_3_2_1_28_1","unstructured":"UCX 2020. Unified Communication X. https:\/\/www.openucx.org\/.  UCX 2020. Unified Communication X. https:\/\/www.openucx.org\/."},{"key":"e_1_3_2_1_29_1","volume-title":"Blink: Fast and Generic Collectives for Distributed ML. In Conference on Machine Learning and Systems (MLSys","author":"Wang Guanhua","year":"2020","unstructured":"Guanhua Wang , Shivaram Venkataraman , Amar Phanishayee , Jorgen Thelin , Nikhil Devanur , and Ion Stoica . 2020 . Blink: Fast and Generic Collectives for Distributed ML. In Conference on Machine Learning and Systems (MLSys 2020). Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In Conference on Machine Learning and Systems (MLSys 2020)."},{"key":"e_1_3_2_1_30_1","volume-title":"2017 USENIX Annual Technical Conference (USENIX ATC 17)","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , Jinliang Wei , Pengtao Xie , and Eric P Xing . 2017 . Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters . In 2017 USENIX Annual Technical Conference (USENIX ATC 17) . 181--193. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181--193."}],"event":{"name":"PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","location":"Virtual Event Republic of Korea","acronym":"PPoPP '21","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3437801.3441620","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3437801.3441620","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:26Z","timestamp":1750191446000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3437801.3441620"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,17]]},"references-count":30,"alternative-id":["10.1145\/3437801.3441620","10.1145\/3437801"],"URL":"https:\/\/doi.org\/10.1145\/3437801.3441620","relation":{},"subject":[],"published":{"date-parts":[[2021,2,17]]},"assertion":[{"value":"2021-02-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}