{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T12:25:24Z","timestamp":1773318324329,"version":"3.50.1"},"reference-count":37,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T00:00:00Z","timestamp":1754006400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T00:00:00Z","timestamp":1754006400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"name":"National Science Foundation","award":["2405142"],"award-info":[{"award-number":["2405142"]}]},{"name":"National Science Foundation","award":["2412182"],"award-info":[{"award-number":["2412182"]}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:p>Modern supercomputers feature an ever-increasing degree of parallelism, particularly in the number of cores per node. These high core counts are considered in our flexible implementation of allreduce, which was implemented specifically with shared-memory communication in mind. At a high level, our algorithm consists of a reduce_scatter stage followed by an allgather stage, or a reduce stage followed by a broadcast stage, and allows for different factors (aka multi-radix) to be applied at each. The reduce and broadcast operations are also considered as standalone functions. Where barriers are required, they are integrated into the algorithm using counters to track progress. To accommodate the complexity of this approach, our implementation is split into a setup phase and an execution phase. The setup phase occurs only once for a given set of parameters, and is responsible for determining the algorithm that will be run each time the allreduce is called within the execution phase. We present two interfaces: a persistent collective interface (an MPI 4.0 feature), which inherently aligns with our setup and execution phases, and a blocking interface, where the setup is performed the first time a specific algorithm is required. Using these methods, we achieve speedups of half an order of magnitude compared to the persistent and blocking allreduce implementations of MPICH and Open MPI on a dual-socket node with AMD EPYC processors, and almost an order of magnitude on a four-socket node with NVIDIA Grace-Hopper processors. Reductions on vectors residing on both CPU and GPU memory are performed. Our implementation also achieves good performance on multiple nodes. A standard benchmark of the application CP2K is sped up by 2.5%. Notably, for long messages, our implementation achieves the same performance as NCCL.<\/jats:p>","DOI":"10.1177\/10943420251363423","type":"journal-article","created":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T22:50:57Z","timestamp":1754088657000},"page":"219-239","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":0,"title":["A compilation-based approach to performant reduction and redistribution collective communication algorithms"],"prefix":"10.1177","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3327-4230","authenticated-orcid":false,"given":"Andreas","family":"Jocksch","sequence":"first","affiliation":[{"name":"Swiss National Supercomputing Centre"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0768-4243","authenticated-orcid":false,"given":"C. Nicole","family":"Avans","sequence":"additional","affiliation":[{"name":"Tennessee Technological University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7486-7542","authenticated-orcid":false,"given":"Riley","family":"Shipley","sequence":"additional","affiliation":[{"name":"Tennessee Technological University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5252-6600","authenticated-orcid":false,"given":"Anthony","family":"Skjellum","sequence":"additional","affiliation":[{"name":"Tennessee Technological University"}]}],"member":"179","published-online":{"date-parts":[[2025,8,1]]},"reference":[{"key":"e_1_3_5_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2021.102860"},{"key":"e_1_3_5_3_1","doi-asserted-by":"publisher","DOI":"10.1142\/S012962649300037X"},{"key":"e_1_3_5_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476178"},{"key":"e_1_3_5_5_1","doi-asserted-by":"publisher","DOI":"10.23919\/ISC.2024.10528936"},{"key":"e_1_3_5_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2642769.2642773"},{"key":"e_1_3_5_7_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2408.11556"},{"key":"e_1_3_5_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-30218-6_19"},{"key":"e_1_3_5_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC62374.2024.00023"},{"key":"e_1_3_5_10_1","volume-title":"Using MPI-2: Advanced Features of the Message Passing Interface","author":"Gropp W","year":"1999","unstructured":"Gropp W, Thakur R, Lusk E (1999) Using MPI-2: Advanced Features of the Message Passing Interface. 2nd edition. Cambridge, MA: MIT Press.","edition":"2"},{"key":"e_1_3_5_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-016-1779-7"},{"key":"e_1_3_5_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2018.00111"},{"key":"e_1_3_5_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2006.1639561"},{"key":"e_1_3_5_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ExaMPI54564.2021.00007"},{"key":"e_1_3_5_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/PMBS.2018.8641622"},{"key":"e_1_3_5_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2021.102812"},{"key":"e_1_3_5_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-85697-6_18"},{"key":"e_1_3_5_18_1","unstructured":"Jocksch A Piccinali JG (2023a) An efficient implementation of blocking and persistent MPI collective communication. In: EuroMPI'23 Conference Bristol 2023 Poster. https:\/\/eurompi23.github.io\/assets\/papers\/EuroMPI23_paper_30.pdf"},{"key":"e_1_3_5_19_1","unstructured":"Jocksch A Piccinali JG (2023b) MPI for multi-core multi socket and GPU architectures: optimised shared memory allreduce. In: PASC 2023 Conference Davos 2023 Poster. https:\/\/pasc23.pasc-conference.org\/presentation\/?id=pos139&sess=sess110"},{"key":"e_1_3_5_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2019.04.015"},{"key":"e_1_3_5_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492805.3492808"},{"key":"e_1_3_5_22_1","doi-asserted-by":"publisher","DOI":"10.1063\/5.0007045"},{"key":"e_1_3_5_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508399"},{"key":"e_1_3_5_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-014-0361-4"},{"key":"e_1_3_5_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD53543.2021.00028"},{"key":"e_1_3_5_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01407956"},{"key":"e_1_3_5_27_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.7402"},{"key":"e_1_3_5_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-02303-3_4"},{"key":"e_1_3_5_29_1","unstructured":"NVIDIA (2025) NVIDIA collective communications library (NCCL). https:\/\/developer.nvidia.com\/nccl"},{"key":"e_1_3_5_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607074"},{"key":"e_1_3_5_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2017.08.004"},{"key":"e_1_3_5_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.271"},{"key":"e_1_3_5_33_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_5_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER49012.2020.00037"},{"key":"e_1_3_5_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER52292.2023.00031"},{"key":"e_1_3_5_36_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.7879"},{"key":"e_1_3_5_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2616498.2616532"},{"key":"e_1_3_5_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER52292.2023.00013"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251363423","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420251363423","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251363423","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T18:41:14Z","timestamp":1773254474000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420251363423"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,1]]},"references-count":37,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["10.1177\/10943420251363423"],"URL":"https:\/\/doi.org\/10.1177\/10943420251363423","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,1]]}}}