{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T01:26:23Z","timestamp":1781054783247,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":49,"publisher":"ACM","funder":[{"name":"EuroHPC JU","award":["101196247"],"award-info":[{"award-number":["101196247"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,16]]},"DOI":"10.1145\/3731599.3767508","type":"proceedings-article","created":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T16:18:44Z","timestamp":1762532324000},"page":"1314-1329","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Redesigning GROMACS Halo Exchange: Improving Strong Scaling with GPU-initiated NVSHMEM"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-5953-0436","authenticated-orcid":false,"given":"Mahesh","family":"Doijade","sequence":"first","affiliation":[{"name":"NVIDIA, Santa Clara, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4906-7241","authenticated-orcid":false,"given":"Andrey","family":"Alekseenko","sequence":"additional","affiliation":[{"name":"KTH Royal Institute of Technology, Stockholm, Sweden"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-7116-7535","authenticated-orcid":false,"given":"Ania","family":"Brown","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7731-1855","authenticated-orcid":false,"given":"Alan","family":"Gray","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0603-5514","authenticated-orcid":false,"given":"Szil\u00e1rd","family":"P\u00e1ll","sequence":"additional","affiliation":[{"name":"KTH Royal Institute of Technology, Stockholm, Sweden"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,11,15]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"publisher","unstructured":"Mark Abraham Andrey Alekseenko Cathrine Bergh Christian Blau Eliane Briand Mahesh Doijade Stefan Fleischmann Vytautas Gapsys Gaurav Garg Sergey Gorelov Gilles Gouaillardet Alan Gray M.\u00a0Eric Irrgang Farzaneh Jalalypour Joe Jordan Christoph Junghans Prashanth Kanduri Sebastian Keller Carsten Kutzner Justin\u00a0A. Lemkul Magnus Lundborg Pascal Merz Vedran Mileti\u0107 Dmitry Morozov Szil\u00e1rd P\u00e1ll Roland Schulz Michael Shirts Alexey Shvetsov B\u00e1lint Soproni David van\u00a0der Spoel Philip Turner Carsten Uphoff Alessandra Villa Sebastian Wingberm\u00fchle Artem Zhmurov Paul Bauer Berk Hess and Erik Lindahl. 2023. GROMACS 2023 Source code. 10.5281\/zenodo.7588619","DOI":"10.5281\/zenodo.7588619"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"publisher","unstructured":"Olivier Adjoua Louis Lagard\u00e8re Luc-Henri Jolly Arnaud Durocher Thibaut Very Isabelle Dupays Zhi Wang Th\u00e9o\u00a0Jaffrelot Inizan Fr\u00e9d\u00e9ric C\u00e9lerse Pengyu Ren Jay\u00a0W. Ponder and Jean-Philip Piquemal. 2021. Tinker-HP: Accelerating Molecular Dynamics Simulations of Large Complex Systems with Advanced Point Dipole Polarizable Force Fields Using GPUs and Multi-GPU Systems. Journal of Chemical Theory and Computation 17 4 (April 2021) 2034\u20132053. 10.1021\/acs.jctc.0c01164","DOI":"10.1021\/acs.jctc.0c01164"},{"key":"e_1_3_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3725789.3725797"},{"key":"e_1_3_3_2_5_2","unstructured":"AMD Corporation. 2025. ROCm\/rocSHMEM. https:\/\/github.com\/ROCm\/rocSHMEM"},{"key":"e_1_3_3_2_6_2","unstructured":"Charlie Boyle. 2024. NVIDIA Eos Revealed: Peek Into Operations of a Top 10 Supercomputer. https:\/\/blogs.nvidia.com\/blog\/eos\/."},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73370-33"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/SCW63240.2024.00169"},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431379.3464454"},{"key":"e_1_3_3_2_10_2","unstructured":"C.-H. Chu et\u00a0al. 2020. GPU-based Key-Value Store using NVSHMEM. https:\/\/developer.nvidia.com\/blog\/scaling-scientific-computing-with-nvshmem\/."},{"key":"e_1_3_3_2_11_2","unstructured":"Jan Ciesko Jeremiah Wilke and Christian Trott. 2023. Kokkos Remote Spaces. https:\/\/github.com\/kokkos\/kokkos-remote-spaces"},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"publisher","unstructured":"Mahesh Doijade Andrey Alekseenko Ania Brown Alan Gray and Szil\u00e1rd P\u00e1ll. 2025. Artifact A2: Supplementary data for \u201cGPU-initiated Halo Exchange using NVSHMEM in GROMACS\u201d. 10.5281\/zenodo.17062607","DOI":"10.5281\/zenodo.17062607"},{"key":"e_1_3_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2834892.2834897"},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"publisher","unstructured":"Mark\u00a0S Friedrichs Peter Eastman Vishal Vaidyanathan Mike Houston Scott Legrand Adam\u00a0L Beberg Daniel\u00a0L Ensign Christopher\u00a0M Bruns and Vijay\u00a0S Pande. 2009. Accelerating molecular dynamic simulation on graphics processing units. Journal of computational chemistry 30 6 (April 2009) 864\u201372. 10.1002\/jcc.21209","DOI":"10.1002\/jcc.21209"},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.22323\/1.430.0282"},{"key":"e_1_3_3_2_16_2","unstructured":"Alan Gray and Szil\u00e1rd P\u00e1ll. 2023. A Guide to CUDA Graphs in GROMACS 2023. https:\/\/developer.nvidia.com\/blog\/a-guide-to-cuda-graphs-in-gromacs-2023\/."},{"key":"e_1_3_3_2_17_2","volume-title":"GROMACS Reference Manual: Domain decomposition","author":"team GROMACS development","year":"2025","unstructured":"GROMACS development team. 2025. GROMACS Reference Manual: Domain decomposition. https:\/\/manual.gromacs.org\/current\/reference-manual\/algorithms\/parallelization-domain-decomp.html Details on DD, staggered grids under DLB, and communication range."},{"key":"e_1_3_3_2_18_2","doi-asserted-by":"publisher","unstructured":"M\u00a0J Harvey G Giupponi and G\u00a0De Fabritiis. 2009. ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. Journal of Chemical Theory and Computation 5 6 (June 2009) 1632\u20131639. 10.1021\/ct9000685","DOI":"10.1021\/ct9000685"},{"key":"e_1_3_3_2_19_2","doi-asserted-by":"publisher","unstructured":"Berk Hess Carsten Kutzner David van\u00a0der Spoel and Erik Lindahl. 2008. GROMACS 4: Algorithms for Highly Efficient Load-Balanced and Scalable Molecular Simulation. Journal of Chemical Theory and Computation 4 3 (2008) 435\u2013447. 10.1021\/ct700301q","DOI":"10.1021\/ct700301q"},{"key":"e_1_3_3_2_20_2","unstructured":"Intel Corporation. 2025. Intel\u00ae MPI Library Developer Guide for Linux OS: Device-Initiated Communications. https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/mpi-library\/developer-guide-linux\/2021-16\/device-initiated-communications.html"},{"key":"e_1_3_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593713"},{"key":"e_1_3_3_2_22_2","doi-asserted-by":"publisher","unstructured":"Carsten Kutzner Vedran Mileti\u0107 Karen Palacio\u00a0Rodr\u00edguez Markus Rampp Gerhard Hummer Bert\u00a0L. de Groot and Helmut Grubm\u00fcller. 2025. Scaling of the GROMACS Molecular Dynamics Code to 65k CPU Cores on an HPC Cluster. Journal of Computational Chemistry 46 5 (2025) e70059. 10.1002\/jcc.70059","DOI":"10.1002\/jcc.70059"},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"publisher","unstructured":"S.\u00a0Y. Liem D. Brown and J.\u00a0H.\u00a0R. Clarke. 1991. A general and efficient method for releasing constraint forces. Molecular Physics 74 2 (1991) 397\u2013411. 10.1080\/00268979100102321","DOI":"10.1080\/00268979100102321"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"publisher","unstructured":"Erik Lindahl Berk Hess and David van\u00a0der Spoel. 2001. GROMACS 3.0: a package for molecular simulation and trajectory analysis. Molecular modeling annual 7 8 (Aug. 2001) 306\u2013317. 10.1007\/s008940100045","DOI":"10.1007\/s008940100045"},{"key":"e_1_3_3_2_25_2","unstructured":"Hannes\u00a0H. Loeffler and D. Winn Martyn. 2012. Large biomolecular simulation on HPC platforms III. AMBER CHARMM GROMACS Large biomolecular simulation on HPC platforms. Tech. rep. STFC Daresbury Laboratory Daresbury Warrington WA4 4AD United Kingdom. (August 2012) 1\u201326."},{"key":"e_1_3_3_2_26_2","unstructured":"Naveen Namashivayam Krishna Kandalla James B.\u00a0White III Larry Kaplan and Mark Pagel. 2023. Exploring Fully Offloaded GPU Stream-Aware Message Passing. http:\/\/arxiv.org\/abs\/2306.15773 arXiv:https:\/\/arXiv.org\/abs\/2306.15773 [cs]."},{"key":"e_1_3_3_2_27_2","volume-title":"NVIDIA GB200 NVL72 Blackwell Platform Datasheet","author":"Corporation NVIDIA","year":"2024","unstructured":"NVIDIA Corporation. 2024. NVIDIA GB200 NVL72 Blackwell Platform Datasheet. https:\/\/nvdam.widen.net\/s\/wwnsxrhm2w\/blackwell-datasheet-3384703"},{"key":"e_1_3_3_2_28_2","unstructured":"NVIDIA Corporation. 2025. libcu++ PTX API: cp.async.bulk. https:\/\/nvidia.github.io\/cccl\/libcudacxx\/ptx\/instructions\/cp_async_bulk.html#cp-async-bulk-global-shared-cta-bulk-group. CUDA PTX cp.async.bulk Global\u00a0\u2194 \u00a0Shared CTA bulk_group variants and usage."},{"key":"e_1_3_3_2_29_2","unstructured":"NVIDIA Corporation. 2025. Memory model \u2014 libcudacxx 3.2 documentation. https:\/\/nvidia.github.io\/cccl\/libcudacxx\/extended_api\/memory_model.html Describes thread scopes acquire and release semantics and memory ordering in CUDA."},{"key":"e_1_3_3_2_30_2","volume-title":"NVIDIA Fabric Manager User Guide","author":"Corporation NVIDIA","year":"2025","unstructured":"NVIDIA Corporation. 2025. NVIDIA Fabric Manager User Guide. https:\/\/docs.nvidia.com\/datacenter\/tesla\/fabric-manager-user-guide\/index.html#h100-baseboard."},{"key":"e_1_3_3_2_31_2","volume-title":"NVIDIA GB200 NVL Multi-Node Tuning Guide","author":"Corporation NVIDIA","year":"2025","unstructured":"NVIDIA Corporation. 2025. NVIDIA GB200 NVL Multi-Node Tuning Guide. https:\/\/docs.nvidia.com\/multi-node-nvlink-systems\/multi-node-tuning-guide\/system.html"},{"key":"e_1_3_3_2_32_2","unstructured":"NVIDIA Corporation. 2025. NVIDIA Hopper Tuning Guide \u2014 Hopper Tuning Guide 12.9 documentation. https:\/\/docs.nvidia.com\/cuda\/hopper-tuning-guide\/index.html#tensor-memory-accelerator"},{"key":"e_1_3_3_2_33_2","volume-title":"NVIDIA OpenSHMEM Library (NVSHMEM) Documentation","author":"Corporation NVIDIA","year":"2025","unstructured":"NVIDIA Corporation. 2025. NVIDIA OpenSHMEM Library (NVSHMEM) Documentation. NVSHMEM implements the OpenSHMEM parallel programming model for clusters of NVIDIA GPUs."},{"key":"e_1_3_3_2_34_2","volume-title":"NVSHMEM Memory Registration: nvshmemx_buffer_register","author":"Corporation NVIDIA","year":"2025","unstructured":"NVIDIA Corporation. 2025. NVSHMEM Memory Registration: nvshmemx_buffer_register. NVSHMEMX_BUFFER_REGISTER API and requirements."},{"key":"e_1_3_3_2_35_2","volume-title":"Parallel Thread Execution (PTX) ISA","author":"Corporation NVIDIA","year":"2025","unstructured":"NVIDIA Corporation. 2025. Parallel Thread Execution (PTX) ISA. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/ See Section 8.9, Release and Acquire Patterns, and PTX instruction reference."},{"key":"e_1_3_3_2_36_2","volume-title":"OpenSHMEM Application Programming Interface","author":"Consortium OpenSHMEM","year":"2020","unstructured":"OpenSHMEM Consortium. 2020. OpenSHMEM Application Programming Interface. OpenSHMEM.org. http:\/\/www.openshmem.org\/site\/Specification"},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2002.10019"},{"key":"e_1_3_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2017.00037"},{"key":"e_1_3_3_2_39_2","doi-asserted-by":"publisher","unstructured":"Sander Pronk Szil\u00e1rd P\u00e1ll Roland Schulz Per Larsson P\u00e4r Bjelkmar Rossen Apostolov Michael\u00a0R Shirts Jeremy\u00a0C Smith Peter\u00a0M Kasson David van\u00a0der Spoel Berk Hess and Erik Lindahl. 2013. GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics (Oxford England) 29 7 (May 2013) 845\u201354. 10.1093\/bioinformatics\/btt055","DOI":"10.1093\/bioinformatics\/btt055"},{"key":"e_1_3_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-15976-81"},{"key":"e_1_3_3_2_41_2","doi-asserted-by":"publisher","unstructured":"Szil\u00e1rd P\u00e1ll and Berk Hess. 2013. A flexible algorithm for calculating pair interactions on SIMD architectures. Computer Physics Communications 184 12 (Dec. 2013) 2641\u20132650. 10.1016\/j.cpc.2013.06.003","DOI":"10.1016\/j.cpc.2013.06.003"},{"key":"e_1_3_3_2_42_2","doi-asserted-by":"publisher","unstructured":"Szil\u00e1rd P\u00e1ll Artem Zhmurov Paul Bauer Mark Abraham Magnus Lundborg Alan Gray Berk Hess and Erik Lindahl. 2020. Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS. J. Chem. Phys. 153 13 (Oct. 2020) 134110. 10.1063\/5.0018516","DOI":"10.1063\/5.0018516"},{"key":"e_1_3_3_2_43_2","volume-title":"Proceedings of the 7th International Conference on PGAS Programming Models","author":"Reyes Ruyman","year":"2013","unstructured":"Ruyman Reyes, Andrew Turner, and Berk Hess. 2013. Introducing SHMEM into the GROMACS molecular dynamics application: experience and results. In Proceedings of the 7th International Conference on PGAS Programming Models, A\u00a0Jackson M\u00a0Weiland and N\u00a0Johnson (Eds.). The University of Edinburgh. https:\/\/www.pure.ed.ac.uk\/ws\/portalfiles\/portal\/19680805\/pgas2013proceedings.pdf"},{"key":"e_1_3_3_2_44_2","doi-asserted-by":"publisher","unstructured":"T.P. Straatsma and Daniel\u00a0G. Chavarr\u00eda-Miranda. 2013. On eliminating synchronous communication in molecular simulations to improve scalability. Computer Physics Communications 184 12 (2013) 2634\u20132640. 10.1016\/j.cpc.2013.01.009","DOI":"10.1016\/j.cpc.2013.01.009"},{"key":"e_1_3_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-26428-88"},{"key":"e_1_3_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/SCW63240.2024.00065"},{"key":"e_1_3_3_2_47_2","unstructured":"Matthias Wagner. 2020. GTC 2020: Overcoming Latency Barriers: Strong Scaling HPC Applications with NVSHMEM. https:\/\/developer.nvidia.com\/gtc\/2020\/video\/s21673-vid"},{"key":"e_1_3_3_2_48_2","volume-title":"Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation","author":"Wang Yuke","year":"2023","unstructured":"Yuke Wang, Boyuan Feng, Zheng Wang, Kevin Barker, Ang Li, and Yufei Ding. 2023. MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association. https:\/\/par.nsf.gov\/biblio\/10467857-mgg-accelerating-graph-neural-networks-fine-grained-intra-kernel-communication-computation-pipelining-multi-gpu-platforms"},{"key":"e_1_3_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472456.3472478"},{"key":"e_1_3_3_2_50_2","doi-asserted-by":"publisher","unstructured":"Junchao Zhang Jed Brown Satish Balay Jacob Faibussowitsch Matthew Knepley Oana Marin Richard\u00a0Tran Mills Todd Munson Barry\u00a0F. Smith and Stefano Zampini. 2022. The PetscSF Scalable Communication Layer. IEEE Transactions on Parallel and Distributed Systems 33 4 (2022) 842\u2013853. 10.1109\/TPDS.2021.3084070","DOI":"10.1109\/TPDS.2021.3084070"}],"event":{"name":"SC Workshops '25: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St Louis MO USA","acronym":"SC Workshops '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731599.3767508","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:34:24Z","timestamp":1767987264000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731599.3767508"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,15]]},"references-count":49,"alternative-id":["10.1145\/3731599.3767508","10.1145\/3731599"],"URL":"https:\/\/doi.org\/10.1145\/3731599.3767508","relation":{},"subject":[],"published":{"date-parts":[[2025,11,15]]},"assertion":[{"value":"2025-11-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}