{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,23]],"date-time":"2026-02-23T02:41:21Z","timestamp":1771814481483,"version":"3.50.1"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,1,14]],"date-time":"2024-01-14T00:00:00Z","timestamp":1705190400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Center for Exascale Monte Carlo Neutron Transport"},{"DOI":"10.13039\/100000015","name":"Department of Energy","doi-asserted-by":"crossref","award":["DE-NA003967"],"award-info":[{"award-number":["DE-NA003967"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Model. Comput. Simul."],"published-print":{"date-parts":[[2024,1,31]]},"abstract":"<jats:p>While Monte Carlo Neutron Transport (MCNT) is near-embarrasingly parallel, the effectively unpredictable lifetime of neutrons can lead to divergence when MCNT is evaluated on GPUs. Divergence is the phenomenon of adjacent threads in a warp executing different control flow paths; on GPUS, it reduces performance because each work group may only execute one path at a time. The process of Thread Data Remapping (TDR) resolves these discrepancies by moving data across hardware such that data in the same warp will be processed through similar paths. A common issue among prior implementations of TDR is the synchronous nature of its remapping and processing cycles, which exhaustively sort data produced by prior processing passes and exhaustively evaluate the sorted data. In another work, we defined a method of remapping data through an asynchronous scheduler which allows for work to be stored in shared memory and deferred arbitrarily until that work is a viable option for low-divergence evaluation. This article surveys a wider set of cases, with the goal of characterizing performance trends across a more comprehensive set of parameters. These parameters include cross sections of scattering\/capturing\/fission, use of implicit capture, source neutron counts, simulation time spans, and tuned memory allocations. Across these cases, we have recorded minimum and average execution times, as well as a heuristically tuned near-optimal memory allocation size for both synchronous and asynchronous scheduling. Across the collected data, it is shown that the asynchronous method is faster and more memory efficient in the majority of cases, and that it requires less tuning to achieve competitive performance.<\/jats:p>","DOI":"10.1145\/3626957","type":"journal-article","created":{"date-parts":[[2023,10,19]],"date-time":"2023-10-19T21:30:47Z","timestamp":1697751047000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous Scheduling"],"prefix":"10.1145","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6493-0990","authenticated-orcid":false,"given":"Braxton","family":"Cuneo","sequence":"first","affiliation":[{"name":"Oregon State University, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2082-7262","authenticated-orcid":false,"given":"Mike","family":"Bailey","sequence":"additional","affiliation":[{"name":"Oregon State University, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,1,14]]},"reference":[{"issue":"8","key":"e_1_3_1_2_2","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1145\/872734.806932","article-title":"The incremental garbage collection of processes","volume":"12","author":"Jr. Henry Baker","year":"1977","unstructured":"Henry Baker Jr. and Carl Hewitt. 1977.The incremental garbage collection of processes. ACM SIGPLAN Notices 12, 8 (1977), 55\u201359.","journal-title":"ACM SIGPLAN Notices"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","first-page":"176","DOI":"10.1016\/j.anucene.2014.10.039","article-title":"Algorithmic choices in WARP\u2014A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs","volume":"77","author":"Bergmann Ryan M.","year":"2015","unstructured":"Ryan M. Bergmann and Jasmina L. Vuji\u0107. 2015. Algorithmic choices in WARP\u2014A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs. Annals of Nuclear Energy 77, C (2015), 176\u2013193.","journal-title":"Annals of Nuclear Energy"},{"key":"e_1_3_1_4_2","first-page":"941","article-title":"Investigation of portable event-based Monte Carlo transport using the NVIDIA Thrust library","volume":"114","author":"Bleile Ryan C.","year":"2016","unstructured":"Ryan C. Bleile, Patrick S. Brantley, Shawn A. Dawson, Matthew J. O\u2019Brien, and Hank Childs. 2016. Investigation of portable event-based Monte Carlo transport using the NVIDIA Thrust library. Transactions of the American Nuclear Society 114 (2016), 941.","journal-title":"Transactions of the American Nuclear Society"},{"key":"e_1_3_1_5_2","unstructured":"Per Brinch Hansen. 1998. The Search for Simplicity: Essays in Parallel Programming . IEEE Los Alamitos CA. 96000647"},{"issue":"3","key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1016\/0149-1970(84)90024-6","article-title":"Monte Carlo methods for radiation transport analysis on vector computers","volume":"14","author":"Brown Forrest B.","year":"1984","unstructured":"Forrest B. Brown and William R. Martin. 1984. Monte Carlo methods for radiation transport analysis on vector computers. Progress in Nuclear Energy 14, 3 (1984), 269\u2013299.","journal-title":"Progress in Nuclear Energy"},{"key":"e_1_3_1_7_2","unstructured":"NVIDIA Corporation. 2021. CUDA C++ Programming Guide. Retrieved October 27 2023 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","first-page":"320","DOI":"10.1109\/PACT.2011.63","volume-title":"Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques","author":"Coutinho B.","year":"2011","unstructured":"B. Coutinho, D. Sampaio, F. M. Q Pereira, and W. Meira. 2011. Divergence analysis and optimizations. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, Los Alamitos, CA, 320\u2013329."},{"key":"e_1_3_1_9_2","first-page":"83","volume-title":"Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium","author":"Cui Zheng","year":"2012","unstructured":"Zheng Cui, Yun Liang, K. Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 83\u201394."},{"key":"e_1_3_1_10_2","unstructured":"B. Cuneo and M. Bailey. 2021. A Method for Reducing Divergence in GPU Programs Using Asynchronous Work Scheduling . (Dec2021). Under Review."},{"key":"e_1_3_1_11_2","doi-asserted-by":"crossref","unstructured":"Harold C. Edwards and Daniel Alejandro Ibanez. 2017. Kokkos\u2019 Task DAG Capabilities . Technical Report. U.S. Department of Energy. 10.2172\/1398234","DOI":"10.2172\/1398234"},{"key":"e_1_3_1_12_2","unstructured":"Harold C. Edwards and Christian Robert Trott. 2015. Kokkos manycore device performance portability for C++ HPC applications. In Proceedings of the 2015 GPU Technology Conference . https:\/\/www.osti.gov\/biblio\/1245917"},{"key":"e_1_3_1_13_2","unstructured":"Marc Harper Bryan Weinstein Cory Simon chebee7i Wiley Morgan Vince Knight Nick Swanson-Hysell Matthew Evans jl-bernal ZGainsforth The Gitter Badger SaxoAnglo Maximiliano Greco and Guido Zuidhof. 2019. marcharper\/python-ternary: Version 1.0.6. Retrieved October 27 2023 from 10.5281\/zenodo.594435"},{"key":"e_1_3_1_14_2","volume-title":"Unwinding Stylized Recursions into Iterations","author":"Friedman Daniel P.","year":"1975","unstructured":"Daniel P. Friedman and David S. Wise. 1975. Unwinding Stylized Recursions into Iterations. Technical Report No. 19. Computer Science Department, Indiana University, Bloomington, IN."},{"issue":"4","key":"e_1_3_1_15_2","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1109\/TC.1978.1675100","article-title":"Aspects of applicative programming for parallel processing","volume":"27","author":"Friedman Daniel P.","year":"1978","unstructured":"Daniel P. Friedman and David S. Wise. 1978. Aspects of applicative programming for parallel processing. IEEE Transactions on Computers C-27, 4 (1978), 289\u2013296.","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"407","DOI":"10.1109\/MICRO.2007.30","volume-title":"Proceedings of the 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201907)","author":"Fung Wilson","year":"2007","unstructured":"Wilson Fung, Ivan Sham, George Yuan, and Tor Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201907). IEEE, Los Alamitos, CA, 407\u2013420."},{"key":"e_1_3_1_17_2","unstructured":"Khronos OpenCL Working Group. 2021. The OpenCL Specification. Retrieved October 27 2023 from https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/3.0-unified\/html\/OpenCL_API.html"},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","first-page":"506","DOI":"10.1016\/j.anucene.2017.11.032","article-title":"Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms","volume":"113","author":"Hamilton Steven P.","year":"2018","unstructured":"Steven P. Hamilton, Stuart R. Slattery, and Thomas M. Evans. 2018. Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms. Annals of Nuclear Energy 113, C (2018), 506\u2013518.","journal-title":"Annals of Nuclear Energy"},{"key":"e_1_3_1_19_2","first-page":"12","volume-title":"Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU\u201913)","author":"Han Tianyi David","year":"2013","unstructured":"Tianyi David Han and Tarek S. Abdelrahman. 2013. Reducing divergence in GPGPU programs with loop merging. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU\u201913). ACM, New York, NY, 12\u201323. 10.1145\/2458523.2458525"},{"issue":"4","key":"e_1_3_1_20_2","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1145\/356622.356624","article-title":"Concurrent programming concepts","volume":"5","author":"Hansen Per","year":"1973","unstructured":"Per Hansen. 1973. Concurrent programming concepts. ACM Computing Surveys 5, 4 (1973), 223\u2013245.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1002\/spe.4380200104"},{"issue":"2","key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1016\/j.pnucene.2010.09.011","article-title":"GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation","volume":"53","author":"Heimlich A.","year":"2011","unstructured":"A. Heimlich, A. C. A Mol, and C. M. N. A Pereira. 2011. GPU-based Monte Carlo simulation in neutron transport and finite differences heat equation evaluation. Progress in Nuclear Energy 53, 2 (2011), 229\u2013239.","journal-title":"Progress in Nuclear Energy"},{"key":"e_1_3_1_23_2","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1145\/3200921.3200931","volume-title":"Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS\u201918)","author":"Ianni Mauro","year":"2018","unstructured":"Mauro Ianni, Romolo Marotta, Davide Cingolani, Alessandro Pellegrini, and Francesco Quaglia. 2018. The ultimate share-everything PDES system. In Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS\u201918). ACM, New York, NY, 73\u201384. 10.1145\/3200921.3200931"},{"issue":"7","key":"e_1_3_1_24_2","doi-asserted-by":"crossref","first-page":"1165","DOI":"10.1109\/TCAD.2015.2501303","article-title":"An accurate GPU performance model for effective control flow divergence optimization","volume":"35","author":"Liang Yun","year":"2016","unstructured":"Yun Liang, Muhammad Teguh Satria, Kyle Rupnow, and Deming Chen. 2016. An accurate GPU performance model for effective control flow divergence optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 7 (2016), 1165\u20131178.","journal-title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1016\/j.jpdc.2019.06.009","article-title":"Efficient low-latency packet processing using On-GPU thread-data remapping","volume":"133","author":"Lin Huanxin","year":"2019","unstructured":"Huanxin Lin and Cho-Li Wang. 2019. Efficient low-latency packet processing using On-GPU thread-data remapping. Journal of Parallel and Distributed Computing 133 (2019), 51\u201362.","journal-title":"Journal of Parallel and Distributed Computing"},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1016\/j.jpdc.2020.02.003","article-title":"On-GPU thread-data remapping for nested branch divergence","volume":"139","author":"Lin Huanxin","year":"2020","unstructured":"Huanxin Lin and Cho-Li Wang. 2020. On-GPU thread-data remapping for nested branch divergence. Journal of Parallel and Distributed Computing 139 (2020), 75\u201386.","journal-title":"Journal of Parallel and Distributed Computing"},{"issue":"3","key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3242089","article-title":"On-GPU thread-data remapping for branch divergence reduction","volume":"15","author":"Lin Huanxin","year":"2018","unstructured":"Huanxin Lin, Cho-Li Wang, and Hongyuan Liu. 2018. On-GPU thread-data remapping for branch divergence reduction. ACM Transactions on Architecture and Code Optimization 15, 3 (2018), 1\u201324.","journal-title":"ACM Transactions on Architecture and Code Optimization"},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"260","DOI":"10.1145\/53990.54016","volume-title":"Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI\u201988)","author":"Liskov B.","year":"1988","unstructured":"B. Liskov and L. Shrira. 1988. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI\u201988). ACM, New York, NY, 260\u2013267."},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1145\/1815961.1815992","volume-title":"Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910)","author":"Meng Jiayuan","year":"2010","unstructured":"Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910). ACM, New York, NY, 235\u2013246."},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1145\/3518997.3531026","volume-title":"Proceedings of the 2022 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS\u201922)","author":"Montesano Federica","year":"2022","unstructured":"Federica Montesano, Romolo Marotta, and Francesco Quaglia. 2022. Spatial\/temporal locality-based load-sharing in speculative discrete event simulation on multi-core machines. In Proceedings of the 2022 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS\u201922). ACM, New York, NY, 81\u201392. 10.1145\/3518997.3531026"},{"key":"e_1_3_1_31_2","first-page":"733","volume-title":"Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium","author":"Ozog David","year":"2015","unstructured":"David Ozog, Allen D. Malony, and Andrew R. Siegel. 2015. A performance analysis of SIMD algorithms for monte carlo simulations of nuclear reactor cores. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos, CA, 733."},{"key":"e_1_3_1_32_2","first-page":"61","volume-title":"Proceedings of the 2012 39th Annual International Symposium on Computer Architecture (ISCA\u201912)","author":"Rhu Minsoo","year":"2012","unstructured":"Minsoo Rhu and Mattan Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of the 2012 39th Annual International Symposium on Computer Architecture (ISCA\u201912). IEEE, Los Alamitos, CA, 61\u201371."},{"key":"e_1_3_1_33_2","article-title":"Monte Carlo radiation penetration calculations on a parallel computer","author":"Troubetzkoy E.","year":"1973","unstructured":"E. Troubetzkoy, H. Steinberg, and M. Kalos. 1973. Monte Carlo radiation penetration calculations on a parallel computer. Transactions of the American Nuclear Society 17 (1973), 260\u2013261.https:\/\/www.osti.gov\/biblio\/4395508","journal-title":"Transactions of the American Nuclear Society 17 (1973), 260\u2013261."},{"key":"e_1_3_1_34_2","doi-asserted-by":"crossref","first-page":"368","DOI":"10.1145\/2485922.2485954","volume-title":"Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA\u201913)","author":"Vaidya Aniruddha","year":"2013","unstructured":"Aniruddha Vaidya, Anahita Shayesteh, Dong Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA\u201913). ACM, New York, NY, 368\u2013379."},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1016\/j.anucene.2014.08.062","article-title":"ARCHER, a new Monte Carlo software tool for emerging heterogeneous computing environments","volume":"82","author":"Xu X. George","year":"2015","unstructured":"X. George Xu, Tianyu Liu, Lin Su, Xining Du, Matthew Riblett, Wei Ji, Deyang Gu, Christopher D. Carothers, Mark S. Shephard, Forrest B. Brown, Mannudeep K. Kalra, and Bob Liu. 2015. ARCHER, a new Monte Carlo software tool for emerging heterogeneous computing environments. Annals of Nuclear Energy 82 (2015), 2\u20139.","journal-title":"Annals of Nuclear Energy"},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1145\/3293320.3293331","volume-title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia\u201919)","author":"Yang Yaohua","year":"2019","unstructured":"Yaohua Yang, Shiqing Zhang, and Li Shen. 2019. A lightweight method for handling control divergence in GPGPUs. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia\u201919). ACM, New York, NY, 120\u2013127. 10.1145\/3293320.3293331"},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1145\/1810085.1810104","volume-title":"Proceedings of the 24th ACM International Conference on Supercomputing (ICS\u201910)","author":"Zhang Eddy","year":"2010","unstructured":"Eddy Zhang, Yunlian Jiang, Ziyu Guo, and Xipeng Shen. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS\u201910). ACM, New York, NY, 115\u2013126."},{"key":"e_1_3_1_38_2","first-page":"369","volume-title":"Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201911)","author":"Zhang Eddy","year":"2011","unstructured":"Eddy Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201911). ACM, New York, NY, 369\u2013380."}],"container-title":["ACM Transactions on Modeling and Computer Simulation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626957","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3626957","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:44:16Z","timestamp":1750290256000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626957"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,14]]},"references-count":37,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,1,31]]}},"alternative-id":["10.1145\/3626957"],"URL":"https:\/\/doi.org\/10.1145\/3626957","relation":{},"ISSN":["1049-3301","1558-1195"],"issn-type":[{"value":"1049-3301","type":"print"},{"value":"1558-1195","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,14]]},"assertion":[{"value":"2022-04-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-18","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}