{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T07:35:45Z","timestamp":1768030545835,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,1,31]],"date-time":"2022-01-31T00:00:00Z","timestamp":1643587200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2022,3,31]]},"abstract":"<jats:p>Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read\/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed \u201cbijective shuffle\u201d trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.<\/jats:p>","DOI":"10.1145\/3505287","type":"journal-article","created":{"date-parts":[[2022,1,31]],"date-time":"2022-01-31T16:35:52Z","timestamp":1643646952000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Bandwidth-Optimal Random Shuffling for GPUs"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2892-1082","authenticated-orcid":false,"given":"Rory","family":"Mitchell","sequence":"first","affiliation":[{"name":"Nvidia and Waikato University, Hamilton, New Zealand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6995-3307","authenticated-orcid":false,"given":"Daniel","family":"Stokes","sequence":"additional","affiliation":[{"name":"Waikato University and Nyriad Ltd., Hamilton, New Zealand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6152-7111","authenticated-orcid":false,"given":"Eibe","family":"Frank","sequence":"additional","affiliation":[{"name":"Waikato University, Hamilton, New Zealand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0433-8925","authenticated-orcid":false,"given":"Geoffrey","family":"Holmes","sequence":"additional","affiliation":[{"name":"Waikato University, Hamilton, New Zealand"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,1,31]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/0304-3975(95)00198-0"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3497623.3497674"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-19934-9_23"},{"key":"e_1_3_2_5_2","article-title":"MergeShuffle: A very fast, parallel random permutation algorithm","volume":"1508","author":"Bacher Axel","year":"2015","unstructured":"Axel Bacher, Olivier Bodini, Alexandros Hollender, and J\u00e9r\u00e9mie O. Lumbroso. 2015. MergeShuffle: A very fast, parallel random permutation algorithm. CoRR abs\/1508.03167. arxiv:1508.03167. http:\/\/arxiv.org\/abs\/1508.03167.","journal-title":"CoRR"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3009909"},{"key":"e_1_3_2_7_2","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1016\/B978-0-12-385963-1.00026-5","volume-title":"GPU Cmputing Gems Jade edition","author":"Bell Nathan","year":"2012","unstructured":"Nathan Bell and Jared Hoberock. 2012. Thrust: A productivity-oriented library for CUDA. In GPU Cmputing Gems Jade edition. Elsevier, 359\u2013371."},{"key":"e_1_3_2_8_2","first-page":"129","volume-title":"Data Encryption Standard (DES)","author":"Biryukov Alex","year":"2005","unstructured":"Alex Biryukov and Christophe De Canni\u00e8re. 2005. Data Encryption Standard (DES). Springer US, Boston, MA, 129\u2013135."},{"key":"e_1_3_2_9_2","volume-title":"Prefix Sums and Their Applications","author":"Blelloch Guy E.","year":"1990","unstructured":"Guy E. Blelloch. 1990. Prefix Sums and Their Applications. Technical Report CMU-CS-90-190. School of Computer Science, Carnegie Mellon University."},{"key":"e_1_3_2_10_2","first-page":"1126\u20131129, Vol.","volume-title":"2005 IEEE International Symposium on Circuits and Systems","author":"Fang Bo","year":"2005","unstructured":"Bo Fang, Guobin Shen, Shipeng Li, and Huifang Chen. 2005. Techniques for efficient DCT\/IDCT implementation on generic GPU. In 2005 IEEE International Symposium on Circuits and Systems. 1126\u20131129, Vol. 2."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2005.10"},{"key":"e_1_3_2_12_2","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. http:\/\/arxiv.org\/abs\/1410.0759."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.5555\/3042817.3043086"},{"key":"e_1_3_2_14_2","first-page":"27","volume-title":"Proceedings of the ISCA 18th International Conference on Parallel and Distributed Computing Systems","author":"Cong Guojing","year":"2005","unstructured":"Guojing Cong and David A. Bader. 2005. An empirical analysis of parallel random permutation algorithms ON SMPs. In Proceedings of the ISCA 18th International Conference on Parallel and Distributed Computing Systems, Michael J. Oudshoorn and Sanguthevar Rajasekaran (Eds.). ISCA, 27\u201334."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.5555\/647906.739658"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0086177"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.5555\/2161709"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/1058129.1058148"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1038\/scientificamerican0573-15"},{"key":"e_1_3_2_20_2","volume-title":"Statistical Tables for Biological, Agricultural and Medical Research","author":"Fisher Ronald A.","year":"1943","unstructured":"Ronald A. Fisher and Frank Yates. 1943. Statistical Tables for Biological, Agricultural and Medical Research. Oliver and Boyd Ltd, London."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.5555\/1005332.1005335"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.5555\/1196379"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.5555\/2394499.2394538"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304621"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.5555\/2503308.2188410"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/777412.777454"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.5555\/111713.111744"},{"key":"e_1_3_2_28_2","unstructured":"Jesse D. Hall Nathan A. Carr and John C. Hart. 2003. Cache and bandwidth aware matrix multiplication on the GPU. (2003)."},{"issue":"39","key":"e_1_3_2_29_2","first-page":"851","article-title":"Parallel prefix sum (scan) with CUDA","volume":"3","author":"Harris Mark","year":"2007","unstructured":"Mark Harris, Shubhabrata Sengupta, and John D. Owens. 2007. Parallel prefix sum (scan) with CUDA. GPU Gems 3, 39 (2007), 851\u2013876. Retrieved on 1 November, 2021 from https:\/\/nvlabs.github.io\/cub\/.","journal-title":"GPU Gems"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/1362622.1362684"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1963.10500830"},{"key":"e_1_3_2_32_2","volume-title":"An Introduction to Parallel Algorithms","author":"J\u00e9J\u00e9 Joseph","year":"1992","unstructured":"Joseph J\u00e9J\u00e9. 1992. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, MA."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045324"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1966.10480879"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.5555\/270146"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2669372"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/1268776.1268777"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1137\/0217022"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1214\/18-EJS1437"},{"key":"e_1_3_2_40_2","first-page":"307","article-title":"Converting high probability into nearly-constant time\u2014With applications to parallel hashing","author":"Matias Yossi","year":"1991","unstructured":"Yossi Matias and Uzi Vishkin. 1991. Converting high probability into nearly-constant time\u2014With applications to parallel hashing. In Procedings of the 23rd Annual ACM Symposium on Theory of Computing (Jan. 1991), 307\u2013316.","journal-title":"Procedings of the 23rd Annual ACM Symposium on Theory of Computing"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/272991.272995"},{"key":"e_1_3_2_42_2","author":"Merrill Duane","year":"2015","unstructured":"Duane Merrill. 2015. UB. NVIDIA Research (2015). Retrieved on 1 November, 2021 from https:\/\/nvlabs.github.io\/cub\/.","journal-title":"NVIDIA Research"},{"key":"e_1_3_2_43_2","article-title":"Single-Pass Parallel Prefix Scan with Decoupled Look-Back","author":"Merrill Duane","year":"2016","unstructured":"Duane Merrill and Michael Garland. 2016. Single-Pass Parallel Prefix Scan with Decoupled Look-Back. NVIDIA, Technical Report NVR-2016-002.","journal-title":"NVIDIA"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.1985.43"},{"key":"e_1_3_2_45_2","unstructured":"Rory Mitchell Joshua Cooper Eibe Frank and Geoffrey Holmes. 2021. Sampling permutations for Shapley value estimation. https:\/\/arxiv.org\/abs\/2104.12199."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.127"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.5555\/844174.844191"},{"key":"e_1_3_2_48_2","unstructured":"NVIDIA Corporation. 2020. CUDA C++ Programming Guide. Version 11.1. Retrieved 1 November 2021 https:\/\/docs.nvidia.com\/cuda\/archive\/11.1.0\/."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/513\/2\/022027"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.5555\/1882792.1882841"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","unstructured":"Lukas Prediger Niki Loppi Samuel Kaski and Antti Honkela. 2021. d3p\u2014A Python package for differentially-private probabilistic programming. https:\/\/arxiv.org\/abs\/2103.11648.","DOI":"10.2478\/popets-2022-0052"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.5555\/6771"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1137\/0218041"},{"issue":"3","key":"e_1_3_2_54_2","first-page":"305","article-title":"Generation of random permutations of given number of elements using random sampling numbers","volume":"23","author":"Rao C. Radhakrishna","year":"1961","unstructured":"C. Radhakrishna Rao. 1961. Generation of random permutations of given number of elements using random sampling numbers. Sankhy\u0101 : The Indian J. Stat., Ser. A (1961-2002) 23, 3 (1961), 305\u2013307.","journal-title":"Sankhy\u0101 : The Indian J. Stat., Ser. A (1961-2002)"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.1985.9"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.5555\/1461409"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063405"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1962.tb00474.x"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0020-0190(98)00127-6"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807207"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.5555\/2722129.2722159"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.5555\/2543987"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/2909437.2909454"},{"key":"e_1_3_2_64_2","volume-title":"RAPIDS: Collection of Libraries for End to End GPU Data Science","author":"Team RAPIDS Development","year":"2018","unstructured":"RAPIDS Development Team. 2018. RAPIDS: Collection of Libraries for End to End GPU Data Science. Retrieved 1 November, 2021 from https:\/\/rapids.ai."}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3505287","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3505287","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:24Z","timestamp":1750182564000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3505287"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,31]]},"references-count":63,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,3,31]]}},"alternative-id":["10.1145\/3505287"],"URL":"https:\/\/doi.org\/10.1145\/3505287","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,1,31]]},"assertion":[{"value":"2020-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-01-31","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}