{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T07:21:04Z","timestamp":1777965664986,"version":"3.51.4"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2017,3,31]],"date-time":"2017-03-31T00:00:00Z","timestamp":1490918400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"DFG","award":["ME 2088\/3-1"],"award-info":[{"award-number":["ME 2088\/3-1"]}]},{"DOI":"10.13039\/501100007277","name":"MADALGO","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100007277","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Sandia LDRD","award":["#13-0144"],"award-info":[{"award-number":["#13-0144"]}]},{"name":"UC Lab Fees Research Program","award":["12-LR-238449"],"award-info":[{"award-number":["12-LR-238449"]}]},{"name":"NVIDIA Graduate Fellowship"},{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CCF-1017399 and OCI-1032859"],"award-info":[{"award-number":["CCF-1017399 and OCI-1032859"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2017,3,31]]},"abstract":"<jats:p>\n            Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous\n            <jats:italic>buckets<\/jats:italic>\n            or\n            <jats:italic>bins<\/jats:italic>\n            , where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on Graphics Processing Units (GPUs), programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: The former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warpwide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93Gkeys\/s (or 11.68Gpairs\/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0Gkeys\/s (and 2.1Gpair\/s).\n          <\/jats:p>","DOI":"10.1145\/3108139","type":"journal-article","created":{"date-parts":[[2017,8,24]],"date-time":"2017-08-24T11:49:04Z","timestamp":1503575344000},"page":"1-44","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["GPU Multisplit"],"prefix":"10.1145","volume":"4","author":[{"given":"Saman","family":"Ashkiani","sequence":"first","affiliation":[{"name":"University of California, Davis"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrew","family":"Davidson","sequence":"additional","affiliation":[{"name":"University of California, Davis"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ulrich","family":"Meyer","sequence":"additional","affiliation":[{"name":"Goethe-Universit\u00e4t Frankfurt am Main"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"John D.","family":"Owens","sequence":"additional","affiliation":[{"name":"University of California, Davis"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2017,8,23]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1618452.1618500"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.69"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851169"},{"key":"e_1_2_1_4_1","volume-title":"Gutin","author":"Bang-Jensen J\u00f8rgen","year":"2009","unstructured":"J\u00f8rgen Bang-Jensen and Gregory Z . Gutin . 2009 . Digraphs : Theory, Algorithms and Applications. Springer-Verlag London , 97--99. J\u00f8rgen Bang-Jensen and Gregory Z. Gutin. 2009. Digraphs: Theory, Algorithms and Applications. Springer-Verlag London, 97--99."},{"key":"e_1_2_1_5_1","volume-title":"Thrust: A productivity-oriented library for CUDA","author":"Bell Nathan","year":"2011","unstructured":"Nathan Bell and Jared Hoberock . 2011 . Thrust: A productivity-oriented library for CUDA . In GPU Computing Gems, Wen-mei W. Hwu (Ed.). Vol. 2 . Morgan Kaufmann , 359--371. Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. In GPU Computing Gems, Wen-mei W. Hwu (Ed.). Vol. 2. Morgan Kaufmann, 359--371."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339589"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972740.43"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.45"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442536"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 20th International Conference on High Performance Computing (HiPC","author":"Deshpande Aditya","year":"2013","unstructured":"Aditya Deshpande and P. J. Narayanan . 2013. Can GPUs sort strings efficiently? In Proceedings of the 20th International Conference on High Performance Computing (HiPC 2013 ). 305--313. Aditya Deshpande and P. J. Narayanan. 2013. Can GPUs sort strings efficiently? In Proceedings of the 20th International Conference on High Performance Computing (HiPC 2013). 305--313."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01386390"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of Innovative Parallel Computing (InPar\u201912)","author":"Gupta Kshitij","unstructured":"Kshitij Gupta , Jeff Stuart , and John D. Owens . 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads . In Proceedings of Innovative Parallel Computing (InPar\u201912) . Kshitij Gupta, Jeff Stuart, and John D. Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In Proceedings of Innovative Parallel Computing (InPar\u201912)."},{"key":"e_1_2_1_16_1","volume-title":"Owens","author":"Harris Mark","year":"2007","unstructured":"Mark Harris , Shubhabrata Sengupta , and John D . Owens . 2007 . Parallel prefix sum (scan) with CUDA. In GPU Gems 3, Herbert Nguyen (Ed.). Addison Wesley , 851--876. Mark Harris, Shubhabrata Sengupta, and John D. Owens. 2007. Parallel prefix sum (scan) with CUDA. In GPU Gems 3, Herbert Nguyen (Ed.). Addison Wesley, 851--876."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376670"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2010.88"},{"key":"e_1_2_1_19_1","volume-title":"10th DIMACS Implementation Challenge.","author":"Kobitzsh Moritz","unstructured":"Moritz Kobitzsh . 2010. 10th DIMACS Implementation Challenge. Retrieved from http:\/\/www.cc.gatech.edu\/dimacs10\/index.shtml. Moritz Kobitzsh. 2010. 10th DIMACS Implementation Challenge. Retrieved from http:\/\/www.cc.gatech.edu\/dimacs10\/index.shtml."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_21_1","unstructured":"Duane Merrill. 2015. CUDA UnBound (CUB) Library. Retrieved from https:\/\/nvlabs.github.io\/cub\/.  Duane Merrill. 2015. CUDA UnBound (CUB) Library. Retrieved from https:\/\/nvlabs.github.io\/cub\/."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0196-6774(03)00076-2"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018338"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1365490.1365500"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1964179.1964181"},{"key":"e_1_2_1_27_1","volume-title":"NVIDIA CUDA C Programming Guide. (Sept","author":"NVIDIA Corporation","year":"2016","unstructured":"NVIDIA Corporation . 2016. NVIDIA CUDA C Programming Guide. (Sept . 2016 ). PG- 02829-001_v8.0. NVIDIA Corporation. 2016. NVIDIA CUDA C Programming Guide. (Sept. 2016). PG-02829-001_v8.0."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018339"},{"key":"e_1_2_1_29_1","volume-title":"Scalable Primitives for Data Mapping and Movement on the GPU. Master\u2019s thesis","author":"Patidar Suryakant","unstructured":"Suryakant Patidar . 2009. Scalable Primitives for Data Mapping and Movement on the GPU. Master\u2019s thesis . International Institute of Information Technology , Hyderabad, India . Suryakant Patidar. 2009. Scalable Primitives for Data Mapping and Movement on the GPU. Master\u2019s thesis. International Institute of Information Technology, Hyderabad, India."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS\u201907)","author":"Shams R.","unstructured":"R. Shams and R. A. Kennedy . 2007. Efficient histogram algorithms for NVIDIA CUDA compatible devices . In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS\u201907) . Gold Coast, Australia, 418--422. R. Shams and R. A. Kennedy. 2007. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS\u201907). Gold Coast, Australia, 418--422."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851145"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018335"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3108139","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3108139","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3108139","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:13:43Z","timestamp":1750212823000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3108139"}},"subtitle":["An Extended Study of a Parallel Algorithm"],"short-title":[],"issued":{"date-parts":[[2017,3,31]]},"references-count":29,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2017,3,31]]}},"alternative-id":["10.1145\/3108139"],"URL":"https:\/\/doi.org\/10.1145\/3108139","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,3,31]]},"assertion":[{"value":"2016-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-08-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}