{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,6,26]],"date-time":"2024-06-26T00:13:55Z","timestamp":1719360835691},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"11","license":[{"start":{"date-parts":[[2024,3,29]],"date-time":"2024-03-29T00:00:00Z","timestamp":1711670400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,3,29]],"date-time":"2024-03-29T00:00:00Z","timestamp":1711670400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Top-K and selection operations are critical in data processing and analysis, and their efficient implementation on GPUs is increasingly important due to the growing demands of data analysis. Existing methods, primarily relying on the bucket partition execution model, encounter challenges such as uneven bucket distribution and latency in merging processes. To address these issues, we introduce a novel Split-Bucket Partition (SBP) execution model that specifically addresses these challenges. Additionally, we propose task and control flow optimizations targeted at top-K and selection algorithms, which further contribute to performance improvements. Our optimized algorithms significantly outperform existing approaches, delivering performance gains of up to <jats:inline-formula><jats:alternatives><jats:tex-math>$$2.3$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>2.3<\/mml:mn>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> times and <jats:inline-formula><jats:alternatives><jats:tex-math>$$1.6$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>1.6<\/mml:mn>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> times for different bucket partitioning rules. Our algorithms show robust performance improvements in non-uniform data scenarios, with gains ranging from <jats:inline-formula><jats:alternatives><jats:tex-math>$$1.9$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>1.9<\/mml:mn>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> times to <jats:inline-formula><jats:alternatives><jats:tex-math>$$15.5$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>15.5<\/mml:mn>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> times. However, it should be noted that the SBP model has limitations related to shared memory and register utilization, potentially impacting performance. Tests on TU102 and A100 GPU architectures validate the effectiveness of our approach, achieving a maximum speedup of <jats:inline-formula><jats:alternatives><jats:tex-math>$$2.9$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>2.9<\/mml:mn>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> times. The study suggests that while the SBP model is effective for top-K and selection algorithms, it also holds promise for other computational tasks, setting the stage for future research.<\/jats:p>","DOI":"10.1007\/s11227-024-06031-x","type":"journal-article","created":{"date-parts":[[2024,3,29]],"date-time":"2024-03-29T05:01:48Z","timestamp":1711688508000},"page":"15122-15160","update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Split-bucket partition (SBP): a novel execution model for top-K and selection algorithms on GPUs"],"prefix":"10.1007","volume":"80","author":[{"given":"Yiqing","family":"Yang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guoyin","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yanxia","family":"Wu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhixiang","family":"Zhao","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yan","family":"Fu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,3,29]]},"reference":[{"key":"6031_CR1","doi-asserted-by":"crossref","unstructured":"Sioulas P, Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A (2019) Hardware-Conscious Hash-Joins on GPUs. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 698\u2013709. IEEE","DOI":"10.1109\/ICDE.2019.00068"},{"key":"6031_CR2","doi-asserted-by":"crossref","unstructured":"Zhao W, Tan S, Li P (2020) SONG: Approximate nearest neighbor search on GPU. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1033\u20131044. IEEE","DOI":"10.1109\/ICDE48307.2020.00094"},{"key":"6031_CR3","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2019.102588","volume":"91","author":"T Ribizel","year":"2020","unstructured":"Ribizel T, Anzt H (2020) Parallel selection on GPUs. Parallel Comput 91:102588","journal-title":"Parallel Comput"},{"key":"6031_CR4","unstructured":"Gaihre A, Zheng D, Weitze S, Li L, Song SL, Ding C, Li XS, Liu H (2021) Dr. Top-k: Delegate-Centric Top-k on GPUs, 1\u201314"},{"key":"6031_CR5","unstructured":"Skrodzki M (2019) The k-d tree data structure and a proof for neighborhood computation in expected logarithmic time. arXiv preprint arXiv:1903.04936"},{"issue":"1","key":"6031_CR6","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1109\/TPDS.2019.2929768","volume":"31","author":"B Lessley","year":"2019","unstructured":"Lessley B, Childs H (2019) Data-Parallel Hashing Techniques for GPU Architectures. IEEE Trans Parallel Distrib Syst 31(1):237\u2013250","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"6031_CR7","unstructured":"Vaidya KE (2021) The case for a learned sorting algorithm. PhD thesis, Massachusetts Institute of Technology"},{"key":"6031_CR8","unstructured":"Gilbert MS, Madduri K, Boman EG, Rajamanickam, S (2023) Jet: Multilevel graph partitioning on GPUs. arXiv preprint arXiv:2304.13194"},{"key":"6031_CR9","doi-asserted-by":"crossref","unstructured":"Guo C, Chen H, Li C, Wu T (2018) A memory access reduced sort on multi-core GPU. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS), pp. 586\u2013593. IEEE","DOI":"10.1109\/HPCC\/SmartCity\/DSS.2018.00108"},{"issue":"11","key":"6031_CR10","doi-asserted-by":"publisher","first-page":"11899","DOI":"10.1109\/TKDE.2022.3230744","volume":"35","author":"L Zeng","year":"2023","unstructured":"Zeng L, Zou L, \u00d6zsu MT (2023) SGSI \u2013 A scalable GPU-friendly subgraph isomorphism algorithm. IEEE Trans Knowl Data Eng 35(11):11899\u201311916. https:\/\/doi.org\/10.1109\/TKDE.2022.3230744","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"6","key":"6031_CR11","doi-asserted-by":"publisher","first-page":"884","DOI":"10.14778\/3380750.3380758","volume":"13","author":"H Funke","year":"2020","unstructured":"Funke H, Teubner J (2020) Data-parallel query processing on non-uniform data. Proc VLDB Endowment 13(6):884\u2013897","journal-title":"Proc VLDB Endowment"},{"key":"6031_CR12","doi-asserted-by":"crossref","unstructured":"He J, Lu M, He B (2013) Revisiting Co-processing for hash joins on the coupled CPU-GPU architecture","DOI":"10.14778\/2536206.2536216"},{"key":"6031_CR13","doi-asserted-by":"crossref","unstructured":"Zhang Y, Fang H, Li X (2019) Scalable Top-K query processing using graphics processing unit. In: Languages and Compilers for Parallel Computing: 30th International Workshop, LCPC 2017, College Station, TX, USA, October 11\u201313, 2017, Revised Selected Papers 30, pp. 240\u2013261. Springer","DOI":"10.1007\/978-3-030-35225-7_16"},{"key":"6031_CR14","doi-asserted-by":"crossref","unstructured":"Leischner N, Osipov V, Sanders P (2010) GPU sample sort, pp. 1\u201310. IEEE","DOI":"10.1109\/IPDPS.2010.5470444"},{"issue":"5","key":"6031_CR15","first-page":"1339","volume":"26","author":"Z Cui","year":"2019","unstructured":"Cui Z, Gao Y, Zhou C, Gao G, Mei Z, Wu Z (2019) An efficient top-K query scheme based on multilayer grouping. Tehni\u010dki vjesnik 26(5):1339\u20131345","journal-title":"Tehni\u010dki vjesnik"},{"key":"6031_CR16","first-page":"1","volume":"17","author":"T Alabi","year":"2012","unstructured":"Alabi T, Blanchard JD, Gordon B, Steinbach R (2012) Fast k-selection algorithms for graphics processing units. J Experim Algor (JEA) 17:1\u20134","journal-title":"J Experim Algor (JEA)"},{"issue":"1","key":"6031_CR17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3505286","volume":"9","author":"M Axtmann","year":"2022","unstructured":"Axtmann M, Witt S, Ferizovic D, Sanders P (2022) Engineering In-place (shared-memory) sorting algorithms. ACM Trans Parallel Comput 9(1):1\u201362","journal-title":"ACM Trans Parallel Comput"},{"key":"6031_CR18","doi-asserted-by":"crossref","unstructured":"Zhou H, Troendle D, Jang B (2021) DACHash: A dynamic, cache-aware and concurrent hash table on GPUs. In: 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 1\u201310. IEEE","DOI":"10.1109\/SBAC-PAD53543.2021.00012"},{"key":"6031_CR19","doi-asserted-by":"crossref","unstructured":"Kalyan G, Junghare V, John SS, Chattopadhyay A, Mitra P, Hazra S (2019) Parsers, data structures and algorithms for macromolecular analysis toolkit (MAT): Design and implementation. bioRxiv, 605907","DOI":"10.1101\/605907"},{"key":"6031_CR20","doi-asserted-by":"crossref","unstructured":"Zhang J, Naruse A, Li X, Wang Y (2023) Parallel top-k algorithms on gpu: A comprehensive study and new methods. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1\u201313","DOI":"10.1145\/3581784.3607062"},{"key":"6031_CR21","doi-asserted-by":"crossref","unstructured":"Shanbhag A, Pirk H, Madden S (2018) Efficient top-K query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1557\u20131570","DOI":"10.1145\/3183713.3183735"},{"key":"6031_CR22","doi-asserted-by":"crossref","unstructured":"Shanbhag A, Madden S, Yu X (2020) A study of the fundamental performance characteristics of GPUs and CPUs for database analytics. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1617\u20131632","DOI":"10.1145\/3318464.3380595"},{"issue":"1","key":"6031_CR23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3108139","volume":"4","author":"S Ashkiani","year":"2017","unstructured":"Ashkiani S, Davidson A, Meyer U, Owens JD (2017) GPU Multisplit: An extended study of a parallel algorithm. ACM Trans Parallel Comput (TOPC) 4(1):1\u201344","journal-title":"ACM Trans Parallel Comput (TOPC)"},{"key":"6031_CR24","doi-asserted-by":"crossref","unstructured":"Lai C-C, Fan C-C, Liu C-M (2022) An effective pruning scheme for top-k dominating query processing on uncertain data streams. In: 2022 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS), pp. 104\u2013108. IEEE","DOI":"10.1109\/APWCS55727.2022.9906502"},{"key":"6031_CR25","unstructured":"Arrayfire Official Website (n.d.). https:\/\/arrayfire.com\/ Accessed 2023-05-09"},{"key":"6031_CR26","doi-asserted-by":"crossref","unstructured":"Takahashi K, Watanakeesuntorn W, Ichikawa K, Park J, Takano R, Haga J, Sugihara G, Pao GM (2021) kEDM: A performance-portable implementation of empirical dynamic modeling using Kokkos, pp. 1\u20138","DOI":"10.1145\/3437359.3465571"},{"issue":"12","key":"6031_CR27","doi-asserted-by":"publisher","first-page":"114","DOI":"10.14778\/3364324.3364327","volume":"13","author":"V Zois","year":"2019","unstructured":"Zois V, Tsotras VJ, Najjar WA (2019) Efficient main-memory top-k selection for multicore architectures. Proc VLDB Endowment 13(12):114\u2013127","journal-title":"Proc VLDB Endowment"},{"key":"6031_CR28","doi-asserted-by":"crossref","unstructured":"Monroe L, Wendelberger J, Michalak S (2011) Randomized selection on the GPU. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 89\u201398","DOI":"10.1145\/2018323.2018338"},{"key":"6031_CR29","doi-asserted-by":"crossref","unstructured":"Ribizel T, Anzt H (2019) Approximate and exact selection on GPUs. In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 471\u2013478. IEEE","DOI":"10.1109\/IPDPSW.2019.00088"},{"issue":"1","key":"6031_CR30","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1109\/TPAMI.2010.57","volume":"33","author":"H Jegou","year":"2010","unstructured":"Jegou H, Douze M, Schmid C (2010) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117\u2013128","journal-title":"IEEE Trans Pattern Anal Mach Intell"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-06031-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-024-06031-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-06031-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,25]],"date-time":"2024-06-25T11:24:30Z","timestamp":1719314670000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-024-06031-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,29]]},"references-count":30,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["6031"],"URL":"https:\/\/doi.org\/10.1007\/s11227-024-06031-x","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"value":"0920-8542","type":"print"},{"value":"1573-0484","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,29]]},"assertion":[{"value":"27 February 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 March 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}