{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,6,1]],"date-time":"2022-06-01T18:40:05Z","timestamp":1654108805228},"reference-count":24,"publisher":"IGI Global","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,4,1]]},"abstract":"<p>Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.<\/p>","DOI":"10.4018\/jghpc.2012040101","type":"journal-article","created":{"date-parts":[[2012,4,12]],"date-time":"2012-04-12T20:48:41Z","timestamp":1334263721000},"page":"1-16","source":"Crossref","is-referenced-by-count":1,"title":["Parallel Shellsort Algorithm for Many-Core GPUs with CUDA"],"prefix":"10.4018","volume":"4","author":[{"given":"Chun-Yuan","family":"Lin","sequence":"first","affiliation":[{"name":"Chang Gung University, Taiwan"}]},{"given":"Wei Sheng","family":"Lee","sequence":"additional","affiliation":[{"name":"National Tsing Hua University, Taiwan"}]},{"given":"Chuan Yi","family":"Tang","sequence":"additional","affiliation":[{"name":"Providence University, Taiwan"}]}],"member":"2432","reference":[{"key":"jghpc.2012040101-0","author":"S. G.Akl","year":"1985","journal-title":"Parallel sorting algorithms"},{"key":"jghpc.2012040101-1","doi-asserted-by":"publisher","DOI":"10.1007\/s002240000083"},{"key":"jghpc.2012040101-2","author":"C.Breshears","year":"2009","journal-title":"The art of concurrency: A thread monkey's guide to writing parallel applications"},{"key":"jghpc.2012040101-3","doi-asserted-by":"crossref","unstructured":"Cederman, D., & Tsigas, P. (2008). A practical quicksort algorithm for graphics processors. In Proceedings of the 16th Annual European Symposium on Algorithms (pp. 246-258).","DOI":"10.1007\/978-3-540-87744-8_21"},{"key":"jghpc.2012040101-4","doi-asserted-by":"crossref","unstructured":"Chen, S., Qin, J., Xie, Y., Zhao, J., & Heng, P.-A. (2009). A fast and flexible sorting algorithm with CUDA. In Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing (pp. 281-290).","DOI":"10.1007\/978-3-642-03095-6_28"},{"key":"jghpc.2012040101-5","doi-asserted-by":"crossref","unstructured":"Ciura, M. (2001). Best increments for the average case of shellsort. In R. Freivalds (Eds.), Proceedings of the 13th International Symposium on Fundamentals of Computation Theory (LNCS 2138, pp. 106-117).","DOI":"10.1007\/3-540-44669-9_12"},{"key":"jghpc.2012040101-6","author":"T. H.Cormen","year":"2001","journal-title":"Introduction to algorithms"},{"key":"jghpc.2012040101-7","unstructured":"Dehne, F., & Zaboli, H. (2010). Deterministic sample sort for GPUs. Retrieved from http:\/\/arxiv.org\/abs\/1002.4464"},{"key":"jghpc.2012040101-8","doi-asserted-by":"crossref","unstructured":"Govindaraju, N., Gray, J., Kumar, R., & Manocha, D. (2006) GPUTeraSort: high performance graphics co-processor sorting for large database management. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 325-336).","DOI":"10.1145\/1142473.1142511"},{"key":"jghpc.2012040101-9","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1998.1462"},{"key":"jghpc.2012040101-10","doi-asserted-by":"crossref","unstructured":"Leischner, N., Osipov, V., & Sanders, P. (2010). GPU sample sort. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (pp. 1-10).","DOI":"10.1109\/IPDPS.2010.5470444"},{"key":"jghpc.2012040101-11","doi-asserted-by":"publisher","DOI":"10.1145\/356593.356594"},{"key":"jghpc.2012040101-12","doi-asserted-by":"publisher","DOI":"10.1145\/272991.272995"},{"key":"jghpc.2012040101-13","doi-asserted-by":"publisher","DOI":"10.1145\/1365490.1365500"},{"key":"jghpc.2012040101-14","year":"2009","journal-title":"CUDA programming guide, version 2.3"},{"key":"jghpc.2012040101-15","doi-asserted-by":"crossref","unstructured":"Peters, H., Schulz-Hildebrandt, O., & Luttenberger, N. (2009). Fast in-place sorting with CUDA based on bitonic sort. In R. Wyrzykowski, J. Dongarra, K. Karczewski, & J. Wasniewski (Eds.), Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics (LNCS 6067, 403-410).","DOI":"10.1007\/978-3-642-14390-8_42"},{"key":"jghpc.2012040101-16","doi-asserted-by":"crossref","unstructured":"Satish, N., Harris, M., & Garland, M. (2009). Designing efficient sorting algorithms for manycore GPUs. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (pp. 1-10).","DOI":"10.1109\/IPDPS.2009.5161005"},{"key":"jghpc.2012040101-17","doi-asserted-by":"crossref","unstructured":"Sedgewick, R. (1996). Analysis of shellsort and related algorithms. In J. Diaz & M. Serna (Eds.), Proceedings of the 4th Annual European Symposium on Algorithms (LNCS 1136, pp. 1-11).","DOI":"10.1007\/3-540-61680-2_42"},{"key":"jghpc.2012040101-18","author":"S.Sengupta","year":"2008","journal-title":"Efficient parallel scan algorithms for GPUs (Tech. Rep.)"},{"key":"jghpc.2012040101-19","unstructured":"Sengupta, S., Harris, M., Zhang, Y., & Owens, J. D. (2007). Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIGGRAPH\/EUROGRAPHICS Symposium on Graphics Hardware (pp. 97-106)."},{"key":"jghpc.2012040101-20","doi-asserted-by":"publisher","DOI":"10.1145\/368370.368387"},{"key":"jghpc.2012040101-21","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.05.012"},{"key":"jghpc.2012040101-22","doi-asserted-by":"crossref","unstructured":"Volkov, V., & Demmel, J. W. (2008). Benchmarking GPUs to tune dense linear algebra. In Proceedings of the ACM\/IEEE Conference on Supercomputing (pp. 1-11).","DOI":"10.1109\/SC.2008.5214359"},{"key":"jghpc.2012040101-23","unstructured":"Ye, X., Fan, D., Lin, W., Yuan, N., & Ienne, P. (2010). High performance comparison-based sorting algorithm on many-core GPUs. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (pp. 1-10)."}],"container-title":["International Journal of Grid and High Performance Computing"],"original-title":[],"language":"ng","link":[{"URL":"https:\/\/www.igi-global.com\/viewtitle.aspx?TitleId=66353","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,1]],"date-time":"2022-06-01T17:59:48Z","timestamp":1654106388000},"score":1,"resource":{"primary":{"URL":"https:\/\/services.igi-global.com\/resolvedoi\/resolve.aspx?doi=10.4018\/jghpc.2012040101"}},"subtitle":[""],"short-title":[],"issued":{"date-parts":[[2012,4,1]]},"references-count":24,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2012,4]]}},"URL":"https:\/\/doi.org\/10.4018\/jghpc.2012040101","relation":{},"ISSN":["1938-0259","1938-0267"],"issn-type":[{"value":"1938-0259","type":"print"},{"value":"1938-0267","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,4,1]]}}}