{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T14:25:59Z","timestamp":1775744759855,"version":"3.50.1"},"reference-count":24,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2008,8]]},"abstract":"<jats:p>Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times.<\/jats:p>\n          <jats:p>\n            This paper presents an efficient implementation and detailed analysis of\n            <jats:bold>MergeSort<\/jats:bold>\n            on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in\n            <jats:italic>less than<\/jats:italic>\n            0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results.\n          <\/jats:p>\n          <jats:p>Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.<\/jats:p>","DOI":"10.14778\/1454159.1454171","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1313-1324","source":"Crossref","is-referenced-by-count":173,"title":["Efficient implementation of sorting on multi-core SIMD CPU architecture"],"prefix":"10.14778","volume":"1","author":[{"given":"Jatin","family":"Chhugani","sequence":"first","affiliation":[{"name":"Intel Corporation"}]},{"given":"Anthony D.","family":"Nguyen","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Victor W.","family":"Lee","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"William","family":"Macy","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Mostafa","family":"Hagog","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Yen-Kuang","family":"Chen","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Akram","family":"Baransi","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Sanjeev","family":"Kumar","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Pradeep","family":"Dubey","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]}],"member":"320","published-online":{"date-parts":[[2008,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1468075.1468121"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1137\/0218014"},{"key":"e_1_2_1_3_1","unstructured":"Intel 64 and IA-32 architectures optimization reference manual. http:\/\/www.intel.com\/products\/processor\/manuals.  Intel 64 and IA-32 architectures optimization reference manual. http:\/\/www.intel.com\/products\/processor\/manuals."},{"key":"e_1_2_1_4_1","volume-title":"An Oracle White Paper","author":"de Wiel M. V.","year":"2005","unstructured":"M. V. de Wiel and H. Daer . Sort Performance Improvements in Oracle Database 10g Release2 . An Oracle White Paper , 2005 . M. V. de Wiel and H. Daer. Sort Performance Improvements in Oracle Database 10g Release2. An Oracle White Paper, 2005."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.9738"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/646422.691940"},{"key":"e_1_2_1_7_1","first-page":"1286","volume-title":"CellSort: High Performance Sorting on the Cell Processor. In VLDB '07","author":"Gedik B.","year":"2007","unstructured":"B. Gedik , R. R. Bordawekar , and P. S. Yu . CellSort: High Performance Sorting on the Cell Processor. In VLDB '07 , pages 1286 -- 1297 , 2007 . B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: High Performance Sorting on the Cell Processor. In VLDB '07, pages 1286--1297, 2007."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142511"},{"key":"e_1_2_1_9_1","volume-title":"GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures. International Parallel and Distributed Processing Symposium, 2006","author":"Gress A.","year":"2006","unstructured":"A. Gress and G. Zachmann . GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures. International Parallel and Distributed Processing Symposium, 2006 , April 2006 . A. Gress and G. Zachmann. GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures. International Parallel and Distributed Processing Symposium, 2006, April 2006."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1128022.1128023"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/1299042.1299047"},{"key":"e_1_2_1_12_1","volume-title":"http:\/\/softwarecommunity.intel.com\/isn\/downloads\/intelavx\/Intel- AVX-Programming-Reference-31943302.pdf\/","author":"Vector Extensions Programming Reference Intel Advanced","year":"2008","unstructured":"Intel Advanced Vector Extensions Programming Reference . http:\/\/softwarecommunity.intel.com\/isn\/downloads\/intelavx\/Intel- AVX-Programming-Reference-31943302.pdf\/ , 2008 . Intel Advanced Vector Extensions Programming Reference. http:\/\/softwarecommunity.intel.com\/isn\/downloads\/intelavx\/Intel- AVX-Programming-Reference-31943302.pdf\/, 2008."},{"key":"e_1_2_1_13_1","volume-title":"The Art of Computer Programming","author":"Knuth D. E.","unstructured":"D. E. Knuth . The Art of Computer Programming , Volume 3: (2nd ed.) Sorting and Searching. 1998 . D. E. Knuth. The Art of Computer Programming, Volume 3: (2nd ed.) Sorting and Searching. 1998."},{"key":"e_1_2_1_14_1","first-page":"315","volume-title":"Easy Sort. Byte Magazine","author":"Lacey S.","year":"1991","unstructured":"S. Lacey and R. Box . A Fast , Easy Sort. Byte Magazine , pages 315 -- 320 , 1991 . S. Lacey and R. Box. A Fast, Easy Sort. Byte Magazine, pages 315--320, 1991."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/356593.356594"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.16506"},{"key":"e_1_2_1_17_1","volume-title":"Accelerating QuickSort on the Intel Pentium 4 Processor with Hyper-Threading Technology. softwarecommunity.intel.com\/articles\/eng\/2422.htm\/","author":"Parikh R.","year":"2008","unstructured":"R. Parikh . Accelerating QuickSort on the Intel Pentium 4 Processor with Hyper-Threading Technology. softwarecommunity.intel.com\/articles\/eng\/2422.htm\/ , 2008 . R. Parikh. Accelerating QuickSort on the Intel Pentium 4 Processor with Hyper-Threading Technology. softwarecommunity.intel.com\/articles\/eng\/2422.htm\/, 2008."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1198555.1198797"},{"key":"e_1_2_1_19_1","volume-title":"Designing Efficient Sorting Algorithms for Manycore GPUs. Submitted to SuperComputing","author":"Satish N. R.","year":"2008","unstructured":"N. R. Satish , M. Harris , and M. Garland . Designing Efficient Sorting Algorithms for Manycore GPUs. Submitted to SuperComputing 2008 . N. R. Satish, M. Harris, and M. Garland. Designing Efficient Sorting Algorithms for Manycore GPUs. Submitted to SuperComputing 2008."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1399504.1360617"},{"key":"e_1_2_1_21_1","first-page":"97","volume-title":"ACM symposium on Graphics hardware","author":"Sengupta S.","year":"2007","unstructured":"S. Sengupta , M. Harris , Y. Zhang , and J. D. Owens . Scan primitives for GPU computing . In ACM symposium on Graphics hardware , pages 97 -- 106 , 2007 . S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In ACM symposium on Graphics hardware, pages 97--106, 2007."},{"key":"e_1_2_1_22_1","volume-title":"Workshop on General Purpose Processing on Graphics Processing Units","author":"Sintorn E.","year":"2007","unstructured":"E. Sintorn and U. Assarsson . Fast Parallel GPU-Sorting Using a Hybrid Algorithm . In Workshop on General Purpose Processing on Graphics Processing Units , 2007 . E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 2007."},{"key":"e_1_2_1_23_1","volume-title":"Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000. pdp, 00:372","author":"Tsigas P.","year":"2003","unstructured":"P. Tsigas and Y. Zhang . A Simple , Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000. pdp, 00:372 , 2003 . P. Tsigas and Y. Zhang. A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000. pdp, 00:372, 2003."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2007.373606"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1454159.1454171","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:57:09Z","timestamp":1672221429000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1454159.1454171"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,8]]},"references-count":24,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2008,8]]}},"alternative-id":["10.14778\/1454159.1454171"],"URL":"https:\/\/doi.org\/10.14778\/1454159.1454171","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2008,8]]}}}