{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T05:26:38Z","timestamp":1740461198675,"version":"3.37.3"},"reference-count":15,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T00:00:00Z","timestamp":1734480000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T00:00:00Z","timestamp":1734480000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2023YFB3001903"],"award-info":[{"award-number":["2023YFB3001903"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62032023","42104078","6190241"],"award-info":[{"award-number":["62032023","42104078","6190241"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["CCF Trans. HPC"],"published-print":{"date-parts":[[2025,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Merge sort as a divide-sort-merge paradigm has been widely applied in computer science fields. As modern reduced instruction set computing architectures like the fifth generation (RISC-V) regard multiple registers as a vector register group for wide instruction parallelism, optimizing merge sort with this vectorized property is becoming increasingly common. In this paper, we overhaul the divide-sort-merge paradigm, from its register-level sort to the cache-aware merge, to develop a fine-grained RISC-V vectorized merge sort (RVMS). From the <jats:italic>register-level<\/jats:italic> view, the inline vectorized transpose instruction is missed in RISC-V, so implementing it efficiently is non-trivial. Besides, the vectorized comparisons do not always work well in the merging networks. Both issues primarily stem from the expensive data shuffle instruction. To bypass it, RVMS strides to take register data as the proxy of data shuffle to accelerate the transpose operation, and meanwhile replaces vectorized comparisons with scalar cousin for more light real value swap. On the other hand, as <jats:italic>cache-aware<\/jats:italic> merge makes larger data merge in the cache, most merge schemes have two drawbacks: the in-cache merge usually has low cache utilization, while the out-of-cache merging network remains an ineffectively symmetric structure. To this end, we propose the half-merge scheme to employ the auxiliary space of in-place merge to halve the footprint of na\u00efve merge sort, and meanwhile copy one sequence to this space to avoid the former data exchange. Furthermore, an asymmetric merging network is developed to adapt to two different input sizes. Experiments on the RISC-V processor SG2042 show that four fine-grained optimization schemes including register strided transpose, hybrid merging network, half-merge strategy, and asymmetric merging network, improve performance by 4.05%, 19.88%, 12.23%, and 11.04% respectively. Importantly, the overall performance is 1.34x faster than the parallel sorting in the Boost C++ library, and 1.85x faster than std::sort.<\/jats:p>","DOI":"10.1007\/s42514-024-00201-2","type":"journal-article","created":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T18:33:53Z","timestamp":1734546833000},"page":"58-71","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Fine-grained vectorized merge sorting on RISC-V: from register to cache"],"prefix":"10.1007","volume":"7","author":[{"given":"Jin","family":"Zhang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jincheng","family":"Zhou","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiang","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Di","family":"Ma","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chunye","family":"Gong","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,12,18]]},"reference":[{"key":"201_CR1","doi-asserted-by":"publisher","first-page":"259","DOI":"10.14778\/3489496.3489507","volume":"15","author":"A Arman","year":"2021","unstructured":"Arman, A., Loguinov, D.: Origami: A high-performance mergesort framework. Proc. VLDB Endow. 15, 259\u2013271 (2021)","journal-title":"Proc. VLDB Endow."},{"key":"201_CR2","doi-asserted-by":"publisher","first-page":"85","DOI":"10.14778\/2732219.2732227","volume":"7","author":"C Balkesen","year":"2013","unstructured":"Balkesen, C., Alonso, G., Teubner, J., \u00d6zsu, M.T.: Multi-core, main-memory joins: Sort vs. hash revisited. Proc. VLDB Endow. 7, 85\u201396 (2013)","journal-title":"Proc. VLDB Endow."},{"key":"201_CR3","doi-asserted-by":"crossref","unstructured":"Batcher, K.E.: Sorting networks and their applications. Proceedings of the April 30\u2013May 2, 1968, spring joint computer conference (1968)","DOI":"10.1145\/1468075.1468121"},{"key":"201_CR4","doi-asserted-by":"publisher","first-page":"1313","DOI":"10.14778\/1454159.1454171","volume":"1","author":"J Chhugani","year":"2008","unstructured":"Chhugani, J., Nguyen, A.D., Lee, V.W., Macy, W., Hagog, M., Chen, Y.-K., Baransi, A., Kumar, S., Dubey, P.K.: Efficient implementation of sorting on multi-core simd cpu architecture. Proc. VLDB Endow. 1, 1313\u20131324 (2008)","journal-title":"Proc. VLDB Endow."},{"key":"201_CR5","unstructured":"Gamble, J.M.: Sorting network generator. http:\/\/pages.ripco.net\/~jgamble\/nw.html (2019)"},{"key":"201_CR6","doi-asserted-by":"crossref","unstructured":"Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: Gputerasort: high performance graphics co-processor sorting for large database management. Proceedings of the 2006 ACM SIGMOD international conference on Management of data (2006)","DOI":"10.1145\/1142473.1142511"},{"issue":"3","key":"201_CR7","doi-asserted-by":"publisher","first-page":"348","DOI":"10.1145\/42392.42403","volume":"31","author":"B-C Huang","year":"1988","unstructured":"Huang, B.-C., Langston, M.A.: Practical in-place merging. Commun. ACM 31(3), 348\u2013352 (1988)","journal-title":"Commun. ACM"},{"key":"201_CR8","doi-asserted-by":"publisher","first-page":"1274","DOI":"10.14778\/2809974.2809988","volume":"8","author":"H Inoue","year":"2015","unstructured":"Inoue, H., Taura, K.: Simd- and cache-friendly algorithm for sorting an array of structures. Proc. VLDB Endow. 8, 1274\u20131285 (2015)","journal-title":"Proc. VLDB Endow."},{"key":"201_CR9","doi-asserted-by":"crossref","unstructured":"Inoue, H., Moriyama, T., Komatsu, H., Nakatani, T.: Aa-sort: A new parallel sorting algorithm for multi-core simd processors. 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), 189\u2013198 (2007)","DOI":"10.1109\/PACT.2007.4336211"},{"key":"201_CR10","unstructured":"Knuth, D.E.: The art of computer programming: sorting and searching (volume 3). (1973)"},{"key":"201_CR11","doi-asserted-by":"crossref","unstructured":"Odeh, S., Green, O., Mwassi, Z., Shmueli, O., Birk, Y.: Merge path\u2014parallel merging made simple. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 1611\u20131618 (2012)","DOI":"10.1109\/IPDPSW.2012.202"},{"key":"201_CR12","doi-asserted-by":"crossref","unstructured":"Satish, N., Kim, C., Chhugani, J., Nguyen, A.D., Lee, V.W., Kim, D., Dubey, P.K.: Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (2010)","DOI":"10.1145\/1807167.1807207"},{"key":"201_CR13","doi-asserted-by":"crossref","unstructured":"Yang, M., Zhang, P., Fang, J., Liu, W., Huang, C.: thsort: an efficient parallel sorting algorithm on multi-core dsps. CCF Transactions on High Performance Computing, 1\u201316 (2024)","DOI":"10.1007\/s42514-023-00175-7"},{"key":"201_CR14","doi-asserted-by":"crossref","unstructured":"Yin, Z., Zhang, T., M\u00fcller, A., Liu, H., Wei, Y., Schmidt, B., Liu, W.: Efficient parallel sort on avx-512-based multi-core and many-core architectures. 2019 IEEE 21st International Conference on High Performance Computing and Communications, 168\u2013176 (2019)","DOI":"10.1109\/HPCC\/SmartCity\/DSS.2019.00038"},{"key":"201_CR15","unstructured":"Zhou, J., Zhang, J., Zhang, X.: A Hybrid Vectorized Merge Sort on ARM NEON. http:\/\/arxiv.org\/abs\/2409.03970arXiv:2409.03970 [accepted] (2024)"}],"container-title":["CCF Transactions on High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-024-00201-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42514-024-00201-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-024-00201-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T10:49:49Z","timestamp":1740394189000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42514-024-00201-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,18]]},"references-count":15,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2]]}},"alternative-id":["201"],"URL":"https:\/\/doi.org\/10.1007\/s42514-024-00201-2","relation":{},"ISSN":["2524-4922","2524-4930"],"issn-type":[{"type":"print","value":"2524-4922"},{"type":"electronic","value":"2524-4930"}],"subject":[],"published":{"date-parts":[[2024,12,18]]},"assertion":[{"value":"16 July 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 October 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 December 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declaration"}},{"value":"The authors declare no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}