{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,26]],"date-time":"2025-06-26T04:17:14Z","timestamp":1750911434254,"version":"3.37.3"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"12","license":[{"start":{"date-parts":[[2021,9,23]],"date-time":"2021-09-23T00:00:00Z","timestamp":1632355200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,9,23]],"date-time":"2021-09-23T00:00:00Z","timestamp":1632355200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"EPFL Lausanne"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Sign Process Syst"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>String sorting is a fundamental kernel of string matching and database index construction; yet, it has not been studied as extensively as fixed-length keys sorting. Because processing variable-length keys in hardware is challenging, it is no surprise that no hardware-accelerated string sorters have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS<jats:sup>5<\/jats:sup>) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade CPU. Our pHS<jats:sup>5<\/jats:sup> extends pS<jats:sup>5<\/jats:sup>, the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, by adding multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable among the dominant kernels of pS<jats:sup>5<\/jats:sup> by up to 33% compared to a single Intel Xeon Broadwell core despite a clock frequency that is 17 times slower. Furthermore, we extended the job scheduling mechanism of pS<jats:sup>5<\/jats:sup> to schedule the accelerable kernel not only among available CPU cores but also on our PEs, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. Overall, we accelerate the entire algorithm by up to 10% with respect to the 28-thread software baseline running on the Xeon processor and by up to 36% at lower thread counts. Finally, we generalize our results assuming pS<jats:sup>5<\/jats:sup> as representative of software that is heavily optimized for modern multi-core CPUs and investigate the performance and energy advantage that an FPGA in a datacenter setting can offer to regular RTL users compared to additional CPU cores.<\/jats:p>","DOI":"10.1007\/s11265-021-01686-8","type":"journal-article","created":{"date-parts":[[2021,9,24]],"date-time":"2021-09-24T00:59:01Z","timestamp":1632445141000},"page":"1405-1417","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System"],"prefix":"10.1007","volume":"93","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8050-0042","authenticated-orcid":false,"given":"Mikhail","family":"Asiatici","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Damian","family":"Maiorano","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paolo","family":"Ienne","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,9,23]]},"reference":[{"key":"1686_CR1","unstructured":"Knuth, D. E. (1998). The art of computer programming: sorting and searching Vol. 3. London: Pearson Education."},{"issue":"1","key":"1686_CR2","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1145\/1327452.1327492","volume":"51","author":"J Dean","year":"2008","unstructured":"Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107\u2013113.","journal-title":"Communications of the ACM"},{"key":"1686_CR3","doi-asserted-by":"crossref","unstructured":"Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., & Manocha, D. (2009). Fast BVH construction on GPUs. In Computer Graphics Forum, (Vol. 28 pp. 375\u2013384). Wiley Online Library.","DOI":"10.1111\/j.1467-8659.2009.01377.x"},{"key":"1686_CR4","unstructured":"Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm: Systems Research Center."},{"key":"1686_CR5","doi-asserted-by":"crossref","unstructured":"Casper, J., & Olukotun, K. (2014). Hardware acceleration of database operations. In Proceedings of the 22nd ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 151\u2013160). ACM.","DOI":"10.1145\/2554688.2554787"},{"issue":"2","key":"1686_CR6","doi-asserted-by":"publisher","first-page":"1313","DOI":"10.14778\/1454159.1454171","volume":"1","author":"J Chhugani","year":"2008","unstructured":"Chhugani, J., Nguyen, A. D., Lee, V. W., Macy, W., Hagog, M., Chen, Y.-K., Baransi, A., Kumar, S., & Dubey, P. (2008). Efficient implementation of sorting on multi-core simd cpu architecture. Proceedings of the VLDB Endowment, 1(2), 1313\u20131324.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"1686_CR7","doi-asserted-by":"crossref","unstructured":"Satish, N., Harris, M., & Garland, M. (2009). Designing efficient sorting algorithms for manycore GPUs. In IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (pp. 1\u201310). IEEE.","DOI":"10.1109\/IPDPS.2009.5161005"},{"key":"1686_CR8","doi-asserted-by":"crossref","unstructured":"Deshpande, A., & Narayanan, P. J. (2013). Can GPUs sort strings efficiently?. In 20th International Conference on High Performance Computing (HiPC) (pp. 305\u2013313). IEEE.","DOI":"10.1109\/HiPC.2013.6799129"},{"key":"1686_CR9","doi-asserted-by":"crossref","unstructured":"Koch, D., & Torresen, J. (2011). FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the 19th ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 45\u201354). ACM.","DOI":"10.1145\/1950413.1950427"},{"key":"1686_CR10","doi-asserted-by":"crossref","unstructured":"Matai, J., Richmond, D., Lee, D., Blair, Z., Wu, Q., Abazari, A., & Kastner, R. (2016). Resolve: Generation of high-performance sorting architectures from high-level synthesis. In Proceedings of the 24th ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 195\u2013204). ACM.","DOI":"10.1145\/2847263.2847268"},{"issue":"1","key":"1686_CR11","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1515\/acss-2014-0013","volume":"16","author":"V Sklyarov","year":"2014","unstructured":"Sklyarov, V., Skliarova, I., & Sudnitson, A. (2014). FPGA-based accelerators for parallel data sort. Applied Computer Systems, 16(1), 53\u201363.","journal-title":"Applied Computer Systems"},{"key":"1686_CR12","doi-asserted-by":"crossref","unstructured":"Chen, R., Siriyal, S., & Prasanna, V. (2015). Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the 23rd ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 240\u2013249). ACM.","DOI":"10.1145\/2684746.2689068"},{"key":"1686_CR13","doi-asserted-by":"crossref","unstructured":"Bingmann, T., & Sanders, P. (2013). Parallel string sample sort. In European Symposium on Algorithms (pp. 169\u2013180). Springer.","DOI":"10.1007\/978-3-642-40450-4_15"},{"issue":"1","key":"1686_CR14","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1007\/s00453-015-0071-1","volume":"77","author":"T Bingmann","year":"2017","unstructured":"Bingmann, T., Eberle, A., & Sanders, Peter (2017). Engineering parallel string sorting. Algorithmica, 77(1), 235\u2013286.","journal-title":"Algorithmica"},{"key":"1686_CR15","doi-asserted-by":"crossref","unstructured":"Neelima, B., Shamsundar, B., Narayan, A., Prabhu, R., & Gomes, C. (2017). Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 29(4).","DOI":"10.1002\/cpe.3865"},{"key":"1686_CR16","doi-asserted-by":"crossref","unstructured":"Umuroglu, Y., Morrison, D., & Jahre, M. (2015). Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. In Proceedings of the 25th International Conference on Field-Programmable Logic and Applications (pp. 1\u20138). IEEE.","DOI":"10.1109\/FPL.2015.7293939"},{"key":"1686_CR17","doi-asserted-by":"crossref","unstructured":"Weisz, G., Melber, J., Wang, Y., Fleming, K., Nurvitadhi, E., & Hoe, J. C. (2016). A study of pointer-chasing performance on shared-memory processor-FPGA systems. In Proceedings of the 24th ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 264\u2013273). ACM.","DOI":"10.1145\/2847263.2847269"},{"key":"1686_CR18","doi-asserted-by":"crossref","unstructured":"Zhang, C., Chen, R., & Prasanna, V. (2016). High throughput large scale sorting on a CPU-FPGA heterogeneous platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 148\u2013155).","DOI":"10.1109\/IPDPSW.2016.117"},{"key":"1686_CR19","doi-asserted-by":"crossref","unstructured":"Chen, R., & Prasanna, V. K. (2016). Accelerating equi-join on a CPU-FPGA heterogeneous platform. In Proceedings of the 24th IEEE Symposium on Field-Programmable Custom Computing Machines (pp. 212\u2013219). IEEE.","DOI":"10.1109\/FCCM.2016.62"},{"key":"1686_CR20","unstructured":"Gupta, P. K. (2016). Accelerating Datacenter Workloads. Keynote presented at FPL 2016."},{"key":"1686_CR21","doi-asserted-by":"crossref","unstructured":"Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., & M., Haselman (2014). A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st Annual International Symposium on Computer Architecture.","DOI":"10.1109\/ISCA.2014.6853195"},{"issue":"3","key":"1686_CR22","doi-asserted-by":"publisher","first-page":"496","DOI":"10.1145\/321592.321600","volume":"17","author":"WD Frazer","year":"1970","unstructured":"Frazer, W. D., & McKellar, A. C. (1970). Samplesort: A sampling approach to minimal storage tree sorting. Journal of the ACM (JACM), 17(3), 496\u2013507.","journal-title":"Journal of the ACM (JACM)"},{"key":"1686_CR23","unstructured":"Rantala, T. (2015). A collection of string sorting algorithm implementations. https:\/\/github.com\/rantala\/string-sorting."},{"key":"1686_CR24","doi-asserted-by":"crossref","unstructured":"Bingmann, T. (2014). Engineering parallel string sorting for multi-core systems. https:\/\/panthema.net\/2013\/parallel-string-sorting\/.","DOI":"10.1007\/978-3-642-40450-4_15"},{"key":"1686_CR25","unstructured":"De Gelas, J. (2016). The Intel Xeon E5 v4 review: Testing Broadwell-EP with demanding server workloads. https:\/\/www.anandtech.com\/show\/10158\/the-intel-xeon-e5-v4-review."},{"key":"1686_CR26","unstructured":"Williams, C. (2016). Here\u2019s what an Intel Broadwell Xeon with a built-in FPGA looks like. https:\/\/www.theregister.co.uk\/2016\/03\/14\/intel_xeon_fpga\/."},{"key":"1686_CR27","unstructured":"Jones, S. (2016). 10 nm SRAM projections - who will lead. https:\/\/www.semiwiki.com\/forum\/content\/5620-10nm-sram-projections-who-will-lead.html."},{"key":"1686_CR28","unstructured":"Intel Corp. (2017). Product specifications. https:\/\/ark.intel.com\/."},{"key":"1686_CR29","doi-asserted-by":"crossref","unstructured":"Chen, H., Madaminov, S., Ferdman, M., & Milder, P. (2020). FPGA-accelerated samplesort for large data sets. In Proceedings of the 28th ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 222\u2013232).","DOI":"10.1145\/3373087.3375304"},{"key":"1686_CR30","doi-asserted-by":"crossref","unstructured":"Jun, S.-W., Xu, S., & et al. (2017). Terabyte sort on fpga-accelerated flash storage. In Proceedings of the 25th IEEE Symposium on Field-Programmable Custom Computing Machines (pp. 17\u201324). IEEE.","DOI":"10.1109\/FCCM.2017.53"},{"key":"1686_CR31","doi-asserted-by":"crossref","unstructured":"Chen, R., & Prasanna, V. K. (2017). Computer generation of high throughput and memory efficient sorting designs on FPGA. IEEE Transactions on Parallel and Distributed Systems, 28(11), 3100\u20133113.","DOI":"10.1109\/TPDS.2017.2705128"},{"key":"1686_CR32","doi-asserted-by":"crossref","unstructured":"Chang, M.-C. F., Chen, Y.-T., Cong, J., Huang, P.-T., Kuo, C.-L., & Yu, C. H. (2016). The SMEM seeding acceleration for DNA sequence alignment. In Proceedings of the 24th IEEE Symposium on Field-Programmable Custom Computing Machines (pp. 32\u201339). IEEE.","DOI":"10.1109\/FCCM.2016.21"}],"container-title":["Journal of Signal Processing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11265-021-01686-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11265-021-01686-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11265-021-01686-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,7,23]],"date-time":"2022-07-23T07:22:14Z","timestamp":1658560934000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11265-021-01686-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,23]]},"references-count":32,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["1686"],"URL":"https:\/\/doi.org\/10.1007\/s11265-021-01686-8","relation":{},"ISSN":["1939-8018","1939-8115"],"issn-type":[{"type":"print","value":"1939-8018"},{"type":"electronic","value":"1939-8115"}],"subject":[],"published":{"date-parts":[[2021,9,23]]},"assertion":[{"value":"15 October 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 April 2021","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 July 2021","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 September 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}