{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:03:29Z","timestamp":1755839009820,"version":"3.41.0"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2020,8,17]],"date-time":"2020-08-17T00:00:00Z","timestamp":1597622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,9,30]]},"abstract":"<jats:p>We consider software-hardware acceleration of K-means clustering on the Intel Xeon+FPGA platform. We design a pipelined accelerator for K-means and combine it with CPU threads to assess performance benefits of (1) acceleration when data are only accessed from system memory and (2) cooperative CPU-FPGA acceleration. Our evaluation shows that the accelerator is up to 12.7\u00d7\/2.4\u00d7 faster than a single CPU thread for the assignment\/update step of K-means. The cooperative use of threads and FPGA is roughly 1.9\u00d7 faster than CPU threads alone or the FPGA by itself. Our approach delivers 4\u00d7\u20135\u00d7 higher throughput compared to existing offload processing approaches.<\/jats:p>","DOI":"10.1145\/3406114","type":"journal-article","created":{"date-parts":[[2020,8,17]],"date-time":"2020-08-17T13:24:45Z","timestamp":1597670685000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System"],"prefix":"10.1145","volume":"17","author":[{"given":"Tarek S.","family":"Abdelrahman","sequence":"first","affiliation":[{"name":"The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada"}]}],"member":"320","published-online":{"date-parts":[[2020,8,17]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1109\/ASAP.2016.7760789"},{"doi-asserted-by":"publisher","key":"e_1_2_1_2_1","DOI":"10.1145\/2400682.2400716"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/3294054"},{"doi-asserted-by":"publisher","key":"e_1_2_1_4_1","DOI":"10.1109\/ASAP.2014.6868624"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1023\/A:1024495400663"},{"unstructured":"P. Gupta. 2015. Xeon+FPGA Platform for the Data Center. Retrieved from http:\/\/www.ece.cmu.edu\/\u223ccalcm\/carl\/doku.php?id=pk_gupta_intel_xeon_fpga_platform_for_the_data_center.  P. Gupta. 2015. Xeon+FPGA Platform for the Data Center. Retrieved from http:\/\/www.ece.cmu.edu\/\u223ccalcm\/carl\/doku.php?id=pk_gupta_intel_xeon_fpga_platform_for_the_data_center.","key":"e_1_2_1_6_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1109\/ReConFig.2011.49"},{"volume-title":"Proceedings of the NASA\/ESA Conference on Adaptive Hardware and Systems (AHS'11)","author":"Hussain Hanaa M.","unstructured":"Hanaa M. Hussain , Khaled Benkrid , Huseyin Seker , and Ahmet T. Erdogan . 2011. FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data . In Proceedings of the NASA\/ESA Conference on Adaptive Hardware and Systems (AHS'11) . 248--255. Hanaa M. Hussain, Khaled Benkrid, Huseyin Seker, and Ahmet T. Erdogan. 2011. FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data. In Proceedings of the NASA\/ESA Conference on Adaptive Hardware and Systems (AHS'11). 248--255.","key":"e_1_2_1_8_1"},{"unstructured":"Intel. 2020. MPF\u2014Memory Properties Factory. Retrieved from https:\/\/github.com\/OPAE\/intel-fpga-bbb\/tree\/master\/BBB_cci_mpf.  Intel. 2020. MPF\u2014Memory Properties Factory. Retrieved from https:\/\/github.com\/OPAE\/intel-fpga-bbb\/tree\/master\/BBB_cci_mpf.","key":"e_1_2_1_9_1"},{"unstructured":"Intel Corp. 2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. Retrieved from https:\/\/www.intel.com\/content\/dam\/www\/programmable\/us\/en\/pdfs\/literature\/manual\/mnl-ias-ccip.pdf.  Intel Corp. 2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. Retrieved from https:\/\/www.intel.com\/content\/dam\/www\/programmable\/us\/en\/pdfs\/literature\/manual\/mnl-ias-ccip.pdf.","key":"e_1_2_1_10_1"},{"unstructured":"Intel Corp. 2020. Intel QuickAssist Technology. Retrieved from http:\/\/www.intel.com\/content\/www\/us\/en\/embedded\/technology\/quickassist\/overview.html.  Intel Corp. 2020. Intel QuickAssist Technology. Retrieved from http:\/\/www.intel.com\/content\/www\/us\/en\/embedded\/technology\/quickassist\/overview.html.","key":"e_1_2_1_11_1"},{"unstructured":"Intel Corp. 2020. Power Solutions. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/programmable\/support\/supportresources\/support-centers\/power-support.html.  Intel Corp. 2020. Power Solutions. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/programmable\/support\/supportresources\/support-centers\/power-support.html.","key":"e_1_2_1_12_1"},{"unstructured":"Intel Documentation. 2020. AN 856: K-Mean Clustering with the Intel FPGA SDK for OpenC. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/programmable\/documentation\/rgw1528307246592.html.  Intel Documentation. 2020. AN 856: K-Mean Clustering with the Intel FPGA SDK for OpenC. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/programmable\/documentation\/rgw1528307246592.html.","key":"e_1_2_1_13_1"},{"key":"e_1_2_1_14_1","volume-title":"LAUR #00-3079","author":"Lavenier Dominique","year":"2000","unstructured":"Dominique Lavenier . 2000. FPGA implementation of the K-means clustering algorithm for hyperspectral images. Los Alamos National Lab , LAUR #00-3079 ( 2000 ), 1--18. Dominique Lavenier. 2000. FPGA implementation of the K-means clustering algorithm for hyperspectral images. Los Alamos National Lab, LAUR #00-3079 (2000), 1--18."},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1109\/ICCAD.2017.8203845"},{"unstructured":"M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http:\/\/archive.ics.uci.edu\/ml.  M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http:\/\/archive.ics.uci.edu\/ml.","key":"e_1_2_1_17_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_18_1","DOI":"10.1109\/FPL.2012.6339141"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the Emerging Information Technology Conference. 3--5.","author":"Liu Wei-Chuan","year":"2005","unstructured":"Wei-Chuan Liu , Jiun-Long Huang , and Ming-Syan Chen . 2005 . KACU: K-means with hardware centroid-updating . In Proceedings of the Emerging Information Technology Conference. 3--5. Wei-Chuan Liu, Jiun-Long Huang, and Ming-Syan Chen. 2005. KACU: K-means with hardware centroid-updating. In Proceedings of the Emerging Information Technology Conference. 3--5."},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1109\/TIT.1982.1056489"},{"unstructured":"Enno Luebbers Song Liu and Michael Chu. 2020. Simplify Software Integration for FPGA Accelerators with OPAE. Retrieved from https:\/\/01.org\/sites\/default\/files\/downloads\/opae\/open-programmable-acceleration-engine-paper.pdf.  Enno Luebbers Song Liu and Michael Chu. 2020. Simplify Software Integration for FPGA Accelerators with OPAE. Retrieved from https:\/\/01.org\/sites\/default\/files\/downloads\/opae\/open-programmable-acceleration-engine-paper.pdf.","key":"e_1_2_1_21_1"},{"key":"e_1_2_1_22_1","volume-title":"Using multi-core HW\/SW co-design architecture for accelerating K-means clustering algorithm. CoRR abs\/1807.09250","author":"Kamali Hadi Mardani","year":"2018","unstructured":"Hadi Mardani Kamali . 2018. Using multi-core HW\/SW co-design architecture for accelerating K-means clustering algorithm. CoRR abs\/1807.09250 ( 2018 ). Hadi Mardani Kamali. 2018. Using multi-core HW\/SW co-design architecture for accelerating K-means clustering algorithm. CoRR abs\/1807.09250 (2018)."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the Great Lakes Symposium on VLSI. 459--462","author":"Kamali Hadi Mardani","year":"2018","unstructured":"Hadi Mardani Kamali and Avesta Sasan . 2018 . MUCH-SWIFT: A high-throughput multi-core HW\/SW co-design K-means clustering architecture . In Proceedings of the Great Lakes Symposium on VLSI. 459--462 . Hadi Mardani Kamali and Avesta Sasan. 2018. MUCH-SWIFT: A high-throughput multi-core HW\/SW co-design K-means clustering architecture. In Proceedings of the Great Lakes Symposium on VLSI. 459--462."},{"volume-title":"Proceedings of the International Symposium on Computer Architecuture (ISCA\u201914)","author":"Andrew","unstructured":"Andrew Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services . In Proceedings of the International Symposium on Computer Architecuture (ISCA\u201914) . 13--24. Andrew Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the International Symposium on Computer Architecuture (ISCA\u201914). 13--24.","key":"e_1_2_1_24_1"},{"doi-asserted-by":"crossref","unstructured":"A. Rodriguez A. Navarro R. Asenjo F. Corbera R. Gran Tejero D. Suarez Gracia and J. Nunez-Yanez. 2019. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. J. Supercomput. (06 2019).  A. Rodriguez A. Navarro R. Asenjo F. Corbera R. Gran Tejero D. Suarez Gracia and J. Nunez-Yanez. 2019. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. J. Supercomput. (06 2019).","key":"e_1_2_1_25_1","DOI":"10.1007\/s11227-019-02935-1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_26_1","DOI":"10.1109\/CAHPC.2018.8645850"},{"key":"e_1_2_1_27_1","volume-title":"Introduction to Data Mining","author":"Tan Pang-Ning","unstructured":"Pang-Ning Tan , Michael Steinbach , Anuj Karpatne , and Vipin Kumar . 2018. Introduction to Data Mining ( 2 nd ed.). Pearson . Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar. 2018. Introduction to Data Mining (2nd ed.). Pearson.","edition":"2"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 3rd Parallel Tools Workshop on Tools for High Performance Computing. 157--173","author":"Terpstra Daniel","year":"2009","unstructured":"Daniel Terpstra , Heike Jagode , Haihang You , and Jack Dongarra . 2009 . Collecting performance data with PAPI-C . In Proceedings of the 3rd Parallel Tools Workshop on Tools for High Performance Computing. 157--173 . Daniel Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2009. Collecting performance data with PAPI-C. In Proceedings of the 3rd Parallel Tools Workshop on Tools for High Performance Computing. 157--173."},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the International Conference on Computational Science","volume":"51","author":"Vilches A.","unstructured":"A. Vilches , R. Asenjo , A. G. Navarro , F. Corbera , R. Gran Tejero , and M. Garzar\u00e1n . 2015. Adaptive partitioning for irregular applications on heterogeneous CPU-GPU chips . In Proceedings of the International Conference on Computational Science , Vol. 51 . 140--149. A. Vilches, R. Asenjo, A. G. Navarro, F. Corbera, R. Gran Tejero, and M. Garzar\u00e1n. 2015. Adaptive partitioning for irregular applications on heterogeneous CPU-GPU chips. In Proceedings of the International Conference on Computational Science, Vol. 51. 140--149."},{"volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201916)","author":"Weisz Gabriel","unstructured":"Gabriel Weisz , Joseph Melber , Yu Wang , Kermin Fleming , Eriko Nurvitadhi , and James C. Hoe . 2016. A study of pointer-chasing performance on shared-memory processor-FPGA systems . In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201916) . 264--273. Gabriel Weisz, Joseph Melber, Yu Wang, Kermin Fleming, Eriko Nurvitadhi, and James C. Hoe. 2016. A study of pointer-chasing performance on shared-memory processor-FPGA systems. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201916). 264--273.","key":"e_1_2_1_30_1"},{"unstructured":"Bruce Wile. 2014. CAPI is Core to POWER. Retrieved from http:\/\/www-03.ibm.com\/linux\/blogs\/capi\/.  Bruce Wile. 2014. CAPI is Core to POWER. Retrieved from http:\/\/www-03.ibm.com\/linux\/blogs\/capi\/.","key":"e_1_2_1_31_1"},{"unstructured":"R. Wilson. 2014. Heterogeneous Computing Meets the Data Center. Retrieved from https:\/\/www.altera.com\/solutions\/technology\/system-design\/articles\/_2014\/heterogeneous-computing.html.  R. Wilson. 2014. Heterogeneous Computing Meets the Data Center. Retrieved from https:\/\/www.altera.com\/solutions\/technology\/system-design\/articles\/_2014\/heterogeneous-computing.html.","key":"e_1_2_1_32_1"},{"volume-title":"Advances in K-means Clustering","author":"Junjie Wu.","unstructured":"Junjie Wu. 2012. Advances in K-means Clustering . Springer-Verlag Berlin . Junjie Wu. 2012. Advances in K-means Clustering. Springer-Verlag Berlin.","key":"e_1_2_1_33_1"},{"unstructured":"Xilinx Inc. 2014. Zynq-7000: All Programmable SoC. Retrieved from http:\/\/www.xilinx.com\/products\/silicon-devices\/soc\/zynq-7000.html.  Xilinx Inc. 2014. Zynq-7000: All Programmable SoC. Retrieved from http:\/\/www.xilinx.com\/products\/silicon-devices\/soc\/zynq-7000.html.","key":"e_1_2_1_34_1"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture and High Performance Computing. 137--144","author":"Zhou Shijie","unstructured":"Shijie Zhou and Viktor K. Prasanna . 2017. Accelerating graph analytics on CPU-FPGA heterogeneous platform . In Proceedings of the International Symposium on Computer Architecture and High Performance Computing. 137--144 . Shijie Zhou and Viktor K. Prasanna. 2017. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing. 137--144.","key":"e_1_2_1_36_1"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3406114","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3406114","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:52Z","timestamp":1750195912000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3406114"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,17]]},"references-count":34,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,9,30]]}},"alternative-id":["10.1145\/3406114"],"URL":"https:\/\/doi.org\/10.1145\/3406114","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2020,8,17]]},"assertion":[{"value":"2020-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-08-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}