{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,18]],"date-time":"2026-01-18T00:41:40Z","timestamp":1768696900915,"version":"3.49.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,5]],"date-time":"2023-12-05T00:00:00Z","timestamp":1701734400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSERC Discovery","award":["RGPIN-2019-04613, DGECR-2019-00120, ALLRP-552042-2020"],"award-info":[{"award-number":["RGPIN-2019-04613, DGECR-2019-00120, ALLRP-552042-2020"]}]},{"name":"Canada Foundation for Innovation John R. Evans Leaders Fund and British Columbia Knowledge Development Fund"},{"name":"Simon Fraser University New Faculty Start-up"},{"name":"Huawei, Xilinx, and Nvidia"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets.<\/jats:p><jats:p>In this article, we design and implement CHIP-KNN: an HLS-based, configurable, and high-performance KNN accelerator. CHIP-KNN\u00a0 optimizes the off-chip memory access on modern HBM-based FPGAs such as the AMD\/Xilinx Alveo U280 FPGA board. CHIP-KNN\u00a0is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension and data type representation of each data point, the distance metric, and the number of nearest neighbors - K. In terms of design architecture, we explore and discuss the tradeoffs between two design versions: CHIP-KNNv1\u00a0(Ping-Pong buffer based) and CHIP-KNNv2\u00a0 (streaming-based). Moreover, we investigate the routing congestion issue in our accelerator design, implement hierarchical structures to shorten critical paths, and integrate an open-source floorplanning optimization tool called TAPA\/AutoBridge to eliminate the place-and-route issues. To explore the design space and balance the computation and memory access performance, we also build an analytical performance model. Given a user configuration of the KNN parameters, our tool can automatically generate TAPA HLS C code for the optimal accelerator design and the corresponding host code, on the HBM-based FPGA platform.<\/jats:p><jats:p>Our experimental results on the Alveo U280 show that, compared to a 48-thread CPU implementation, CHIP-KNNv2\u00a0achieves a geomean performance speedup of 15\u00d7, with a maximum speedup of 45\u00d7. Additionally, we show that CHIP-KNNv2\u00a0achieves up to 2.1\u00d7 performance speedup over CHIP-KNNv1\u00a0 while increasing configurability. Compared with the state-of-the-art Facebook AI Similarity Search (FAISS)\u00a0[<jats:xref ref-type=\"bibr\">23<\/jats:xref>] GPU implementation running on a Nvidia Tesla V100 GPU, CHIP-KNNv2\u00a0achieves an average latency reduction of 30.6\u00d7 while requiring 34.3% of GPU power consumption.<\/jats:p>","DOI":"10.1145\/3616873","type":"journal-article","created":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T12:19:42Z","timestamp":1694607582000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["CHIP-KNNv2: A<u>C<\/u>onfigurable and<u>Hi<\/u>gh-<u>P<\/u>erformance<u>K<\/u>-<u>N<\/u>earest<u>N<\/u>eighbors Accelerator on HBM-based FPGAs"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-1348-6583","authenticated-orcid":false,"given":"Kenneth","family":"Liu","sequence":"first","affiliation":[{"name":"School of Engineering Science, Simon Fraser University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3315-7368","authenticated-orcid":false,"given":"Alec","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Engineering Science, Simon Fraser University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6309-7995","authenticated-orcid":false,"given":"Kartik","family":"Samtani","sequence":"additional","affiliation":[{"name":"School of Engineering Science, Simon Fraser University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0603-9697","authenticated-orcid":false,"given":"Zhenman","family":"Fang","sequence":"additional","affiliation":[{"name":"School of Engineering Science, Simon Fraser University, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0705-9510","authenticated-orcid":false,"given":"Licheng","family":"Guo","sequence":"additional","affiliation":[{"name":"Computer Science Department, University of California, Los Angeles, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,5]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT47387.2019.00019"},{"issue":"3","key":"e_1_3_1_3_2","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1080\/00031305.1992.10475879","article-title":"An introduction to kernel and nearest-neighbor nonparametric regression","volume":"46","author":"Altman N. S.","year":"1992","unstructured":"N. S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175\u2013185.","journal-title":"The American Statistician"},{"key":"e_1_3_1_4_2","first-page":"225","volume-title":"Proceedings of the High Performance Computing for Computational Science.","author":"Apar\u00edcio G.","year":"2007","unstructured":"G. Apar\u00edcio, I. Blanquer, and V. Hern\u00e1ndez. 2007. A parallel implementation of the K nearest neighbors classifier in three levels: Threads, MPI processes and the grid. In Proceedings of the High Performance Computing for Computational Science.Michel Dayd\u00e9, Jos\u00e9 M. L. M. Palma, \u00c1lvaro L. G. A. Coutinho, Esther Pacitti, and Jo\u00e3o Correia Lopes (Eds.), Springer, Berlin, 225\u2013235."},{"key":"e_1_3_1_5_2","first-page":"33","volume-title":"Proceedings of the IEEE CGC Workshop on Computational Geometry","author":"Arya Sunil","year":"1998","unstructured":"Sunil Arya and David M. Mount. 1998. ANN: Library for approximate nearest neighbor searching. In Proceedings of the IEEE CGC Workshop on Computational Geometry. 33\u201340."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/293347.293348"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2022.3147743"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_1_9_2","first-page":"204","volume-title":"Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines","author":"Chi Yuze","year":"2021","unstructured":"Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines. 204\u2013213."},{"key":"e_1_3_1_10_2","unstructured":"Jason Cong Zhenman Fang Yuchen Hao Peng Wei Cody Hao Yu Chen Zhang and Peipei Zhou. 2018. Best-effort FPGA programming: A few steps can go a long way. arXiv:1807.01340 [cs.AR]"},{"key":"e_1_3_1_11_2","first-page":"93","volume-title":"Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines","author":"Cong Jason","year":"2018","unstructured":"Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, 93\u201396."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062208"},{"key":"e_1_3_1_13_2","unstructured":"Intel Corporation. 2021. AN 584: Timing closure methodology for advanced FPGA designs. Retrieved July 28 2023 from https:\/\/www.intel.com\/content\/dam\/www\/programmable\/us\/en\/pdfs\/literature\/an\/an584.pdf"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ReCoSoC48741.2019.9034938"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2008.4563100"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2010.5654017"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/SOCC56010.2022.9908102"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-39964-3_62"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439289"},{"key":"e_1_3_1_20_2","first-page":"10","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Guo Ruiqi","year":"2020","unstructured":"Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. In Proceedings of the International Conference on Machine Learning. 10 pages. arXiv:1908.10396. Retrieved from https:\/\/arxiv.org\/abs\/1908.10396"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/AHS.2012.6268651"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2016.7840682"},{"key":"e_1_3_1_23_2","unstructured":"Intel. 2022. Intel (R) Memory Latency Checker. (2022). Retrieved July 3 2022 from https:\/\/www.intel.com\/content\/www\/us\/en\/download\/736633\/intel-memory-latency-checker-intel-mlc.html"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2019.2921572"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2018.00099"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00021"},{"key":"e_1_3_1_27_2","volume-title":"Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL","author":"Liu Liyuan","year":"2018","unstructured":"Liyuan Liu. 2018. Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL. Master\u2019s Thesis. University of Windsor."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00027"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3517131"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2015.125"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2016.7761587"},{"key":"e_1_3_1_32_2","first-page":"331","volume-title":"Proceedings of the VISAPP International Conference on Computer Vision Theory and Applications","author":"Muja Marius","year":"2009","unstructured":"Marius Muja and David G. Lowe. 2009. Fast approximate nearest neighbors with automatic algorithm configuration. In Proceedings of the VISAPP International Conference on Computer Vision Theory and Applications. 331\u2013340."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2015.7"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2018.8622087"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/276305.276319"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD46524.2019.00030"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2955864"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-007-0114-2"},{"key":"e_1_3_1_39_2","unstructured":"Xilinx. 2020. Alveo U200 and U250 Data Center Accelerator Cards Data Sheet. (2020). https:\/\/www.xilinx.com\/support\/documentation\/data_sheets\/ds962-u200-u250.pdfLast accessed July 28 2020."},{"key":"e_1_3_1_40_2","unstructured":"Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. (2020). Retrieved July 28 2020 from https:\/\/www.xilinx.com\/support\/documentation\/data_sheets\/ds963-u280.pdf"},{"key":"e_1_3_1_41_2","unstructured":"Xilinx. 2020. Vitis Unified Software Platform. (2020). Retrieved July 28 2020 from https:\/\/www.xilinx.com\/products\/design-tools\/vitis\/vitis-platform.html#development"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2010.5447837"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807601"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2785257"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/2554688.2554775"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3616873","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3616873","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:49:13Z","timestamp":1750286953000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3616873"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,5]]},"references-count":44,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3616873"],"URL":"https:\/\/doi.org\/10.1145\/3616873","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,5]]},"assertion":[{"value":"2023-03-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-14","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}