{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T13:38:02Z","timestamp":1767965882204,"version":"3.49.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,1,15]],"date-time":"2024-01-15T00:00:00Z","timestamp":1705276800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["1652866"],"award-info":[{"award-number":["1652866"]}]},{"name":"Intel ISRA program on FPGA"},{"name":"Intel\/VMware Crossroads 3D-FPGA Academic Research Center"},{"name":"Intel\/NSERC Industrial Research Chair in Programmable Silicon"},{"name":"Vector Institute for Artificial Intelligence"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3\u00d7 higher throughput and 5\u00d7 lower latency compared to the best prior FPGA-based solution with comparable accuracy.<\/jats:p>","DOI":"10.1145\/3634919","type":"journal-article","created":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T11:50:52Z","timestamp":1701690652000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4991-188X","authenticated-orcid":false,"given":"Anupreetham","family":"Anupreetham","sequence":"first","affiliation":[{"name":"Arizona State University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8930-0692","authenticated-orcid":false,"given":"Mohamed","family":"Ibrahim","sequence":"additional","affiliation":[{"name":"University of Toronto, Intel Corporation, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2134-8247","authenticated-orcid":false,"given":"Mathew","family":"Hall","sequence":"additional","affiliation":[{"name":"University of Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8044-1644","authenticated-orcid":false,"given":"Andrew","family":"Boutros","sequence":"additional","affiliation":[{"name":"University of Toronto, Vector Institute for AI, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6780-6451","authenticated-orcid":false,"given":"Ajay","family":"Kuzhively","sequence":"additional","affiliation":[{"name":"Arizona State University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9524-0366","authenticated-orcid":false,"given":"Abinash","family":"Mohanty","sequence":"additional","affiliation":[{"name":"Arizona State University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2347-9590","authenticated-orcid":false,"given":"Eriko","family":"Nurvitadhi","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0528-6493","authenticated-orcid":false,"given":"Vaughn","family":"Betz","sequence":"additional","affiliation":[{"name":"University of Toronto, Vector Institute for AI, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5689-0768","authenticated-orcid":false,"given":"Yu","family":"Cao","sequence":"additional","affiliation":[{"name":"Arizona State University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4551-7789","authenticated-orcid":false,"given":"Jae-Sun","family":"Seo","sequence":"additional","affiliation":[{"name":"Arizona State University, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,1,15]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1109\/FPL.2018.00077","volume-title":"2018 28th International Conference on Field Programmable Logic and Applications (FPL)","author":"Abdelfattah Mohamed S.","year":"2018","unstructured":"Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O\u2019Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C. Ling, and Gordon R. Chiu. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 411\u20134117. 10.1109\/FPL.2018.00077"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL53798.2021.00021"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCAS.2021.3071607"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00011"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242898"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSICT49897.2020.9278177"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.195"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPT.2018.00014"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00017"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_14_2","article-title":"MobileNets: Efficient convolutional neural networks for mobile vision applications","volume":"1704","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs\/1704.04861 (2017). arXiv:1704.04861http:\/\/arxiv.org\/abs\/1704.04861","journal-title":"CoRR"},{"key":"e_1_3_1_15_2","volume-title":"Extending Data Flow Architectures for Convolutional Neural Networks to Object Detection and Multiple FPGAs","author":"Ibrahim Mohamed","year":"2022","unstructured":"Mohamed Ibrahim and Vaughn Betz. 2022. Extending Data Flow Architectures for Convolutional Neural Networks to Object Detection and Multiple FPGAs. Master\u2019s thesis. The University of Toronto. https:\/\/tspace.library.utoronto.ca\/handle\/1807\/123335"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2939201"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00034"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439293"},{"key":"e_1_3_1_19_2","article-title":"Microsoft COCO: Common objects in context","volume":"1405","author":"Lin Tsung-Yi","year":"2014","unstructured":"Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. CoRR abs\/1405.0312 (2014). arXiv:1405.0312http:\/\/arxiv.org\/abs\/1405.0312","journal-title":"CoRR"},{"key":"e_1_3_1_20_2","article-title":"SSD: Single shot multibox detector","volume":"1512","author":"Liu Wei","year":"2015","unstructured":"Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. SSD: Single shot multibox detector. CoRR abs\/1512.02325 (2015). arxiv:1512.02325http:\/\/arxiv.org\/abs\/1512.02325","journal-title":"CoRR"},{"key":"e_1_3_1_21_2","volume-title":"IEEE International Conference on Computer-Aided Design (ICCAD)","author":"Ma Yufei","year":"2018","unstructured":"Yufei Ma, Tu Zheng, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs. In IEEE International Conference on Computer-Aided Design (ICCAD)."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL53798.2021.00010"},{"key":"e_1_3_1_23_2","volume-title":"2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)","year":"2019","unstructured":"NVIDIA. 2019. NVIDIA Tesla deep learning product performance. In 2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)."},{"key":"e_1_3_1_24_2","article-title":"MLPerf inference benchmark","volume":"1911","author":"Reddi Vijay Janapa","year":"2019","unstructured":"Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenk, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2019. MLPerf inference benchmark. CoRR abs\/1911.02549 (2019). arXiv:1911.02549http:\/\/arxiv.org\/abs\/1911.02549","journal-title":"CoRR"},{"key":"e_1_3_1_25_2","article-title":"You only look once: Unified, real-time object detection","volume":"1506","author":"Redmon Joseph","year":"2015","unstructured":"Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. CoRR abs\/1506.02640 (2015). arXiv:1506.02640http:\/\/arxiv.org\/abs\/1506.02640","journal-title":"CoRR"},{"key":"e_1_3_1_26_2","article-title":"YOLOv3: An incremental improvement","volume":"1804","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. CoRR abs\/1804.02767 (2018). arxiv:1804.02767http:\/\/arxiv.org\/abs\/1804.02767","journal-title":"CoRR"},{"key":"e_1_3_1_27_2","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"1506","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR abs\/1506.01497 (2015). arXiv:1506.01497http:\/\/arxiv.org\/abs\/1506.01497","journal-title":"CoRR"},{"key":"e_1_3_1_28_2","article-title":"Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation","volume":"1801","author":"Sandler Mark","year":"2018","unstructured":"Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs\/1801.04381 (2018). arXiv:1801.04381http:\/\/arxiv.org\/abs\/1801.04381","journal-title":"CoRR"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSII.2019.2893527"},{"key":"e_1_3_1_30_2","first-page":"1","volume-title":"International Conference on Field-Programmable Technology (FPT)","author":"Stan Marius","year":"2022","unstructured":"Marius Stan, Mathew Hall, Mohamed Ibrahim, and Vaughn Betz. 2022. HPIPE NX: Boosting CNN inference acceleration performance with AI-optimized FPGAs. In International Conference on Field-Programmable Technology (FPT). IEEE, 1\u20139."},{"key":"e_1_3_1_31_2","article-title":"Towards closing the energy gap between HOG and CNN features for embedded vision","volume":"1703","author":"Suleiman Amr","year":"2017","unstructured":"Amr Suleiman, Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2017. Towards closing the energy gap between HOG and CNN features for embedded vision. CoRR abs\/1703.05853 (2017). arXiv:1703.05853http:\/\/arxiv.org\/abs\/1703.05853","journal-title":"CoRR"},{"key":"e_1_3_1_32_2","unstructured":"Mingxing Tan Bo Chen Ruoming Pang Vijay Vasudevan Mark Sandler Andrew Howard and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition . 2820\u20132828."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3004198"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00030"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI49217.2020.00089"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/1486\/2\/022045"},{"key":"e_1_3_1_37_2","article-title":"Object detection in 20 years: A survey","volume":"1905","author":"Zou Zhengxia","year":"2019","unstructured":"Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. 2019. Object detection in 20 years: A survey. CoRR abs\/1905.05055 (2019). arXiv:1905.05055http:\/\/arxiv.org\/abs\/1905.05055","journal-title":"CoRR"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3634919","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3634919","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:51:07Z","timestamp":1750287067000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3634919"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,15]]},"references-count":36,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3634919"],"URL":"https:\/\/doi.org\/10.1145\/3634919","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,15]]},"assertion":[{"value":"2023-05-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-16","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}