{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T16:25:08Z","timestamp":1774628708302,"version":"3.50.1"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,4,30]],"date-time":"2024-04-30T00:00:00Z","timestamp":1714435200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"European Union's Horizon 2020 Research and Innovation Program under the Marie Sk\u0142odowska Curie","award":["956090"],"award-info":[{"award-number":["956090"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal\u2122 fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado\u2122 default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500\u00a0MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.<\/jats:p>","DOI":"10.1145\/3645097","type":"journal-article","created":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T10:18:51Z","timestamp":1707560331000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["High-efficiency Compressor Trees for Latest AMD FPGAs"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-9542-3317","authenticated-orcid":false,"given":"Konstantin","family":"Ho\u00dffeld","sequence":"first","affiliation":[{"name":"AMD Research, Dresden, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8409-0282","authenticated-orcid":false,"given":"Hans Jakob","family":"Damsgaard","sequence":"additional","affiliation":[{"name":"Tampere University, Tampere, Finland and AMD Research, Dresden, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2169-4606","authenticated-orcid":false,"given":"Jar","family":"Nurmi","sequence":"additional","affiliation":[{"name":"Tampere University, Tampere, Finland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7833-4057","authenticated-orcid":false,"given":"Michaela","family":"Blott","sequence":"additional","affiliation":[{"name":"AMD Research, Dublin, Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3998-7896","authenticated-orcid":false,"given":"Thomas B.","family":"Preu\u00dfer","sequence":"additional","affiliation":[{"name":"AMD Research, Dresden, Germany"}]}],"member":"320","published-online":{"date-parts":[[2024,4,30]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"AMD Inc. 2023. Versal ACAP Configurable Logic Block. Retrieved from https:\/\/docs.xilinx.com\/r\/en-US\/am005-versal-clb"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228584"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242897"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3506713"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2013.6645544"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2023.3238128"},{"key":"e_1_3_1_8_2","first-page":"349","article-title":"Some schemes for parallel multipliers","volume":"34","author":"Dadda L.","year":"1965","unstructured":"L. Dadda. 1965. Some schemes for parallel multipliers. Alta Frequenza 34 (1965), 349\u2013356.","journal-title":"Alta Frequenza"},{"key":"e_1_3_1_9_2","unstructured":"F. de Dinechin. FloPoCo Project Website. (n.d.). Retrieved from http:\/\/flopoco.gforge.inria.fr"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1088\/1748-0221\/13\/07\/P07027"},{"key":"e_1_3_1_11_2","first-page":"84","volume-title":"Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Hojabr Reza","year":"2021","unstructured":"Reza Hojabr, Ali Sedaghati, Amirali Sharifian, Ahmad Khonsari, and Arrvindh Shriraman. 2021. SPAGHETTI: Streaming accelerators for highly sparse GEMM on FPGAs. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921). IEEE, 84\u201396."},{"key":"e_1_3_1_12_2","first-page":"1404","volume-title":"Proceedings of the Design, Automation & Test in Europe Conference and Exhibition (DATE\u201921)","author":"Kerner Madis","year":"2021","unstructured":"Madis Kerner, Kalle Tammem\u00e4e, Jaan Raik, and Thomas Hollstein. 2021. Triple fixed-point MAC unit for deep learning. In Proceedings of the Design, Automation & Test in Europe Conference and Exhibition (DATE\u201921). IEEE, 1404\u20131407."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2018.2795611"},{"key":"e_1_3_1_14_2","volume-title":"Proceedings of the Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV\u201914)","author":"Kumm Martin","year":"2014","unstructured":"Martin Kumm and Peter Zipf. 2014. Efficient high speed compression trees on Xilinx FPGAs. In Proceedings of the Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV\u201914), Matteo Michel J\u00fcrgen Ruf, Dirk Allmendinger (Ed.). Cuvillier Verlag."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2014.6927468"},{"key":"e_1_3_1_16_2","first-page":"111","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Oguntebi Tayo","year":"2016","unstructured":"Tayo Oguntebi and Kunle Olukotun. 2016. Graphops: A dataflow library for graph analytics acceleration. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 111\u2013117."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2008.4483927"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2009.5272301"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/2068716.2068725"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.23919\/FPL.2017.8056834"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3270764"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_32"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3337929"},{"key":"e_1_3_1_24_2","article-title":"FP8 versus INT8 for efficient deep learning inference","author":"Baalen Mart van","year":"2023","unstructured":"Mart van Baalen, Andrey Kuzmin, Suparna S. Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga et\u00a0al. 2023. FP8 versus INT8 for efficient deep learning inference. Retrieved from https:\/\/arXiv:2303.17951","journal-title":"R"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3546182"},{"key":"e_1_3_1_26_2","first-page":"1","volume-title":"Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL\u201917)","author":"V\u00e9stias M\u00e1rio P.","year":"2017","unstructured":"M\u00e1rio P. V\u00e9stias, Rui Policarpo Duarte, Jos\u00e9 T. de Sousa, and Hor\u00e1cio Neto. 2017. Parallel dot-products for deep learning on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL\u201917). IEEE, 1\u20134."},{"key":"e_1_3_1_27_2","first-page":"350","volume-title":"Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL\u201919)","author":"V\u00e9stias M\u00e1rio P.","year":"2019","unstructured":"M\u00e1rio P. V\u00e9stias, Rui Policarpo Duarte, Jos\u00e9 T. de Sousa, and Hor\u00e1cio Neto. 2019. Hybrid dot-product calculation for convolutional neural networks in FPGA. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL\u201919). IEEE, 350\u2013353."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/PGEC.1964.263830"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","first-page":"763","DOI":"10.1109\/ASPDAC.2012.6165057","volume-title":"Proceedings of the 17th Asia and South Pacific Design Automation Conference","author":"Whatmough Paul N.","year":"2012","unstructured":"Paul N. Whatmough, Shidhartha Das, David M. Bull, and Izzat Darzaweh. 2012. Selective time borrowing for DSP pipelines with hybrid voltage control loop. In Proceedings of the 17th Asia and South Pacific Design Automation Conference. IEEE, 763\u2013768."},{"key":"e_1_3_1_30_2","unstructured":"Xilinx Inc. 2016. 7 Series FPGAs Configurable Logic Block. Xilinx Inc. Retrieved from https:\/\/docs.xilinx.com\/v\/u\/en-US\/ug474_7Series_CLB."},{"key":"e_1_3_1_31_2","unstructured":"Xilinx Inc. 2017. UltraScale Architecture Configurable Logic Block. Xilinx Inc. Retrieved from https:\/\/docs.xilinx.com\/v\/u\/en-US\/ug574-ultrascale-clb"},{"key":"e_1_3_1_32_2","first-page":"27168","article-title":"ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers","volume":"35","author":"Yao Zhewei","year":"2023","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2023. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Info. Process. Syst. 35 (2023), 27168\u201327183.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2941985"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS51040.2020.00026"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3645097","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3645097","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:27Z","timestamp":1750291407000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3645097"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,30]]},"references-count":33,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3645097"],"URL":"https:\/\/doi.org\/10.1145\/3645097","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,30]]},"assertion":[{"value":"2023-09-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-27","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}