{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,16]],"date-time":"2026-01-16T00:21:22Z","timestamp":1768522882454,"version":"3.49.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,1,24]],"date-time":"2023-01-24T00:00:00Z","timestamp":1674518400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["CNS-1718160"],"award-info":[{"award-number":["CNS-1718160"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>Post-Moore\u2019s law area-constrained systems rely on accelerators to deliver performance enhancements. Coarse-grained accelerators can offer substantial domain acceleration, but manual, ad hoc identification of code to accelerate is prohibitively expensive. Because cycle-accurate simulators and high-level synthesis (HLS) flows are so time-consuming, the manual creation of high-utilization accelerators that exploit control and data flow patterns at optimal granularities is rarely successful. To address these challenges, we present AccelMerger, the first automated methodology to create coarse-grained, control- and data-flow-rich merged accelerators. AccelMerger uses sequence alignment matching to recognize similar function call-graphs and loops, and neural networks to quickly evaluate their post-HLS characteristics. It accurately identifies which functions to accelerate, and it merges accelerators to respect an area budget and to accommodate system communication characteristics like latency and bandwidth. Merging two accelerators can save as much as 99% of the area of one. The space saved is used by a globally optimal integer linear program to allocate more accelerators for increased performance. We demonstrate AccelMerger\u2019s effectiveness using HLS flows without any manual effort to fine-tune the resulting designs. On FPGA-based systems, AccelMerger yields application performance improvements of up to 16.7\u00d7 over software implementations, and 1.91\u00d7 on average with respect to state-of-the-art early-stage design space exploration tools.<\/jats:p>","DOI":"10.1145\/3546070","type":"journal-article","created":{"date-parts":[[2022,7,18]],"date-time":"2022-07-18T12:16:56Z","timestamp":1658146616000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Early DSE and Automatic Generation of Coarse-grained Merged Accelerators"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0403-856X","authenticated-orcid":false,"given":"Iulian","family":"Brumar","sequence":"first","affiliation":[{"name":"Harvard University, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6644-5200","authenticated-orcid":false,"given":"Georgios","family":"Zacharopoulos","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7479-9263","authenticated-orcid":false,"given":"Yuan","family":"Yao","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9528-5966","authenticated-orcid":false,"given":"Saketh","family":"Rama","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0662-7889","authenticated-orcid":false,"given":"David","family":"Brooks","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5730-9904","authenticated-orcid":false,"given":"Gu-Yeon","family":"Wei","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,1,24]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/390013.808479"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/996566.996679"},{"key":"e_1_3_3_4_2","unstructured":"Cadence. 2016. Stratus High-Level Synthesis. https:\/\/www.cadence.com\/en_US\/home\/tools\/digital-design-and-signoff\/synthesis\/stratus-high-level-synthesis.html."},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/2514740"},{"key":"e_1_3_3_6_2","volume-title":"Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914)","author":"Chen Tianshi","year":"2014","unstructured":"Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914)."},{"key":"e_1_3_3_7_2","volume-title":"International Symposium on Field-Programmable Gate Arrays (FPGA\u201908)","author":"Cong Jason","year":"2008","unstructured":"Jason Cong and Wei Jiang. 2008. Pattern-based behavior synthesis for FPGA resource reduction. In International Symposium on Field-Programmable Gate Arrays (FPGA\u201908)."},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2017.24"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/115372.115320"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3385412.3385983"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/1186736.1186737"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3192366.3192379"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178493"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2925426.2926269"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293910"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2008.06.003"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/2786763.2694358"},{"key":"e_1_3_3_19_2","volume-title":"2016 ACM\/SIGDA International Symposium On Field-Programmable Gate Arrays (FPGA\u201916)","author":"Liu Xinheng","year":"2016","unstructured":"Xinheng Liu, Yao Chen, Tan Nguyen, Swathi Gurumani, Kyle Rupnow, and Deming Chen. 2016. High level synthesis of complex applications: An h. 264 video decoder. In 2016 ACM\/SIGDA International Symposium On Field-Programmable Gate Arrays (FPGA\u201916)."},{"key":"e_1_3_3_20_2","unstructured":"Hans Mittelmann. 2022. http:\/\/plato.asu.edu\/ftp\/lpsimp.html."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/217474.217560"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2005.850844"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/0022-2836(70)90057-4"},{"key":"e_1_3_3_24_2","unstructured":"Andrew Ng. 2017. Machine learning yearning. http:\/\/www.mlyearning.org\/(96). 139 (2017)."},{"key":"e_1_3_3_25_2","first-page":"2011","article-title":"Bambu: A free framework for the high-level synthesis of complex applications","volume":"29","author":"Pilato Christian","year":"2012","unstructured":"Christian Pilato and Fabrizio Ferrandi. 2012. Bambu: A free framework for the high-level synthesis of complex applications. University Booth of DATE 29 (2012), 2011.","journal-title":"University Booth of DATE"},{"key":"e_1_3_3_26_2","doi-asserted-by":"crossref","first-page":"110","DOI":"10.1109\/IISWC.2014.6983050","volume-title":"2014 IEEE International Symposium on Workload Characterization (IISWC\u201914)","author":"Reagen Brandon","year":"2014","unstructured":"Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC\u201914). IEEE, 110\u2013119."},{"key":"e_1_3_3_27_2","volume-title":"Programming Language Design and Implementation (PLDI\u201920)","author":"Rocha Rodrigo C. O.","year":"2020","unstructured":"Rodrigo C. O. Rocha, Pavlos Petoumenos, Zheng Wang, Murray Cole, and Hugh Leather. 2020. Effective function merging in the SSA form. In Programming Language Design and Implementation (PLDI\u201920)."},{"key":"e_1_3_3_28_2","volume-title":"2019 International Symposium on Code Generation and Optimization (CGO\u201919)","author":"Rocha R. C. O.","year":"2019","unstructured":"R. C. O. Rocha, P. Petoumenos, Z. Wang, M. Cole, and H. Leather. 2019. Function merging by sequence alignment. In 2019 International Symposium on Code Generation and Optimization (CGO\u201919)."},{"key":"e_1_3_3_29_2","doi-asserted-by":"crossref","first-page":"471","DOI":"10.1109\/MICRO50266.2020.00047","volume-title":"2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920)","author":"Rogers Samuel","year":"2020","unstructured":"Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. 2020. gem5-SALAM: A system architecture for LLVM-based accelerator modeling. In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920). IEEE, 471\u2013482."},{"key":"e_1_3_3_30_2","volume-title":"2014 ACM\/IEEE Annual International Symposium on Computer Architecture (ISCA\u201914)","author":"Shao Yakun Sophia","year":"2014","unstructured":"Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In 2014 ACM\/IEEE Annual International Symposium on Computer Architecture (ISCA\u201914)."},{"key":"e_1_3_3_31_2","first-page":"1","volume-title":"2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Shao Yakun Sophia","year":"2016","unstructured":"Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-designing accelerators and soc interfaces using gem5-aladdin. In 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, 1\u201312."},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3494534"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2018.2886342"},{"key":"e_1_3_3_34_2","volume-title":"2018 ACM\/IEEE Annual International Symposium on Computer Architecture (ISCA\u201918)","author":"Tan Cheng","year":"2018","unstructured":"Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. 2018. Stitch: Fusible heterogeneous accelerators enmeshed with many-core architecture for wearables. In 2018 ACM\/IEEE Annual International Symposium on Computer Architecture (ISCA\u201918)."},{"key":"e_1_3_3_35_2","unstructured":"T\u00falio A. M. Toffolo and Haroldo G. Santos. 2022. MIP. https:\/\/python-mip.readthedocs.io\/en\/latest\/intro.html."},{"key":"e_1_3_3_36_2","volume-title":"Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201910)","author":"Venkatesh Ganesh","year":"2010","unstructured":"Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. 2010. Conservation cores: Reducing the energy of mature computations. In Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201910)."},{"key":"e_1_3_3_37_2","volume-title":"IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911)","author":"Venkatesh Ganesh","year":"2011","unstructured":"Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911)."},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/216585.216588"},{"key":"e_1_3_3_39_2","unstructured":"Xilinx. 2021. Vivado High-Level Synthesis. www.xilinx.com\/products\/design-tools\/vivado\/integration\/esl-design.html."},{"key":"e_1_3_3_40_2","unstructured":"Xilinx. 2021. Xilinx All Programmable SoC Portfolio. https:\/\/www.xilinx.com\/support\/documentation\/data_sheets\/ds190-Zynq-7000-Overview.pdf."},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414670"},{"key":"e_1_3_3_42_2","article-title":"Trireme: Exploring hierarchical multi-level parallelism for domain specific hardware acceleration","author":"Zacharopoulos Georgios","year":"2022","unstructured":"Georgios Zacharopoulos, Adel Ejjeh, Ying Jing, En-Yu Yang, Tianyu Jia, Iulian Brumar, Jeremy Intan, Muhammad Huzaifa, Sarita Adve, Vikram Adve, et\u00a0al. 2022. Trireme: Exploring hierarchical multi-level parallelism for domain specific hardware acceleration. arXiv preprint arXiv:2201.08603 (2022).","journal-title":"arXiv preprint arXiv:2201.08603"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD46524.2019.00024"},{"key":"e_1_3_3_44_2","article-title":"RegionSeeker: Automatically identifying and selecting accelerators from application source code","author":"Zacharopoulos Georgios","year":"2018","unstructured":"Georgios Zacharopoulos, Lorenzo Ferretti, Emanuele Giaquinta, Giovanni Ansaloni, and Laura Pozzi. 2018. RegionSeeker: Automatically identifying and selecting accelerators from application source code. IEEE Transactions on Computer-Aided Design Of Integrated Circuits and Systems (TCAD) (2018), 1\u20136.","journal-title":"IEEE Transactions on Computer-Aided Design Of Integrated Circuits and Systems (TCAD)"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062195"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3546070","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3546070","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:19Z","timestamp":1750188619000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3546070"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,24]]},"references-count":44,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3546070"],"URL":"https:\/\/doi.org\/10.1145\/3546070","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"value":"1539-9087","type":"print"},{"value":"1558-3465","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,24]]},"assertion":[{"value":"2021-10-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-06-05","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}