{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,20]],"date-time":"2025-08-20T13:00:01Z","timestamp":1755694801722,"version":"3.41.0"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,3,1]],"date-time":"2023-03-01T00:00:00Z","timestamp":1677628800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"DARPA\u2019s DSSoC","award":["FA8650-18-2-7861"],"award-info":[{"award-number":["FA8650-18-2-7861"]}]},{"name":"Stanford AHA Center"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>\n            Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. Programmable domain-specific accelerators, such as coarse-grained reconfigurable arrays (CGRAs), have emerged as a promising middle-ground, but they have traditionally been difficult compiler targets since they use a different memory abstraction. In contrast to CPUs and GPUs, the memory hierarchies of domain-specific accelerators use\n            <jats:italic>push memories<\/jats:italic>\n            : memories that send input data streams to computation kernels or to higher or lower levels in the memory hierarchy and store the resulting output data streams. To address the compilation challenge caused by push memories, we propose that the representation of these memories in the compiler be altered to directly represent them by combining storage with address generation and control logic in a single structure\u2014a unified buffer.\n          <\/jats:p>\n          <jats:p>The unified buffer abstraction enables the compiler to separate generic push memory optimizations from the mapping to specific memory implementations in the backend. This separation allows our compiler to map high-level Halide applications to different CGRA memory designs, including some with a ready-valid interface. The separation also opens the opportunity for optimizing push memory elements on reconfigurable arrays. Our optimized memory implementation, the Physical Unified Buffer, uses a wide-fetch, single-port SRAM macro with built-in address generation logic to implement a buffer with two read and two write ports. It is 18% smaller and consumes 31% less energy than a physical buffer implementation using a dual-port memory that only supports two ports.<\/jats:p>\n          <jats:p>Finally, our system evaluation shows that enabling a compiler to support CGRAs leads to performance and energy benefits. Over a wide range of image processing and machine learning applications, our CGRA achieves 4.7\u00d7 better runtime and 3.5\u00d7 better energy-efficiency compared to an FPGA.<\/jats:p>","DOI":"10.1145\/3572908","type":"journal-article","created":{"date-parts":[[2022,11,29]],"date-time":"2022-11-29T12:05:35Z","timestamp":1669723535000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1083-9953","authenticated-orcid":false,"given":"Qiaoyi","family":"Liu","sequence":"first","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2327-646X","authenticated-orcid":false,"given":"Jeff","family":"Setter","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9055-3490","authenticated-orcid":false,"given":"Dillon","family":"Huff","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5945-1349","authenticated-orcid":false,"given":"Maxwell","family":"Strange","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9860-4942","authenticated-orcid":false,"given":"Kathleen","family":"Feng","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3245-7542","authenticated-orcid":false,"given":"Mark","family":"Horowitz","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8834-8663","authenticated-orcid":false,"given":"Priyanka","family":"Raina","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2267-903X","authenticated-orcid":false,"given":"Fredrik","family":"Kjolstad","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,3]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3322967"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2012.2207748"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_3_2_5_2","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1145\/1950413.1950423","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201911)","author":"Canis Andrew","year":"2011","unstructured":"Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor\/accelerator systems. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201911). Association for Computing Machinery, New York, NY, 33\u201336."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240850"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967969"},{"key":"e_1_3_2_9_2","first-page":"408","volume-title":"Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201920)","author":"Durst David","year":"2020","unstructured":"David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan. 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201920). Association for Computing Machinery, New York, NY, 408\u2013422."},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1145\/3174243.3174251","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201918)","author":"Escobedo Juan","year":"2018","unstructured":"Juan Escobedo and Mingjie Lin. 2018. Graph-theoretically optimal memory banking for stencil-based computing kernels. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201918). Association for Computing Machinery, New York, NY, 199\u2013208."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2018.2797600"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01407835"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.30"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2601097.2601174"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925892"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM51124.2021.00030"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3519939.3523446"},{"key":"e_1_3_2_18_2","unstructured":"Intel Inc. 2022. Altera OpenCL. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/software\/programmable\/sdk-for-opencl\/overview.html."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_3_2_20_2","first-page":"1","volume-title":"Proceedings of the ACM\/IEEE International Symposium on Computer Architecture (ISCA\u201917)","author":"Jouppi Norman","year":"2017","unstructured":"Norman Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et\u00a0al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM\/IEEE International Symposium on Computer Architecture (ISCA\u201917). Association for Computing Machinery, New York, NY, 1\u201312."},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"296","DOI":"10.1145\/3192366.3192379","volume-title":"Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201918)","author":"Koeplinger David","year":"2018","unstructured":"David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201918). Association for Computing Machinery, New York, NY, 296\u2013311."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL53798.2021.00074"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293910"},{"key":"e_1_3_2_24_2","first-page":"51","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201920)","author":"Li Jiajie","year":"2020","unstructured":"Jiajie Li, Yuze Chi, and Jason Cong. 2020. HeteroHalide: From image processing DSL to efficient FPGA acceleration. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201920). Association for Computing Machinery, New York, NY, 51\u201357."},{"key":"e_1_3_2_25_2","unstructured":"Maxeler Inc.2022. MaxCompiler. Retrieved from https:\/\/www.maxeler.com\/products\/software\/maxcompiler."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10617-012-9096-8"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-45234-8_7"},{"volume-title":"Catapult Synthesis User and Reference Manual","year":"2019","key":"e_1_3_2_28_2","unstructured":"Mentor. 2019. Catapult Synthesis User and Reference Manual. Mentor, Wilsonville, OR."},{"key":"e_1_3_2_29_2","unstructured":"Mentor Graphics Inc. 2022. Catapult High Level Synthesis."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPGA.1996.564808"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3229762.3229766"},{"key":"e_1_3_2_32_2","unstructured":"Thierry Moreau Tianqi Chen Ziheng Jiang Luis Ceze Carlos Guestrin and Arvind Krishnamurthy. 2018. VTA: An open hardware-software stack for deep learning. arXiv:1807.04188. Retrieved from http:\/\/arxiv.org\/abs\/1807.04188."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925952"},{"key":"e_1_3_2_34_2","first-page":"12","volume-title":"Proceedings of the IEEE International System-on-Chip Conference (SOCC\u201917)","author":"Nautiyal Vivek","year":"2017","unstructured":"Vivek Nautiyal, Gaurav Singla, Lalit Gupta, Sagar Dwivedi, and Martin Kinkade. 2017. An ultra high density pseudo dual-port SRAM in 16nm FINFET process for graphics processors. In Proceedings of the IEEE International System-on-Chip Conference (SOCC\u201917). IEEE, New York, NY, 12\u201317."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080255"},{"key":"e_1_3_2_36_2","volume-title":"Proceedings of the 11th International Workshop on Polyhedral Compilation Techniques (IMPACT\u201921)","author":"Parashar Angshuman","year":"2021","unstructured":"Angshuman Parashar, Prasanth Chatarasi, and Po-An Tsai. 2021. Hardware abstractions for targeting EDDO Architectures with the Polyhedral Model. In Proceedings of the 11th International Workshop on Polyhedral Compilation Techniques (IMPACT\u201921). HiPEAC."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304025"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3012084"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080256"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3107953"},{"issue":"6","key":"e_1_3_2_41_2","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1145\/2499370.2462176","article-title":"Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines","volume":"48","author":"Ragan-Kelley Jonathan","year":"2013","unstructured":"Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr\u00e9do Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Not. 48, 6 (2013), 519\u2013530.","journal-title":"ACM Sigplan Not."},{"key":"e_1_3_2_42_2","first-page":"1","volume-title":"Proceedings of the International Conference on Hardware\/Software Codesign and System Synthesis (CODES+ISSS\u201914)","author":"Reiche Oliver","year":"2014","unstructured":"Oliver Reiche, Moritz Schmid, Frank Hannig, Richard Membarth, and J\u00fcrgen Teich. 2014. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proceedings of the International Conference on Hardware\/Software Codesign and System Synthesis (CODES+ISSS\u201914). IEEE, New York, NY, 1\u201310."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358302"},{"key":"e_1_3_2_44_2","first-page":"1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Sharma Hardik","year":"2016","unstructured":"Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, New York, NY, 1\u201312."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3494534"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00042"},{"key":"e_1_3_2_47_2","volume-title":"Evaluating Spatially Programmable Architecture for Imaging and Vision Applications","author":"Vasilyev Artem","year":"2019","unstructured":"Artem Vasilyev. 2019. Evaluating Spatially Programmable Architecture for Imaging and Vision Applications. Ph.D. Dissertation Stanford University."},{"key":"e_1_3_2_48_2","first-page":"299","volume-title":"Proceedings of the International Congress of Mathematical Software (ICMS\u201910)","author":"Verdoolaege Sven","year":"2010","unstructured":"Sven Verdoolaege. 2010. ISL: An integer set library for the polyhedral model. In Proceedings of the International Congress of Mathematical Software (ICMS\u201910), Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.). Springer, Berlin, 299\u2013302."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439292"},{"volume-title":"Vivado Design Suite User Guide High-Level Synthesis","year":"2019","key":"e_1_3_2_50_2","unstructured":"Xilinx. 2019. Vivado Design Suite User Guide High-Level Synthesis. Xilinx, San Jose, CA."},{"key":"e_1_3_2_51_2","unstructured":"Xilinx Inc.2022. Vivado High Level Synthesis. Retrieved from https:\/\/www.xilinx.com\/products\/design-tools\/vivado\/integration\/esl-design.html."},{"key":"e_1_3_2_52_2","first-page":"369","volume-title":"Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201920)","author":"Yang Xuan","year":"2020","unstructured":"Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. 2020. Interstellar: Using Halide\u2019s scheduling language to analyze DNN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201920). Association for Computing Machinery, New York, NY, 369\u2013383."},{"key":"e_1_3_2_53_2","unstructured":"Benjamin Ylvisaker Carl Ebeling and Scott Hauck. 2010. Enhanced loop flattening for software pipelining of arbitrary loop nests. Technical Report. University of Washington Seattle."},{"key":"e_1_3_2_54_2","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1145\/2684746.2689060","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201915)","author":"Zhang Chen","year":"2015","unstructured":"Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201915). Association for Computing Machinery, New York, NY, 161\u2013170."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00085"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2013.6691121"},{"key":"e_1_3_2_58_2","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1145\/2435264.2435271","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201913)","author":"Zuo Wei","year":"2013","unstructured":"Wei Zuo, Yun Liang, Peng Li, Kyle Rupnow, Deming Chen, and Jason Cong. 2013. Improving high level synthesis optimization opportunity through polyhedral transformations. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201913). Association for Computing Machinery, New York, NY, 9\u201318."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572908","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3572908","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:38Z","timestamp":1750182698000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572908"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3]]},"references-count":57,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3572908"],"URL":"https:\/\/doi.org\/10.1145\/3572908","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,3]]},"assertion":[{"value":"2022-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-07","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}