{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T09:58:19Z","timestamp":1764842299820,"version":"3.41.0"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2019,10,7]],"date-time":"2019-10-07T00:00:00Z","timestamp":1570406400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001381","name":"National Research Foundation Singapore","doi-asserted-by":"publisher","award":["NRF2015-IIP003"],"award-info":[{"award-number":["NRF2015-IIP003"]}],"id":[{"id":"10.13039\/501100001381","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Huawei International Pte.Ltd."}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2019,10,31]]},"abstract":"<jats:p>A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.<\/jats:p>\n          <jats:p>\n            We propose a novel decoupled access-execute CGRA design called\n            <jats:italic>CASCADE<\/jats:italic>\n            with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory.\n            <jats:italic>CASCADE<\/jats:italic>\n            offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3\u00d7 performance benefit and 2.2\u00d7 performance per watt improvement for\n            <jats:italic>CASCADE<\/jats:italic>\n            compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.\n          <\/jats:p>","DOI":"10.1145\/3358177","type":"journal-article","created":{"date-parts":[[2019,10,10]],"date-time":"2019-10-10T13:13:05Z","timestamp":1570713185000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":40,"title":["CASCADE"],"prefix":"10.1145","volume":"18","author":[{"given":"Dhananjaya","family":"Wijerathne","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore"}]},{"given":"Zhaoying","family":"Li","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore"}]},{"given":"Manupa","family":"Karunarathne","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore"}]},{"given":"Anuj","family":"Pathania","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore"}]},{"given":"Tulika","family":"Mitra","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2019,10,7]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2019. MediaBench 2 Benchmark. http:\/\/mathstat.slu.edu\/ fritts\/mediabench\/.  2019. MediaBench 2 Benchmark. http:\/\/mathstat.slu.edu\/ fritts\/mediabench\/."},{"key":"e_1_2_1_2_1","unstructured":"2019. PolyLib - A Library of Polyhedral Functions. http:\/\/icps.u-strasbg.fr\/polylib\/.  2019. PolyLib - A Library of Polyhedral Functions. http:\/\/icps.u-strasbg.fr\/polylib\/."},{"key":"e_1_2_1_3_1","unstructured":"2019. The Polyhedral Benchmark Suite. http:\/\/web.cse.ohio-state.edu\/&sim;pouchet.2\/software\/polybench\/.  2019. The Polyhedral Benchmark Suite. http:\/\/web.cse.ohio-state.edu\/&sim;pouchet.2\/software\/polybench\/."},{"key":"e_1_2_1_4_1","volume-title":"Ullman","author":"Aho Alfred V.","year":"2007","unstructured":"Alfred V. Aho , Monica S. Lam , Ravi Sethi , and Jeffrey D . Ullman . 2007 . Compilers : Principles, Techniques, and Tools Second Edition . Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2007. Compilers: Principles, Techniques, and Tools Second Edition."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3203217.3203267"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2017.8203842"},{"key":"e_1_2_1_7_1","first-page":"21","article-title":"Graph minor approach for application mapping on CGRAs","volume":"7","author":"Chen Liang","year":"2014","unstructured":"Liang Chen and Tulika Mitra . 2014 . Graph minor approach for application mapping on CGRAs . Transactions on Reconfigurable Technology and Systems (TRETS) 7 , 3 (2014), 21 . Liang Chen and Tulika Mitra. 2014. Graph minor approach for application mapping on CGRAs. Transactions on Reconfigurable Technology and Systems (TRETS) 7, 3 (2014), 21.","journal-title":"Transactions on Reconfigurable Technology and Systems (TRETS)"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/956417.956540"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1008069920230"},{"volume-title":"2015 52nd ACM\/EDAC\/IEEE Design Automation Conference (DAC). IEEE, 1--6.","author":"Cota Emilio G.","key":"e_1_2_1_10_1","unstructured":"Emilio G. Cota , Paolo Mantovani , Giuseppe Di Guglielmo , and Luca P. Carloni . 2015. An analysis of accelerator coupling in heterogeneous architectures . In 2015 52nd ACM\/EDAC\/IEEE Design Automation Conference (DAC). IEEE, 1--6. Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM\/EDAC\/IEEE Design Automation Conference (DAC). IEEE, 1--6."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAC.2018.8465892"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2014.05.009"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/EUC.2014.26"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508128.1508158"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228600"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463209.2488756"},{"key":"e_1_2_1_17_1","first-page":"8","article-title":"Power-efficient predication techniques for acceleration of control flow execution on CGRA","volume":"10","author":"Han Kyuseung","year":"2013","unstructured":"Kyuseung Han , Junwhan Ahn , and Kiyoung Choi . 2013 . Power-efficient predication techniques for acceleration of control flow execution on CGRA . ACM Transactions on Architecture and Code Optimization (TACO) 10 , 2 (2013), 8 . Kyuseung Han, Junwhan Ahn, and Kiyoung Choi. 2013. Power-efficient predication techniques for acceleration of control flow execution on CGRA. ACM Transactions on Architecture and Code Optimization (TACO) 10, 2 (2013), 8.","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO)"},{"key":"e_1_2_1_18_1","volume-title":"Sung Jin Kim, and Karthikeyan Sankaralingam","author":"Ho Chen-Han","year":"2015","unstructured":"Chen-Han Ho , Sung Jin Kim, and Karthikeyan Sankaralingam . 2015 . Efficient execution of memory access phases using dataflow specialization. In SIGARCH Computer Architecture News, Vol. 43 . ACM , 118--130. Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In SIGARCH Computer Architecture News, Vol. 43. ACM, 118--130."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062262"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAC.2018.8465833"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2016.2595560"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1755888.1755892"},{"key":"e_1_2_1_23_1","first-page":"42","article-title":"Memory access optimization in compilation for coarse-grained reconfigurable architectures","volume":"16","author":"Kim Yongjoo","year":"2011","unstructured":"Yongjoo Kim , Jongeun Lee , Aviral Shrivastava , and Yunheung Paek . 2011 . Memory access optimization in compilation for coarse-grained reconfigurable architectures . Transactions on design automation of electronic systems (TODAES) 16 , 4 (2011), 42 . Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. Transactions on design automation of electronic systems (TODAES) 16, 4 (2011), 42.","journal-title":"Transactions on design automation of electronic systems (TODAES)"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2656075.2656085"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463209.2488757"},{"volume-title":"The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range","author":"McMahon Frank H.","key":"e_1_2_1_27_1","unstructured":"Frank H. McMahon . 1986. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range . Technical Report. Lawrence Livermore National Lab., CA (USA) . Frank H. McMahon. 1986. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range. Technical Report. Lawrence Livermore National Lab., CA (USA)."},{"key":"e_1_2_1_28_1","volume-title":"DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing","author":"Mei Bingfeng","year":"2007","unstructured":"Bingfeng Mei , M. Berekovic , and J. Y. Mignolet . 2007 . ADRES 8 DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing . Springer , 255--297. Bingfeng Mei, M. Berekovic, and J. Y. Mignolet. 2007. ADRES 8 DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing. Springer, 255--297."},{"key":"e_1_2_1_29_1","volume-title":"International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173","author":"Mei Bingfeng","year":"2002","unstructured":"Bingfeng Mei , Serge Vernalde , Diederik Verkest , Hugo De Man , and Rudy Lauwereins . 2002 . DRESC: A retargetable compiler for coarse-grained reconfigurable architectures . In International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173 . Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2744769.2744831"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080255"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2016.2647322"},{"key":"e_1_2_1_33_1","first-page":"435","article-title":"System-level optimization of accelerator local memory for heterogeneous systems-on-chip","volume":"36","author":"Pilato Christian","year":"2016","unstructured":"Christian Pilato , Paolo Mantovani , Giuseppe Di Guglielmo , and Luca P Carloni . 2016 . System-level optimization of accelerator local memory for heterogeneous systems-on-chip . Transactions on Computer-Aided Design of Integrated Circuits and Systems 36 , 3 (2016), 435 -- 448 . Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P Carloni. 2016. System-level optimization of accelerator local memory for heterogeneous systems-on-chip. Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 3 (2016), 435--448.","journal-title":"Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/192724.192731"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.859540"},{"volume-title":"ACM SIGARCH Computer Architecture News","author":"Smith James E.","key":"e_1_2_1_36_1","unstructured":"James E. Smith . 1982. Decoupled access\/execute computer architectures . In ACM SIGARCH Computer Architecture News , Vol. 10 . IEEE Computer Society Press , 112--119. James E. Smith. 1982. Decoupled access\/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Computer Society Press, 112--119."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2554688.2554780"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463209.2488748"},{"key":"e_1_2_1_39_1","volume-title":"Kanwen Wang, Hao Yu, and Mingbin Yu.","author":"Xu Dongjun","year":"2015","unstructured":"Dongjun Xu , Ningmei Yu , PD Sai Manoj , Kanwen Wang, Hao Yu, and Mingbin Yu. 2015 . A 2.5-D memory-logic integration with data-pattern-aware memory controller. Design 8 Test 32, 4 (2015), 1--10. Dongjun Xu, Ningmei Yu, PD Sai Manoj, Kanwen Wang, Hao Yu, and Mingbin Yu. 2015. A 2.5-D memory-logic integration with data-pattern-aware memory controller. Design 8 Test 32, 4 (2015), 1--10."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/1854205.1854208"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2966986.2967056"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2693274"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2015.2474129"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2682241"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2966986.2967049"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3358177","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3358177","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:32:58Z","timestamp":1750199578000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3358177"}},"subtitle":["High Throughput Data Streaming via Decoupled Access-Execute CGRA"],"short-title":[],"issued":{"date-parts":[[2019,10,7]]},"references-count":45,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2019,10,31]]}},"alternative-id":["10.1145\/3358177"],"URL":"https:\/\/doi.org\/10.1145\/3358177","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2019,10,7]]},"assertion":[{"value":"2019-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}