{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T23:16:23Z","timestamp":1776122183015,"version":"3.50.1"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2012,1,1]],"date-time":"2012-01-01T00:00:00Z","timestamp":1325376000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004085","name":"Ministry of Education, Science and Technology","doi-asserted-by":"publisher","award":["2010-0011534"],"award-info":[{"award-number":["2010-0011534"]}],"id":[{"id":"10.13039\/501100004085","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004085","name":"Ministry of Education, Science and Technology","doi-asserted-by":"publisher","award":["2011-0018609"],"award-info":[{"award-number":["2011-0018609"]}],"id":[{"id":"10.13039\/501100004085","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004085","name":"Ministry of Education, Science and Technology","doi-asserted-by":"publisher","award":["2011-0000975"],"award-info":[{"award-number":["2011-0000975"]}],"id":[{"id":"10.13039\/501100004085","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2012,1]]},"abstract":"<jats:p>Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained Reconfigurable Architectures (CGRAs) used as a coprocessor to a main processor, pipeline setup can take much longer due to the communication delay between the two processors, and can become significant if it is repeated in an outer loop of a loop nest. In this paper we evaluate the overhead of such non-kernel execution times when mapping nested loops for CGRAs, and propose a novel architecture-compiler cooperative scheme to reduce the overhead, while also minimizing the number of extra configurations required. Our experimental results using loops from multimedia and scientific domains demonstrate that our proposed techniques can greatly increase the performance of nested loops by up to 2.87 times compared to the conventional approach of accelerating only the innermost loops. Moreover, the mappings generated by our techniques require only a modest number of configurations that can fit in recent reconfigurable architectures.<\/jats:p>","DOI":"10.1145\/2086696.2086711","type":"journal-article","created":{"date-parts":[[2012,1,24]],"date-time":"2012-01-24T16:47:14Z","timestamp":1327423634000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":28,"title":["Improving performance of nested loops on reconfigurable array processors"],"prefix":"10.1145","volume":"8","author":[{"given":"Yongjoo","family":"Kim","sequence":"first","affiliation":[{"name":"Seoul National University, Korea"}]},{"given":"Jongeun","family":"Lee","sequence":"additional","affiliation":[{"name":"UNIST, South Korea"}]},{"given":"Toan X.","family":"Mai","sequence":"additional","affiliation":[{"name":"UNIST, South Korea"}]},{"given":"Yunheung","family":"Paek","sequence":"additional","affiliation":[{"name":"Seoul National University, Korea"}]}],"member":"320","published-online":{"date-parts":[[2012,1,26]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"ADS. ARM Developer Suite. ARM Limited. http:\/\/www.arm.com\/.  ADS. ARM Developer Suite. ARM Limited. http:\/\/www.arm.com\/."},{"key":"e_1_2_1_2_1","unstructured":"AMBA2. Advanced Microcontroller Bus Architecture 2. ARM Limited. http:\/\/www.arm.com\/.  AMBA2. Advanced Microcontroller Bus Architecture 2. ARM Limited. http:\/\/www.arm.com\/."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/378239.378483"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/354880.354889"},{"key":"e_1_2_1_5_1","unstructured":"DongbuAnam Semiconductor. http:\/\/www.dsemi.com.  DongbuAnam Semiconductor. http:\/\/www.dsemi.com."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2008.08.007"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-92990-1_8"},{"key":"e_1_2_1_8_1","unstructured":"H.264. H.264\/AVC JM reference software v15.0. http:\/\/iphome.hhi.de\/suehring\/tml\/.  H.264. H.264\/AVC JM reference software v15.0. http:\/\/iphome.hhi.de\/suehring\/tml\/."},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 18th International Symposium on Parallel and Distributed Processing. 148","author":"Hannig F.","unstructured":"Hannig , F. , Dutta , H. , and Teich , J . 2004. Mapping of regular nested loop programs to coarse-grained reconfigurable arrays - constraints and methodology . In Proceedings of the 18th International Symposium on Parallel and Distributed Processing. 148 . Hannig, F., Dutta, H., and Teich, J. 2004. Mapping of regular nested loop programs to coarse-grained reconfigurable arrays - constraints and methodology. In Proceedings of the 18th International Symposium on Parallel and Distributed Processing. 148."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/367072.367839"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.869367"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.877947"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/DATE.2005.260"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1165573.1165646"},{"key":"e_1_2_1_15_1","unstructured":"L220. ARM L220 cache controller technical reference manual. ARM Limited. http:\/\/infocenter.arm.com\/.  L220. ARM L220 cache controller technical reference manual. ARM Limited. http:\/\/infocenter.arm.com\/."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1450135.1450143"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT). 166--173","author":"Mei B.","unstructured":"Mei , B. , Vernalde , S. , Verkest , D. , De Man , H. , and Lauwereins , R . 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures . In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT). 166--173 . Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins, R. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT). 166--173."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the Conference on Design, Automation and Test in Europe (DATE'03)","author":"Mei B.","unstructured":"Mei , B. , Vernalde , S. , Verkest , D. , De Man , H. , and Lauwereins , R . 2003. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling . In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'03) . IEEE Computer Society, Los Alamitos, CA. Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins, R. 2003. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'03). IEEE Computer Society, Los Alamitos, CA."},{"key":"e_1_2_1_19_1","article-title":"Remarc: Reconfigurable multimedia array coprocessor. IEICE","author":"Miyamori T.","year":"1998","unstructured":"Miyamori , T. and Olukotun , K. 1998 . Remarc: Reconfigurable multimedia array coprocessor. IEICE Trans. Info. Syst. E82-D, 389--397. Miyamori, T. and Olukotun, K. 1998. Remarc: Reconfigurable multimedia array coprocessor. IEICE Trans. Info. Syst. E82-D, 389--397.","journal-title":"Trans. Info. Syst. E82-D, 389--397."},{"key":"e_1_2_1_20_1","volume-title":"Advanced Compiler Design Implementation","author":"Muchnick S. S.","unstructured":"Muchnick , S. S. 1997. Advanced Compiler Design Implementation . Morgan Kaufmann Publishers . Muchnick, S. S. 1997. Advanced Compiler Design Implementation. Morgan Kaufmann Publishers."},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 10th International Conference on Compiler Construction (CC'01)","author":"Muthukumar K.","unstructured":"Muthukumar , K. and Doshi , G . 2001. Software pipelining of nested loops . In Proceedings of the 10th International Conference on Compiler Construction (CC'01) . Springer-Verlag, Berlin, 165--181. Muthukumar, K. and Doshi, G. 2001. Software pipelining of nested loops. In Proceedings of the 10th International Conference on Compiler Construction (CC'01). Springer-Verlag, Berlin, 165--181."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454140"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669160"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629395.1629433"},{"key":"e_1_2_1_25_1","unstructured":"Patterson D. and Hennessy J. 2008. Computer Organization and Design: The Hardware\/Software Interface 4 Ed. Morgan Kaufmann.   Patterson D. and Hennessy J. 2008. Computer Organization and Design: The Hardware\/Software Interface 4 Ed. Morgan Kaufmann."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the International Symposium on Parallel and Distributed Processing Symposium (IPDPS'02)","author":"Petkov D.","unstructured":"Petkov , D. , Harr , R. , and Amarasinghe , S . 2002. Efficient pipelining of nested loops: unroll-and-squash . In Proceedings of the International Symposium on Parallel and Distributed Processing Symposium (IPDPS'02) . Abstracts and CD-ROM. Petkov, D., Harr, R., and Amarasinghe, S. 2002. Efficient pipelining of nested loops: unroll-and-squash. In Proceedings of the International Symposium on Parallel and Distributed Processing Symposium (IPDPS'02). Abstracts and CD-ROM."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/192724.192731"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization (CGO'04)","author":"Rong H.","unstructured":"Rong , H. , Tang , Z. , Govindarajan , R. , Douillet , A. , and Gao , G . 2004. Single-dimension software pipelining for multi-dimensional loops . In Proceedings of the International Symposium on Code Generation and Optimization (CGO'04) . 163--174. Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. 2004. Single-dimension software pipelining for multi-dimensional loops. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'04). 163--174."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.859540"},{"key":"e_1_2_1_30_1","unstructured":"Thoziyoor S. Muralimanohar N. Ahn J. and Jouppi N. 2008. Cacti 5.1. Tech. rep.  Thoziyoor S. Muralimanohar N. Ahn J. and Jouppi N. 2008. Cacti 5.1. Tech. rep."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 6th International Conference on Compiler Construction. Springer-Verlag","author":"Wang J.","unstructured":"Wang , J. and Gao , G. R . 1996. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops . In Proceedings of the 6th International Conference on Compiler Construction. Springer-Verlag , Berlin, 1--17. Wang, J. and Gao, G. R. 1996. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops. In Proceedings of the 6th International Conference on Compiler Construction. Springer-Verlag, Berlin, 1--17."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2086696.2086711","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2086696.2086711","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T10:06:42Z","timestamp":1750241202000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2086696.2086711"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,1]]},"references-count":31,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2012,1]]}},"alternative-id":["10.1145\/2086696.2086711"],"URL":"https:\/\/doi.org\/10.1145\/2086696.2086711","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,1]]},"assertion":[{"value":"2011-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2011-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-01-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}