{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:27:42Z","timestamp":1750220862790,"version":"3.41.0"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T00:00:00Z","timestamp":1583280000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,3,31]]},"abstract":"<jats:p>In heterogeneous multicore systems, the memory subsystem plays a critical role, since most core-to-core communications are conducted through the main memory. Memory efficiency has a substantial impact on system performance. Although memory traffic from multimedia cores generally manifests high row-buffer locality, which is beneficial to memory efficiency, the locality is often lost as memory streams are forwarded through networks-on-chip (NoC). Previous studies have discussed the techniques that improve memory visibility to reveal scattered row-buffer hit opportunities to the memory scheduler. However, extending local memory visibility introduces little benefit after the locality has been severely diluted. As the alternative approach, preserving row-buffer locality in the NoC has not been well explored. What is worse, it remains to be studied how to perform network traffic scheduling with the awareness of both memory efficiency and quality-of-service (QoS). In this article, we propose a router design with embedded row-index caches to enable locality-aware packet forwarding. The proposed design requires minor modifications to existing router microarchitecture and can be easily implemented with priority arbiters to integrate QoS support. Extensive evaluations show that the proposed design achieves higher memory efficiency than prior memory-aware routers, in addition to providing QoS support. On basis of extant QoS-aware routers, locality-aware forwarding helps to increase row-buffer hits by 58.32% and reduce memory latency by 14.45% on average. It also introduces a net reduction in DRAM and NoC energy cost by 27.82%.<\/jats:p>","DOI":"10.1145\/3377149","type":"journal-article","created":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T12:50:12Z","timestamp":1583326212000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Improving Memory Efficiency in Heterogeneous MPSoCs through Row-Buffer Locality-aware Forwarding"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3455-2885","authenticated-orcid":false,"given":"Yang","family":"Song","sequence":"first","affiliation":[{"name":"University of California San Diego, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0965-7247","authenticated-orcid":false,"given":"Bill","family":"Lin","sequence":"additional","affiliation":[{"name":"University of California San Diego, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,3,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337207"},{"key":"e_1_2_1_2_1","unstructured":"Karthik Chandrasekar Christian Weis Yonghui Li Sven Goossens Matthias Jung Omar Naji Benny Akesson Norbert Wehn and Kees Goossens. 2014. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. Retrieved from http:\/\/www.drampower.info.  Karthik Chandrasekar Christian Weis Yonghui Li Sven Goossens Matthias Jung Omar Naji Benny Akesson Norbert Wehn and Kees Goossens. 2014. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. Retrieved from http:\/\/www.drampower.info."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-44917-2_15"},{"volume-title":"Principles and Practices of Interconnection Networks. Morgan Kaufmann","author":"Dally William","key":"e_1_2_1_4_1","unstructured":"William Dally and Brian Towles . 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann , San Francisco, CA . William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA."},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910)","author":"Das Reetuparna","year":"1815","unstructured":"Reetuparna Das , Onur Mutlu , Thomas Moscibroda , and Chita R. Das . 2010. Aergia: Exploiting packet latency slack in on-chip networks . In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910) . ACM, New York, NY, 106--116. DOI:https:\/\/doi.org\/10.1145\/ 1815 961.1815976 10.1145\/1815961.1815976 Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aergia: Exploiting packet latency slack in on-chip networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910). ACM, New York, NY, 106--116. DOI:https:\/\/doi.org\/10.1145\/1815961.1815976"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1998582.1998590"},{"volume-title":"Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911)","author":"Ebrahimi Eiman","key":"e_1_2_1_7_1","unstructured":"Eiman Ebrahimi , Rustam Miftakhutdinov , Chris Fallin , Chang Joo Lee , Jos\u00e9 A. Joao , Onur Mutlu , and Yale N. Patt . 2011. Parallel application memory scheduling . In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911) . ACM, New York, NY, 362--373. DOI:https:\/\/doi.org\/10.1145\/2155620.2155663 10.1145\/2155620.2155663 Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Jos\u00e9 A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911). ACM, New York, NY, 362--373. DOI:https:\/\/doi.org\/10.1145\/2155620.2155663"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000112"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669149"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications. 625--632","author":"Heibwolf Jan","year":"2012","unstructured":"Jan Heibwolf , Ralf K\u00c3\u0171nig , and Jurgen Becker . 2012 . A scalable NoC router design providing QoS support using weighted round robin scheduling . In Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications. 625--632 . DOI:https:\/\/doi.org\/10.1109\/ISPA.2012.93 10.1109\/ISPA.2012.93 Jan Heibwolf, Ralf K\u00c3\u0171nig, and Jurgen Becker. 2012. A scalable NoC router design providing QoS support using weighted round robin scheduling. In Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications. 625--632. DOI:https:\/\/doi.org\/10.1109\/ISPA.2012.93"},{"key":"e_1_2_1_11_1","volume-title":"Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann","author":"Jacob Bruce","year":"2007","unstructured":"Bruce Jacob , Spencer Ng , and David Wang . 2007 . Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann , San Francisco, CA . Bruce Jacob, Spencer Ng, and David Wang. 2007. Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann, San Francisco, CA."},{"volume-title":"Proceedings of the 46th Annual Design Automation Conference (DAC\u201909)","author":"Jang Wooyoung","key":"e_1_2_1_12_1","unstructured":"Wooyoung Jang and David Z. Pan . 2009. An SDRAM-aware router for networks-on-chip . In Proceedings of the 46th Annual Design Automation Conference (DAC\u201909) . ACM, New York, NY, 800--805. DOI:https:\/\/doi.org\/10.1145\/1629911.1630117 10.1145\/1629911.1630117 Wooyoung Jang and David Z. Pan. 2009. An SDRAM-aware router for networks-on-chip. In Proceedings of the 46th Annual Design Automation Conference (DAC\u201909). ACM, New York, NY, 800--805. DOI:https:\/\/doi.org\/10.1145\/1629911.1630117"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541228.2555300"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228513"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155624"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2333660.2333751"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA\u201910)","author":"Kim Yoongu","year":"2010","unstructured":"Yoongu Kim , Dongsu Han , Onur Mutlu , and Mor Harchol-Balter . 2010 . ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers . In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA\u201910) . 1--12. DOI:https:\/\/doi.org\/10.1109\/HPCA.2010.5416658 10.1109\/HPCA.2010.5416658 Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA\u201910). 1--12. DOI:https:\/\/doi.org\/10.1109\/HPCA.2010.5416658"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.51"},{"key":"e_1_2_1_19_1","volume-title":"Patt","author":"Lee Chang Joo","year":"2010","unstructured":"Chang Joo Lee , Eiman Ebrahimi , Veynu Narasiman , Onur Mutlu , and Yale N . Patt . 2010 . DRAM-aware last-level cache replacement. High Performance Systems Technical Report TR-HPS-2010-007. University of Texas , Austin, TX. Chang Joo Lee, Eiman Ebrahimi, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2010. DRAM-aware last-level cache replacement. High Performance Systems Technical Report TR-HPS-2010-007. University of Texas, Austin, TX."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.31"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1394608.1382128"},{"key":"e_1_2_1_22_1","unstructured":"NVIDIA. 2015. Tegra X1. Retrieved from http:\/\/www.nvidia.com\/object\/tegra-x1-processor.html.  NVIDIA. 2015. Tegra X1. Retrieved from http:\/\/www.nvidia.com\/object\/tegra-x1-processor.html."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.21"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLSID.2014.47"},{"key":"e_1_2_1_25_1","unstructured":"Qualcomm. 2017. Snapdragon 845. Retrieved from https:\/\/www.qualcomm.com\/products\/snapdragon\/processors\/845.  Qualcomm. 2017. Snapdragon 845. Retrieved from https:\/\/www.qualcomm.com\/products\/snapdragon\/processors\/845."},{"volume-title":"Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA\u201900)","author":"Rixner Scott","key":"e_1_2_1_26_1","unstructured":"Scott Rixner , William J. Dally , Ujval J. Kapasi , Peter Mattson , and John D. Owens . 2000. Memory access scheduling . In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA\u201900) . ACM, New York, NY, 128--138. DOI:https:\/\/doi.org\/10.1145\/339647.339668 10.1145\/339647.339668 Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA\u201900). ACM, New York, NY, 128--138. DOI:https:\/\/doi.org\/10.1145\/339647.339668"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2018.8342112"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3195970.3196110"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2898093"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910)","author":"Stuecheli Jeffrey","year":"1815","unstructured":"Jeffrey Stuecheli , Dimitris Kaseridis , David Daly , Hillery C. Hunter , and Lizy K. John . 2010. The virtual write queue: Coordinating DRAM and last-level cache policies . In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910) . ACM, New York, NY, 72--82. DOI:https:\/\/doi.org\/10.1145\/ 1815 961.1815972 10.1145\/1815961.1815972 Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910). ACM, New York, NY, 72--82. DOI:https:\/\/doi.org\/10.1145\/1815961.1815972"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOCS.2012.31"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847255"},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","unstructured":"David Wang Brinda Ganesh Nuengwong Tuaycharoen Katie Baynes Aamer Jaleel and Bruce Jacob. 2005. DRAMSim: A memory-system simulator. In SIGARCH Computer Architecture News. 100--107.  David Wang Brinda Ganesh Nuengwong Tuaycharoen Katie Baynes Aamer Jaleel and Bruce Jacob. 2005. DRAMSim: A memory-system simulator. In SIGARCH Computer Architecture News. 100--107.","DOI":"10.1145\/1105734.1105748"},{"volume-title":"Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA\u201912)","author":"Wang Zhe","key":"e_1_2_1_34_1","unstructured":"Zhe Wang , Samira M. Khan , and Daniel A. Jim\u00e9nez . 2012. Improving writeback efficiency with decoupled last-write prediction . In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA\u201912) . IEEE Computer Society, Washington, DC, 309--320. http:\/\/dl.acm.org\/citation.cfm?id=2337159.2337195 Zhe Wang, Samira M. Khan, and Daniel A. Jim\u00e9nez. 2012. Improving writeback efficiency with decoupled last-write prediction. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA\u201912). IEEE Computer Society, Washington, DC, 309--320. http:\/\/dl.acm.org\/citation.cfm?id=2337159.2337195"},{"volume-title":"Proceedings of the 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)","author":"Yuan George L.","key":"e_1_2_1_35_1","unstructured":"George L. Yuan , Ali Bakhoda , and Tor M. Aamodt . 2009. Complexity effective memory access scheduling for many-core accelerator architectures . In Proceedings of the 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909) . ACM, New York, NY, 34--44. DOI:https:\/\/doi.org\/10.1145\/1669112.1669119 10.1145\/1669112.1669119 George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909). ACM, New York, NY, 34--44. DOI:https:\/\/doi.org\/10.1145\/1669112.1669119"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195672"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.47"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3377149","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3377149","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:23:38Z","timestamp":1750202618000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3377149"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,4]]},"references-count":37,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,3,31]]}},"alternative-id":["10.1145\/3377149"],"URL":"https:\/\/doi.org\/10.1145\/3377149","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2020,3,4]]},"assertion":[{"value":"2018-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}