{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:29:03Z","timestamp":1750220943113,"version":"3.41.0"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2019,4,26]],"date-time":"2019-04-26T00:00:00Z","timestamp":1556236800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100011033","name":"Spanish State Research Agency","doi-asserted-by":"crossref","award":["TIN2015-66972-C5-3-R, TIN2016-75344-R"],"award-info":[{"award-number":["TIN2015-66972-C5-3-R, TIN2016-75344-R"]}],"id":[{"id":"10.13039\/501100011033","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Center for Future Architecture Research"},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["SHF-1617732"],"award-info":[{"award-number":["SHF-1617732"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,6,30]]},"abstract":"<jats:p>\n            Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This article explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose M\n            <jats:sc>ercury<\/jats:sc>\n            , a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some microarchitectural improvements for data supply units to efficiently handle long-latency indirect loads.\n          <\/jats:p>","DOI":"10.1145\/3310332","type":"journal-article","created":{"date-parts":[[2019,4,29]],"date-time":"2019-04-29T17:12:14Z","timestamp":1556557934000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Efficient Data Supply for Parallel Heterogeneous Architectures"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2669-6849","authenticated-orcid":false,"given":"Tae Jun","family":"Ham","sequence":"first","affiliation":[{"name":"Seoul National University, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Juan L.","family":"Arag\u00f3n","sequence":"additional","affiliation":[{"name":"University of Murcia, Murcia, SPAIN"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Margaret","family":"Martonosi","sequence":"additional","affiliation":[{"name":"Princeton University, NJ"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,4,26]]},"reference":[{"volume-title":"Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO\u201903)","author":"Akkary Haitham","key":"e_1_2_1_1_1","unstructured":"Haitham Akkary , Ravi Rajwar , and Srikanth T. Srinivasan . 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors . In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO\u201903) . http:\/\/dl.acm.org\/citation.cfm?id=956417.956554 Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO\u201903). http:\/\/dl.acm.org\/citation.cfm?id=956417.956554"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/165939.165952"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063454"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2629677"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/300979.300995"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555814"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"volume-title":"Proceedings of the 49th Annual International Symposium on Microarchitecture (MICRO\u201916)","author":"Chen T.","key":"e_1_2_1_8_1","unstructured":"T. Chen and G. E. Suh . 2016. Efficient data supply for hardware accelerators with prefetching and access\/execute decoupling . In Proceedings of the 49th Annual International Symposium on Microarchitecture (MICRO\u201916) . T. Chen and G. E. Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access\/execute decoupling. In Proceedings of the 49th Annual International Symposium on Microarchitecture (MICRO\u201916)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000079"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1044823.1044825"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/WWC.2003.1249063"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/L-CA.2013.9"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2008.4771800"},{"key":"e_1_2_1_14_1","unstructured":"J. D. Gindele. 1977. Buffer Block Prefetching Method. IBM.  J. D. Gindele. 1977. Buffer Block Prefetching Method. IBM."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830800"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3075620"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.46"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Hashemi Milad","year":"1956","unstructured":"Milad Hashemi , Onur Mutlu , and Yale N. Patt . 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads . In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916) . 12. http:\/\/dl.acm.org\/citation.cfm?id=3 1956 38.3195712 Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). 12. http:\/\/dl.acm.org\/citation.cfm?id=3195638.3195712"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning (ICML\u201918)","author":"Hashemi Milad","year":"2018","unstructured":"Milad Hashemi , Kevin Swersky , Jamie A. Smith , Grant Ayers , Heiner Litz , Jichuan Chang , Christos Kozyrakis , 2018 . Learning memory access patterns . In Proceedings of the 35th International Conference on Machine Learning (ICML\u201918) . Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, et al. 2018. Learning memory access patterns. In Proceedings of the 35th International Conference on Machine Learning (ICML\u201918)."},{"key":"e_1_2_1_20_1","volume-title":"Retrieved","author":"AMD.","year":"2015","unstructured":"AMD. 2015 . High-Bandwidth Memory (HBM) . Retrieved March 22, 2019 from https:\/\/www.amd.com\/Documents\/High-Bandwidth-Memory-HBM.pdf. AMD. 2015. High-Bandwidth Memory (HBM). Retrieved March 22, 2019 from https:\/\/www.amd.com\/Documents\/High-Bandwidth-Memory-HBM.pdf."},{"volume-title":"Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA\u201910)","author":"Hilton A.","key":"e_1_2_1_21_1","unstructured":"A. Hilton and A. Roth . 2010. BOLT: Energy-efficient out-of-order latency-tolerant execution . In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA\u201910) . A. Hilton and A. Roth. 2010. BOLT: Energy-efficient out-of-order latency-tolerant execution. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA\u201910)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750390"},{"key":"e_1_2_1_23_1","volume-title":"Retrieved","author":"Hybrid Memory Cube Consortium","year":"2018","unstructured":"Hybrid Memory Cube Consortium . 2018 . Hybrid Memory Cube (HMC) . Retrieved March 22, 2019 from http:\/\/hybridmemorycube.org. Hybrid Memory Cube Consortium. 2018. Hybrid Memory Cube (HMC). Retrieved March 22, 2019 from http:\/\/hybridmemorycube.org."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540730"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2581122.2544161"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/264107.264207"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/325164.325162"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1950365.1950411"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.36"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/605397.605415"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/545215.545223"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669172"},{"volume-title":"Proceedings of the International Conference on Computer-Aided Design (ICCAD\u201911)","author":"Li Sheng","key":"e_1_2_1_33_1","unstructured":"Sheng Li , Ke Chen , Jung Ho Ahn , Jay B. Brockman , and Norman P. Jouppi . 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques . In Proceedings of the International Conference on Computer-Aided Design (ICCAD\u201911) . Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-Aided Design (ICCAD\u201911)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2005.18"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the 9th Annual International Symposium on High-Performance Computer Architecture (HPCA\u201903)","author":"Mutlu Onur","year":"2080","unstructured":"Onur Mutlu , Jared Stark , Chris Wilkerson , and Yale N. Patt . 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors . In Proceedings of the 9th Annual International Symposium on High-Performance Computer Architecture (HPCA\u201903) . http:\/\/dl.acm.org\/citation.cfm?id=82 2080 .822823 Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th Annual International Symposium on High-Performance Computer Architecture (HPCA\u201903). http:\/\/dl.acm.org\/citation.cfm?id=822080.822823"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2004.10030"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080255"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/191995.192014"},{"volume-title":"Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917)","author":"Parihar R.","key":"e_1_2_1_39_1","unstructured":"R. Parihar and M. C. Huang . 2017. DRUT: An efficient turbo boost solution via load balancing in decoupled look-ahead architecture . In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917) . 91--104. R. Parihar and M. C. Huang. 2017. DRUT: An efficient turbo boost solution via load balancing in decoupled look-ahead architecture. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917). 91--104."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/1299042.1299107"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2006.1598112"},{"volume-title":"Proceedings of 13th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201904)","author":"Rangan Ram","key":"e_1_2_1_42_1","unstructured":"Ram Rangan , Neil Vachharajani , Manish Vachharajani , and David I. August . 2004. Decoupled software pipelining with the synchronization array . In Proceedings of 13th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201904) . Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of 13th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201904)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.45"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.5555\/800048.801719"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/357401.357403"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1024393.1024407"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/378993.379247"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/224170.224301"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/232973.232993"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815965"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.5555\/800078.802557"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/216585.216588"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346187"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2005.18"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/379240.379246"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3310332","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3310332","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3310332","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:53:37Z","timestamp":1750204417000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3310332"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,4,26]]},"references-count":55,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,6,30]]}},"alternative-id":["10.1145\/3310332"],"URL":"https:\/\/doi.org\/10.1145\/3310332","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2019,4,26]]},"assertion":[{"value":"2018-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-04-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}