{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T11:02:25Z","timestamp":1750503745104,"version":"3.41.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,7,31]],"date-time":"2018-07-31T00:00:00Z","timestamp":1532995200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key R8D Program of China","award":["2017YFB0902600"],"award-info":[{"award-number":["2017YFB0902600"]}]},{"DOI":"10.13039\/501100007129","name":"Shandong Provincial Natural Science Foundation","doi-asserted-by":"crossref","award":["ZR2017MF034 and ZR2016FQ22"],"award-info":[{"award-number":["ZR2017MF034 and ZR2016FQ22"]}],"id":[{"id":"10.13039\/501100007129","id-type":"DOI","asserted-by":"crossref"}]},{"name":"State Key Program of NSFC","award":["61533011"],"award-info":[{"award-number":["61533011"]}]},{"name":"Research and Application of Key Technology for Intelligent Dispatching and Security Early-Warning of Large Power Grid"},{"name":"Young Scholars Program of Shandong University"},{"DOI":"10.13039\/501100010880","name":"State Grid Corporation of China","doi-asserted-by":"crossref","award":["SGJS0000DKJS1700840"],"award-info":[{"award-number":["SGJS0000DKJS1700840"]}],"id":[{"id":"10.13039\/501100010880","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61602274"],"award-info":[{"award-number":["61602274"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2018,7,31]]},"abstract":"<jats:p>Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) technologies, hybrid memory combining both DRAM and NVM achieves high performance, low power, and high density simultaneously, which provides a promising main memory design for GPGPUs. In this article, we explore the shared last-level cache management for GPGPUs with consideration of the underlying hybrid main memory. To improve the overall memory subsystem performance, we exploit the characteristics of both the asymmetric read\/write latency of the hybrid main memory architecture, as well as the memory coalescing feature of GPGPUs. In particular, to reduce the average cost of L2 cache misses, we prioritize cache blocks from DRAM or NVM based on observations that operations to NVM part of main memory have a large impact on the system performance. Furthermore, the cache management scheme also integrates the GPU memory coalescing and cache bypassing techniques to improve the overall system performance. To minimize the impact of memory divergence behaviors among simultaneously executed groups of threads, we propose a hybrid main memory and warp aware memory scheduling mechanism for GPGPUs. Experimental results show that in the context of a hybrid main memory system, our proposed L2 cache management policy and memory scheduling mechanism improve performance by 15.69% on average for memory intensive benchmarks, whereas the maximum gain can be up to 29% and achieve an average memory subsystem energy reduction of 21.27%.<\/jats:p>","DOI":"10.1145\/3230643","type":"journal-article","created":{"date-parts":[[2018,7,31]],"date-time":"2018-07-31T15:56:23Z","timestamp":1533052583000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0469-3647","authenticated-orcid":false,"given":"Guan","family":"Wang","sequence":"first","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"given":"Chuanqi","family":"Zang","sequence":"additional","affiliation":[{"name":"Shandong University, Ji\u2019nan, China"}]},{"given":"Lei","family":"Ju","sequence":"additional","affiliation":[{"name":"Shandong University, Ji\u2019nan, China"}]},{"given":"Mengying","family":"Zhao","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"given":"Xiaojun","family":"Cai","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"given":"Zhiping","family":"Jia","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}]}],"member":"320","published-online":{"date-parts":[[2018,7,31]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337207"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2015.38"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145820"},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174","author":"Bakhoda Ali","key":"e_1_2_1_4_1"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2012.6402918"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168943"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.16"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.16"},{"volume-title":"Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann.","year":"2010","author":"Jacob Bruce","key":"e_1_2_1_10_1"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815971"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835938"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588768.2576780"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451158"},{"volume-title":"Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA\u201914)","author":"Khan Samira","key":"e_1_2_1_15_1"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2016.2615845"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228519"},{"key":"e_1_2_1_18_1","unstructured":"David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.   David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann."},{"volume-title":"Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU. 1--10","author":"Nagesh","key":"e_1_2_1_19_1"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168947"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.44"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2013.98"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807606"},{"volume-title":"Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Li Dong","key":"e_1_2_1_24_1"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2015.2424962"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/2523721.2523753"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815992"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2013.2278025"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.40"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.7"},{"key":"e_1_2_1_31_1","unstructured":"NVIDIA. 2011. NVIDIA CUDA SDK. (May 2011). https:\/\/developer.nvidia.com\/cuda-toolkit-40.  NVIDIA. 2011. NVIDIA CUDA SDK. (May 2011). https:\/\/developer.nvidia.com\/cuda-toolkit-40."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.49"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555815.1555760"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995896.1995911"},{"volume-title":"Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 93--102","author":"Wang Bin","key":"e_1_2_1_35_1"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751239"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/LES.2014.2325878"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2014.6974658"},{"key":"e_1_2_1_39_1","unstructured":"Nicholas Wilt. 2013. The Cuda Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.  Nicholas Wilt. 2013. The Cuda Handbook: A Comprehensive Guide to GPU Programming. Pearson Education."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056023"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555778"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2039370.2039420"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2898110"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2429384.2429400"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3230643","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3230643","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:39:47Z","timestamp":1750210787000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3230643"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,7,31]]},"references-count":44,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,7,31]]}},"alternative-id":["10.1145\/3230643"],"URL":"https:\/\/doi.org\/10.1145\/3230643","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2018,7,31]]},"assertion":[{"value":"2017-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-07-31","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}