{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T16:37:07Z","timestamp":1778603827097,"version":"3.51.4"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2019,6,17]],"date-time":"2019-06-17T00:00:00Z","timestamp":1560729600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Fundamental Research Funds for the Central Universities of Civil Aviation University of China","award":["3122018C023, 3122018C021"],"award-info":[{"award-number":["3122018C023, 3122018C021"]}]},{"name":"Scientific Research Foundation of Civil Aviation University of China","award":["2017QD12S"],"award-info":[{"award-number":["2017QD12S"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61702521"],"award-info":[{"award-number":["61702521"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"U.S. National Science Foundation","doi-asserted-by":"crossref","award":["CNS 17-05047"],"award-info":[{"award-number":["CNS 17-05047"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100006606","name":"Natural Science Foundation of Tianjin","doi-asserted-by":"crossref","award":["18JCQNJC00400"],"award-info":[{"award-number":["18JCQNJC00400"]}],"id":[{"id":"10.13039\/501100006606","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,9,30]]},"abstract":"<jats:p>GPUs provide high-bandwidth\/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide accesses. To support such large accesses to L1 cache with low latency, the size of L1 cache line is no smaller than that of warp-wide accesses. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences that make requests uncoalesced and small. Furthermore, unlike L1 cache, the shared memory of GPUs is not often used in many applications, which essentially depends on programmers. In this article, we propose Elastic-Cache, which can efficiently support both fine- and coarse-grained L1 cache line management for applications with both regular and irregular memory access patterns to improve the L1 cache efficiency. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage, since it stores auxiliary tags for fine-grained L1 cache line managements in the shared memory space that is not fully used in many applications. To improve the bandwidth utilization of L1 cache with Elastic-Cache for fine-grained accesses, we further propose Elastic-Plus to issue 32-byte memory requests in parallel, which can reduce the processing latency of memory instructions and improve the throughput of GPUs. Our experiment result shows that Elastic-Cache improves the geometric-mean performance of applications with irregular memory access patterns by 104% without degrading the performance of applications with regular memory access patterns. Elastic-Plus outperforms Elastic-Cache and improves the performance of applications with irregular memory access patterns by 131%.<\/jats:p>","DOI":"10.1145\/3322127","type":"journal-article","created":{"date-parts":[[2019,6,18]],"date-time":"2019-06-18T12:14:26Z","timestamp":1560860066000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns"],"prefix":"10.1145","volume":"16","author":[{"given":"Bingchao","family":"Li","sequence":"first","affiliation":[{"name":"Civil Aviation University of China 8 Tianjin University, China"}]},{"given":"Jizeng","family":"Wei","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}]},{"given":"Jizhou","family":"Sun","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}]},{"given":"Murali","family":"Annavaram","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, California, USA"}]},{"given":"Nam Sung","family":"Kim","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, Illinois, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,6,17]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909)","author":"Bakhoda A.","unstructured":"A. Bakhoda , G. L. Yuan , W. W. L. Fung , H. Wong , and T. M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909) . 163--174. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909). 163--174."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the Symposium on VLSI Circuits. 152--153","author":"Chang J.","unstructured":"J. Chang , S. Chen , W. Chen , S. Chiu , R. Faber , R. Ganesan , M. Grgek , V. Lukka , W. W. Mar , J. Vash , S. Rusu , and K. Zhang . 2009. A 45nm 24MB on-die L3 cache for the 8-core multi-threaded Xeon processor . In Proceedings of the Symposium on VLSI Circuits. 152--153 . J. Chang, S. Chen, W. Chen, S. Chiu, R. Faber, R. Ganesan, M. Grgek, V. Lukka, W. W. Mar, J. Vash, S. Rusu, and K. Zhang. 2009. A 45nm 24MB on-die L3 cache for the 8-core multi-threaded Xeon processor. In Proceedings of the Symposium on VLSI Circuits. 152--153."},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the IEEE\/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914)","author":"Chatterjee N.","unstructured":"N. Chatterjee , M. O\u2019Connor , G. H. Loh , N. Jayasena , and R. Balasubramonia . 2014. Managing DRAM latency divergence in irregular GPGPU applications . In Proceedings of the IEEE\/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914) . 128--139. N. Chatterjee, M. O\u2019Connor, G. H. Loh, N. Jayasena, and R. Balasubramonia. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the IEEE\/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914). 128--139."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2013.6704684"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.11"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1400181.1400197"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912)","author":"Gebhart Mark","unstructured":"Mark Gebhart , Stephen W. Keckler , Brucek Khailany , Ronny Krashinsky , and William J. Dally . 2012. Unifying primary cache, scratch, and register file memories in a throughput processor . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912) . 96--106. Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912). 96--106."},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the Conference on Innovative Parallel Computing (InPar\u201912)","author":"Grauer-Gray S.","unstructured":"S. Grauer-Gray , L. Xu , R. Searles , S. Ayalasomayajula , and J. Cavazos . 2012. Auto-tuning a high-level language targeted to GPU codes . In Proceedings of the Conference on Innovative Parallel Computing (InPar\u201912) . 1--10. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Conference on Innovative Parallel Computing (InPar\u201912). 1--10."},{"key":"e_1_2_1_9_1","unstructured":"Mark Harris. 2013. Using shared memory in CUDA C\/C++. NVIDIA Developer Blog. Retrieved from https:\/\/devblogs.nvidia.com\/parallelforall\/using-shared-memory-cuda-cc\/.  Mark Harris. 2013. Using shared memory in CUDA C\/C++. NVIDIA Developer Blog. Retrieved from https:\/\/devblogs.nvidia.com\/parallelforall\/using-shared-memory-cuda-cc\/."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454152"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830784"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835938"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Jing N.","unstructured":"N. Jing , J. Wang , F. Fan , W. Yu , L. Jiang , C. Li , and X. Liang . 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916) . 1--12. N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). 1--12."},{"key":"e_1_2_1_14_1","unstructured":"M. Khairy J. Akshay T. Aamodt and T. G. Rogers. 2018. Exploring modern GPU memory system design challenges through accurate modeling. Computing Research Repository (CoRR) vol. abs\/1810.07269.  M. Khairy J. Akshay T. Aamodt and T. G. Rogers. 2018. Exploring modern GPU memory system design challenges through accurate modeling. Computing Research Repository (CoRR) vol. abs\/1810.07269."},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)","author":"Khorasani F.","unstructured":"F. Khorasani , R. Gupta , and L. N. Bhuyan . 2015. Efficient warp execution in presence of divergence with collaborative context collection . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915) . 204--215. F. Khorasani, R. Gupta, and L. N. Bhuyan. 2015. Efficient warp execution in presence of divergence with collaborative context collection. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915). 204--215."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.42"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751237"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.71.0015"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201912)","author":"Pingali Keshav","year":"2012","unstructured":"Keshav Pingali , Martin Burtscher , Rupesh Nasre . 2012 . A quantitative study of irregular programs on GPUs . In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201912) . 141--151. Keshav Pingali, Martin Burtscher, Rupesh Nasre. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201912). 141--151."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815992"},{"key":"e_1_2_1_23_1","volume-title":"Jouppi","author":"Balasubramonian Rajeev","year":"2009","unstructured":"Rajeev Balasubramonian , Naveen Muralimanohar , and Norman P . Jouppi . 2009 . CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories . https:\/\/www.hpl.hp.com\/techreports\/2009\/HPL-2009-85.pdf. Rajeev Balasubramonian, Naveen Muralimanohar, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories. https:\/\/www.hpl.hp.com\/techreports\/2009\/HPL-2009-85.pdf."},{"key":"e_1_2_1_24_1","unstructured":"NVIDIA Corporation. 2014. NVIDIA GeForce GTX 980. https:\/\/www.techpowerup.com\/gpu-specs\/docs\/nvidia-gtx-980.pdf.  NVIDIA Corporation. 2014. NVIDIA GeForce GTX 980. https:\/\/www.techpowerup.com\/gpu-specs\/docs\/nvidia-gtx-980.pdf."},{"key":"e_1_2_1_25_1","unstructured":"NVIDIA Corporation. 2015. NVIDIA CUDA C Programming Guide. https:\/\/docs.nvidia.com\/cuda\/archive\/8.0\/pdf\/CUDA_C_Programming_Guide.pdf.  NVIDIA Corporation. 2015. NVIDIA CUDA C Programming Guide. https:\/\/docs.nvidia.com\/cuda\/archive\/8.0\/pdf\/CUDA_C_Programming_Guide.pdf."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540717"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912)","author":"Rogers T. G.","unstructured":"T. G. Rogers , M. O\u2019Connor , and T. M. Aamodt . 2012. Cache-conscious wavefront scheduling . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912) . 72--83. T. G. Rogers, M. O\u2019Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912). 72--83."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913)","author":"Rogers Timothy G.","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2013. Divergence-aware warp scheduling . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913) . 99--110. Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913). 99--110."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.1994.288133"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201913)","author":"Singh I.","unstructured":"I. Singh , A. Shriraman , W. W. L. Fung , M. O\u2019Connor , and T. M. Aamodt . 2013. Cache coherence for GPU architectures . In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201913) . 578--590. I. Singh, A. Shriraman, W. W. L. Fung, M. O\u2019Connor, and T. M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201913). 578--590."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2014.2376112"},{"key":"e_1_2_1_32_1","volume-title":"Inside Kepler. GPGPU workshop of Department of Computer Science","author":"Ujaldon Manuel","year":"2015","unstructured":"Manuel Ujaldon . 2015 . Inside Kepler. GPGPU workshop of Department of Computer Science , University of Cape Town. Retrieved from http:\/\/gpu.cs.uct.ac.za\/Slides\/Kepler.pdf. Manuel Ujaldon. 2015. Inside Kepler. GPGPU workshop of Department of Computer Science, University of Cape Town. Retrieved from http:\/\/gpu.cs.uct.ac.za\/Slides\/Kepler.pdf."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/305138.305188"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Xie X.","unstructured":"X. Xie , Y. Liang , Y. Wang , G. Sun , and T. Wang . 2015. Coordinated static and dynamic cache bypassing for GPUs . In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201915) . 76--88. X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201915). 76--88."},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201914)","author":"Xu Qiumin","unstructured":"Qiumin Xu , Hyeran Jeon , and M. Annavaram . 2014. Graph processing on GPUs: Where are the bottlenecks? In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201914) . 140--149. Qiumin Xu, Hyeran Jeon, and M. Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks? In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201914). 140--149."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370858"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Computer Architecture (ISCA\u201916)","author":"Yoon M. K.","unstructured":"M. K. Yoon , K. Kim , S. Lee , W. W. Ro , and M. Annavaram . 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit . In Proceedings of the IEEE\/ACM International Symposium on Computer Architecture (ISCA\u201916) . 609--621. M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram. 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit. In Proceedings of the IEEE\/ACM International Symposium on Computer Architecture (ISCA\u201916). 609--621."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2014.2359882"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3322127","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3322127","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:49:05Z","timestamp":1750268945000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3322127"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,6,17]]},"references-count":38,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2019,9,30]]}},"alternative-id":["10.1145\/3322127"],"URL":"https:\/\/doi.org\/10.1145\/3322127","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,6,17]]},"assertion":[{"value":"2018-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-06-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}