{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T23:23:30Z","timestamp":1758237810916,"version":"3.44.0"},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>Memory hierarchy in Graphics Processing Units (GPUs) is conventionally designed to provide high bandwidth rather than low latency. In particular, because of the high tolerance to load-to-use latency (i.e., the time that warps wait for data fetched by memory loads), GPU L1D caches are optimized for density, capacity, and low power with latencies that are often orders of magnitude longer than conventional CPU caches. However, there are many important classes of data-parallel applications (e.g., graph, tree, priority queue processing, and sparse deep learning applications) that benefit from lower load-to-use latency than that offered by modern GPUs due to their inherent divergence and low effective Thread-Level Parallelism (TLP). This article introduces an innovative on-chip cache hierarchy that incorporates a decoupled L1D cache with reduced latency (LoTUS) and its management scheme. LoTUS is a minimally sized fully associative cache placed in each GPU subcore that captures the primary working set of data-parallel applications. It exploits conventional high-performance low-density SRAM cells and dramatically reduces load-to-use latency. We also propose an intelligent extension of LoTUS, called LoTUSage, which employs a lightweight learning-based model to predict the utility of caching requests in LoTUS. Evaluation results show that LoTUS and LoTUSage improve the average performance by 23.9% and 35.4% and reduce the average energy consumption by 27.8% and 38.5%, respectively, for the applications suffering from high load-to-use stalls with negligible area and power overheads.<\/jats:p>","DOI":"10.1145\/3760782","type":"journal-article","created":{"date-parts":[[2025,8,18]],"date-time":"2025-08-18T11:26:44Z","timestamp":1755516404000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5232-3539","authenticated-orcid":false,"given":"Negin (Sadat)","family":"(Nematollahi zadeh) Mahani","sequence":"first","affiliation":[{"name":"Computer Science, Barcelona Supercomputing Center","place":["Barcelona, Spain"]},{"name":"Sharif University of Technology","place":["Barcelona, Spain"]},{"name":"Shahid Bahonar University of Kerman","place":["Barcelona, Spain"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8375-3339","authenticated-orcid":false,"given":"Hajar","family":"Falahati","sequence":"additional","affiliation":[{"name":"Barcelona Supercomputing Center","place":["Barcelona, Spain"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7082-123X","authenticated-orcid":false,"given":"Sina","family":"Darabi","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera italiana","place":["Lugano, Switzerland"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-4444-150X","authenticated-orcid":false,"given":"Ahmad","family":"Javadi-Nezhad","sequence":"additional","affiliation":[{"name":"Sharif University of Technology","place":["Tehran, Iran (the Islamic Republic of)"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6442-3705","authenticated-orcid":false,"given":"Yunho","family":"Oh","sequence":"additional","affiliation":[{"name":"School of Electrical Engineering, Korea University","place":["Seoul, Korea (the Republic of)"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4029-0175","authenticated-orcid":false,"given":"Mohammad","family":"Sadrosadati","sequence":"additional","affiliation":[{"name":"ETH","place":["Zurikh, Switzerland"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4079-8603","authenticated-orcid":false,"given":"Hamid","family":"Sarbazi-Azad","sequence":"additional","affiliation":[{"name":"Electrical & Computer Engineering, Sharif University of Technology","place":["Tehran, Iran (the Islamic Republic of)"]},{"name":"IPM","place":["Tehran, Iran (the Islamic Republic of)"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5916-8068","authenticated-orcid":false,"given":"Babak","family":"Falsafi","sequence":"additional","affiliation":[{"name":"EPFL","place":["Lausanne, Switzerland"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,18]]},"reference":[{"issue":"11","key":"e_1_3_2_2_2","doi-asserted-by":"crossref","first-page":"3348","DOI":"10.1109\/TCAD.2020.3012210","article-title":"Dynamic memory bandwidth allocation for real-time GPU-based SoC platforms","volume":"39","author":"Aghilinasab Homa","year":"2020","unstructured":"Homa Aghilinasab, Waqar Ali, Heechul Yun, and Rodolfo Pellizzoni. 2020. Dynamic memory bandwidth allocation for real-time GPU-based SoC platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3348\u20133360.","journal-title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2018.2873679"},{"key":"e_1_3_2_4_2","volume-title":"A COMPILER FRAMEWORK FOR OPTIMIZING DYNAMIC PARALLELISM ON GPUS","author":"Olabi Mhd Ghaith","year":"2021","unstructured":"Mhd Ghaith Olabi. 2021. A COMPILER FRAMEWORK FOR OPTIMIZING DYNAMIC PARALLELISM ON GPUS. Ph.D. Dissertation."},{"key":"e_1_3_2_5_2","first-page":"1","volume-title":"Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS)","author":"Macri Joe","year":"2015","unstructured":"Joe Macri. 2015. AMD\u2019s next generation GPU and high bandwidth memory architecture: FURY. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS). IEEE, 1\u201326."},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1109\/IISWC.2012.6402918","volume-title":"Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC)","author":"Burtscher Martin","year":"2012","unstructured":"Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 141\u2013151."},{"key":"e_1_3_2_7_2","unstructured":"Baidu. 2017. DeepBench. (2017). Retrieved February 2023 from https:\/\/github.com\/baidu-research\/DeepBench"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.3012514"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3429981"},{"key":"e_1_3_2_10_2","unstructured":"NVIDIA. 2017. Volta architecture Whitepaper - NVIDIA File Downloads. (2017). Retrieved February 2023 from https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf"},{"key":"e_1_3_2_11_2","first-page":"1","volume-title":"Proceedings of the 51st Annual Design Automation Conference","author":"Samavatian Mohammad Hossein","year":"2014","unstructured":"Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In Proceedings of the 51st Annual Design Automation Conference. 1\u20136."},{"key":"e_1_3_2_12_2","first-page":"163","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009, ISPASS 2009.","author":"Bakhoda Ali","year":"2009","unstructured":"Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009, ISPASS 2009. IEEE, 163\u2013174."},{"issue":"1","key":"e_1_3_2_13_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3508036","article-title":"NURA: A framework for supporting non-uniform resource accesses in GPUs","volume":"6","author":"Darabi Sina","year":"2022","unstructured":"Sina Darabi, Negin Mahani, Hazhir Baxishi, Ehsan Yousefzadeh-Asl-Miandoab, Mohammad Sadrosadati, and Hamid Sarbazi-Azad. 2022. NURA: A framework for supporting non-uniform resource accesses in GPUs. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, 1 (2022), 1\u201327.","journal-title":"Proceedings of the ACM on Measurement and Analysis of Computing Systems"},{"key":"e_1_3_2_14_2","first-page":"37","volume-title":"Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","author":"Nugteren Cedric","year":"2014","unstructured":"Cedric Nugteren, Gert-Jan Van den Braak, Henk Corporaal, and Henri Bal. 2014. A detailed GPU cache model based on reuse distance theory. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 37\u201348."},{"key":"e_1_3_2_15_2","first-page":"424","volume-title":"Proceedings of the 2019 ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA)","author":"Segura Albert","year":"2019","unstructured":"Albert Segura, Jose-Maria Arnau, and Antonio Gonz\u00e1lez. 2019. SCU: A GPU stream compaction unit for graph processing. In Proceedings of the 2019 ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE, 424\u2013435."},{"issue":"2","key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3444844","article-title":"Grus: Toward unified-memory-efficient high-performance graph processing on GPU","volume":"18","author":"Wang Pengyu","year":"2021","unstructured":"Pengyu Wang, Jing Wang, Chao Li, Jianzong Wang, Haojin Zhu, and Minyi Guo. 2021. Grus: Toward unified-memory-efficient high-performance graph processing on GPU. ACM Transactions on Architecture and Code Optimization 18, 2 (2021), 1\u201325.","journal-title":"ACM Transactions on Architecture and Code Optimization"},{"key":"e_1_3_2_17_2","doi-asserted-by":"crossref","unstructured":"Ali Mohammadpur-Fard Sina Darabi Hajar Falahati Negin Mahani and Hamid Sarbazi-Azad. 2024. Exploiting direct memory operands in GPU instructions. IEEE Computer Architecture Letters 23 2 (2024) 162\u2013165.","DOI":"10.1109\/LCA.2024.3371062"},{"key":"e_1_3_2_18_2","unstructured":"Zhe Jia Marco Maggioni Benjamin Staiger and Daniele P. Scarpazza. 2018. Dissecting the nvidia volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018)."},{"key":"e_1_3_2_19_2","first-page":"44","volume-title":"Proceedings of the IEEE International Symposium on Workload Characterization, 2009, IISWC 2009.","author":"Che Shuai","year":"2009","unstructured":"Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization, 2009, IISWC 2009. IEEE, 44\u201354."},{"key":"e_1_3_2_20_2","first-page":"1","volume-title":"Proceedings of the 2012 Innovative Parallel Computing (InPar)","author":"Grauer-Gray Scott","year":"2012","unstructured":"Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 Innovative Parallel Computing (InPar). IEEE, 1\u201310."},{"key":"e_1_3_2_21_2","unstructured":"John A. Stratton Christopher Rodrigues I-Jui Sung Nady Obeid Li-Wen Chang Nasser Anssari Geng Daniel Liu and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 7.2 (2012)."},{"key":"e_1_3_2_22_2","first-page":"185","volume-title":"Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC)","author":"Che Shuai","year":"2013","unstructured":"Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 185\u2013195."},{"key":"e_1_3_2_23_2","first-page":"140","volume-title":"Proceedings of the 2014 IEEE International Symposium on Workload Characterization (IISWC)","author":"Xu Qiumin","year":"2014","unstructured":"Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 140\u2013149."},{"key":"e_1_3_2_24_2","first-page":"260","volume-title":"Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques","author":"He Bingsheng","year":"2008","unstructured":"Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce framework on graphics processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 260\u2013269."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/1735688.1735702"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s42979-021-00592-x"},{"key":"e_1_3_2_27_2","first-page":"1083","volume-title":"Proceedings of the 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)","author":"Wang Yang","year":"2021","unstructured":"Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021. Dual-side sparse tensor core. In Proceedings of the 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1083\u20131095."},{"key":"e_1_3_2_28_2","first-page":"473","volume-title":"Proceedings of the 2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)","author":"Khairy Mahmoud","year":"2020","unstructured":"Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In Proceedings of the 2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473\u2013486."},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","first-page":"407","DOI":"10.1109\/MICRO.2007.30","volume-title":"Proceedings of the 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO 2007)","author":"Fung Wilson WL","year":"2007","unstructured":"Wilson WL Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO 2007). IEEE, 407\u2013420."},{"key":"e_1_3_2_30_2","first-page":"25","volume-title":"Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture","author":"Fung Wilson WL","year":"2011","unstructured":"Wilson WL Fung and Tor M Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 25\u201336."},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1145\/2540708.2540718","volume-title":"Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Rogers Timothy G.","year":"2013","unstructured":"Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture. 99\u2013110."},{"key":"e_1_3_2_32_2","unstructured":"NVIDIA. Janueray 2025. Cuda C++ programming guide. (2025). Retrieved February 2023 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3085572"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3322127"},{"key":"e_1_3_2_35_2","first-page":"1","volume-title":"Proceedings of the 2019 IEEE\/ACM International Symposium on Low Power Electronics and Design (ISLPED)","author":"Do Cong Thuan","year":"2019","unstructured":"Cong Thuan Do, Young-Ho Gong, Cheol Hong Kim, Seon Wook Kim, and Sung Woo Chung. 2019. Exploring the relation between monolithic 3D L1 GPU cache capacity and warp scheduling efficiency. In Proceedings of the 2019 IEEE\/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 1\u20136."},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1016\/j.vlsi.2017.02.002","article-title":"Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm","volume":"58","author":"Stillmaker Aaron","year":"2017","unstructured":"Aaron Stillmaker and Bevan Baas. 2017. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integration 58 (2017), 74\u201381.","journal-title":"Integration"},{"key":"e_1_3_2_37_2","first-page":"111","volume-title":"Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques","author":"Lee Jungseob","year":"2011","unstructured":"Jungseob Lee, Vijay Sathisha, Michael Schulte, Katherine Compton, and Nam Sung Kim. 2011. Improving throughput of power-constrained GPUs using dynamic voltage\/frequency and core scaling. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 111\u2013120."},{"issue":"11","key":"e_1_3_2_38_2","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1145\/1839676.1839694","article-title":"Understanding throughput-oriented architectures","volume":"53","author":"Garland Michael","year":"2010","unstructured":"Michael Garland and David B. Kirk. 2010. Understanding throughput-oriented architectures. Communications of the ACM 53, 11 (2010), 58\u201366.","journal-title":"Communications of the ACM"},{"key":"e_1_3_2_39_2","first-page":"1","volume-title":"Proceedings of the 2016 IEEE\/ACM International Conference on Computer-Aided Design (ICCAD)","author":"Guthaus Matthew R.","year":"2016","unstructured":"Matthew R. Guthaus, James E. Stine, Samira Ataei, Brian Chen, Bin Wu, and Mehedi Sarwar. 2016. OpenRAM: An open-source memory compiler. In Proceedings of the 2016 IEEE\/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1\u20136."},{"issue":"2","key":"e_1_3_2_40_2","doi-asserted-by":"crossref","first-page":"421","DOI":"10.31577\/cai_2019_2_421","article-title":"RDGC: A reuse distance-based approach to GPU cache performance analysis","volume":"38","author":"Kiani Mohsen","year":"2019","unstructured":"Mohsen Kiani and Amir Rajabzadeh. 2019. RDGC: A reuse distance-based approach to GPU cache performance analysis. Computing and Informatics 38, 2 (2019), 421\u2013453.","journal-title":"Computing and Informatics"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3291051"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/986537.986601"},{"key":"e_1_3_2_43_2","first-page":"67","volume-title":"Proceedings of the 29th ACM on International Conference on Supercomputing","author":"Li Chao","year":"2015","unstructured":"Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing. 67\u201377."},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522351"},{"key":"e_1_3_2_45_2","first-page":"582","volume-title":"Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","author":"Ren Xiaowei","year":"2020","unstructured":"Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David Nellans. 2020. Hmg: Extending cache coherence protocols across modern hierarchical multi-GPU systems. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 582\u2013595."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/2889488"},{"key":"e_1_3_2_47_2","first-page":"647","volume-title":"Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"Sinclair Matthew D.","year":"2015","unstructured":"Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). 647\u2013659. DOI:10.1145\/2830772.2830821"},{"issue":"2","key":"e_1_3_2_48_2","first-page":"1","article-title":"Efficient processing of deep neural networks","volume":"15","author":"Sze Vivienne","year":"2020","unstructured":"Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture 15, 2 (2020), 1\u2013341.","journal-title":"Synthesis Lectures on Computer Architecture"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358319"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","first-page":"728","DOI":"10.1145\/3613424.3623782","volume-title":"Proceedings of the 56th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Mostofi Saba","year":"2023","unstructured":"Saba Mostofi, Hajar Falahati, Negin Mahani, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2023. Snake: A variable-length chain-based prefetching for GPUs. In Proceedings of the 56th Annual IEEE\/ACM International Symposium on Microarchitecture. 728\u2013741."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480063"},{"key":"e_1_3_2_52_2","unstructured":"Zhe Jia Marco Maggioni Jeffrey Smith and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019)."},{"key":"e_1_3_2_53_2","first-page":"183","volume-title":"Proceedings of the 46th International Symposium on Computer Architecture","author":"Oh Yunho","year":"2019","unstructured":"Yunho Oh, Gunjae Koo, Murali Annavaram, and Won Woo Ro. 2019. Linebacker: Preserving victim cache lines in idle register files of GPUs. In Proceedings of the 46th International Symposium on Computer Architecture. 183\u2013196."},{"key":"e_1_3_2_54_2","first-page":"137","volume-title":"Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Koo Gunjae","year":"2018","unstructured":"Gunjae Koo, Hyeran Jeon, Zhenhong Liu, Nam Sung Kim, and Murali Annavaram. 2018. Cta-aware prefetching and scheduling for GPU. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 137\u2013148."},{"key":"e_1_3_2_55_2","unstructured":"AMD. 2021. RDNA architecture. (2021). Retrieved February 2023 from https:\/\/GPUopen.com\/wp-content\/uploads\/2019\/08\/RDNA_Architecture_public.pdf"},{"key":"e_1_3_2_56_2","unstructured":"ISS Group at the University of Texas. 2020. LonestarGPU. (2020). Retrieved February 2023 from https:\/\/iss.oden.utexas.edu\/?p=projects\/galois\/lonestargpu"},{"key":"e_1_3_2_57_2","volume-title":"Proceedings of the WMT","author":"others Bojar, Ondrej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Yepes, Antonio Jimeno and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and","year":"2016","unstructured":"Bojar, Ondrej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Yepes, Antonio Jimeno and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and others. 2016. Findings of the 2016 conference on machine translation (wmt16). In Proceedings of the WMT."},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.596"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_60_2","doi-asserted-by":"crossref","unstructured":"John S. Garofolo Lori F. Lamel William M. Fisher Jonathan G. Fiscus and David S. Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI\/Recon Technical Report no. 93 (1993) 27403.","DOI":"10.6028\/NIST.IR.4930"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3337821.3337886"},{"key":"e_1_3_2_62_2","first-page":"1","volume-title":"Proceedings of the 47th International Conference on Parallel Processing","author":"Zhu Xian","year":"2018","unstructured":"Xian Zhu, Robert Wernsman, and Joseph Zambreno. 2018. Improving first level cache efficiency for GPUs using dynamic line protection. In Proceedings of the 47th International Conference on Parallel Processing. 1\u201310."},{"key":"e_1_3_2_63_2","first-page":"272","volume-title":"Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","author":"Jia Wenhao","year":"2014","unstructured":"Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 272\u2013283."},{"key":"e_1_3_2_64_2","first-page":"332","volume-title":"Proceedings of the 40th Annual International Symposium on Computer Architecture","author":"Jog Adwait","year":"2013","unstructured":"Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 332\u2013343."},{"key":"e_1_3_2_65_2","volume-title":"Design and Analysis of Memory Management Techniques for Next-Generation GPUS","author":"Wang Haonan","year":"2020","unstructured":"Haonan Wang. 2020. Design and Analysis of Memory Management Techniques for Next-Generation GPUS. The College of William and Mary."},{"issue":"2","key":"e_1_3_2_66_2","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1007\/s10766-022-00729-2","article-title":"A quantitative study of locality in GPU caches for memory-divergent workloads","volume":"50","author":"Lal Sohan","year":"2022","unstructured":"Sohan Lal, Bogaraju Sharatchandra Varma, and Ben Juurlink. 2022. A quantitative study of locality in GPU caches for memory-divergent workloads. International Journal of Parallel Programming 50, 2 (2022), 189\u2013216.","journal-title":"International Journal of Parallel Programming"},{"issue":"5","key":"e_1_3_2_67_2","doi-asserted-by":"crossref","first-page":"5421","DOI":"10.1007\/s11227-022-04878-6","article-title":"Aggressive GPU cache bypassing with monolithic 3D-based NoC","volume":"79","author":"Do Cong Thuan","year":"2023","unstructured":"Cong Thuan Do, Cheol Hong Kim, and Sung Woo Chung. 2023. Aggressive GPU cache bypassing with monolithic 3D-based NoC. The Journal of Supercomputing 79, 5 (2023), 5421\u20135442.","journal-title":"The Journal of Supercomputing"},{"issue":"5","key":"e_1_3_2_68_2","doi-asserted-by":"crossref","first-page":"1479","DOI":"10.1109\/TPDS.2023.3247808","article-title":"LAS: Locality-aware scheduling for GEMM-accelerated convolutions in GPUs","volume":"34","author":"Kim Hyeonjin","year":"2023","unstructured":"Hyeonjin Kim and William J. Song. 2023. LAS: Locality-aware scheduling for GEMM-accelerated convolutions in GPUs. IEEE Transactions on Parallel and Distributed Systems 34, 5 (2023), 1479\u20131494.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_69_2","first-page":"863","volume-title":"Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Li Jialin","year":"2022","unstructured":"Jialin Li, Huang Ye, Shaobo Tian, Xinyuan Li, and Jian Zhang. 2022. A fine-grained prefetching scheme for DGEMM kernels on GPU with auto-tuning compatibility. In Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 863\u2013874."},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/280756.280788"},{"key":"e_1_3_2_71_2","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1109\/MICRO.2012.18","volume-title":"Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Gebhart Mark","year":"2012","unstructured":"Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE, 96\u2013106."},{"key":"e_1_3_2_72_2","first-page":"1919","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Hashemi Milad","year":"2018","unstructured":"Milad Hashemi, Kevin Swersky, Jamie Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Learning memory access patterns. In Proceedings of the International Conference on Machine Learning. PMLR, 1919\u20131928."},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322207"},{"key":"e_1_3_2_74_2","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1109\/HPCA51647.2021.00033","volume-title":"Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)","author":"Sethumurugan Subhash","year":"2021","unstructured":"Subhash Sethumurugan, Jieming Yin, and John Sartori. 2021. Designing a cost-effective cache replacement policy using machine learning. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 291\u2013303."},{"key":"e_1_3_2_75_2","doi-asserted-by":"crossref","unstructured":"Xinjian Long Xiangyang Gong Bo Zhang and Huiyang Zhou. 2023. Deep learning based data prefetching in CPU-GPU unified virtual memory. Journal of Parallel and Distributed Computing 174 (2023) 19\u201331.","DOI":"10.1016\/j.jpdc.2022.12.004"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3760782","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T14:08:52Z","timestamp":1758204532000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3760782"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,18]]},"references-count":74,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3760782"],"URL":"https:\/\/doi.org\/10.1145\/3760782","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2025,9,18]]},"assertion":[{"value":"2024-09-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}