{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:33:30Z","timestamp":1772724810640,"version":"3.50.1"},"reference-count":117,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,9,14]],"date-time":"2024-09-14T00:00:00Z","timestamp":1726272000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:p>\n            Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains, because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs for two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data-sharing mechanism, called\n            <jats:italic>Cross-Core Data Sharing (CCDS)<\/jats:italic>\n            . CCDS employs a predictor to estimate whether the required cache block exists in another SM. If the block is predicted to exist in another SM\u2019s L1D, then CCDS fetches the data from the L1D that contain the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30\u00d7 and 1.20\u00d7, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37\u00d7 and 1.11\u00d7, respectively.\n          <\/jats:p>","DOI":"10.1145\/3653019","type":"journal-article","created":{"date-parts":[[2024,3,18]],"date-time":"2024-03-18T14:25:56Z","timestamp":1710771956000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Cross-core Data Sharing for Energy-efficient GPUs"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8375-3339","authenticated-orcid":false,"given":"Hajar","family":"Falahati","sequence":"first","affiliation":[{"name":"Sharif University of Technology, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4029-0175","authenticated-orcid":false,"given":"Mohammad","family":"Sadrosadati","sequence":"additional","affiliation":[{"name":"School of Computer Science, IPM, Tehran, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1391-3397","authenticated-orcid":false,"given":"Qiumin","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6514-1571","authenticated-orcid":false,"given":"Juan","family":"G\u00f3mez-Luna","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Z\u00fcrich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3735-9191","authenticated-orcid":false,"given":"Banafsheh","family":"Saber Latibari","sequence":"additional","affiliation":[{"name":"Sharif University of Technology, Tehran, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1767-8198","authenticated-orcid":false,"given":"Hyeran","family":"Jeon","sequence":"additional","affiliation":[{"name":"San Jos\u00e9 State University, San Jose, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3193-2567","authenticated-orcid":false,"given":"Shaahin","family":"Hesaabi","sequence":"additional","affiliation":[{"name":"Sharif University of Technology, Tehran, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4079-8603","authenticated-orcid":false,"given":"Hamid","family":"Sarbazi-Azad","sequence":"additional","affiliation":[{"name":"Sharif University of Technology, School of Computer Science, IPM, Tehran, Iran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0075-2312","authenticated-orcid":false,"given":"Onur","family":"Mutlu","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Carnegie Mellon University, Z\u00fcrich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4633-6867","authenticated-orcid":false,"given":"Murali","family":"Annavaram","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2677-7307","authenticated-orcid":false,"given":"Masoud","family":"Pedram","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,9,14]]},"reference":[{"key":"e_1_3_2_2_2","volume-title":"Whitepaper: NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi","year":"2009","unstructured":"2009. Whitepaper: NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. Technical Report. NVIDIA."},{"key":"e_1_3_2_3_2","volume-title":"Whitepaper: NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110","year":"2012","unstructured":"2012. Whitepaper: NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110. Technical Report. NVIDIA."},{"key":"e_1_3_2_4_2","volume-title":"Whitepaper: NVIDIA GeForce GTX980","year":"2014","unstructured":"2014. Whitepaper: NVIDIA GeForce GTX980. Technical Report. NVIDIA."},{"key":"e_1_3_2_5_2","volume-title":"Whitepaper: NVIDIA GeForce GP100","year":"2016","unstructured":"2016. Whitepaper: NVIDIA GeForce GP100. Technical Report. NVIDIA."},{"key":"e_1_3_2_6_2","volume-title":"HPCA","author":"Abdel-Majeed Mohammad","year":"2013","unstructured":"Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCA."},{"key":"e_1_3_2_7_2","volume-title":"HPCA","author":"Abdel-Majeed Mohammad","year":"2017","unstructured":"Mohammad Abdel-Majeed, Alireza Shafaei, Hyeran Jeon, Massoud Pedram, and Murali Annavaram. 2017. Pilot register file: Energy efficient partitioned register file for GPUs. In HPCA."},{"key":"e_1_3_2_8_2","volume-title":"MICRO","author":"Abdel-Majeed Mohammad","year":"2013","unstructured":"Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: Gating aware scheduling and power gating for GPGPUs. In MICRO."},{"key":"e_1_3_2_9_2","volume-title":"ICS","author":"Abdel-Majeed Mohammad","year":"2016","unstructured":"Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, and Murali Annavaram. 2016. Origami: Folding warps for energy efficient GPUs. In ICS."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446089"},{"key":"e_1_3_2_11_2","unstructured":"Ehoud Ahronovitz Jean-Pierre Aubert and Christophe Fiorio. 1995. The star-topology: A topology for image analysis. In DGCI\u201905: 5th International Conference on Discrete Geometry for Computer Imagery."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2009.5413147"},{"key":"e_1_3_2_13_2","volume-title":"ISCA","year":"2017","unstructured":"Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In ISCA."},{"key":"e_1_3_2_14_2","volume-title":"ISCA","author":"Ausavarungnirun Rachata","year":"2012","unstructured":"Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In ISCA."},{"key":"e_1_3_2_15_2","volume-title":"PACT","author":"Ausavarungnirun R.","year":"2015","unstructured":"R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In PACT."},{"key":"e_1_3_2_16_2","volume-title":"PACT","author":"Ausavarungnirun Rachata","year":"2015","unstructured":"Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In PACT."},{"key":"e_1_3_2_17_2","volume-title":"MICRO","author":"Ausavarungnirun Rachata","year":"2017","unstructured":"Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In MICRO."},{"key":"e_1_3_2_18_2","volume-title":"ASPLOS","author":"Ausavarungnirun Rachata","year":"2018","unstructured":"Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the GPU memory hierarchy to support multi-application concurrency. In ASPLOS."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_3_2_20_2","article-title":"Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU","author":"others Balasubramanian, Raghuraman and Gangadhar, Vinay and Guo, Ziliang and Ho, Chen-Han and Joseph, Cherin and Menon, Jaikrishnan and Drumond, Mario Paulo and Paul, Robin and Prasad, Sharath and Valathol, Pradip and","year":"2015","unstructured":"Balasubramanian, Raghuraman and Gangadhar, Vinay and Guo, Ziliang and Ho, Chen-Han and Joseph, Cherin and Menon, Jaikrishnan and Drumond, Mario Paulo and Paul, Robin and Prasad, Sharath and Valathol, Pradip and others. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM Trans. Arch. Code Optim. 12, 2 (2015), 21\u20131.","journal-title":"ACM Trans. Arch. Code Optim."},{"key":"e_1_3_2_21_2","volume-title":"IISWC","author":"Keshav Burtscher, Martin and Nasre, Rupesh and Pingali,","year":"2012","unstructured":"Burtscher, Martin and Nasre, Rupesh and Pingali, Keshav. 2012. A quantitative study of irregular programs on GPUs. In IISWC."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2013.6704684"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_2_24_2","article-title":"Improving GPGPU performance via cache locality aware thread block scheduling","author":"Chen Li-Jhan","year":"2017","unstructured":"Li-Jhan Chen, Hsiang-Yun Cheng, Po-Han Wang, and Chia-Lin Yang. 2017. Improving GPGPU performance via cache locality aware thread block scheduling. CAL (2017).","journal-title":"CAL"},{"key":"e_1_3_2_25_2","unstructured":"Design Compiler. 2000. Synopsys inc."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3154315"},{"key":"e_1_3_2_27_2","unstructured":"Sina Darabi Negin Mahani Hazhir Baxishi Ehsan Yousefzadeh-Asl-Miandoab Mohammad Sadrosadati and Hamid Sarbazi-Azad. 2022. NURA: A framework for supporting non-uniform resource accesses in GPUs. Proc. ACM Meas. Anal. Comput. Syst. (2022)."},{"key":"e_1_3_2_28_2","volume-title":"MICRO","year":"2022","unstructured":"Sina Darabi, Mohammad Sadrosadati, Negar Akbarzadeh, Jo\u00ebl Lindegger, Mohammad Hosseini, Jisung Park, Juan G\u00f3mez-Luna, Onur Mutlu, and Hamid Sarbazi-Azad. 2022. Morpheus: Extending the last level cache capacity in gpu systems using idle gpu core resources. In MICRO."},{"key":"e_1_3_2_29_2","volume-title":"IISWC","year":"2016","unstructured":"Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Characterizing memory bottlenecks in GPGPU workloads. In IISWC."},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","unstructured":"Saumay Dublish Vijay Nagarajan and Nigel Topham. 2016. Cooperative caching for GPUs. ACM TOPC.","DOI":"10.1145\/3001589"},{"key":"e_1_3_2_31_2","volume-title":"ISPASS","year":"2017","unstructured":"Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2017. Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs. In ISPASS."},{"key":"e_1_3_2_32_2","volume-title":"HPCA","year":"2019","unstructured":"Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing thread-level parallelism and memory system performance in GPUs using machine learning. In HPCA."},{"key":"e_1_3_2_33_2","volume-title":"CADS","author":"Falahati Hajar","year":"2013","unstructured":"Hajar Falahati, Mania Abdi, Amirali Baniasadi, and Shaahin Hessabi. 2013. ISP: Using idle SMs in hardware-based prefetching. In CADS."},{"key":"e_1_3_2_34_2","doi-asserted-by":"crossref","unstructured":"Hajar Falahati Mania Abdi Amirali Baniasadi and Shaahin Hessabi. 2015. Power-efficient prefetching in GPGPUs. The Journal of Supercomputing 71 (2015) 2808\u20132829.","DOI":"10.1007\/s11227-014-1331-6"},{"key":"e_1_3_2_35_2","unstructured":"Hajar Falahati Pejman Lotfi-Kamran Mohammad Sadrosadati and Hamid Sarbazi-Azad. 2018. ORIGAMI: A heterogeneous split architecture for in-memory acceleration of learning. arXiv preprint arXiv:1812.11473 (2018)."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2021.3096191"},{"key":"e_1_3_2_37_2","volume-title":"ISCA","author":"Gebhart Mark","year":"2011","unstructured":"Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA."},{"key":"e_1_3_2_38_2","volume-title":"MICRO","author":"Gilani Syed Zohaib","year":"2013","unstructured":"Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In MICRO."},{"key":"e_1_3_2_39_2","volume-title":"HPCA","author":"J Gilani, Syed Zohaib and Kim, Nam Sung and Schulte, Michael","year":"2013","unstructured":"Gilani, Syed Zohaib and Kim, Nam Sung and Schulte, Michael J. 2013. Power-efficient computing for compute-intensive GPGPU applications. In HPCA."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2015.30"},{"key":"e_1_3_2_41_2","volume-title":"HPCA","author":"Goswami Nilanjan","year":"2013","unstructured":"Nilanjan Goswami, Bingyi Cao, and Tao Li. 2013. Power-performance co-optimization of throughput core architecture using resistive memory. In HPCA."},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339595"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454152"},{"key":"e_1_3_2_44_2","volume-title":"PACT","year":"2019","unstructured":"Mohamed Assem Ibrahim, Hongyuan Liu, Onur Kayiran, and Adwait Jog. 2019. Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs. In PACT."},{"key":"e_1_3_2_45_2","volume-title":"ISCA","author":"Indrani Paul","year":"2015","unstructured":"Paul Indrani, Huang Wei, Manish Arora, and Sudhakar Yalmanchili. 2015. Harmonia: Balancing compute and memory power in high-performance GPUs. In ISCA."},{"key":"e_1_3_2_46_2","volume-title":"MICRO","author":"Jeon Hyeran","year":"2012","unstructured":"Hyeran Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In MICRO."},{"key":"e_1_3_2_47_2","volume-title":"MICRO","author":"Jing Naifeng","year":"2016","unstructured":"Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In MICRO."},{"key":"e_1_3_2_48_2","volume-title":"ASPLOS","author":"Jog Adwait","year":"2013","unstructured":"Adwait Jog, Onur Kayiran, Nachiappan Chidambaram, Asit K.Mishra, Mahmut T.Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R.Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS."},{"key":"e_1_3_2_49_2","volume-title":"ISCA","author":"Jog Adwait","year":"2013","unstructured":"Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In ISCA."},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/2896377.2901468"},{"key":"e_1_3_2_51_2","volume-title":"MICRO","year":"2021","unstructured":"Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G. Rogers, Tor M. Aamodt, and Nikos Hardavellas. 2021. AccelWattch: A power modeling framework for modern GPUs. In MICRO."},{"key":"e_1_3_2_52_2","volume-title":"PACT","author":"Kayiran Onur","year":"2013","unstructured":"Onur Kayiran, Await Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In PACT."},{"key":"e_1_3_2_53_2","volume-title":"MICRO","author":"Kayiran Onur","year":"2014","unstructured":"Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In MICRO."},{"key":"e_1_3_2_54_2","doi-asserted-by":"crossref","unstructured":"M. M. Keshtegar H. Falahati and S. Hessabi. 2015. Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality. IET Computers & Digital Techniques 9 5 (2015) 275\u2013282.","DOI":"10.1049\/iet-cdt.2014.0092"},{"key":"e_1_3_2_55_2","volume-title":"ISCA","author":"G Khairy, Mahmoud and Shen, Zhesheng and Aamodt, Tor M and Rogers, Timothy","year":"2020","unstructured":"Khairy, Mahmoud and Shen, Zhesheng and Aamodt, Tor M and Rogers, Timothy G. 2020. Accel-sim: An extensible simulation framework for validated GPU modeling. In ISCA."},{"key":"e_1_3_2_56_2","volume-title":"IISWC","author":"Koo Gunjae","year":"2015","unstructured":"Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing critical loads and hidden data locality in GPGPU applications. In IISWC."},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835970"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168947"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.44"},{"key":"e_1_3_2_60_2","volume-title":"HPCA","year":"2014","unstructured":"Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA."},{"key":"e_1_3_2_61_2","volume-title":"ISCA","author":"Leng Jingwen","year":"2013","unstructured":"Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA."},{"key":"e_1_3_2_62_2","volume-title":"SC","author":"Li Ang","year":"2015","unstructured":"Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In SC."},{"key":"e_1_3_2_63_2","volume-title":"ICS","author":"Li Chao","year":"2015","unstructured":"Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In ICS."},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056024"},{"key":"e_1_3_2_65_2","volume-title":"HPCA","author":"Liu Zhenhong","year":"2017","unstructured":"Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In HPCA."},{"key":"e_1_3_2_66_2","volume-title":"Cuda9.0 Programming Guide","author":"M.Aamodt Tor","year":"2018","unstructured":"Tor M.Aamodt, Wilson W. L. Fung, and Tayler H. Hetherington. 2018. Cuda9.0 Programming Guide. Retrieved from http:\/\/gpgpu-sim.org\/manual\/index.php5\/GPGPU-Sim_3.x_Manual"},{"key":"e_1_3_2_67_2","article-title":"Power-efficient cache design using dual-edge clocking scheme in sun OpenSPARC T1 and Alpha AXP processors","author":"Daniel Megalingam R.Kannan and M. Arunkumar and V. A.Ashok and Krishnan Nived and C. J.","year":"2010","unstructured":"Megalingam R.Kannan and M. Arunkumar and V. A.Ashok and Krishnan Nived and C. J. Daniel. 2010. Power-efficient cache design using dual-edge clocking scheme in sun OpenSPARC T1 and Alpha AXP processors. J. Commun. Comput. Inf. Sci. (2010).","journal-title":"J. Commun. Comput. Inf. Sci."},{"key":"e_1_3_2_68_2","volume-title":"NOCS","author":"Mirhosseini Amirhossein","year":"2017","unstructured":"Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In NOCS."},{"key":"e_1_3_2_69_2","article-title":"BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip","author":"Hamid Mirhosseini, Amirhossein and Sadrosadati, Mohammad and Aghamohammadi, Fatemeh and Modarressi, Mehdi and Sarbazi-Azad,","year":"2019","unstructured":"Mirhosseini, Amirhossein and Sadrosadati, Mohammad and Aghamohammadi, Fatemeh and Modarressi, Mehdi and Sarbazi-Azad, Hamid. 2019. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM Trans. Parallel Comput. (2019).","journal-title":"ACM Trans. Parallel Comput."},{"key":"e_1_3_2_70_2","article-title":"A survey of cache bypassing techniques","author":"Mittal Sparsh","year":"2016","unstructured":"Sparsh Mittal. 2016. A survey of cache bypassing techniques. J. Low Power Electr. Appl. (2016).","journal-title":"J. Low Power Electr. Appl."},{"key":"e_1_3_2_71_2","first-page":"728","volume-title":"MICRO","author":"Mostofi Saba","year":"2023","unstructured":"Saba Mostofi, Hajar Falahati, Negin Mahani, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2023. Snake: A variable-length chain-based prefetching for GPUs. In MICRO. 728\u2013741."},{"key":"e_1_3_2_72_2","volume-title":"Khronos OpenCL Working Group","author":"Munshi Aaftab","year":"2008","unstructured":"Aaftab Munshi. 2008. The OpenCL specification. In Khronos OpenCL Working Group."},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124538"},{"key":"e_1_3_2_74_2","volume-title":"MICRO","author":"Narasiman Veynu","year":"2011","unstructured":"Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N.Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO."},{"key":"e_1_3_2_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2018.2873679"},{"issue":"1","key":"e_1_3_2_76_2","first-page":"1","article-title":"Efficient nearest-neighbor data sharing in GPUs","volume":"18","author":"Babak Nematollahi, Negin and Sadrosadati, Mohammad and Falahati, Hajar and Barkhordar, Marzieh and Drumond, Mario Paulo and Sarbazi-Azad, Hamid and Falsafi,","year":"2020","unstructured":"Nematollahi, Negin and Sadrosadati, Mohammad and Falahati, Hajar and Barkhordar, Marzieh and Drumond, Mario Paulo and Sarbazi-Azad, Hamid and Falsafi, Babak. 2020. Efficient nearest-neighbor data sharing in GPUs. ACM Trans. Arch. Code Optim. 18, 1 (2020), 1\u201326.","journal-title":"ACM Trans. Arch. Code Optim."},{"key":"e_1_3_2_77_2","volume-title":"CUDA SDK 2.3. Retrieved from","year":"2009","unstructured":"NVIDIA. 2009. CUDA SDK 2.3. Retrieved fromhttps:\/\/developer.nvidia.com\/cuda-toolkit-23-downloads"},{"key":"e_1_3_2_78_2","volume-title":"ISCA","author":"Park Chang Hyun","year":"2016","unstructured":"Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh. 2016. Efficient intra-SM slicing through dynamic resource partitioning for gpu multiprogramming. In ISCA."},{"key":"e_1_3_2_79_2","volume-title":"AIPR","author":"Park Seung In","year":"2008","unstructured":"Seung In Park, Sean P. Ponce, Jing Huang, Yong Cao, and Francis Quek. 2008. Low-cost, high-speed computer vision using NVIDIA\u2019s CUDA architecture. In AIPR."},{"key":"e_1_3_2_80_2","volume-title":"ASP-AC","author":"Pedram Massoud","year":"1998","unstructured":"Massoud Pedram, Qing Wu, and Xunwei Wu. 1998. A new design for double edge triggered flip-flops. In ASP-AC."},{"key":"e_1_3_2_81_2","doi-asserted-by":"crossref","DOI":"10.1109\/LCA.2015.2430853","article-title":"Toggle-aware compression for GPUs.","author":"Pekhimenko Gennady","year":"2015","unstructured":"Gennady Pekhimenko, Evgeny Bolotin, Mike O\u2019Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2015. Toggle-aware compression for GPUs. IEEE Comput. Arch. Lett. (2015).","journal-title":"IEEE Comput. Arch. Lett."},{"key":"e_1_3_2_82_2","doi-asserted-by":"crossref","DOI":"10.1118\/1.3578605","article-title":"GPU computing in medical physics: A review","author":"Pratx Guillem","year":"2011","unstructured":"Guillem Pratx and Lei Xing. 2011. GPU computing in medical physics: A review. Med. Phys. (2011).","journal-title":"Med. Phys."},{"key":"e_1_3_2_83_2","volume-title":"MICRO","author":"Rogers Timothy G.","year":"2012","unstructured":"Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO."},{"key":"e_1_3_2_84_2","doi-asserted-by":"publisher","DOI":"10.1145\/3291606"},{"key":"e_1_3_2_85_2","volume-title":"ASPLOS","author":"Sadrosadati Mohammad","year":"2018","unstructured":"Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for gpus via hardware\/software cooperative register prefetching. In ASPLOS."},{"key":"e_1_3_2_86_2","doi-asserted-by":"publisher","DOI":"10.1145\/3419973"},{"key":"e_1_3_2_87_2","volume-title":"DATE","author":"Hamid Sadrosadati, Mohammad and Mirhosseini, Amirhossein and Roozkhosh, Shahin and Bakhishi, Hazhir and Sarbazi-Azad,","year":"2017","unstructured":"Sadrosadati, Mohammad and Mirhosseini, Amirhossein and Roozkhosh, Shahin and Bakhishi, Hazhir and Sarbazi-Azad, Hamid. 2017. Effective cache bank placement for GPUs. In DATE."},{"key":"e_1_3_2_88_2","volume-title":"DAC","author":"Samavatian Mohammad Hossein","year":"2014","unstructured":"Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In DAC."},{"key":"e_1_3_2_89_2","article-title":"Wall street accelerates options analysis with GPU technology","author":"Schmerken I.","year":"2009","unstructured":"I. Schmerken. 2009. Wall street accelerates options analysis with GPU technology. Wall Street Technol. (2009).","journal-title":"Wall Street Technol."},{"key":"e_1_3_2_90_2","volume-title":"PACT","author":"Seething Ankit","year":"2010","unstructured":"Ankit Seething, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2010. Apogee: Adaptive prefetching on GPUs for energy efficiency. In PACT."},{"key":"e_1_3_2_91_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056031"},{"key":"e_1_3_2_92_2","volume-title":"MICRO","author":"Sethia Ankit","year":"2014","unstructured":"Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic tuning of GPU resources for efficient execution. In MICRO."},{"key":"e_1_3_2_93_2","volume-title":"PACT","author":"Skadron Kevin","year":"2009","unstructured":"Kevin Skadron, Margaret Martonosi, and Douglas W. Clark. 2009. A taxonomy of branch mispredictions, and alloyed prediction as a robust solution to wrong-history mispredictions. In PACT."},{"key":"e_1_3_2_94_2","doi-asserted-by":"crossref","unstructured":"Sam S. Stone Justin P. Haldar Stephanie C. Tsao B. P. Sutton Z.-P. Liang et\u00a0al. 2008. Accelerating advanced MRI reconstructions on GPUs. In Proceedings of the 5th Conference on Computing Frontiers. 261\u2013272.","DOI":"10.1145\/1366230.1366276"},{"key":"e_1_3_2_95_2","volume-title":"Parboil: A Revised Benchmark Suite for Scientific and Commercial throughput Computing","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniek Liu, and Wen Mei Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial throughput Computing. Technical Report."},{"key":"e_1_3_2_96_2","volume-title":"NOCS","year":"2012","unstructured":"Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In NOCS."},{"key":"e_1_3_2_97_2","volume-title":"TTSMC-28nm","year":"2022","unstructured":"Synopsys. 2022. TTSMC-28nm. Retrieved from https:\/\/www.synopsys.com\/dw\/emllselector.php?f=TSMC&n=28&s=wMkRWA"},{"key":"e_1_3_2_98_2","volume-title":"IPDPS","author":"Tabbakh Abdulaziz","year":"2017","unstructured":"Abdulaziz Tabbakh, Murali Annavaram, and Xuehai Qian. 2017. Power efficient sharing-aware GPU data management. In IPDPS."},{"key":"e_1_3_2_99_2","volume-title":"SC","author":"Tarjan David","year":"2010","unstructured":"David Tarjan and Kevin Skadron. 2010. The sharing tracker: Using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches. In SC."},{"key":"e_1_3_2_100_2","volume-title":"GPGPU","author":"Tian Yingying","year":"2015","unstructured":"Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jim\u00e9nez. 2015. Adaptive GPU cache bypassing. In GPGPU."},{"key":"e_1_3_2_101_2","volume-title":"MICRO","author":"Aamodt Timothy G. Rogers and Mike O\u2019Connor and Tor M.","year":"2013","unstructured":"Timothy G. Rogers and Mike O\u2019Connor and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In MICRO."},{"key":"e_1_3_2_102_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750399"},{"key":"e_1_3_2_103_2","doi-asserted-by":"publisher","DOI":"10.5555\/2523721.2523737"},{"key":"e_1_3_2_104_2","volume-title":"ISCA","author":"Wang Jin","year":"2015","unstructured":"Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2015. Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs. In ISCA."},{"key":"e_1_3_2_105_2","volume-title":"ISCA","author":"Sudhakar Wang, Jin and Rubin, Norm and Sidelnik, Albert and Yalamanchili,","year":"2016","unstructured":"Wang, Jin and Rubin, Norm and Sidelnik, Albert and Yalamanchili, Sudhakar. 2016. Laperm: Locality aware scheduler for dynamic parallelism on gpus. In ISCA."},{"key":"e_1_3_2_106_2","volume-title":"IPDPS","author":"Lieven Wang, Lu and Zhao, Xia and Kaeli, David and Wang, Zhiying and Eeckhout,","year":"2018","unstructured":"Wang, Lu and Zhao, Xia and Kaeli, David and Wang, Zhiying and Eeckhout, Lieven. 2018. Intra-cluster coalescing to reduce gpu noc pressure. In IPDPS."},{"key":"e_1_3_2_107_2","doi-asserted-by":"crossref","DOI":"10.1109\/4.509850","article-title":"CACTI: An enhanced cache access and cycle time model","author":"Wilton Steven J. E.","year":"1996","unstructured":"Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE J. Solid-State Circ. 31, 5 (1996), 677\u2013688.","journal-title":"IEEE J. Solid-State Circ."},{"key":"e_1_3_2_108_2","volume-title":"ISCA","author":"Wing-Kei S Yu","year":"2011","unstructured":"S Yu Wing-Kei, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In ISCA."},{"key":"e_1_3_2_109_2","doi-asserted-by":"publisher","DOI":"10.5555\/2561828.2561929"},{"key":"e_1_3_2_110_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056023"},{"key":"e_1_3_2_111_2","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628105"},{"key":"e_1_3_2_112_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2014.6983053"},{"key":"e_1_3_2_113_2","volume-title":"IISWC","author":"Murali Xu, Qiumin and Jeon, Hyeran and Annavaram,","year":"2014","unstructured":"Xu, Qiumin and Jeon, Hyeran and Annavaram, Murali. 2014. Graph processing on GPUs: Where are the bottlenecks? In IISWC."},{"key":"e_1_3_2_114_2","volume-title":"ISLPED","author":"Yin Jieming","year":"2012","unstructured":"Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai. 2012. Energy-efficient non-minimal path on-chip interconnection network for heterogeneous systems. In ISLPED."},{"key":"e_1_3_2_115_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783731"},{"key":"e_1_3_2_116_2","doi-asserted-by":"crossref","unstructured":"Xia Zhao Yuxi Liu Almutaz Adileh and Lieven Eeckhout. 2016. LA-LLC: Inter-core locality-aware last-level cache to exploit many-to-many traffic in GPGPUs. IEEE Comput. Arch. Lett. (2016).","DOI":"10.1109\/LCA.2016.2611663"},{"key":"e_1_3_2_117_2","volume-title":"ICCD","author":"Zhiying Zhao, Xia and Ma, Sheng and Li, Chen and Eeckhout, Lieven and Wang,","year":"2016","unstructured":"Zhao, Xia and Ma, Sheng and Li, Chen and Eeckhout, Lieven and Wang, Zhiying. 2016. A heterogeneous low-cost and low-latency ring-chain network for GPGPUs. In ICCD."},{"key":"e_1_3_2_118_2","volume-title":"NOCS","author":"David Ziabari, Amir Kavyan and Abell\u00e1n, Jos\u00e9 L and Ma, Yenai and Joshi, Ajay and Kaeli,","year":"2015","unstructured":"Ziabari, Amir Kavyan and Abell\u00e1n, Jos\u00e9 L and Ma, Yenai and Joshi, Ajay and Kaeli, David. 2015. Asymmetric NoC architectures for GPU systems. In NOCS."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3653019","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3653019","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:59Z","timestamp":1750295879000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3653019"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,14]]},"references-count":117,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,9,30]]}},"alternative-id":["10.1145\/3653019"],"URL":"https:\/\/doi.org\/10.1145\/3653019","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,14]]},"assertion":[{"value":"2022-06-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}