{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T16:04:48Z","timestamp":1756310688077,"version":"3.41.0"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2020,8,17]],"date-time":"2020-08-17T00:00:00Z","timestamp":1597622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF 13-02641,CCF 16-19245"],"award-info":[{"award-number":["CCF 13-02641,CCF 16-19245"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,9,30]]},"abstract":"<jats:p>As GPUs have become more programmable, their performance and energy benefits have made them increasingly popular. However, while GPU compute units continue to improve in performance, on-chip memories lag behind and data accesses are becoming increasingly expensive in performance and energy. Emerging GPU coherence protocols can mitigate this bottleneck by exploiting data reuse in GPU caches across kernel boundaries. Unfortunately, current GPU thread block schedulers are typically not designed to expose such reuse. This article proposes new hardware thread block schedulers that optimize inter-kernel reuse while using work stealing to preserve load balance. Our schedulers are simple, decentralized, and have extremely low overhead. Compared to a baseline round-robin scheduler, the best performing scheduler reduces average execution time and energy by 19% and 11%, respectively, in regular applications, and 10% and 8%, respectively, in irregular applications.<\/jats:p>","DOI":"10.1145\/3406538","type":"journal-article","created":{"date-parts":[[2020,8,17]],"date-time":"2020-08-17T13:24:45Z","timestamp":1597670685000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["Inter-kernel Reuse-aware Thread Block Scheduling"],"prefix":"10.1145","volume":"17","author":[{"given":"Muhammad","family":"Huzaifa","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"given":"Johnathan","family":"Alsop","sequence":"additional","affiliation":[{"name":"AMD Research, Bellevue, WA, USA"}]},{"given":"Abdulrahman","family":"Mahmoud","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"given":"Giordano","family":"Salvador","sequence":"additional","affiliation":[{"name":"Unaffiliated"}]},{"given":"Matthew D.","family":"Sinclair","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison, USA and AMD Research, Bellevue, WA, USA"}]},{"given":"Sarita V.","family":"Adve","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]}],"member":"320","published-online":{"date-parts":[[2020,8,17]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123976"},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.","author":"Agarwal N.","key":"e_1_2_1_2_1","unstructured":"N. Agarwal , T. Krishna , Li-Shiuan Peh , and N. K. Jha . 2009. GARNET: A detailed on-chip network model inside a full-system simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. N. Agarwal, T. Krishna, Li-Shiuan Peh, and N. K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software."},{"volume-title":"Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS\u201910)","author":"Agrawal K.","key":"e_1_2_1_3_1","unstructured":"K. Agrawal , C. E. Leiserson , and J. Sukha . 2010. Executing task graphs using work-stealing . In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS\u201910) . K. Agrawal, C. E. Leiserson, and J. Sukha. 2010. Executing task graphs using work-stealing. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS\u201910)."},{"volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Alsop Johnathan","key":"e_1_2_1_4_1","unstructured":"Johnathan Alsop , Marc S. Orr , Bradford M. Beckmann , and David A. Wood . 2016. Lazy release consistency for GPUs . In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916) . Johnathan Alsop, Marc S. Orr, Bradford M. Beckmann, and David A. Wood. 2016. Lazy release consistency for GPUs. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)."},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.","author":"Bakhoda Ali","key":"e_1_2_1_5_1","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2013.6704684"},{"volume-title":"Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201910)","author":"Che Shuai","key":"e_1_2_1_7_1","unstructured":"Shuai Che , J. W. Sheaffer , M. Boyer , L. G. Szafaryn , Liang Wang , and K. Skadron . 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads . In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201910) . 1--11. Shuai Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, and K. Skadron. 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201910). 1--11."},{"key":"e_1_2_1_8_1","first-page":"1","article-title":"Improving GPGPU performance via cache locality aware thread block scheduling","volume":"99","author":"Chen Li-Jhan","year":"2017","unstructured":"Li-Jhan Chen , Hsiang-Yun Cheng , Po-Han Wang , and Chia-Lin Yang . 2017 . Improving GPGPU performance via cache locality aware thread block scheduling . IEEE Comput. Archit. Lett. PP , 99 (2017), 1 -- 1 . Li-Jhan Chen, Hsiang-Yun Cheng, Po-Han Wang, and Chia-Lin Yang. 2017. Improving GPGPU performance via cache locality aware thread block scheduling. IEEE Comput. Archit. Lett. PP, 99 (2017), 1--1.","journal-title":"IEEE Comput. Archit. Lett. PP"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.11"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"volume-title":"Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing.","author":"Gautier T.","key":"e_1_2_1_12_1","unstructured":"T. Gautier , J. V. F. Lima , N. Maillard , and B. Raffin . 2013. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures . In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing."},{"volume-title":"Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS\u201910)","author":"Guo Y.","key":"e_1_2_1_13_1","unstructured":"Y. Guo , J. Zhao , V. Cave , and V. Sarkar . 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler . In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS\u201910) . Y. Guo, J. Zhao, V. Cave, and V. Sarkar. 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS\u201910)."},{"volume-title":"Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture. 189--200","author":"Hechtman Blake A.","key":"e_1_2_1_14_1","unstructured":"Blake A. Hechtman , Shuai Che , Derek R. Hower , Yingying Tian , Bradford M. Beckmann , Mark D. Hill , Steven K. Reinhardt , and David A. Wood . 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs . In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture. 189--200 . Blake A. Hechtman, Shuai Che, Derek R. Hower, Yingying Tian, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture. 189--200."},{"volume-title":"Version 2.0","author":"Howes Lee","key":"e_1_2_1_15_1","unstructured":"Lee Howes and Aaftab Munshi . 2015. The OpenCL Specification , Version 2.0 . Khronos Group . Retrieved from https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/2.2\/pdf\/OpenCL_C.pdf. Lee Howes and Aaftab Munshi. 2015. The OpenCL Specification, Version 2.0. Khronos Group. Retrieved from https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/2.2\/pdf\/OpenCL_C.pdf."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 5th Conference on Partitioned Global Address Space Programming Models.","author":"Min Seung Jai","year":"2011","unstructured":"Seung Jai Min , Costin Iancu , and Katherine Yelick . 2011 . Hierarchical work stealing on manycore clusters . In Proceedings of the 5th Conference on Partitioned Global Address Space Programming Models. Seung Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical work stealing on manycore clusters. In Proceedings of the 5th Conference on Partitioned Global Address Space Programming Models."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835938"},{"volume-title":"Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913)","author":"Jog Adwait","key":"e_1_2_1_18_1","unstructured":"Adwait Jog , Onur Kayiran , Nachiappan Chidambaram Nachiappan , Asit K. Mishra , Mahmut T. Kandemir , Onur Mutlu , Ravishankar Iyer , and Chita R. Das . 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance . In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913) . Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913)."},{"volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201913)","author":"Jog Adwait","key":"e_1_2_1_19_1","unstructured":"Adwait Jog , Onur Kayiran , Asit K. Mishra , Mahmut T. Kandemir , Onur Mutlu , Ravishankar Iyer , and Chita R. Das . 2013. Orchestrated scheduling and prefetching for GPGPUs . In Proceedings of the International Symposium on Computer Architecture (ISCA\u201913) . Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA\u201913)."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913)","author":"Kayiran Onur","year":"2013","unstructured":"Onur Kayiran , Adwait Jog , Mahmut Taylan Kandemir , and Chita Ranjan Das . 2013 . Neither more nor less: Optimizing thread-level parallelism for GPGPUs . In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913) . Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2011.89"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201915)","author":"Komuravelli Rakesh","key":"e_1_2_1_23_1","unstructured":"Rakesh Komuravelli , Matthew D. Sinclair , Johnathan Alsop , Muhammad Huzaifa , Prakalp Srivastava , Maria Kotsifakou , Sarita V. Adve , and Vikram S. Adve . 2015. Stash: Have your scratchpad and cache it too . In Proceedings of the International Symposium on Computer Architecture (ISCA\u201915) . Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Prakalp Srivastava, Maria Kotsifakou, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have your scratchpad and cache it too. In Proceedings of the International Symposium on Computer Architecture (ISCA\u201915)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080239"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2889488"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835937"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628107"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037709"},{"volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201915)","author":"Li Ang","key":"e_1_2_1_30_1","unstructured":"Ang Li , Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs . In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201915) . Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201915)."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751237"},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201915)","author":"Li Dong","key":"e_1_2_1_32_1","unstructured":"Dong Li , Minsoo Rhu , Daniel R. Johnson , Mike O\u2019Connor , Mattan Erez , Doug Burger , Donald S. Fussell , and Stephen W. Redder . 2015. Priority-based cache allocation in throughput processors . In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201915) . Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O\u2019Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201915)."},{"volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)","author":"Li Sheng","key":"e_1_2_1_33_1","unstructured":"Sheng Li , Jung-Ho Ahn , R. D. Strong , J. B. Brockman , D. M. Tullsen , and N. P. Jouppi . 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909) . Sheng Li, Jung-Ho Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1105734.1105747"},{"key":"e_1_2_1_35_1","unstructured":"NVIDIA. 2010. CUDA SDK 3.1. Retrieved from http:\/\/developer.nvidia.com\/object\/cuda_3_1_downloads.html.  NVIDIA. 2010. CUDA SDK 3.1. Retrieved from http:\/\/developer.nvidia.com\/object\/cuda_3_1_downloads.html."},{"key":"e_1_2_1_36_1","unstructured":"NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. Retrieved from https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf.  NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. Retrieved from https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"volume-title":"Proceedings of the IEEE International Conference on Cluster Computing.","author":"Perez J. M.","key":"e_1_2_1_37_1","unstructured":"J. M. Perez , R. M. Badia , and J. Labarta . 2008. A dependency-aware task-based programming environment for multi-core architectures . In Proceedings of the IEEE International Conference on Cluster Computing. J. M. Perez, R. M. Badia, and J. Labarta. 2008. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the IEEE International Conference on Cluster Computing."},{"volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912)","author":"Rogers Timothy G.","key":"e_1_2_1_38_1","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2012. Cache-conscious wavefront scheduling . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912) . Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201912)."},{"volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913)","author":"Rogers Timothy G.","key":"e_1_2_1_39_1","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2013. Divergence-aware warp scheduling . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913) . Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913)."},{"key":"e_1_2_1_40_1","volume-title":"Adve","author":"Salvador Giordano","year":"2020","unstructured":"Giordano Salvador , Wesley H. Darvin , Muhammad Huzaifa , Johnathan Alsop , Matthew D. Sinclair , and Sarita V . Adve . 2020 . Specializing Coherence, Consistency , and Push\/Pull for GPU Graph Analytics. Retrieved from arxiv:cs.DC\/2002.10245. Giordano Salvador, Wesley H. Darvin, Muhammad Huzaifa, Johnathan Alsop, Matthew D. Sinclair, and Sarita V. Adve. 2020. Specializing Coherence, Consistency, and Push\/Pull for GPU Graph Analytics. Retrieved from arxiv:cs.DC\/2002.10245."},{"volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)","author":"Sinclair Matthew D.","key":"e_1_2_1_41_1","unstructured":"Matthew D. Sinclair , Johnathan Alsop , and Sarita V. Adve . 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915) . Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)."},{"volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201917)","author":"Sinclair Matthew D.","key":"e_1_2_1_42_1","unstructured":"Matthew D. Sinclair , Johnathan Alsop , and Sarita V. Adve . 2017. Chasing away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems . In Proceedings of the International Symposium on Computer Architecture (ISCA\u201917) . Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2017. Chasing away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems. In Proceedings of the International Symposium on Computer Architecture (ISCA\u201917)."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the 19th International Symposium on High Performance Computer Architecture. DOI:https:\/\/doi.org\/10","author":"Singh I.","year":"2013","unstructured":"I. Singh , A. Shriraman , W. W. L. Fung , M. O\u2019Connor , and T. M. Aamodt . 2013. Cache coherence for GPU architectures . In Proceedings of the 19th International Symposium on High Performance Computer Architecture. DOI:https:\/\/doi.org\/10 .1109\/HPCA. 2013 .6522351 10.1109\/HPCA.2013.6522351 I. Singh, A. Shriraman, W. W. L. Fung, M. O\u2019Connor, and T. M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. DOI:https:\/\/doi.org\/10.1109\/HPCA.2013.6522351"},{"key":"e_1_2_1_44_1","volume-title":"Geng Daniel Liu, and WMW Hwu","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and WMW Hwu . 2012 . Parboil : A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Department of ECE and CS, University of Illinois at Urbana-Champaign. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and WMW Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Department of ECE and CS, University of Illinois at Urbana-Champaign."},{"key":"e_1_2_1_45_1","unstructured":"SuiteSparse Matrix Collection. 2010. cond-mat. Retrieved from https:\/\/sparse.tamu.edu\/Newman\/cond-mat.  SuiteSparse Matrix Collection. 2010. cond-mat. Retrieved from https:\/\/sparse.tamu.edu\/Newman\/cond-mat."},{"key":"e_1_2_1_46_1","unstructured":"SuiteSparse Matrix Collection. 2010. olesnik0. Retrieved from https:\/\/sparse.tamu.edu\/GHS_indef\/olesnik0.  SuiteSparse Matrix Collection. 2010. olesnik0. Retrieved from https:\/\/sparse.tamu.edu\/GHS_indef\/olesnik0."},{"volume-title":"Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU\u201915)","author":"Tian Yingying","key":"e_1_2_1_47_1","unstructured":"Yingying Tian , Sooraj Puthoor , Joseph L. Greathouse , Bradford M. Beckmann , and Daniel A. Jim\u00e9nez . 2015. Adaptive GPU cache bypassing . In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU\u201915) . Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jim\u00e9nez. 2015. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU\u201915)."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2638554"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00074"},{"key":"e_1_2_1_50_1","unstructured":"Virtutech. 2006. Simics full system simulator. Retrieved from http:\/\/www.simics.net.  Virtutech. 2006. Simics full system simulator. Retrieved from http:\/\/www.simics.net."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001199"},{"volume-title":"Proceedings of the IEEE International Conference on Big Data (Big Data\u201914)","author":"Wang K.","key":"e_1_2_1_52_1","unstructured":"K. Wang , X. Zhou , T. Li , D. Zhao , M. Lang , and I. Raicu . 2014. Optimizing load balancing and data-locality with data-aware scheduling . In Proceedings of the IEEE International Conference on Big Data (Big Data\u201914) . 119--128. K. Wang, X. Zhou, T. Li, D. Zhao, M. Lang, and I. Raicu. 2014. Optimizing load balancing and data-locality with data-aware scheduling. In Proceedings of the IEEE International Conference on Big Data (Big Data\u201914). 119--128."},{"key":"e_1_2_1_53_1","unstructured":"WikiChip. 2019. Exynos 9820 - Samsung. Retrieved from https:\/\/en.wikichip.org\/wiki\/samsung\/exynos\/9820.  WikiChip. 2019. Exynos 9820 - Samsung. Retrieved from https:\/\/en.wikichip.org\/wiki\/samsung\/exynos\/9820."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751213"},{"volume-title":"Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Xie X.","key":"e_1_2_1_55_1","unstructured":"X. Xie , Y. Liang , Y. Wang , G. Sun , and T. Wang . 2015. Coordinated static and dynamic cache bypassing for GPUs . In Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA\u201915) . X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA\u201915)."},{"key":"e_1_2_1_56_1","volume-title":"Mowry","author":"Yazdanbakhsh Amir","year":"2019","unstructured":"Amir Yazdanbakhsh , Gennady Pekhimenko , Hadi Esmaeilzadeh , Onur Mutlu , and Todd C . Mowry . 2019 . Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction. Springer International Publishing , 417--441. Amir Yazdanbakhsh, Gennady Pekhimenko, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2019. Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction. Springer International Publishing, 417--441."},{"key":"e_1_2_1_57_1","first-page":"1","article-title":"Adaptive cache and concurrency allocation on GPGPUs","volume":"99","author":"Zheng Z.","year":"2014","unstructured":"Z. Zheng , Z. Wang , and M. Lipasti . 2014 . Adaptive cache and concurrency allocation on GPGPUs . Comput. Archit. Lett. PP , 99 (2014), 1 -- 1 . DOI:https:\/\/doi.org\/10.1109\/LCA.2014.2359882 10.1109\/LCA.2014.2359882 Z. Zheng, Z. Wang, and M. Lipasti. 2014. Adaptive cache and concurrency allocation on GPGPUs. Comput. Archit. Lett. PP, 99 (2014), 1--1. DOI:https:\/\/doi.org\/10.1109\/LCA.2014.2359882","journal-title":"Comput. Archit. Lett. PP"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3406538","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3406538","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3406538","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:35Z","timestamp":1750197695000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3406538"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,17]]},"references-count":56,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,9,30]]}},"alternative-id":["10.1145\/3406538"],"URL":"https:\/\/doi.org\/10.1145\/3406538","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2020,8,17]]},"assertion":[{"value":"2019-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-08-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}