{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T19:03:37Z","timestamp":1778699017578,"version":"3.51.4"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,6,8]],"date-time":"2021-06-08T00:00:00Z","timestamp":1623110400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["CCF-1815643 and CCF-1907401"],"award-info":[{"award-number":["CCF-1815643 and CCF-1907401"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"<jats:p>The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored.<\/jats:p>\n          <jats:p>\n            We present\n            <jats:bold>PAVER<\/jats:bold>\n            , a\n            <jats:bold>P<\/jats:bold>\n            riority-\n            <jats:bold>A<\/jats:bold>\n            ware\n            <jats:bold>V<\/jats:bold>\n            ertex schedul\n            <jats:bold>ER<\/jats:bold>\n            , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among\n            <jats:bold>thread blocks<\/jats:bold>\n            (\n            <jats:bold>TBs<\/jats:bold>\n            ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.\n          <\/jats:p>","DOI":"10.1145\/3451164","type":"journal-article","created":{"date-parts":[[2021,6,8]],"date-time":"2021-06-08T16:21:19Z","timestamp":1623169279000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":27,"title":["PAVER"],"prefix":"10.1145","volume":"18","author":[{"given":"Devashree","family":"Tripathy","sequence":"first","affiliation":[{"name":"University of California, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amirali","family":"Abdolrashidi","sequence":"additional","affiliation":[{"name":"University of California, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Laxmi Narayan","family":"Bhuyan","sequence":"additional","affiliation":[{"name":"University of California, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liang","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of California, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daniel","family":"Wong","sequence":"additional","affiliation":[{"name":"University of California, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,6,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2009. Retrieved April 11 2018 from https:\/\/github.com\/gpgpu-sim\/ispass2009-benchmarks.  2009. Retrieved April 11 2018 from https:\/\/github.com\/gpgpu-sim\/ispass2009-benchmarks."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123976"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 1\u201312","author":"Acar Umut A.","unstructured":"Umut A. Acar , Guy E. Blelloch , and Robert D. Blumofe . 2000. The data locality of work stealing . In Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 1\u201312 . Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The data locality of work stealing. In Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 1\u201312."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/1753228.1753233"},{"key":"e_1_2_1_5_1","volume-title":"IEEE Symposium on Performance Analysis of Systems and Software (ISPASS\u201909)","author":"Bakhoda Ali","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In IEEE Symposium on Performance Analysis of Systems and Software (ISPASS\u201909) . IEEE, 163\u2013174. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE Symposium on Performance Analysis of Systems and Software (ISPASS\u201909). IEEE, 163\u2013174."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54\u201367","author":"Belviranli Mehmet E.","unstructured":"Mehmet E. Belviranli , Seyong Lee , Jeffrey S. Vetter , and Laxmi N. Bhuyan . 2018. Juggler: A dependence-aware task-based execution framework for GPUs . In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54\u201367 . Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Laxmi N. Bhuyan. 2018. Juggler: A dependence-aware task-based execution framework for GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54\u201367."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2017.2693371"},{"key":"e_1_2_1_9_1","volume-title":"2018 USENIX Annual Technical Conference (USENIXATC\u201918)","author":"Chen Yanhao","unstructured":"Yanhao Chen , Ari B. Hayes , Chi Zhang , Timothy Salmon , and Eddy Z. Zhang . 2018. Locality-aware software throttling for sparse matrix operation on GPUs . In 2018 USENIX Annual Technical Conference (USENIXATC\u201918) . 413\u2013426. Yanhao Chen, Ari B. Hayes, Chi Zhang, Timothy Salmon, and Eddy Z. Zhang. 2018. Locality-aware software throttling for sparse matrix operation on GPUs. In 2018 USENIX Annual Technical Conference (USENIXATC\u201918). 413\u2013426."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854318"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00084"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1188455.1188543"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 12\u201322","author":"Fauzia Naznin","unstructured":"Naznin Fauzia , Louis-No\u00ebl Pouchet , and P. Sadayappan . 2015. Characterizing and enhancing global memory data coalescing on GPUs . In Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 12\u201322 . Naznin Fauzia, Louis-No\u00ebl Pouchet, and P. Sadayappan. 2015. Characterizing and enhancing global memory data coalescing on GPUs. In Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 12\u201322."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1693453.1693504"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3406538"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2020.3023723"},{"key":"e_1_2_1_18_1","volume-title":"Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das.","author":"Jog Adwait","year":"2013","unstructured":"Adwait Jog , Onur Kayiran , Nachiappan Chidambaram Nachiappan , Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013 . OWL : Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48 . ACM , 395\u2013406. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395\u2013406."},{"key":"e_1_2_1_19_1","volume-title":"GPU Technology Conference Presentation S","volume":"338","author":"Jones Stephen","year":"2012","unstructured":"Stephen Jones . 2012 . Introduction to dynamic parallelism . In GPU Technology Conference Presentation S , Vol. 338 . 2012. Stephen Jones. 2012. Introduction to dynamic parallelism. In GPU Technology Conference Presentation S, Vol. 338. 2012."},{"key":"e_1_2_1_20_1","volume-title":"Mary Jane Irwin, and Yuanrui Zhnag","author":"Kandemir Mahmut","year":"2010","unstructured":"Mahmut Kandemir , Taylan Yemliha , SaiPrashanth Muralidhara , Shekhar Srikantaiah , Mary Jane Irwin, and Yuanrui Zhnag . 2010 . Cache topology aware computation mapping for multicores. In ACM Sigplan Notices, Vol. 45 . ACM , 74\u201385. Mahmut Kandemir, Taylan Yemliha, SaiPrashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, and Yuanrui Zhnag. 2010. Cache topology aware computation mapping for multicores. In ACM Sigplan Notices, Vol. 45. ACM, 74\u201385."},{"key":"e_1_2_1_21_1","volume-title":"Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices","author":"Karypis George","unstructured":"George Karypis and Vipin Kumar . 1998. A Software Package for Partitioning Unstructured Graphs , Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices . University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Center , Minneapolis, MN. George Karypis and Vipin Kumar. 1998. A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices. University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Center, Minneapolis, MN."},{"key":"e_1_2_1_22_1","volume-title":"2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920)","author":"Khairy Mahmoud","unstructured":"Mahmoud Khairy , Vadim Nikiforov , David Nellans , and Timothy G. Rogers . 2020. Locality-centric data and threadblock management for massive GPUs . In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920) . IEEE, 1022\u20131036. Mahmoud Khairy, Vadim Nikiforov, David Nellans, and Timothy G. Rogers. 2020. Locality-centric data and threadblock management for massive GPUs. In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920). IEEE, 1022\u20131036."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00073"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3337821.3337886"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2015.23"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1090\/S0002-9939-1956-0078686-7"},{"key":"e_1_2_1_27_1","volume-title":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914)","author":"Nagesh","unstructured":"Nagesh B. Lakshminarayana and Hyesoon Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs . In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914) . IEEE, 614\u2013625. Nagesh B. Lakshminarayana and Hyesoon Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914). IEEE, 614\u2013625."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835937"},{"key":"e_1_2_1_30_1","unstructured":"Nikolaj Leischner Vitaly Osipov and Peter Sanders. 2009. Fermi architecture white paper.  Nikolaj Leischner Vitaly Osipov and Peter Sanders. 2009. Fermi architecture white paper."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037709"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2764886"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3062341.3062385"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124538"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture. ACM, 308\u2013317","author":"Narasiman Veynu","unstructured":"Veynu Narasiman , Michael Shebanow , Chang Joo Lee , Rustam Miftakhutdinov , Onur Mutlu , and Yale N. Patt . 2011. Improving GPU performance via large warps and two-level warp scheduling . In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture. ACM, 308\u2013317 . Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture. ACM, 308\u2013317."},{"key":"e_1_2_1_36_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2007","unstructured":"NVIDIA. 2007 . CUDA Toolkit . Retrieved April 11, 2018 from https:\/\/developer.nvidia.com\/cuda-toolkit. NVIDIA. 2007. CUDA Toolkit. Retrieved April 11, 2018 from https:\/\/developer.nvidia.com\/cuda-toolkit."},{"key":"e_1_2_1_37_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2009","unstructured":"NVIDIA. 2009 . NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi . Retrieved June 19, 2018 from https:\/\/www.nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. NVIDIA. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. Retrieved June 19, 2018 from https:\/\/www.nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf."},{"key":"e_1_2_1_38_1","unstructured":"NVIDIA. 2016. GeForce GTX 1080. http:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_GTX_1080_Whitepaper_FINAL.pdf.  NVIDIA. 2016. GeForce GTX 1080. http:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_GTX_1080_Whitepaper_FINAL.pdf."},{"key":"e_1_2_1_39_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2017","unstructured":"NVIDIA. 2017 . NVIDIA TESLA V100 GPU ARCHITECTURE . Retrieved November 26, 2018 from http:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf. NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. Retrieved November 26, 2018 from http:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"key":"e_1_2_1_40_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2020","unstructured":"NVIDIA. 2020 . CUDA Toolkit Documentation . Retrieved September 23, 2020 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#just-in-time-compilation. NVIDIA. 2020. CUDA Toolkit Documentation. Retrieved September 23, 2020 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#just-in-time-compilation."},{"key":"e_1_2_1_41_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2020","unstructured":"NVIDIA. 2020 . NVIDIA A100 Tensor Core GPU Architecture . Retrieved October 10, 2020 from https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf. NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. Retrieved October 10, 2020 from https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf."},{"key":"e_1_2_1_42_1","volume-title":"April","author":"CUDA NVIDIA.","year":"2012","unstructured":"CUDA NVIDIA. 2012. C Programming Guide, v4.2 , April 2012 . CUDA NVIDIA. 2012. C Programming Guide, v4.2, April 2012."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001158"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00042"},{"key":"e_1_2_1_45_1","volume-title":"Polybench: The polyhedral benchmark suite","author":"Pouchet Louis-No\u00ebl","year":"2012","unstructured":"Louis-No\u00ebl Pouchet . 2012 . Polybench: The polyhedral benchmark suite . http:\/\/www. cs. ucla. edu\/pouchet\/software\/polybench. Louis-No\u00ebl Pouchet. 2012. Polybench: The polyhedral benchmark suite. http:\/\/www. cs. ucla. edu\/pouchet\/software\/polybench."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1957.tb01515.x"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2019.2933842"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72\u201383","author":"Rogers Timothy G.","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2012. Cache-conscious wavefront scheduling . In Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72\u201383 . Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72\u201383."},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72\u201383","author":"Rogers Timothy G.","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2012. Cache-conscious wavefront scheduling . In Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72\u201383 . Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72\u201383."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture. ACM, 99\u2013110","author":"Rogers Timothy G.","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2013. Divergence-aware warp scheduling . In Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture. ACM, 99\u2013110 . Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture. ACM, 99\u2013110."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1070891.1065927"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178442.3178445"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3168831"},{"key":"e_1_2_1_54_1","volume-title":"Geng Daniel Liu, and Wen-mei W. Hwu","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and Wen-mei W. Hwu . 2012 . Parboil : A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012). John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012)."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.106"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406577"},{"key":"e_1_2_1_57_1","volume-title":"2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID\u201919)","author":"Tripathy S.","year":"2019","unstructured":"S. Tripathy , D. Sahoo , and M. Satpathy . 2019. Multidimensional grid aware address prediction for GPGPU . In 2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID\u201919) . 263\u2013268. DOI:https:\/\/doi.org\/10.1109\/VLSID. 2019 .00064 S. Tripathy, D. Sahoo, and M. Satpathy. 2019. Multidimensional grid aware address prediction for GPGPU. In 2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID\u201919). 263\u2013268. DOI:https:\/\/doi.org\/10.1109\/VLSID.2019.00064"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.4380230407"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00074"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001199"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/2486159.2486175"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330373"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406553"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993744.1993748"},{"key":"e_1_2_1_65_1","volume-title":"Intel VTune amplifier","author":"Zone Intel Developer","year":"2017","unstructured":"Intel Developer Zone . 2017. Intel VTune amplifier , 2017 . https:\/\/software. intel. com\/en-us\/intel-vtune-amplifier-xe-support\/documentation. Intel Developer Zone. 2017. Intel VTune amplifier, 2017. https:\/\/software. intel. com\/en-us\/intel-vtune-amplifier-xe-support\/documentation."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3451164","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3451164","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3451164","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:49:25Z","timestamp":1750268965000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3451164"}},"subtitle":["Locality Graph-Based Thread Block Scheduling for GPUs"],"short-title":[],"issued":{"date-parts":[[2021,6,8]]},"references-count":64,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3451164"],"URL":"https:\/\/doi.org\/10.1145\/3451164","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,8]]},"assertion":[{"value":"2020-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-06-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}