{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T07:08:26Z","timestamp":1767856106574,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":96,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T00:00:00Z","timestamp":1674777600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62102438"],"award-info":[{"award-number":["62102438"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100005090","name":"Beijing Nova Program","doi-asserted-by":"publisher","award":[""],"award-info":[{"award-number":[""]}],"id":[{"id":"10.13039\/501100005090","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Research Council of Norway","award":["286596"],"award-info":[{"award-number":["286596"]}]},{"name":"UGent-BOF-GOA grant","award":["No. 01G01421"],"award-info":[{"award-number":["No. 01G01421"]}]},{"DOI":"10.13039\/501100000781","name":"European Research Council","doi-asserted-by":"publisher","award":["741097"],"award-info":[{"award-number":["741097"]}],"id":[{"id":"10.13039\/501100000781","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,1,27]]},"DOI":"10.1145\/3575693.3575745","type":"proceedings-article","created":{"date-parts":[[2023,1,30]],"date-time":"2023-01-30T22:56:55Z","timestamp":1675119415000},"page":"544-559","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["NUBA: Non-Uniform Bandwidth GPUs"],"prefix":"10.1145","author":[{"given":"Xia","family":"Zhao","sequence":"first","affiliation":[{"name":"Academy of Military Sciences, China"}]},{"given":"Magnus","family":"Jahre","sequence":"additional","affiliation":[{"name":"NTNU, Norway"}]},{"given":"Yuhua","family":"Tang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Guangda","family":"Zhang","sequence":"additional","affiliation":[{"name":"Academy of Military Sciences, China"}]},{"given":"Lieven","family":"Eeckhout","sequence":"additional","affiliation":[{"name":"Ghent University, Belgium"}]}],"member":"320","published-online":{"date-parts":[[2023,1,30]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Rogers","author":"Aamodt Tor M.","year":"2018","unstructured":"Tor M. Aamodt , Wilson W. L. Fung , and Timothy G . Rogers . 2018 . General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers . Tor M. Aamodt, Wilson W. L. Fung, and Timothy G. Rogers. 2018. General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers."},{"key":"e_1_3_2_1_2_1","unstructured":"AMD. 2012. AMD Graphics Core Next. https:\/\/www.techpowerup.com\/gpu-specs\/docs\/amd-gcn1-architecture.pdf \t\t\t\t  AMD. 2012. AMD Graphics Core Next. https:\/\/www.techpowerup.com\/gpu-specs\/docs\/amd-gcn1-architecture.pdf"},{"key":"e_1_3_2_1_3_1","unstructured":"AMD. 2019. Introducing RDNA Architecture. https:\/\/www.amd.com\/system\/files\/documents\/rdna-whitepaper.pdf \t\t\t\t  AMD. 2019. Introducing RDNA Architecture. https:\/\/www.amd.com\/system\/files\/documents\/rdna-whitepaper.pdf"},{"key":"e_1_3_2_1_4_1","unstructured":"AMD. 2020. Introducing AMD CDNA Architecture. https:\/\/www.amd.com\/system\/files\/documents\/amd-cdna-whitepaper.pdf \t\t\t\t  AMD. 2020. Introducing AMD CDNA Architecture. https:\/\/www.amd.com\/system\/files\/documents\/amd-cdna-whitepaper.pdf"},{"key":"e_1_3_2_1_5_1","unstructured":"AMD. 2021. AMD Radeon PRO V620. https:\/\/www.amd.com\/en\/products\/server-accelerators\/amd-radeon-pro-v620 \t\t\t\t  AMD. 2021. AMD Radeon PRO V620. https:\/\/www.amd.com\/en\/products\/server-accelerators\/amd-radeon-pro-v620"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080231"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00028"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123975"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173169"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854314"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1531793.1531803"},{"key":"e_1_3_2_1_12_1","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO). 421\u2013432","author":"Bakhoda Ali","unstructured":"Ali Bakhoda , John Kim , and Tor M. Aamodt . 2010. Throughput-Effective On-Chip Networks for Manycore Accelerators . In Proceedings of the International Symposium on Microarchitecture (MICRO). 421\u2013432 . Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-Effective On-Chip Networks for Manycore Accelerators. In Proceedings of the International Symposium on Microarchitecture (MICRO). 421\u2013432."},{"key":"e_1_3_2_1_13_1","volume-title":"Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 163\u2013174","author":"Bakhoda Ali","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator . In Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 163\u2013174 . Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 163\u2013174."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00055"},{"key":"e_1_3_2_1_15_1","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO). 443\u2013454","author":"Beckmann Bradford M.","unstructured":"Bradford M. Beckmann , Michael R. Marty , and David A. Wood . 2006. ASR: Adaptive Selective Replication for CMP Caches . In Proceedings of the International Symposium on Microarchitecture (MICRO). 443\u2013454 . Bradford M. Beckmann, Michael R. Marty, and David A. Wood. 2006. ASR: Adaptive Selective Replication for CMP Caches. In Proceedings of the International Symposium on Microarchitecture (MICRO). 443\u2013454."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2010.5470442"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-010-0136-3"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/195473.195485"},{"key":"e_1_3_2_1_19_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA). 264\u2013276","author":"Chang Jichuan","unstructured":"Jichuan Chang and Gurindar S. Sohi . 2006. Cooperative Caching for Chip Multiprocessors . In Proceedings of the International Symposium on Computer Architecture (ISCA). 264\u2013276 . Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative Caching for Chip Multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA). 264\u2013276."},{"key":"e_1_3_2_1_20_1","volume-title":"International Symposium on High Performance Computer Architecture (HPCA). 73\u201384","author":"Chatterjee Niladrish","unstructured":"Niladrish Chatterjee , Mike O\u2019Connor , Donghyuk Lee , Daniel R. Johnson , Stephen W. Keckler , Minsoo Rhu , and William J. Dally . 2017. Architecting an Energy-Efficient DRAM System for GPUs . In International Symposium on High Performance Computer Architecture (HPCA). 73\u201384 . Niladrish Chatterjee, Mike O\u2019Connor, Donghyuk Lee, Daniel R. Johnson, Stephen W. Keckler, Minsoo Rhu, and William J. Dally. 2017. Architecting an Energy-Efficient DRAM System for GPUs. In International Symposium on High Performance Computer Architecture (HPCA). 73\u201384."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_2_1_22_1","first-page":"634","article-title":"Scalable crossbar apparatus and method for arranging crossbar circuits","volume":"9","author":"Chen Gregory K.","year":"2017","unstructured":"Gregory K. Chen , Mark A. Anders , and Himanshu Kaul . 2017 . Scalable crossbar apparatus and method for arranging crossbar circuits . US Patent 9 ,577, 634 Gregory K. Chen, Mark A. Anders, and Himanshu Kaul. 2017. Scalable crossbar apparatus and method for arranging crossbar circuits. US Patent 9,577,634","journal-title":"US Patent"},{"key":"e_1_3_2_1_23_1","volume-title":"Proceedings of the International Conference on Supercomputing (ICS). 353\u2013360","author":"Chen Hu","unstructured":"Hu Chen , Wenguang Chen , Jian Huang , Bob Robert , and H. Kuhn . 2006. MPIPP: An Automatic Profile-Guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters . In Proceedings of the International Conference on Supercomputing (ICS). 353\u2013360 . Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An Automatic Profile-Guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters. In Proceedings of the International Conference on Supercomputing (ICS). 353\u2013360."},{"key":"e_1_3_2_1_24_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA). 357\u2013368","author":"Chishti Zeshan","unstructured":"Zeshan Chishti , Michael D. Powell , and T. N. Vijaykumar . 2005. Optimizing Replication, Communication, and Capacity Allocation in CMPs . In Proceedings of the International Symposium on Computer Architecture (ISCA). 357\u2013368 . Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing Replication, Communication, and Capacity Allocation in CMPs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 357\u2013368."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC42613.2021.9365803"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2016.91"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451157"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628085"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC.2010.114"},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization (CGO). 1\u201312","author":"Ding Wei","year":"2013","unstructured":"Wei Ding , Yuanrui Zhang , Mahmut Kandemir , Jithendra Srinivas , and Praveen Yedlapalli . 2013 . Locality-aware Mapping and Scheduling for Multicores . In Proceedings of the International Symposium on Code Generation and Optimization (CGO). 1\u201312 . Wei Ding, Yuanrui Zhang, Mahmut Kandemir, Jithendra Srinivas, and Praveen Yedlapalli. 2013. Locality-aware Mapping and Scheduling for Multicores. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). 1\u201312."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346180"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2814328"},{"key":"e_1_3_2_1_33_1","unstructured":"David B. Glasco Peter B. Holmqvist George R. Lynch Patrick R. Marchand Karan Mehra and James Roberts. 2012. Cache-based Control of Atomic Operations in Conjunction With an External ALU Block. US Patent 8 135 926 B1 \t\t\t\t  David B. Glasco Peter B. Holmqvist George R. Lynch Patrick R. Marchand Karan Mehra and James Roberts. 2012. Cache-based Control of Atomic Operations in Conjunction With an External ALU Block. US Patent 8 135 926 B1"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339595"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555779"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454152"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2010.83"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2011.59"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414623"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00047"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5216918"},{"key":"e_1_3_2_1_42_1","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO). 1022\u20131036","author":"Khairy Mahmoud","unstructured":"Mahmoud Khairy , Vadim Nikiforov , David Nellans , and Timothy G. Rogers . 2020. Locality-Centric Data and Threadblock Management for Massive GPUs . In Proceedings of the International Symposium on Microarchitecture (MICRO). 1022\u20131036 . Mahmoud Khairy, Vadim Nikiforov, David Nellans, and Timothy G. Rogers. 2020. Locality-Centric Data and Threadblock Management for Massive GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1022\u20131036."},{"key":"e_1_3_2_1_43_1","volume-title":"Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 211\u2013222","author":"Kim Changkyu","unstructured":"Changkyu Kim , Doug Burger , and Stephen W. Keckler . 2002. An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On-chip Caches . In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 211\u2013222 . Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On-chip Caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 211\u2013222."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2015.2414456"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080239"},{"key":"e_1_3_2_1_46_1","unstructured":"Argonne National Laboratory. 2013. Using the Hydra Process Manager. https:\/\/wiki.mpich.org\/mpich\/index.php\/Using_the_Hydra_Process_Manager \t\t\t\t  Argonne National Laboratory. 2013. Using the Hydra Process Manager. https:\/\/wiki.mpich.org\/mpich\/index.php\/Using_the_Hydra_Process_Manager"},{"key":"e_1_3_2_1_47_1","volume-title":"Proceedings of the International Conference on Parallel Processing and Applied Mathematics (PPAM). 576\u2013585","author":"Lankes Stefan","year":"2009","unstructured":"Stefan Lankes , Boris Bierbaum , and Thomas Bemmerl . 2009 . Affinity-on-next-Touch: An Extension to the Linux Kernel for NUMA Architectures . In Proceedings of the International Conference on Parallel Processing and Applied Mathematics (PPAM). 576\u2013585 . Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl. 2009. Affinity-on-next-Touch: An Extension to the Linux Kernel for NUMA Architectures. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics (PPAM). 576\u2013585."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00024"},{"key":"e_1_3_2_1_50_1","volume-title":"Proceedings of the International Conference on Supercomputing (ICS). 387\u2013392","author":"L\u00f6f Henrik","year":"2005","unstructured":"Henrik L\u00f6f and Sverker Holmgren . 2005 . Affinity-on-next-Touch: Increasing the Performance of an Industrial PDE Solver on a Cc-NUMA System . In Proceedings of the International Conference on Supercomputing (ICS). 387\u2013392 . Henrik L\u00f6f and Sverker Holmgren. 2005. Affinity-on-next-Touch: Increasing the Performance of an Industrial PDE Solver on a Cc-NUMA System. In Proceedings of the International Conference on Supercomputing (ICS). 387\u2013392."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1122971.1122987"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2010.08.015"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124534"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2000.10025"},{"key":"e_1_3_2_1_55_1","unstructured":"NVIDIA. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture. https:\/\/www.nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf \t\t\t\t  NVIDIA. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture. https:\/\/www.nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf"},{"key":"e_1_3_2_1_56_1","unstructured":"NVIDIA. 2012. NVIDIA GeForce GTX 680. https:\/\/www.nvidia.com\/content\/PDF\/product-specifications\/GeForce_GTX_680_Whitepaper_FINAL.pdf \t\t\t\t  NVIDIA. 2012. NVIDIA GeForce GTX 680. https:\/\/www.nvidia.com\/content\/PDF\/product-specifications\/GeForce_GTX_680_Whitepaper_FINAL.pdf"},{"key":"e_1_3_2_1_57_1","unstructured":"NVIDIA. 2014. NVIDIA GeForce GTX 980. https:\/\/www.microway.com\/download\/whitepaper\/NVIDIA_Maxwell_GM204_Architecture_Whitepaper.pdf \t\t\t\t  NVIDIA. 2014. NVIDIA GeForce GTX 980. https:\/\/www.microway.com\/download\/whitepaper\/NVIDIA_Maxwell_GM204_Architecture_Whitepaper.pdf"},{"key":"e_1_3_2_1_58_1","unstructured":"NVIDIA. 2016. NVIDIA Tesla P100. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdf \t\t\t\t  NVIDIA. 2016. NVIDIA Tesla P100. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdf"},{"key":"e_1_3_2_1_59_1","unstructured":"NVIDIA. 2016. NVIDIA Turing GPU Architecture. https:\/\/images.nvidia.cn\/aem-dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf \t\t\t\t  NVIDIA. 2016. NVIDIA Turing GPU Architecture. https:\/\/images.nvidia.cn\/aem-dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf"},{"key":"e_1_3_2_1_60_1","unstructured":"NVIDIA. 2017. NVIDIA Tesla V100 Volta Architecture. http:\/\/www.nvidia.com\/object\/volta-architecture-whitepaper.html \t\t\t\t  NVIDIA. 2017. NVIDIA Tesla V100 Volta Architecture. http:\/\/www.nvidia.com\/object\/volta-architecture-whitepaper.html"},{"key":"e_1_3_2_1_61_1","unstructured":"NVIDIA. 2018. VOLTA Architecture and performance optimization. http:\/\/on-demand.gputechconf.com\/gtc\/2018\/presentation\/s81006-volta-architecture-and-performance-optimization.pdf \t\t\t\t  NVIDIA. 2018. VOLTA Architecture and performance optimization. http:\/\/on-demand.gputechconf.com\/gtc\/2018\/presentation\/s81006-volta-architecture-and-performance-optimization.pdf"},{"key":"e_1_3_2_1_62_1","unstructured":"NVIDIA. 2019. Parallel Thread Execution ISA Version 6.5. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html \t\t\t\t  NVIDIA. 2019. Parallel Thread Execution ISA Version 6.5. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html"},{"key":"e_1_3_2_1_63_1","unstructured":"NVIDIA. 2020. CUDA COMPILER DRIVER NVCC. https:\/\/docs.nvidia.com\/pdf\/CUDA_Compiler_Driver_NVCC.pdf \t\t\t\t  NVIDIA. 2020. CUDA COMPILER DRIVER NVCC. https:\/\/docs.nvidia.com\/pdf\/CUDA_Compiler_Driver_NVCC.pdf"},{"key":"e_1_3_2_1_64_1","unstructured":"NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf \t\t\t\t  NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf"},{"key":"e_1_3_2_1_65_1","unstructured":"NVIDIA. 2022. NVIDIA CUDA SDK Code Samples. https:\/\/developer.nvidia.com\/cuda-downloads \t\t\t\t  NVIDIA. 2022. NVIDIA CUDA SDK Code Samples. https:\/\/developer.nvidia.com\/cuda-downloads"},{"key":"e_1_3_2_1_66_1","unstructured":"NVIDIA. 2022. NVLink and NVSwitch. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/ \t\t\t\t  NVIDIA. 2022. NVLink and NVSwitch. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/1640089.1640117"},{"key":"e_1_3_2_1_68_1","unstructured":"Oracle. 2010. Solaris OS Tuning Features. https:\/\/docs.oracle.com\/cd\/E18659_01\/html\/821-1381\/aewda.html \t\t\t\t  Oracle. 2010. Solaris OS Tuning Features. https:\/\/docs.oracle.com\/cd\/E18659_01\/html\/821-1381\/aewda.html"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOCS.2010.37"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/1999946.1999981"},{"key":"e_1_3_2_1_71_1","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 369\u2013380","author":"Piccoli Guilherme","unstructured":"Guilherme Piccoli , Henrique N. Santos , Raphael E. Rodrigues , Christiane Pousa , Edson Borin , and Fernando M . Quint\u00e3o Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures . In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 369\u2013380 . Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quint\u00e3o Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 369\u2013380."},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541942"},{"key":"e_1_3_2_1_73_1","volume-title":"Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 568\u2013578","author":"Power Jason","unstructured":"Jason Power , Mark D. Hill , and David A. Wood . 2014. Supporting x86-64 Address Translation for 100s of GPU Lanes . In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 568\u2013578 . Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 Address Translation for 100s of GPU Lanes. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 568\u2013578."},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2009.4798236"},{"key":"e_1_3_2_1_75_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA). 167\u2013178","author":"Qureshi Moinuddin K.","unstructured":"Moinuddin K. Qureshi , Daniel N. Lynch , Onur Mutlu , and Yale N. Patt . 2006. A Case for MLP-Aware Cache Replacement . In Proceedings of the International Symposium on Computer Architecture (ISCA). 167\u2013178 . Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A Case for MLP-Aware Cache Replacement. In Proceedings of the International Symposium on Computer Architecture (ISCA). 167\u2013178."},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/2150976.2151002"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00054"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCC.2009.5202271"},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2012.2193936"},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00025"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00036"},{"key":"e_1_3_2_1_82_1","volume-title":"Geng Daniel Liu, and Wen-mei W Hwu","author":"Stratton John A","year":"2012","unstructured":"John A Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and Wen-mei W Hwu . 2012 . Parboil : A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. University of Illinois at Urbana-Champaign. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. University of Illinois at Urbana-Champaign."},{"key":"e_1_3_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOCS.2012.31"},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/1272996.1273004"},{"key":"e_1_3_2_1_85_1","volume-title":"Proceedings of the International Conference on Supercomputing (SC). 46\u201346","author":"Mustafa","unstructured":"Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2004. Using Hardware Counters to Automatically Improve Memory Performance . In Proceedings of the International Conference on Supercomputing (SC). 46\u201346 . Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2004. Using Hardware Counters to Automatically Improve Memory Performance. In Proceedings of the International Conference on Supercomputing (SC). 46\u201346."},{"key":"e_1_3_2_1_86_1","volume-title":"Tango: A Deep Neural Network Benchmark Suite for Various Accelerators. https:\/\/gitlab.com\/Tango-DNNbench\/Tango","author":"San Jose State University","year":"2019","unstructured":"San Jose State University . 2019 . Tango: A Deep Neural Network Benchmark Suite for Various Accelerators. https:\/\/gitlab.com\/Tango-DNNbench\/Tango San Jose State University. 2019. Tango: A Deep Neural Network Benchmark Suite for Various Accelerators. https:\/\/gitlab.com\/Tango-DNNbench\/Tango"},{"key":"e_1_3_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1145\/237090.237205"},{"key":"e_1_3_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2016.7482091"},{"key":"e_1_3_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2018.00108"},{"key":"e_1_3_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00035"},{"key":"e_1_3_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2019.2944790"},{"key":"e_1_3_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/1080695.1069998"},{"key":"e_1_3_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155677"},{"key":"e_1_3_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322235"},{"key":"e_1_3_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00082"},{"key":"e_1_3_2_1_96_1","volume-title":"Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345\u2013357","author":"Zheng Tianhao","unstructured":"Tianhao Zheng , David Nellans , Arslan Zulfiqar , Mark Stephenson , and Stephen W. Keckler . 2016. Towards High Performance Paged Memory for GPUs . In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345\u2013357 . Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards High Performance Paged Memory for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345\u2013357."}],"event":{"name":"ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2","location":"Vancouver BC Canada","acronym":"ASPLOS '23","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages"]},"container-title":["Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3575693.3575745","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3575693.3575745","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:20Z","timestamp":1750182680000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3575693.3575745"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,27]]},"references-count":96,"alternative-id":["10.1145\/3575693.3575745","10.1145\/3575693"],"URL":"https:\/\/doi.org\/10.1145\/3575693.3575745","relation":{},"subject":[],"published":{"date-parts":[[2023,1,27]]},"assertion":[{"value":"2023-01-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}