{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:18:08Z","timestamp":1767853088003,"version":"3.49.0"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,14]],"date-time":"2023-12-14T00:00:00Z","timestamp":1702512000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"CSC scholarship","award":["201903170128"],"award-info":[{"award-number":["201903170128"]}]},{"name":"Magnus Jahre is supported by the Research Council of Norway","award":["286596"],"award-info":[{"award-number":["286596"]}]},{"name":"UGent-BOF-GOA","award":["01G01421"],"award-info":[{"award-number":["01G01421"]}]},{"name":"European Research Council","award":["741097"],"award-info":[{"award-number":["741097"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>\n            Multi-chip Graphics Processing Unit (GPU) systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs, though, is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This article characterizes the shared dataset in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared dataset scales with input size, (3) along which dimensions the shared dataset scales, and (4) how sensitive the shared dataset is with respect to the input\u2019s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared dataset that scales linearly with input size, whereas others feature sublinear scaling (following a\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\sqrt {2}\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            or\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\sqrt [3]{2}\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            relationship). We further demonstrate how the shared dataset affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.\n          <\/jats:p>","DOI":"10.1145\/3629521","type":"journal-article","created":{"date-parts":[[2023,10,20]],"date-time":"2023-10-20T21:45:06Z","timestamp":1697838306000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Characterizing Multi-Chip GPU Data Sharing"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6690-3718","authenticated-orcid":false,"given":"Shiqing","family":"Zhang","sequence":"first","affiliation":[{"name":"Ghent University, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7762-2878","authenticated-orcid":false,"given":"Mahmood","family":"Naderan-Tahan","sequence":"additional","affiliation":[{"name":"Ghent University, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9147-5228","authenticated-orcid":false,"given":"Magnus","family":"Jahre","sequence":"additional","affiliation":[{"name":"Norwegian University of Science and Technology (NTNU), Norway"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8792-4473","authenticated-orcid":false,"given":"Lieven","family":"Eeckhout","sequence":"additional","affiliation":[{"name":"Ghent University, Belgium"}]}],"member":"320","published-online":{"date-parts":[[2023,12,14]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"AMD. 2021. AMD Instinct MI200 Series Accelerator. https:\/\/www.amd.com\/system\/files\/documents\/amd-instinct-mi200-datasheet.pdf[Online; accessed 2023-03-02]."},{"key":"e_1_3_2_3_2","first-page":"320","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA)","author":"Arunkumar Akhil","year":"2017","unstructured":"Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA). IEEE, 320\u2013332."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_3_2_5_2","first-page":"596","volume-title":"Proceedings of the International Symposium on High Performance Computer Architecture (HPCA)","author":"Baruah Trinayan","year":"2020","unstructured":"Trinayan Baruah, Yifan Sun, Ali Tolga Din\u00e7er, Saiful A. Mojumder, Jos\u00e9 L. Abell\u00e1n, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Griffin: Hardware-software support for efficient page migration in multi-GPU systems. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). IEEE, 596\u2013609."},{"key":"e_1_3_2_6_2","first-page":"1","volume-title":"Proceedings of Hot Chips 33 Symposium (HCS)","author":"Blythe David","year":"2021","unstructured":"David Blythe. 2021. Xe-HPC Ponte Vecchio. In Proceedings of Hot Chips 33 Symposium (HCS). IEEE Computer Society, 1\u201334."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_2_8_2","first-page":"3","volume-title":"Proceedings of the International Symposium on VLSI Technology and Circuits (VLSI)","author":"Dally William J.","year":"2018","unstructured":"William J. Dally, C. Thomas Gray, John Poulton, Brucek Khailany, John Wilson, and Larry Dennison. 2018. Hardware-enabled artificial intelligence. In Proceedings of the International Symposium on VLSI Technology and Circuits (VLSI). IEEE, 3\u20136."},{"key":"e_1_3_2_9_2","unstructured":"John Cavazos and Scott Grauer-Gray. 2015. PolyBench\/GPU 1.0. https:\/\/web.cse.ohio-state.edu\/pouchet.2\/software\/polybench\/[Online; accessed 2022-04-16]."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-014-0400-1"},{"key":"e_1_3_2_11_2","first-page":"1","article-title":"Graphics processing requirements for enabling immersive VR","author":"Kanter David","year":"2015","unstructured":"David Kanter. 2015. Graphics processing requirements for enabling immersive VR. AMD White Paper (2015), 1\u201312.","journal-title":"AMD White Paper"},{"key":"e_1_3_2_12_2","first-page":"1022","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Khairy Mahmoud","year":"2020","unstructured":"Mahmoud Khairy, Vadim Nikiforov, David Nellans, and Timothy G. Rogers. 2020. Locality-centric data and threadblock management for massive GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 1022\u20131036."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3232521"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00024"},{"key":"e_1_3_2_15_2","first-page":"123","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Milic Ugljesa","year":"2017","unstructured":"Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the socket: NUMA-aware GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 123\u2013135."},{"key":"e_1_3_2_16_2","first-page":"122","volume-title":"Proceedings of the International Symposium on Workload Characterization (IISWC)","author":"Mojumder Saiful A.","year":"2018","unstructured":"Saiful A. Mojumder, Marcia S. Louis, Yifan Sun, Amir Kavyan Ziabari, Jos\u00e9 L. Abell\u00e1n, John Kim, David Kaeli, and Ajay Joshi. 2018. Profiling DNN workloads on a volta-based DGX-1 system. In Proceedings of the International Symposium on Workload Characterization (IISWC). IEEE, 122\u2013133."},{"key":"e_1_3_2_17_2","unstructured":"NVIDIA. 2016. NVIDIA DGX-1: Essential Instrument of AI Research. https:\/\/www.nvidia.com\/en-gb\/data-center\/dgx-systems\/dgx-1\/[Online; accessed 2022-04-16]."},{"key":"e_1_3_2_18_2","unstructured":"NVIDIA. 2016. NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator Ever Built. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdf[Online; accessed 2022-04-16]."},{"key":"e_1_3_2_19_2","unstructured":"NVIDIA. 2018. NVIDIA DGX-2: Break Through the Barriers to AI Speed and Scale. https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-2\/[Online; accessed 2022-04-16]."},{"key":"e_1_3_2_20_2","unstructured":"NVIDIA. 2022. NVIDIA H100 Tensor Core GPU: Unprecedented Performance Scalability and Security for Every Data Center. https:\/\/resources.nvidia.com\/en-us-tensor-core\/nvidia-tensor-core-gpu-datasheet[Online; accessed 2023-03-02]."},{"key":"e_1_3_2_21_2","unstructured":"NVIDIA. 2022. NVIDIA NVLINK: High-speed GPU Interconnect. https:\/\/www.nvidia.com\/en-us\/design-visualization\/nvlink-bridges\/[Online; accessed 2022-04-16]."},{"key":"e_1_3_2_22_2","first-page":"72","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Rogers Timothy G.","year":"2012","unstructured":"Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 72\u201383."},{"key":"e_1_3_2_23_2","first-page":"14","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Shao Yakun Sophia","year":"2019","unstructured":"Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO). 14\u201327."},{"key":"e_1_3_2_24_2","first-page":"29","volume-title":"Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign. 29 pages."},{"key":"e_1_3_2_25_2","first-page":"829","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA)","author":"Vijaykumar Nandita","year":"2018","unstructured":"Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). IEEE, 829\u2013842."},{"key":"e_1_3_2_26_2","first-page":"339","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Young Vinson","year":"2018","unstructured":"Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018. Combining HW\/SW mechanisms to improve NUMA performance of multi-GPU systems. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 339\u2013351."},{"key":"e_1_3_2_27_2","first-page":"1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA)","author":"Zhang Shiqing","year":"2023","unstructured":"Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, and Lieven Eeckhout. 2023. SAC: Sharing-aware caching in multi-chip GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 1\u201313."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322235"},{"key":"e_1_3_2_29_2","first-page":"967","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Zhao Xia","year":"2020","unstructured":"Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. Selective replication in memory-side GPU caches. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 967\u2013980."},{"key":"e_1_3_2_30_2","first-page":"544","volume-title":"Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)","author":"Zhao Xia","year":"2023","unstructured":"Xia Zhao, Magnus Jahre, Yuhua Tang, Guangda Zhang, and Lieven Eeckhout. 2023. NUBA: Non-uniform bandwidth GPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 544\u2013559."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3629521","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3629521","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:01Z","timestamp":1750178161000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3629521"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,14]]},"references-count":29,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3629521"],"URL":"https:\/\/doi.org\/10.1145\/3629521","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,14]]},"assertion":[{"value":"2023-03-16","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-12","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}