{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,19]],"date-time":"2026-06-19T02:42:14Z","timestamp":1781836934062,"version":"3.54.5"},"reference-count":81,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,12,6]],"date-time":"2021-12-06T00:00:00Z","timestamp":1638748800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2022,3,31]]},"abstract":"<jats:p>\n            As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a\n            <jats:bold>C<\/jats:bold>\n            omposable\n            <jats:bold>O<\/jats:bold>\n            n-\n            <jats:bold>PA<\/jats:bold>\n            ckage\n            <jats:bold>GPU<\/jats:bold>\n            (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4\u00d7 higher off-die bandwidth, 32\u00d7 larger on-package cache, and 2.3\u00d7 higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16\u00d7 larger cache capacity and 1.6\u00d7 higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.\n          <\/jats:p>","DOI":"10.1145\/3484505","type":"journal-article","created":{"date-parts":[[2021,12,6]],"date-time":"2021-12-06T21:29:34Z","timestamp":1638826174000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["GPU Domain Specialization via Composable On-Package Architecture"],"prefix":"10.1145","volume":"19","author":[{"given":"Yaosheng","family":"Fu","sequence":"first","affiliation":[{"name":"NVIDIA, Santa Clara, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Evgeny","family":"Bolotin","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Niladrish","family":"Chatterjee","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"David","family":"Nellans","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Stephen W.","family":"Keckler","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,12,6]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","unstructured":"Mark James Abraham Teemu Murtola Roland Schulz Szil\u00e1rd P\u00e1ll Jeremy C. Smith Berk Hess and Erik Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1\u20132 (2015) 19\u201325.","DOI":"10.1016\/j.softx.2015.06.001"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","unstructured":"Sjors H.W. Scheres. 2012. RELION: Implementation of a Bayesian approach to cryo-EM structure determination. Journal of Structural Biology 180 3 (2012) 519\u2013530.","DOI":"10.1016\/j.jsb.2012.09.006"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.crme.2010.11.007"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1002\/wcms.1121"},{"key":"e_1_3_1_6_2","first-page":"172","volume-title":"International Symposium on Performance Analysis of Systems and Software (ISPASS\u201916)","author":"Alsop Johnathan","year":"2016","unstructured":"Johnathan Alsop, Matthew D. Sinclair, Rakesh Komuravelli, and Sarita V. Adve. 2016. GSI: A GPU stall inspector to characterize the sources of memory stalls for tightly coupled GPUs. In International Symposium on Performance Analysis of Systems and Software (ISPASS\u201916). 172\u2013182."},{"key":"e_1_3_1_7_2","unstructured":"AMD. 2020. AMD INSTINCT MI100 ACCELERATOR. https:\/\/www.amd.com\/system\/files\/documents\/instinct-mi100-brochure.pdf."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045410"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080231"},{"key":"e_1_3_1_10_2","volume-title":"International Symposium on High-Performance Computer Architecture (HPCA\u201919)","author":"Arunkumar Akhil","year":"2019","unstructured":"Akhil Arunkumar, Evgeny Bolotin, David Nellans, and Carole-Jean Wu. 2019. Understanding the future of energy efficiency in multi-module GPUs. In International Symposium on High-Performance Computer Architecture (HPCA\u201919)."},{"key":"e_1_3_1_11_2","volume-title":"International Solid-State Circuits Conference (ISSCC\u201920)","year":"2020","unstructured":"Christopher Berry, Brian Bell, Adam Jatkowski, Jesse Surprise, John Isakson, Ofer Geva, Brian Deskin, Mark Cichanowski, Dina Hamid, Chris Cavitt, Gregory Fredeman, Anthony Saporito, Ashutosh Mishra, Alper Buyuktosunoglu, Tobias Webel, Preetham Lobo, Pradeep Parashurama, Ramon Bertran, Dureseti Chidambarrao, David Wolpert, and Brandon Bruen 2020. IBM z15: A 12-Core 5.2GHz Microprocessor. In International Solid-State Circuits Conference (ISSCC\u201920)."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2015.72"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2018.2873584"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC19947.2020.9062967"},{"key":"e_1_3_1_15_2","volume-title":"Electronic Components and Technology Conference (ECTC\u201919)","year":"2019","unstructured":"Ming-Fa Chen, Fang-Cheng Chen, Wen-Chih Chiou, and Doug C. H. Yu. 2019. System on Integrated Chips (SoIC) for 3D heterogeneous integration. In Electronic Components and Technology Conference (ECTC\u201919)."},{"key":"e_1_3_1_16_2","volume-title":"International Symposium on Microarchitecture (MICRO\u201914)","year":"2014","unstructured":"Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In International Symposium on Microarchitecture (MICRO\u201914)."},{"key":"e_1_3_1_17_2","first-page":"493","volume-title":"International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Choi Yujeong","year":"2021","unstructured":"Yujeong Choi, Yunseong Kim, and Minsoo Rhu. 2021. Lazy Batching: An SLA-aware batching system for cloud machine learning inference. In International Symposium on High-Performance Computer Architecture (HPCA\u201921). 493\u2013506."},{"key":"e_1_3_1_18_2","first-page":"220","volume-title":"International Symposium on High-Performance Computer Architecture (HPCA\u201920)","author":"Choi Yujeong","year":"2020","unstructured":"Yujeong Choi and Minsoo Rhu. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In International Symposium on High-Performance Computer Architecture (HPCA\u201920). 220\u2013233."},{"key":"e_1_3_1_19_2","unstructured":"DataCeneter Knowledge 2018. Google Brings Liquid Cooling to Data Centers to Cool Latest AI Chips. https:\/\/www.datacenterknowledge.com\/google-alphabet\/google-brings-liquid-cooling-data-centers-cool-latest-ai-chips."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSSC.2019.2910619"},{"key":"e_1_3_1_21_2","volume-title":"arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina. Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In arXiv preprint arXiv:1810.04805."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190541"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2017.7870257"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2017.7989385"},{"key":"e_1_3_1_25_2","volume-title":"International Solid-State Circuits Conference (ISSCC\u201918)","author":"Guo Zheng","year":"2018","unstructured":"Zheng Guo, Daeyeon Kim, Satyanand Nalam, Jami Wiedemer, Xiaofei Wang, and Eric Karl. 2018. A 23.6-Mb\/mm2 SRAM in 10-nm FinFET technology with pulsed-pMOS TVC and stepped-WL for low-voltage applications. In International Solid-State Circuits Conference (ISSCC\u201918)."},{"key":"e_1_3_1_26_2","volume-title":"Microprocessor Report by the Linley Group","author":"Gwennap Linley","year":"2020","unstructured":"Linley Gwennap. 2020. Groq rocks neural networks. In Microprocessor Report by the Linley Group."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052569"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TED.2017.2737644"},{"key":"e_1_3_1_30_2","volume-title":"Symposium on VLSI Technology","author":"Hu C. C.","year":"2019","unstructured":"C. C. Hu, M. F. Chen, W. C. Chiou, and Doug Yu. 2019. 3D Multi-chip integration with system on integrated chips (SoIC). In Symposium on VLSI Technology."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.5555\/3157382.3157557"},{"key":"e_1_3_1_32_2","volume-title":"IEEE International Electron Devices Meeting (IEDM\u201919)","author":"Ingerly D.","year":"2019","unstructured":"D. Ingerly, K. Enamul, W. Gomes, D. Jones, K. Kolluru, A. Kandas, G.-S Kim, H. Ma, D. Pantuso, C. F. Petersburg, M. Phen-givoni, A. Pillai, A. Sairam, P. Shekhar, P. Sinha, P. Stover, A. Telang, Z. Zell, and R. Criss. 2019. Foveros: 3D Integration and the use of face-to-face chip stacking for logic devices. In IEEE International Electron Devices Meeting (IEDM\u201919)."},{"key":"e_1_3_1_33_2","unstructured":"Intel. 2015. 14 nm Technology Announcement. https:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/presentation\/advancing-moores-law-in-2014-presentation.pdf."},{"key":"e_1_3_1_34_2","unstructured":"Intel. 2019. Lakefield: Hybrid cores in 3D package. https:\/\/www.hotchips.org\/hc31\/HC31_2.10_LKF_HC_2019_Final_v7.pdf."},{"key":"e_1_3_1_35_2","volume-title":"arXiv preprint arXiv:1912.03413","author":"Jia Zhe","year":"2019","unstructured":"Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the graphcore IPU architecture via microbenchmarking. In arXiv preprint arXiv:1912.03413."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC19947.2020.9062984"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00010"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3360307"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080246"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830808"},{"key":"e_1_3_1_41_2","volume-title":"arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. In arXiv preprint arXiv:2001.08361."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2011.89"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00085"},{"key":"e_1_3_1_44_2","volume-title":"HotChips-33","author":"Knowles Simon","year":"2021","unstructured":"Simon Knowles. 2021. Graphcore Colossus Mk2 IPU. In HotChips-33."},{"key":"e_1_3_1_45_2","volume-title":"arXiv:1605.04711","author":"Li Fengfu","year":"2016","unstructured":"Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. In arXiv:1605.04711."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322259"},{"key":"e_1_3_1_47_2","volume-title":"HotChips-31","author":"Lie Sean","year":"2019","unstructured":"Sean Lie. 2019. Wafer scale deep learning. In HotChips-31."},{"key":"e_1_3_1_48_2","volume-title":"HotChips-33","author":"Lie Sean","year":"2021","unstructured":"Sean Lie. 2021. The multi-million core, multi-wafer AI cluster. In HotChips-33."},{"key":"e_1_3_1_49_2","volume-title":"Symposium on VLSI Circuits","author":"Lin Mu-Shan","year":"2019","unstructured":"Mu-Shan Lin, Tze-Chiang Huang, Chien-Chun Tsai, King-Ho Tam, Cheng-Hsiang Hsieh, Tom Chen, Wen-Hung Huang, Jack Hu, Yu-Chi Chen, Sandeep Goel, Chin-Ming Fu, Stefan Rusu, Chao-Chieh Li, Sheng-Yao Yang, Mei Wong, Shu-Chun Yang, and Frank Lee. 2019. A 7nm 4GHz Arm-core-based CoWoS chiplet design for high performance computing. In Symposium on VLSI Circuits."},{"key":"e_1_3_1_50_2","unstructured":"LLNL. [n.d.]. Laghos. https:\/\/computing.llnl.gov\/projects\/co-design\/laghos."},{"key":"e_1_3_1_51_2","unstructured":"LLNL. 2014. The CORAL Benchmarks. https:\/\/asc.llnl.gov\/CORAL-benchmarks\/."},{"key":"e_1_3_1_52_2","unstructured":"LLNL. 2017. The CORAL2 Benchmarks. https:\/\/asc.llnl.gov\/coral-2-benchmarks."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/2818950.2818951"},{"key":"e_1_3_1_54_2","volume-title":"Machine Learning and Systems (MLSys\u201920)","author":"Mattson Peter","year":"2020","unstructured":"Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf training benchmark. In Machine Learning and Systems (MLSys\u201920)."},{"key":"e_1_3_1_55_2","volume-title":"International Conference on Learning Representations (ICLR\u201918)","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124534"},{"key":"e_1_3_1_57_2","unstructured":"MLPerf. 2019. MLPerf Inference Results v0.5. https:\/\/mlperf.org\/inference-results\/."},{"key":"e_1_3_1_58_2","unstructured":"MLPerf. 2019. MLPerf Training Results v0.6. https:\/\/mlperf.org\/training-results-0-6\/."},{"issue":"8","key":"e_1_3_1_59_2","first-page":"114","article-title":"Cramming more components onto integrated circuits","volume":"38","author":"Moore Gordon E.","year":"1965","unstructured":"Gordon E. Moore. 1965. Cramming more components onto integrated circuits. Electronics 38, 8 (1965), 114\u2013117 (1965).","journal-title":"Electronics"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC19947.2020.9063103"},{"key":"e_1_3_1_61_2","unstructured":"NASA. [n.d.]. FUN3D. https:\/\/fun3d.larc.nasa.gov\/."},{"key":"e_1_3_1_62_2","unstructured":"NVIDIA. 2012. NVIDIA Kepler GK110 Architecture. https:\/\/www.nvidia.com\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf."},{"key":"e_1_3_1_63_2","unstructured":"NVIDIA. 2016. NVIDIA NVLink. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/."},{"key":"e_1_3_1_64_2","unstructured":"NVIDIA. 2016. NVIDIA Tesla P100 Architecture. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdff."},{"key":"e_1_3_1_65_2","unstructured":"NVIDIA. 2017. NVIDIA Tesla V100 Architecture. http:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"key":"e_1_3_1_66_2","unstructured":"NVIDIA. 2019. NVIDIA Turing GPU Architecture. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf."},{"key":"e_1_3_1_67_2","unstructured":"NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf."},{"key":"e_1_3_1_68_2","unstructured":"Open Standards Harmonization Working Group. 2018. Open Specification for a Liquid Cooled Server Rack - Progress Update. https:\/\/datacenters.lbl.gov\/sites\/default\/files\/OpenSpecification.pdf."},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"IEEE Micro 2019 39 5 Optimizing Multi-GPU parallelization strategies for deep learning training","DOI":"10.1109\/MM.2019.2935967"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3313231.3352380"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00045"},{"key":"e_1_3_1_72_2","volume-title":"International Symposium on High-Performance Computer Architecture (HPCA\u201920)","author":"Ren Xiaowei","year":"2020","unstructured":"Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David Nellans. 2020. HMG: Extending cache coherence protocols across modern hierarchical multi-GPU systems. In International Symposium on High-Performance Computer Architecture (HPCA\u201920)."},{"key":"e_1_3_1_73_2","volume-title":"arXiv:1505.04597","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: convolutional networks for biomedical image segmentation. In arXiv:1505.04597."},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.2352\/ISSN.2470-1173.2017.19.AVM-023"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358302"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969049"},{"key":"e_1_3_1_77_2","volume-title":"IEEE Custom Integrated Circuits Conference (CICC\u201918)","year":"2018","unstructured":"Walker J. Turner, John W. Poulton, John M. Wilson, Xi Chen, Stephen G. Tell, Matthew Fojtik, Thomas H. Greer, Brian Zimmer, Sanquan Song, Nikola Nedovic, Sudhir S. Kudva, Sunil R. Sudhakaran, Rizwan Bashirullah, Wenxu Zhao, William J. Dally, and C. Thomas Gray. 2018. Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects. In IEEE Custom Integrated Circuits Conference (CICC\u201918)."},{"key":"e_1_3_1_78_2","volume-title":"International Symposium on Circuits and Systems (ISCAS\u201917)","author":"Vashishtha V.","year":"2017","unstructured":"V. Vashishtha, M. Vangala, P. Sharma, and L. T. Clark. 2017. Robust 7-nm SRAM design on a predictive PDK. In International Symposium on Circuits and Systems (ISCAS\u201917)."},{"key":"e_1_3_1_79_2","volume-title":"International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Villa Oreste","year":"2021","unstructured":"Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish Chatterjee, Nan Jiang, and David Nellans. 2021. Need for speed: Experiences building a trustworthy system-level GPU simulator. In International Symposium on High-Performance Computer Architecture (HPCA\u201921)."},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC19947.2020.9062927"},{"key":"e_1_3_1_81_2","first-page":"479","volume-title":"International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Yeh Tsung Tai","year":"2021","unstructured":"Tsung Tai Yeh, Matthew D. Sinclair, Bradford M. Beckmann, and Timothy G. Rogers. 2021. Deadline-aware offloading for high-throughput accelerators. In International Symposium on High-Performance Computer Architecture (HPCA\u201921). 479\u2013492."},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00035"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3484505","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3484505","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:14Z","timestamp":1750191434000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3484505"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12,6]]},"references-count":81,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,3,31]]}},"alternative-id":["10.1145\/3484505"],"URL":"https:\/\/doi.org\/10.1145\/3484505","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12,6]]},"assertion":[{"value":"2021-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}