{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T00:17:59Z","timestamp":1777421879562,"version":"3.51.4"},"reference-count":144,"publisher":"Association for Computing Machinery (ACM)","issue":"9","license":[{"start":{"date-parts":[[2025,4,4]],"date-time":"2025-04-04T00:00:00Z","timestamp":1743724800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Guangdong Major Project of Basic and Applied Basic Research","award":["2019B030302002"],"award-info":[{"award-number":["2019B030302002"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62402198"],"award-info":[{"award-number":["62402198"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Guangxi Key Research and Development Project","award":["2024AB02018"],"award-info":[{"award-number":["2024AB02018"]}]},{"name":"Guangzhou Development Zone Science and Technology Project","award":["2023GH02"],"award-info":[{"award-number":["2023GH02"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["21624348"],"award-info":[{"award-number":["21624348"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Major Key Project of PCL","award":["PCL2023A09"],"award-info":[{"award-number":["PCL2023A09"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>The effective and efficient utilization of AI accelerators represents a critical issue for the practitioners engaged in the field of deep learning. Practical evidence from companies such as Alibaba, SenseTime, and Microsoft reveals that the utilization of production GPU clusters in the industry is generally between 25% and 50%. This indicates a significant opportunity for improvement. To this end, AI accelerator resource sharing has emerged as a promising approach to the performance optimization of multi-tenant clusters. This survey covers this line of studies from 2016 to 2024, focusing primarily on system efficiency while also including discussion on fairness, interference, and security in AI accelerator sharing. We revisit the fundamentals and key concepts, followed by a comprehensive review of recent advances in the field. We find that over 70% of the studies focus on efficiency improvement. We also observe that approximately half of the reviewed studies have made their source code publicly available, while fewer than one-third of the studies did not utilize a physical machine for experimentation. Finally, based on the limitations of existing research, we outline several directions for future research concerning the integration of sharing with large language models (LLMs), coordination between schedulers and application-layer metrics, and collaboration among heterogeneous accelerators.<\/jats:p>","DOI":"10.1145\/3721427","type":"journal-article","created":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T11:54:11Z","timestamp":1741002851000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["On Efficiency, Fairness and Security in AI Accelerator Resource Sharing: A Survey"],"prefix":"10.1145","volume":"57","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-7801-9831","authenticated-orcid":false,"given":"Jiahua","family":"Huang","sequence":"first","affiliation":[{"name":"South China University of Technology, China and Pengcheng Laboratory, Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6876-1795","authenticated-orcid":false,"given":"Weiwei","family":"Lin","sequence":"additional","affiliation":[{"name":"South China University of Technology, China and Pengcheng Laboratory, Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5851-327X","authenticated-orcid":false,"given":"Wentai","family":"Wu","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Jinan University, Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9438-6060","authenticated-orcid":false,"given":"Yang","family":"Wang","sequence":"additional","affiliation":[{"name":"Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-4975-668X","authenticated-orcid":false,"given":"Haocheng","family":"Zhong","sequence":"additional","affiliation":[{"name":"South China University of Technology, Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7346-0823","authenticated-orcid":false,"given":"Xinhua","family":"Wang","sequence":"additional","affiliation":[{"name":"South China University of Technology, Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5224-4048","authenticated-orcid":false,"given":"Keqin","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computer Science, State University of New York, New Paltz, United States"}]}],"member":"320","published-online":{"date-parts":[[2025,4,4]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2024. Apple\u2019s Neural Engine. Retrieved from https:\/\/www.apple.com\/newsroom\/2024\/05\/apple-introduces-m4-chip\/"},{"key":"e_1_3_1_3_2","unstructured":"2024. Google Edge TPU. Retrieved from https:\/\/cloud.google.com\/edge-tpu?hl=zh-cn"},{"key":"e_1_3_1_4_2","unstructured":"2024. Huawei\u2019s Ascend. Retrieved from https:\/\/e.huawei.com\/en\/products\/computing\/ascend"},{"key":"e_1_3_1_5_2","unstructured":"2024. NVIDIA Management Library. Retrieved from https:\/\/developer.nvidia.com\/management-library-nvml"},{"key":"e_1_3_1_6_2","unstructured":"2024. NVIDIA Multi-instance GPU. Retrieved from https:\/\/www.nvidia.com\/en-us\/technologies\/multi-instance-gpu\/"},{"key":"e_1_3_1_7_2","unstructured":"2024. NVIDIA Multi-process Service. Retrieved from https:\/\/docs.nvidia.com\/deploy\/mps\/"},{"issue":"3","key":"e_1_3_1_8_2","first-page":"18","article-title":"GPU fast convolution via the overlap-and-save method in shared memory","volume":"17","author":"Ad\u00e1mek Karel","year":"2020","unstructured":"Karel Ad\u00e1mek, Sofia Dimoudi, Mike Giles, and Wesley Armour. 2020. GPU fast convolution via the overlap-and-save method in shared memory. ACM Trans. Archit. Code Optim. 17, 3, Article 18 (Aug.2020), 20 pages.","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","unstructured":"Hyunho Ahn Munkyu Lee Sihoon Seong Gap-Joo Na In-Geol Chun Blesson Varghese and Cheol-Ho Hong. 2024. ScissionLite: Accelerating distributed deep learning with lightweight data compression for IIoT. In IEEE Transactions on Industrial Informatics 20 10 (2024) 11950\u201311960.","DOI":"10.1109\/TII.2024.3413340"},{"key":"e_1_3_1_10_2","first-page":"503","volume-title":"23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201918)","author":"Ausavarungnirun Rachata","year":"2018","unstructured":"Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, and Onur Mutlu. 2018. MASK: Redesigning the GPU memory hierarchy to support multi-application concurrency. In 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201918). Association for Computing Machinery, New York, NY, USA, 503\u2013518."},{"key":"e_1_3_1_11_2","first-page":"76","volume-title":"IEEE 40th International Conference on Computer Design (ICCD\u201922)","author":"Barrera Javier","year":"2022","unstructured":"Javier Barrera, Leonidas Kosmidis, Hamid Tabani, Jaume Abella, and Francisco J. Cazorla. 2022. Contention tracking in GPU last-level cache. In IEEE 40th International Conference on Computer Design (ICCD\u201922). 76\u201379."},{"key":"e_1_3_1_12_2","volume-title":"15th European Conference on Computer Systems (EuroSys\u201920)","author":"Chaudhary Shubham","year":"2020","unstructured":"Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In 15th European Conference on Computer Systems (EuroSys\u201920). Association for Computing Machinery, New York, NY, USA, Article 1, 16 pages. DOI:DOI:10.1145\/3342195.3387555"},{"issue":"1","key":"e_1_3_1_13_2","doi-asserted-by":"crossref","first-page":"854","DOI":"10.1109\/TCC.2021.3119205","article-title":"Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation","volume":"11","author":"Chen Hung-Hsin","year":"2023","unstructured":"Hung-Hsin Chen, En-Te Lin, Yu-Min Chou, and Jerry Chou. 2023. Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation. IEEE Trans. Cloud Comput. 11, 1 (2023), 854\u2013867.","journal-title":"IEEE Trans. Cloud Comput."},{"issue":"1","key":"e_1_3_1_14_2","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1145\/3093337.3037700","article-title":"Prophet: Precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers","volume":"45","author":"Chen Quan","year":"2017","unstructured":"Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. SIGARCH Comput. Archit. News 45, 1 (Apr.2017), 17\u201332.","journal-title":"SIGARCH Comput. Archit. News"},{"key":"e_1_3_1_15_2","first-page":"681","volume-title":"21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201916)","author":"Chen Quan","year":"2016","unstructured":"Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. In 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201916). Association for Computing Machinery, New York, NY, USA, 681\u2013696."},{"issue":"1","key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1145\/2654822.2541967","article-title":"DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning","volume":"42","author":"Chen Tianshi","year":"2014","unstructured":"Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput. Archit. News 42, 1 (2014), 269\u2013284.","journal-title":"ACM SIGARCH Comput. Archit. News"},{"key":"e_1_3_1_17_2","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201923)","author":"Chen Wenyan","year":"2023","unstructured":"Wenyan Chen, Zizhao Mo, Huanle Xu, Kejiang Ye, and Chengzhong Xu. 2023. Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201923). Association for Computing Machinery, New York, NY, USA, Article 30, 15 pages."},{"key":"e_1_3_1_18_2","first-page":"609","volume-title":"47th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Chen Yunji","year":"2014","unstructured":"Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun et\u00a0al. 2014. DaDianNao: A machine-learning supercomputer. In 47th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE, 609\u2013622."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.40"},{"issue":"2","key":"e_1_3_1_20_2","doi-asserted-by":"crossref","first-page":"292","DOI":"10.1109\/JETCAS.2019.2910232","article-title":"Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices","volume":"9","author":"Chen Yu-Hsin","year":"2019","unstructured":"Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Topics Circ. Syst. 9, 2 (2019), 292\u2013308.","journal-title":"IEEE J. Emerg. Select. Topics Circ. Syst."},{"issue":"9","key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"2553","DOI":"10.1109\/TPDS.2023.3293835","article-title":"DeepBoot: Dynamic scheduling system for training and inference deep learning tasks in GPU cluster","volume":"34","author":"Chen Zhenqian","year":"2023","unstructured":"Zhenqian Chen, Xinkui Zhao, Chen Zhi, and Jianwei Yin. 2023. DeepBoot: Dynamic scheduling system for training and inference deep learning tasks in GPU cluster. IEEE Trans. Parallel Distrib. Syst. 34, 9 (2023), 2553\u20132567.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"issue":"12","key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"3383","DOI":"10.1109\/TC.2023.3299030","article-title":"Enabling fine-grained spatial multitasking on systolic-array NPUs using dataflow mirroring","volume":"72","author":"Choi Jinwoo","year":"2023","unstructured":"Jinwoo Choi, Yeonan Ha, Jounghoo Lee, Sangsu Lee, Jinho Lee, Hanhwi Jang, and Youngsok Kim. 2023. Enabling fine-grained spatial multitasking on systolic-array NPUs using dataflow mirroring. IEEE Trans. Comput. 72, 12 (2023), 3383\u20133398.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_23_2","first-page":"199","volume-title":"USENIX Annual Technical Conference (USENIX ATC\u201922)","author":"Choi Seungbeom","year":"2022","unstructured":"Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on multi-GPU servers with spatio-temporal sharing. In USENIX Annual Technical Conference (USENIX ATC\u201922). USENIX Association, 199\u2013216."},{"key":"e_1_3_1_24_2","first-page":"220","volume-title":"IEEE International Symposium on High Performance Computer Architecture (HPCA\u201920)","author":"Choi Yujeong","year":"2020","unstructured":"Yujeong Choi and Minsoo Rhu. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In IEEE International Symposium on High Performance Computer Architecture (HPCA\u201920). 220\u2013233."},{"key":"e_1_3_1_25_2","first-page":"624","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201923)","author":"Chow Marcus","year":"2023","unstructured":"Marcus Chow, Ali Jahanshahi, and Daniel Wong. 2023. KRISP: Enabling kernel-wise right-sizing for spatial partitioned GPU inference servers. In IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201923). 624\u2013637."},{"issue":"2","key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1109\/TC.2020.2981068","article-title":"PermCNN: Energy-efficient convolutional neural network hardware architecture with permuted diagonal structure","volume":"70","author":"Deng Chunhua","year":"2021","unstructured":"Chunhua Deng, Siyu Liao, and Bo Yuan. 2021. PermCNN: Energy-efficient convolutional neural network hardware architecture with permuted diagonal structure. IEEE Trans. Comput. 70, 2 (2021), 163\u2013173.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"492","DOI":"10.1145\/3419111.3421284","volume-title":"11th ACM Symposium on Cloud Computing (SoCC\u201920)","author":"Dhakal Aditya","year":"2020","unstructured":"Aditya Dhakal, Sameer G. Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: Controlled spatial sharing of GPUs for a scalable inference platform. In 11th ACM Symposium on Cloud Computing (SoCC\u201920). Association for Computing Machinery, New York, NY, USA, 492\u2013506."},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"658","DOI":"10.1145\/3351095.3372847","volume-title":"Conference on Fairness, Accountability, and Transparency (FAT*\u201920)","author":"Donahue Kate","year":"2020","unstructured":"Kate Donahue and Jon Kleinberg. 2020. Fairness and utilization in allocating resources with uncertain demand. In Conference on Fairness, Accountability, and Transparency (FAT*\u201920). Association for Computing Machinery, New York, NY, USA, 658\u2013668."},{"key":"e_1_3_1_29_2","first-page":"146","volume-title":"IEEE 40th International Conference on Computer Design (ICCD\u201922)","author":"Du Yajuan","year":"2022","unstructured":"Yajuan Du, Mingyang Liu, Yuqi Yang, Mingzhe Zhang, and Xulong Tang. 2022. Enhancing GPU performance via neighboring directory table based inter-TLB sharing. In IEEE 40th International Conference on Computer Design (ICCD\u201922). 146\u2013153."},{"key":"e_1_3_1_30_2","first-page":"1121","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201924)","author":"Duan Jiangfei","year":"2024","unstructured":"Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia. 2024. Parcae: Proactive, liveput-optimized DNN training on preemptible instances. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201924). USENIX Association, 1121\u20131139."},{"key":"e_1_3_1_31_2","first-page":"263","volume-title":"IEEE 38th International Conference on Computer Design (ICCD\u201920)","author":"Eshratifar Amir Erfan","year":"2020","unstructured":"Amir Erfan Eshratifar and Massoud Pedram. 2020. Runtime deep model multiplexing for reduced latency and energy consumption inference. In IEEE 38th International Conference on Computer Design (ICCD\u201920). 263\u2013270."},{"key":"e_1_3_1_32_2","first-page":"353","volume-title":"56th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201923)","author":"Fan Hongxiang","year":"2023","unstructured":"Hongxiang Fan, Stylianos I. Venieris, Alexandros Kouris, and Nicholas Lane. 2023. Sparse-DySta: Sparsity-aware dynamic and static scheduling for sparse multi-DNN workloads. In 56th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201923). Association for Computing Machinery, New York, NY, USA, 353\u2013366."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS53621.2022.00077"},{"key":"e_1_3_1_34_2","first-page":"1212","volume-title":"Design, Automation & Test in Europe Conference & Exhibition (DATE\u201921)","author":"Ganguly Debashis","year":"2021","unstructured":"Debashis Ganguly, Rami Melhem, and Jun Yang. 2021. An adaptive framework for oversubscription management in CPU-GPU unified memory. In Design, Automation & Test in Europe Conference & Exhibition (DATE\u201921). 1212\u20131217."},{"key":"e_1_3_1_35_2","first-page":"1","volume-title":"Design, Automation & Test in Europe Conference & Exhibition (DATE\u201923)","author":"Gao Chengsi","year":"2023","unstructured":"Chengsi Gao, Ying Wang, Cheng Liu, Mengdi Wang, Weiwei Chen, Yinhe Han, and Lei Zhang. 2023. Layer-Puzzle: Allocating and scheduling multi-task on multi-core NPUs by using layer heterogeneity. In Design, Automation & Test in Europe Conference & Exhibition (DATE\u201923). 1\u20136."},{"key":"e_1_3_1_36_2","first-page":"15","volume-title":"Asia Conference on Computer and Communications Security (ASIACCS\u201918)","author":"Giechaskiel Ilias","year":"2018","unstructured":"Ilias Giechaskiel, Kasper B. Rasmussen, and Ken Eguro. 2018. Leaky wires: Information leakage and covert communication between FPGA long wires. In Asia Conference on Computer and Communications Security (ASIACCS\u201918). Association for Computing Machinery, New York, NY, USA, 15\u201327. DOI:DOI:10.1145\/3196494.3196518"},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1145\/3352460.3358291","volume-title":"52nd Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Gondimalla Ashish","year":"2019","unstructured":"Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A sparse tensor accelerator for convolutional neural networks. In 52nd Annual IEEE\/ACM International Symposium on Microarchitecture. 151\u2013165."},{"key":"e_1_3_1_38_2","first-page":"635","volume-title":"52nd International Conference on Parallel Processing (ICPP\u201923)","author":"Gu Jianfeng","year":"2023","unstructured":"Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, and Michael Gerndt. 2023. FaST-GShare: Enabling efficient spatio-temporal GPU sharing in serverless computing for deep learning inference. In 52nd International Conference on Parallel Processing (ICPP\u201923). Association for Computing Machinery, New York, NY, USA, 635\u2013644."},{"key":"e_1_3_1_39_2","first-page":"539","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922)","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922). USENIX Association, 539\u2013558."},{"issue":"3","key":"e_1_3_1_40_2","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1145\/3007787.3001163","article-title":"EIE: Efficient inference engine on compressed deep neural network","volume":"44","author":"Han Song","year":"2016","unstructured":"Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 243\u2013254.","journal-title":"ACM SIGARCH Comput. Archit. News"},{"issue":"3","key":"e_1_3_1_41_2","first-page":"35","article-title":"GPU virtualization and scheduling methods: A comprehensive survey","volume":"50","author":"Hong Cheol-Ho","year":"2017","unstructured":"Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. 2017. GPU virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv. 50, 3, Article 35 (June2017), 37 pages.","journal-title":"ACM Comput. Surv."},{"issue":"4","key":"e_1_3_1_42_2","first-page":"60","article-title":"A closer look at GPGPU","volume":"48","author":"Hu Liang","year":"2016","unstructured":"Liang Hu, Xilong Che, and Si-Qing Zheng. 2016. A closer look at GPGPU. ACM Comput. Surv. 48, 4, Article 60 (Mar.2016), 20 pages.","journal-title":"ACM Comput. Surv."},{"key":"e_1_3_1_43_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921)","author":"Hu Qinghao","year":"2021","unstructured":"Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. 2021. Characterization and prediction of deep learning workloads in large-scale GPU datacenters. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921). 1\u201315."},{"key":"e_1_3_1_44_2","first-page":"457","volume-title":"28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS\u201923)","author":"Hu Qinghao","year":"2023","unstructured":"Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2023. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS\u201923). Association for Computing Machinery, New York, NY, USA, 457\u2013472."},{"key":"e_1_3_1_45_2","doi-asserted-by":"crossref","first-page":"705","DOI":"10.1145\/3605573.3605593","volume-title":"52nd International Conference on Parallel Processing (ICPP\u201923)","author":"Huang Weiming","year":"2023","unstructured":"Weiming Huang, Yajuan Du, and Mingyang Liu. 2023. GPU performance acceleration via intra-group sharing TLB. In 52nd International Conference on Parallel Processing (ICPP\u201923). Association for Computing Machinery, New York, NY, USA, 705\u2013714."},{"key":"e_1_3_1_46_2","article-title":"GPipe: Efficient training of giant neural networks using pipeline parallelism","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu et\u00a0al. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advan. Neural Inf. Process. Syst., Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d\u2019Alche-Buc, Emily B. Fox, and Roman Garnett (Eds.). Vol. 32. NeurIPS, 103\u2013112.","journal-title":"Advan. Neural Inf. Process. Syst."},{"key":"e_1_3_1_47_2","article-title":"Dynamic space-time scheduling for GPU inference","author":"Jain Paras","year":"2018","unstructured":"Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic space-time scheduling for GPU inference. arXiv preprint arXiv:1901.00041 (2018).","journal-title":"arXiv preprint arXiv:1901.00041"},{"key":"e_1_3_1_48_2","first-page":"947","volume-title":"USENIX Annual Technical Conference (USENIX ATC\u201919)","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scalemulti-tenant GPU clusters for DNN training workloads. In USENIX Annual Technical Conference (USENIX ATC\u201919). 947\u2013960."},{"key":"e_1_3_1_49_2","first-page":"89","volume-title":"European Conference on Parallel Processing","author":"Ji Zhuoran","year":"2021","unstructured":"Zhuoran Ji and Cho-Li Wang. 2021. Collaborative GPU preemption via spatial multitasking for efficient GPU sharing. In European Conference on Parallel Processing. Springer, 89\u2013104."},{"key":"e_1_3_1_50_2","first-page":"463","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920)","author":"Jiang Yimin","year":"2020","unstructured":"Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU\/CPU clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920). USENIX Association, 463\u2013479."},{"key":"e_1_3_1_51_2","first-page":"745","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201924)","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nonget al.2024. MegaScale: Scaling large language model training to more than 10,000 GPUs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201924). USENIX Association, 745\u2013760. Retrieved from https:\/\/www.usenix.org\/conference\/nsdi24\/presentation\/jiang-ziheng"},{"key":"e_1_3_1_52_2","volume-title":"50th Annual International Symposium on Computer Architecture (ISCA\u201923)","author":"Jouppi Norm","year":"2023","unstructured":"Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towleset al.2023. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In 50th Annual International Symposium on Computer Architecture (ISCA\u201923). Association for Computing Machinery, New York, NY, USA, Article 82, 14 pages."},{"issue":"2","key":"e_1_3_1_53_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3140659.3080246","article-title":"In-datacenter performance analysis of a tensor processing unit","volume":"45","author":"Jouppi Norman P.","year":"2017","unstructured":"Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borcherset al.2017. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45, 2 (June2017), 1\u201312.","journal-title":"SIGARCH Comput. Archit. News"},{"key":"e_1_3_1_54_2","first-page":"72","article-title":"A survey on techniques for cooperative CPU-GPU computing","volume":"19","author":"K. Raju","year":"2018","unstructured":"Raju K. and Niranjan N. Chiplunkar. 2018. A survey on techniques for cooperative CPU-GPU computing. Sustain. Comput.: Inform. Syst. 19 (2018), 72\u201385.","journal-title":"Sustain. Comput.: Inform. Syst."},{"key":"e_1_3_1_55_2","first-page":"1","volume-title":"57th ACM\/IEEE Design Automation Conference (DAC\u201920)","author":"Kang Donghyun","year":"2020","unstructured":"Donghyun Kang and Soonhoi Ha. 2020. Tensor virtualization technique to support efficient data reorganization for CNN accelerators. In 57th ACM\/IEEE Design Automation Conference (DAC\u201920). 1\u20136."},{"key":"e_1_3_1_56_2","first-page":"814","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201922)","author":"Kao Sheng-Chun","year":"2022","unstructured":"Sheng-Chun Kao and Tushar Krishna. 2022. MAGMA: An optimization framework for mapping multiple DNNs on multiple accelerator cores. In IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201922). 814\u2013830."},{"issue":"7","key":"e_1_3_1_57_2","doi-asserted-by":"crossref","first-page":"2107","DOI":"10.1109\/TPDS.2023.3274957","article-title":"Multi-tier GPU virtualization for deep learning in cloud-edge systems","volume":"34","author":"Kennedy Jason","year":"2023","unstructured":"Jason Kennedy, Vishal Sharma, Blesson Varghese, and Carlos Rea\u00f1o. 2023. Multi-tier GPU virtualization for deep learning in cloud-edge systems. IEEE Trans. Parallel Distrib. Syst. 34, 7 (2023), 2107\u20132123.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_58_2","first-page":"828","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201923)","author":"Kim Seah","year":"2023","unstructured":"Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanovi\u0107, Borivoje Nikoli\u0107, and Yakun Sophia Shao. 2023. MoCA: Memory-centric, adaptive execution for multi-tenant deep neural networks. In IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201923). 828\u2013841."},{"key":"e_1_3_1_59_2","first-page":"62","volume-title":"56th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201923)","author":"Kim Seah","year":"2023","unstructured":"Seah Kim, Jerry Zhao, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2023. AuRORA: Virtualized accelerator orchestration for multi-tenant workloads. In 56th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201923). Association for Computing Machinery, New York, NY, USA, 62\u201376."},{"key":"e_1_3_1_60_2","first-page":"1082","volume-title":"53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920)","author":"Kim Young Geun","year":"2020","unstructured":"Young Geun Kim and Carole-Jean Wu. 2020. AutoScale: Energy efficiency optimization for stochastic edge inference using reinforcement learning. In 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920). 1082\u20131096."},{"key":"e_1_3_1_61_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201924)","author":"Lee Munkyu","year":"2024","unstructured":"Munkyu Lee, Sihoon Seong, Minki Kang, Jihyuk Lee, Gap-Joo Na, In-Geol Chun, Dimitrios Nikolopoulos, and Cheol-Ho Hong. 2024. ParvaGPU: Efficient spatial GPU sharing for large-scale DNN inference in cloud environments. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201924). 1\u201314. DOI:DOI:10.1109\/SC41406.2024.00048"},{"key":"e_1_3_1_62_2","first-page":"66","volume-title":"IEEE International Conference on Cloud Engineering (IC2E\u201920)","author":"LeMay Matthew","year":"2020","unstructured":"Matthew LeMay, Shijian Li, and Tian Guo. 2020. PERSEUS: Characterizing performance and cost of multi-tenant serving for CNN models. In IEEE International Conference on Cloud Engineering (IC2E\u201920). 66\u201372."},{"key":"e_1_3_1_63_2","first-page":"173","volume-title":"13th Symposium on Cloud Computing","author":"Li Baolin","year":"2022","unstructured":"Baolin Li, Tirthak Patel, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2022. MISO: Exploiting multi-instance GPU capability on multi-tenant GPU clusters. In 13th Symposium on Cloud Computing. 173\u2013189."},{"key":"e_1_3_1_64_2","first-page":"49","volume-title":"24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201919)","author":"Li Chen","year":"2019","unstructured":"Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A framework for memory oversubscription management in graphics processing units. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201919). Association for Computing Machinery, New York, NY, USA, 49\u201363."},{"key":"e_1_3_1_65_2","article-title":"PyTorch distributed: Experiences on accelerating data parallel training","author":"Li Shen","year":"2020","unstructured":"Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania et\u00a0al. 2020. PyTorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).","journal-title":"arXiv preprint arXiv:2006.15704"},{"key":"e_1_3_1_66_2","first-page":"231","volume-title":"28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201919)","author":"Li Yusen","year":"2019","unstructured":"Yusen Li, Chuxu Shan, Ruobing Chen, Xueyan Tang, Wentong Cai, Shanjiang Tang, Xiaoguang Liu, Gang Wang, Xiaoli Gong, and Ying Zhang. 2019. GAugur: Quantifying performance interference of colocated games for improving resource utilization in cloud gaming. In 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201919). Association for Computing Machinery, New York, NY, USA, 231\u2013242."},{"key":"e_1_3_1_67_2","article-title":"Resource allocation and workload scheduling for large-scale distributed deep learning: A survey","author":"Liang Feng","year":"2024","unstructured":"Feng Liang, Zhen Zhang, Haifeng Lu, Chengming Li, Victor Leung, Yanyi Guo, and Xiping Hu. 2024. Resource allocation and workload scheduling for large-scale distributed deep learning: A survey. arXiv preprint arXiv:2406.08115 (2024).","journal-title":"arXiv preprint arXiv:2406.08115"},{"key":"e_1_3_1_68_2","first-page":"161","volume-title":"USENIX Annual Technical Conference (USENIX ATC\u201921)","author":"Lim Gangmuk","year":"2021","unstructured":"Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. 2021. Zico: Efficient GPU memory sharing for concurrent DNN training. In USENIX Annual Technical Conference (USENIX ATC\u201921). USENIX Association, 161\u2013175."},{"key":"e_1_3_1_69_2","first-page":"1140","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201923)","author":"Lin Shao-Fu","year":"2023","unstructured":"Shao-Fu Lin, Yi-Jung Chen, Hsiang-Yun Cheng, and Chia-Lin Yang. 2023. Tensor movement orchestration in multi-GPU training systems. In IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201923). 1140\u20131152."},{"key":"e_1_3_1_70_2","volume-title":"51st International Conference on Parallel Processing (ICPP\u201922)","author":"Liu Liu","year":"2023","unstructured":"Liu Liu, Jian Yu, and Zhijun Ding. 2023. Adaptive and efficient GPU time sharing for hyperparameter tuning in cloud. In 51st International Conference on Parallel Processing (ICPP\u201922). Association for Computing Machinery, New York, NY, USA, Article 5, 11 pages."},{"issue":"3","key":"e_1_3_1_71_2","doi-asserted-by":"crossref","first-page":"393","DOI":"10.1145\/3007787.3001179","article-title":"Cambricon: An instruction set architecture for neural networks","volume":"44","author":"Liu Shaoli","year":"2016","unstructured":"Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 393\u2013405.","journal-title":"ACM SIGARCH Comput. Archit. News"},{"issue":"2","key":"e_1_3_1_72_2","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1109\/MWC.006.2200407","article-title":"Machine learning for 6G enhanced ultra-reliable and low-latency services","volume":"30","author":"Liu Yan","year":"2023","unstructured":"Yan Liu, Yansha Deng, Arumugam Nallanathan, and Jinhong Yuan. 2023. Machine learning for 6G enhanced ultra-reliable and low-latency services. IEEE Wirel. Commun. 30, 2 (2023), 48\u201354.","journal-title":"IEEE Wirel. Commun."},{"key":"e_1_3_1_73_2","first-page":"1","volume-title":"IEEE Conference on Computer Communications (INFOCOM\u201923)","author":"Liu Yunzhuo","year":"2023","unstructured":"Yunzhuo Liu, Bo Jiang, Shizhen Zhao, Tao Lin, Xinbing Wang, and Chenghu Zhou. 2023. Libra: Contention-aware GPU thread allocation for data parallel training in high speed networks. In IEEE Conference on Computer Communications (INFOCOM\u201923). 1\u201310."},{"key":"e_1_3_1_74_2","first-page":"388","volume-title":"27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201922)","author":"Liu Zihan","year":"2022","unstructured":"Zihan Liu, Jingwen Leng, Zhihui Zhang, Quan Chen, Chao Li, and Minyi Guo. 2022. VELTAIR: Towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling. In 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201922). Association for Computing Machinery, New York, NY, USA, 388\u2013401."},{"issue":"4","key":"e_1_3_1_75_2","doi-asserted-by":"crossref","first-page":"843","DOI":"10.1109\/TPDS.2019.2948753","article-title":"gQoS: A QoS-oriented GPU virtualization with adaptive capacity sharing","volume":"31","author":"Lu Qiumin","year":"2020","unstructured":"Qiumin Lu, Jianguo Yao, Haibing Guan, and Ping Gao. 2020. gQoS: A QoS-oriented GPU virtualization with adaptive capacity sharing. IEEE Trans. Parallel Distrib. Syst. 31, 4 (2020), 843\u2013855.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_76_2","first-page":"605","volume-title":"52nd International Conference on Parallel Processing (ICPP\u201923)","author":"Luo Diaohan","year":"2023","unstructured":"Diaohan Luo, Tian Yu, Yuewen Wu, Heng Wu, Tao Wang, and Wenbo Zhang. 2023. SPLIT: QoS-aware DNN inference on shared GPU via evenly-sized model splitting. In 52nd International Conference on Parallel Processing (ICPP\u201923). Association for Computing Machinery, New York, NY, USA, 605\u2013614."},{"key":"e_1_3_1_77_2","first-page":"827","volume-title":"25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201920)","author":"Ma Jiacheng","year":"2020","unstructured":"Jiacheng Ma, Gefei Zuo, Kevin Loughlin, Xiaohe Cheng, Yanqiang Liu, Abel Mulugeta Eneyew, Zhengwei Qi, and Baris Kasikci. 2020. A hypervisor for shared-memory FPGA platforms. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201920). Association for Computing Machinery, New York, NY, USA, 827\u2013844. DOI:DOI:10.1145\/3373376.3378482"},{"key":"e_1_3_1_78_2","first-page":"289","volume-title":"17th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201920)","author":"Mahajan Kshiteej","year":"2020","unstructured":"Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201920). USENIX Association, 289\u2013304."},{"issue":"6","key":"e_1_3_1_79_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3460972","article-title":"An energy-efficient inference method in convolutional neural networks based on dynamic adjustment of the pruning level","volume":"26","author":"Maleki Mohammad-Ali","year":"2021","unstructured":"Mohammad-Ali Maleki, Alireza Nabipour-Meybodi, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2021. An energy-efficient inference method in convolutional neural networks based on dynamic adjustment of the pruning level. ACM Trans. Des. Autom. Electron. Syst. 26, 6 (2021), 1\u201320.","journal-title":"ACM Trans. Des. Autom. Electron. Syst."},{"issue":"1","key":"e_1_3_1_80_2","first-page":"3","article-title":"Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools","volume":"53","author":"Mayer Ruben","year":"2020","unstructured":"Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv. 53, 1, Article 3 (Feb.2020), 37 pages.","journal-title":"ACM Comput. Surv."},{"key":"e_1_3_1_81_2","article-title":"Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism","author":"Miao Xupeng","year":"2022","unstructured":"Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. arXiv preprint arXiv:2211.13878 (2022).","journal-title":"arXiv preprint arXiv:2211.13878"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3648475"},{"key":"e_1_3_1_83_2","first-page":"212","volume-title":"21st ACM\/IEEE International Symposium on Code Generation and Optimization (CGO\u201923)","author":"Min Hyemi","year":"2023","unstructured":"Hyemi Min, Jungyoon Kwon, and Bernhard Egger. 2023. Flexer: Out-of-order scheduling for multi-NPUs. In 21st ACM\/IEEE International Symposium on Code Generation and Optimization (CGO\u201923). Association for Computing Machinery, New York, NY, USA, 212\u2013223."},{"key":"e_1_3_1_84_2","first-page":"579","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922)","author":"Mohan Jayashree","year":"2022","unstructured":"Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. 2022. Looking beyond GPUs for DNN scheduling on multi-tenant clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922). USENIX Association, 579\u2013596."},{"key":"e_1_3_1_85_2","first-page":"204","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201924)","author":"Na Seonjin","year":"2024","unstructured":"Seonjin Na, Jungwoo Kim, Sunho Lee, and Jaehyuk Huh. 2024. Supporting secure multi-GPU computing with dynamic and batched metadata management. In IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201924). 204\u2013217."},{"issue":"12","key":"e_1_3_1_86_2","doi-asserted-by":"crossref","first-page":"2159","DOI":"10.14778\/3407790.3407816","article-title":"Cerebro: A data system for optimized deep learning model selection","volume":"13","author":"Nakandala Supun","year":"2020","unstructured":"Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A data system for optimized deep learning model selection. VLDB Endow. 13, 12 (2020), 2159\u20132173.","journal-title":"VLDB Endow."},{"key":"e_1_3_1_87_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921)","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro et\u00a0al. 2021. Efficient large-scale language model training on GPU clusters using Megatron-LM. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921). 1\u201315."},{"issue":"2","key":"e_1_3_1_88_2","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1109\/MM.2021.3058217","article-title":"The design process for Google\u2019s training chips: TPUv2 and TPUv3","volume":"41","author":"Norrie Thomas","year":"2021","unstructured":"Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. The design process for Google\u2019s training chips: TPUv2 and TPUv3. IEEE Micro 41, 2 (2021), 56\u201363.","journal-title":"IEEE Micro"},{"key":"e_1_3_1_89_2","first-page":"584","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Oh Young H.","year":"2021","unstructured":"Young H. Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W. Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921). 584\u2013597."},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080254"},{"issue":"4","key":"e_1_3_1_91_2","doi-asserted-by":"crossref","first-page":"527","DOI":"10.1145\/3093336.3037707","article-title":"Dynamic resource management for efficient utilization of multitasking GPUs","volume":"52","author":"Park Jason Jong Kyu","year":"2017","unstructured":"Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. SIGPLAN Not. 52, 4 (Apr.2017), 527\u2013540.","journal-title":"SIGPLAN Not."},{"key":"e_1_3_1_92_2","doi-asserted-by":"publisher","DOI":"10.1145\/3542929.3563467"},{"key":"e_1_3_1_93_2","first-page":"313","volume-title":"25th International Middleware Conference (MIDDLEWARE\u201924)","author":"Pavlidakis Manos","year":"2024","unstructured":"Manos Pavlidakis, Giorgos Vasiliadis, Stelios Mavridis, Anargyros Argyros, Antony Chazapis, and Angelos Bilas. 2024. Guardian: Safe GPU sharing in multi-tenant environments. In 25th International Middleware Conference (MIDDLEWARE\u201924). Association for Computing Machinery, New York, NY, USA, 313\u2013326. DOI:DOI:10.1145\/3652892.3700768"},{"key":"e_1_3_1_94_2","first-page":"1511","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201924)","author":"Peng Yajuan","year":"2024","unstructured":"Yajuan Peng, Shuang Chen, Yi Zhao, and Zhibin Yu. 2024. UFO: The ultimate QoS-Aware core management for virtualized and oversubscribed public clouds. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201924). USENIX Association, 1511\u20131530."},{"key":"e_1_3_1_95_2","article-title":"ConServe: Harvesting GPUs for low-latency and high-throughput large language model serving","author":"Qiao Yifan","year":"2024","unstructured":"Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, and Harry Xu. 2024. ConServe: Harvesting GPUs for low-latency and high-throughput large language model serving. arXiv preprint arXiv:2410.01228 (2024).","journal-title":"arXiv preprint arXiv:2410.01228"},{"key":"e_1_3_1_96_2","first-page":"58","volume-title":"IEEE International Symposium on High Performance Computer Architecture (HPCA\u201920)","author":"Qin Eric","year":"2020","unstructured":"Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In IEEE International Symposium on High Performance Computer Architecture (HPCA\u201920). IEEE, 58\u201370."},{"key":"e_1_3_1_97_2","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921)","author":"Ranganath Kiran","year":"2021","unstructured":"Kiran Ranganath, Joshua D. Suetterlein, Joseph B. Manzano, Shuaiwen Leon Song, and Daniel Wong. 2021. MAPA: Multi-accelerator pattern allocation policy for multi-tenant GPU servers. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921). Association for Computing Machinery, New York, NY, USA, Article 99, 14 pages."},{"issue":"1","key":"e_1_3_1_98_2","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1109\/JSAC.2020.3036971","article-title":"Accelerating DNN training in wireless federated edge learning systems","volume":"39","author":"Ren Jinke","year":"2021","unstructured":"Jinke Ren, Guanding Yu, and Guangyao Ding. 2021. Accelerating DNN training in wireless federated edge learning systems. IEEE J. Select Areas Commun. 39, 1 (2021), 219\u2013232.","journal-title":"IEEE J. Select Areas Commun."},{"key":"e_1_3_1_99_2","first-page":"1","volume-title":"49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Rhu Minsoo","year":"2016","unstructured":"Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). 1\u201313."},{"key":"e_1_3_1_100_2","first-page":"185","volume-title":"IEEE International Conference on Cluster Computing (CLUSTER\u201923)","author":"Saroliya Urvij","year":"2023","unstructured":"Urvij Saroliya, Eishi Arima, Dai Liu, and Martin Schulz. 2023. Hierarchical resource partitioning on modern GPUs: A reinforcement learning approach. In IEEE International Conference on Cluster Computing (CLUSTER\u201923). 185\u2013196."},{"key":"e_1_3_1_101_2","article-title":"Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model","author":"Smith Shaden","year":"2022","unstructured":"Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti et\u00a0al. 2022. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).","journal-title":"arXiv preprint arXiv:2201.11990"},{"key":"e_1_3_1_102_2","first-page":"654","volume-title":"Design, Automation & Test in Europe Conference & Exhibition (DATE\u201918)","author":"Stangl Jakob","year":"2018","unstructured":"Jakob Stangl, Thomas Lor\u00fcnser, and Sai Manoj Pudukotai Dinakarrao. 2018. A fast and resource efficient FPGA implementation of secret sharing for storage applications. In Design, Automation & Test in Europe Conference & Exhibition (DATE\u201918). IEEE, 654\u2013659."},{"key":"e_1_3_1_103_2","first-page":"654","volume-title":"Design, Automation & Test in Europe Conference & Exhibition (DATE\u201918)","author":"Stangl Jakob","year":"2018","unstructured":"Jakob Stangl, Thomas Lor\u00fcnser, and Sai Manoj Pudukotai Dinakarrao. 2018. A fast and resource efficient FPGA implementation of secret sharing for storage applications. In Design, Automation & Test in Europe Conference & Exhibition (DATE\u201918). 654\u2013659. DOI:DOI:10.23919\/DATE.2018.8342091"},{"key":"e_1_3_1_104_2","first-page":"1075","volume-title":"19th European Conference on Computer Systems (EuroSys\u201924)","author":"Strati Foteini","year":"2024","unstructured":"Foteini Strati, Xianzhe Ma, and Ana Klimovic. 2024. Orion: Interference-aware, fine-grained GPU sharing for ML applications. In 19th European Conference on Computer Systems (EuroSys\u201924). Association for Computing Machinery, New York, NY, USA, 1075\u20131092."},{"key":"e_1_3_1_105_2","article-title":"Llumnix: Dynamic scheduling for large language model serving","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large language model serving. arXiv preprint arXiv:2406.03243 (2024).","journal-title":"arXiv preprint arXiv:2406.03243"},{"key":"e_1_3_1_106_2","doi-asserted-by":"crossref","first-page":"102958","DOI":"10.1016\/j.parco.2022.102958","article-title":"QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU","volume":"113","author":"Sun Qingxiao","year":"2022","unstructured":"Qingxiao Sun, Liu Yi, Hailong Yang, Mingzhen Li, Zhongzhi Luan, and Depei Qian. 2022. QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU. Parallel Comput. 113 (2022), 102958.","journal-title":"Parallel Comput."},{"key":"e_1_3_1_107_2","article-title":"Serving DNN models with multi-instance GPUs: A case of the reconfigurable machine scheduling problem","author":"Tan Cheng","year":"2021","unstructured":"Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and Chuanxiong Guo. 2021. Serving DNN models with multi-instance GPUs: A case of the reconfigurable machine scheduling problem. arXiv preprint arXiv:2109.11067 (2021).","journal-title":"arXiv preprint arXiv:2109.11067"},{"issue":"2","key":"e_1_3_1_108_2","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/TPDS.2018.2865341","article-title":"A virtual multi-channel GPU fair scheduling method for virtual machines","volume":"30","author":"Tan Huailiang","year":"2019","unstructured":"Huailiang Tan, Yanjie Tan, Xiaofei He, Kenli Li, and Keqin Li. 2019. A virtual multi-channel GPU fair scheduling method for virtual machines. IEEE Trans. Parallel Distrib. Syst. 30, 2 (2019), 257\u2013270.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373087.3375322"},{"key":"e_1_3_1_110_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.73"},{"key":"e_1_3_1_111_2","first-page":"681","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Volos Stavros","year":"2018","unstructured":"Stavros Volos, Kapil Vaswani, and Rodrigo Bruno. 2018. Graviton: Trusted execution environments on GPUs. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association, 681\u2013696. Retrieved from https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/volos"},{"issue":"3","key":"e_1_3_1_112_2","first-page":"653","article-title":"Online scheduling of distributed machine learning jobs for incentivizing sharing in multi-tenant systems","volume":"72","author":"Wang Ne","year":"2023","unstructured":"Ne Wang, Ruiting Zhou, Ling Han, Hao Chen, and Zongpeng Li. 2023. Online scheduling of distributed machine learning jobs for incentivizing sharing in multi-tenant systems. IEEE Trans. Comput. 72, 3 (2023), 653\u2013667.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_113_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201920)","author":"Wang Shaoqi","year":"2020","unstructured":"Shaoqi Wang, Oscar J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, and Thomas Woo. 2020. An efficient and non-intrusive GPU scheduling framework for deep learning training systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201920). IEEE, 1\u201313."},{"issue":"5","key":"e_1_3_1_114_2","doi-asserted-by":"crossref","first-page":"4403","DOI":"10.1109\/JIOT.2020.2976702","article-title":"FANN-on-MCU: An open-source toolkit for energy-efficient neural network inference at the edge of the Internet of Things","volume":"7","author":"Wang Xiaying","year":"2020","unstructured":"Xiaying Wang, Michele Magno, Lukas Cavigelli, and Luca Benini. 2020. FANN-on-MCU: An open-source toolkit for energy-efficient neural network inference at the edge of the Internet of Things. IEEE Internet Things J. 7, 5 (2020), 4403\u20134417.","journal-title":"IEEE Internet Things J."},{"key":"e_1_3_1_115_2","first-page":"158","volume-title":"IEEE 41st International Conference on Computer Design (ICCD\u201923)","author":"Wang Xuhang","year":"2023","unstructured":"Xuhang Wang, Zhuoran Song, and Xiaoyao Liang. 2023. RealArch: A real-time scheduler for mapping multi-tenant DNNs on multi-core accelerators. In IEEE 41st International Conference on Computer Design (ICCD\u201923). 158\u2013165."},{"key":"e_1_3_1_116_2","first-page":"945","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201922)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201922). USENIX Association, 945\u2013960."},{"key":"e_1_3_1_117_2","first-page":"995","volume-title":"USENIX Annual Technical Conference (USENIX ATC\u201923)","author":"Weng Qizhen","year":"2023","unstructured":"Qizhen Weng, Lingyun Yang, Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. 2023. Beware of fragmentation: Scheduling GPU-sharing workloads with fragmentation gradient descent. In USENIX Annual Technical Conference (USENIX ATC\u201923). USENIX Association, 995\u20131008."},{"key":"e_1_3_1_118_2","first-page":"69","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201923)","author":"Wu Bingyang","year":"2023","unstructured":"Bingyang Wu, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. 2023. Transparent GPU sharing in container clouds for deep learning workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201923). USENIX Association, 69\u201385."},{"key":"e_1_3_1_119_2","first-page":"595","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhanget al.2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association, 595\u2013610. Retrieved from https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/xiao"},{"key":"e_1_3_1_120_2","first-page":"533","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920)","author":"Xiao Wencong","year":"2020","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920). USENIX Association, 533\u2013548."},{"issue":"3","key":"e_1_3_1_121_2","doi-asserted-by":"crossref","first-page":"812","DOI":"10.1109\/TPDS.2022.3232715","article-title":"iGniter: Interference-aware GPU resource provisioning for predictable DNN inference in the cloud","volume":"34","author":"Xu Fei","year":"2023","unstructured":"Fei Xu, Jianian Xu, Jiabin Chen, Li Chen, Ruitao Shang, Zhi Zhou, and Fangming Liu. 2023. iGniter: Interference-aware GPU resource provisioning for predictable DNN inference in the cloud. IEEE Trans. Parallel Distrib. Syst. 34, 3 (2023), 812\u2013827.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_122_2","volume-title":"11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud\u201919)","author":"Xu Xin","year":"2019","unstructured":"Xin Xu, Na Zhang, Michael Cui, Michael He, and Ridhi Surana. 2019. Characterization and prediction of performance interference on mediated passthrough GPUs for interference-aware scheduler. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud\u201919). USENIX Association."},{"issue":"4","key":"e_1_3_1_123_2","first-page":"799","article-title":"Energy-aware inference offloading for DNN-driven applications in mobile edge clouds","volume":"32","author":"Xu Zichuan","year":"2020","unstructured":"Zichuan Xu, Liqian Zhao, Weifa Liang, Omer F. Rana, Pan Zhou, Qiufen Xia, Wenzheng Xu, and Guowei Wu. 2020. Energy-aware inference offloading for DNN-driven applications in mobile edge clouds. IEEE Trans. Parallel Distrib. Syst. 32, 4 (2020), 799\u2013814.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"issue":"8","key":"e_1_3_1_124_2","doi-asserted-by":"crossref","first-page":"1823","DOI":"10.1109\/TPDS.2018.2789883","article-title":"Scalable GPU virtualization with dynamic sharing of graphics memory space","volume":"29","author":"Xue Mochi","year":"2018","unstructured":"Mochi Xue, Jiacheng Ma, Wentai Li, Kun Tian, Yaozu Dong, Jinyu Wu, Zhengwei Qi, Bingsheng He, and Haibing Guan. 2018. Scalable GPU virtualization with dynamic sharing of graphics memory space. IEEE Trans. Parallel Distrib. Syst. 29, 8 (2018), 1823\u20131836.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_125_2","volume-title":"50th Annual International Symposium on Computer Architecture (ISCA\u201923)","author":"Xue Yuqi","year":"2023","unstructured":"Yuqi Xue, Yiqi Liu, Lifeng Nai, and Jian Huang. 2023. V10: Hardware-assisted NPU multi-tenancy for improved resource utilization and fairness. In 50th Annual International Symposium on Computer Architecture (ISCA\u201923). Association for Computing Machinery, New York, NY, USA, Article 24, 15 pages."},{"key":"e_1_3_1_126_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1145\/2370816.2370858","volume-title":"21st International Conference on Parallel Architectures and Compilation Techniques (PACT\u201912)","author":"Yang Yi","year":"2012","unstructured":"Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In 21st International Conference on Parallel Architectures and Compilation Techniques (PACT\u201912). Association for Computing Machinery, New York, NY, USA, 283\u2013292."},{"issue":"5","key":"e_1_3_1_127_2","doi-asserted-by":"crossref","first-page":"1371","DOI":"10.1109\/TC.2022.3199998","article-title":"An economy-oriented GPU virtualization with dynamic and adaptive oversubscription","volume":"72","author":"Yao Jianguo","year":"2023","unstructured":"Jianguo Yao, Qiumin Lu, Run Tian, Keqin Li, and Haibing Guan. 2023. An economy-oriented GPU virtualization with dynamic and adaptive oversubscription. IEEE Trans. Comput. 72, 5 (2023), 1371\u20131383.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_128_2","first-page":"476","volume-title":"18th Conference on Embedded Networked Sensor Systems","author":"Yao Shuochao","year":"2020","unstructured":"Shuochao Yao, Jinyang Li, Dongxin Liu, Tianshi Wang, Shengzhong Liu, Huajie Shao, and Tarek Abdelzaher. 2020. Deep compressive offloading: Speeding up neural network inference by trading edge computation for network latency. In 18th Conference on Embedded Networked Sensor Systems. 476\u2013488."},{"issue":"6","key":"e_1_3_1_129_2","first-page":"146","article-title":"Deep learning workload scheduling in GPU datacenters: A survey","volume":"56","author":"Ye Zhisheng","year":"2024","unstructured":"Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep learning workload scheduling in GPU datacenters: A survey. ACM Comput. Surv. 56, 6, Article 146 (Jan.2024), 38 pages.","journal-title":"ACM Comput. Surv."},{"key":"e_1_3_1_130_2","first-page":"173","volume-title":"29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201920)","author":"Yeh Ting-An","year":"2020","unstructured":"Ting-An Yeh, Hung-Hsin Chen, and Jerry Chou. 2020. KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud. In 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201920). Association for Computing Machinery, New York, NY, USA, 173\u2013184."},{"key":"e_1_3_1_131_2","article-title":"A survey of multi-tenant deep learning inference on GPU","author":"Yu Fuxun","year":"2022","unstructured":"Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu, and Xiang Chen. 2022. A survey of multi-tenant deep learning inference on GPU. arXiv preprint arXiv:2203.09040 (2022).","journal-title":"arXiv preprint arXiv:2203.09040"},{"key":"e_1_3_1_132_2","first-page":"807","volume-title":"25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201920)","author":"Yu Hangchen","year":"2020","unstructured":"Hangchen Yu, Arthur Michener Peters, Amogh Akshintala, and Christopher J. Rossbach. 2020. AvA: Accelerated virtualization of accelerators. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201920). Association for Computing Machinery, New York, NY, USA, 807\u2013825."},{"key":"e_1_3_1_133_2","first-page":"98","volume-title":"Proceedings of Machine Learning and Systems","volume":"2","author":"Yu Peifeng","year":"2020","unstructured":"Peifeng Yu and Mosharaf Chowdhury. 2020. Fine-grained GPU sharing primitives for deep learning applications. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2. 98\u2013111."},{"key":"e_1_3_1_134_2","first-page":"472","volume-title":"IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201920)","author":"Yu Qi","year":"2020","unstructured":"Qi Yu, Bruce Childers, Libo Huang, Cheng Qian, Hui Guo, and Zhiying Wang. 2020. Coordinated page prefetch and eviction for memory oversubscription management in GPUs. In IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201920). 472\u2013482."},{"issue":"10","key":"e_1_3_1_135_2","doi-asserted-by":"crossref","first-page":"2963","DOI":"10.1109\/TC.2023.3278541","article-title":"Enabling efficient spatio-temporal GPU sharing for network function virtualization","volume":"72","author":"Zeng Deze","year":"2023","unstructured":"Deze Zeng, Andong Zhu, Lin Gu, Peng Li, Quan Chen, and Minyi Guo. 2023. Enabling efficient spatio-temporal GPU sharing for network function virtualization. IEEE Trans. Comput. 72, 10 (2023), 2963\u20132977.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2022.3214113"},{"issue":"9","key":"e_1_3_1_137_2","doi-asserted-by":"crossref","first-page":"2580","DOI":"10.1109\/TPDS.2023.3287883","article-title":"GraphAGILE: An FPGA-based overlay accelerator for low-latency GNN inference","volume":"34","author":"Zhang Bingyi","year":"2023","unstructured":"Bingyi Zhang, Hanqing Zeng, and Viktor K. Prasanna. 2023. GraphAGILE: An FPGA-based overlay accelerator for low-latency GNN inference. IEEE Trans. Parallel Distrib. Syst. 34, 9 (2023), 2580\u20132597.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"issue":"9","key":"e_1_3_1_138_2","doi-asserted-by":"crossref","first-page":"2262","DOI":"10.1109\/TPDS.2021.3059108","article-title":"An efficient parallel secure machine learning framework on GPUs","volume":"32","author":"Zhang Feng","year":"2021","unstructured":"Feng Zhang, Zheng Chen, Chenyang Zhang, Amelie Chi Zhou, Jidong Zhai, and Xiaoyong Du. 2021. An efficient parallel secure machine learning framework on GPUs. IEEE Trans. Parallel Distrib. Syst. 32, 9 (2021), 2262\u20132276.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_139_2","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1145\/3613424.3614309","volume-title":"56th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201923)","author":"Zhang Haoyang","year":"2023","unstructured":"Haoyang Zhang, Yirui Zhou, Yuqi Xue, Yiqi Liu, and Jian Huang. 2023. G10: Enabling an efficient unified GPU memory and storage architecture with smart tensor migrations. In 56th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201923). Association for Computing Machinery, New York, NY, USA, 395\u2013410."},{"key":"e_1_3_1_140_2","first-page":"187","volume-title":"15th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201918)","author":"Zhang Kai","year":"2018","unstructured":"Kai Zhang, Bingsheng He, Jiayu Hu, Zeke Wang, Bei Hua, Jiayi Meng, and Lishan Yang. 2018. G-NET: Effective GPU sharing in NFV systems. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201918). USENIX Association, 187\u2013200."},{"key":"e_1_3_1_141_2","first-page":"1","volume-title":"49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Zhang Shijin","year":"2016","unstructured":"Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, 1\u201312."},{"issue":"6","key":"e_1_3_1_142_2","doi-asserted-by":"crossref","first-page":"1451","DOI":"10.1109\/TPDS.2021.3115630","article-title":"A survey of GPU multitasking methods supported by hardware architecture","volume":"33","author":"Zhao Chen","year":"2022","unstructured":"Chen Zhao, Wu Gao, Feiping Nie, and Huiyang Zhou. 2022. A survey of GPU multitasking methods supported by hardware architecture. IEEE Trans. Parallel Distrib. Syst. 33, 6 (2022), 1451\u20131463.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_143_2","first-page":"742","volume-title":"IEEE 38th International Conference on Distributed Computing Systems (ICDCS\u201918)","author":"Zhao Xiaohui","year":"2018","unstructured":"Xiaohui Zhao, Jianguo Yao, Ping Gao, and Haibing Guan. 2018. Efficient sharing and fine-grained scheduling of virtualized GPU resources. In IEEE 38th International Conference on Distributed Computing Systems (ICDCS\u201918). 742\u2013752."},{"key":"e_1_3_1_144_2","first-page":"428","volume-title":"ACM SIGCOMM Conference (SIGCOMM\u201922)","author":"Zhao Yihao","year":"2022","unstructured":"Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In ACM SIGCOMM Conference (SIGCOMM\u201922). Association for Computing Machinery, New York, NY, USA, 428\u2013440."},{"issue":"2","key":"e_1_3_1_145_2","first-page":"610","article-title":"FPGA resource pooling in cloud computing","volume":"9","author":"Zhu Zhuangdi","year":"2018","unstructured":"Zhuangdi Zhu, Alex X. Liu, Fan Zhang, and Fei Chen. 2018. FPGA resource pooling in cloud computing. IEEE Trans. Cloud Comput. 9, 2 (2018), 610\u2013626.","journal-title":"IEEE Trans. Cloud Comput."}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721427","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:09:47Z","timestamp":1750295387000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721427"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,4]]},"references-count":144,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3721427"],"URL":"https:\/\/doi.org\/10.1145\/3721427","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,4]]},"assertion":[{"value":"2024-08-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-24","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}