{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T12:39:47Z","timestamp":1771331987097,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":81,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,30]]},"DOI":"10.1145\/3669940.3707282","type":"proceedings-article","created":{"date-parts":[[2025,2,6]],"date-time":"2025-02-06T12:28:01Z","timestamp":1738844881000},"page":"1052-1068","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0622-9258","authenticated-orcid":false,"given":"Wei","family":"Zhao","sequence":"first","affiliation":[{"name":"Stanford University, Stanford, USA, &amp; CentML, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2118-5935","authenticated-orcid":false,"given":"Anand","family":"Jayarajan","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, Canada, Vector Institute, Toronto, Canada, &amp; CentML, Toronto, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3839-0919","authenticated-orcid":false,"given":"Gennady","family":"Pekhimenko","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, Canada, Vector Institute, Toronto, Canada, &amp; CentML, Toronto, Canada"}]}],"member":"320","published-online":{"date-parts":[[2025,3,30]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Available online","year":"2024","unstructured":"Multi-process service. Available online, 2024. https:\/\/docs.nvidia.com\/deploy\/mps\/."},{"key":"e_1_3_2_1_2_1","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Abadi Martin","year":"2016","unstructured":"Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265--283, 2016."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10626"},{"key":"e_1_3_2_1_4_1","first-page":"252","volume-title":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Allen Tyler","year":"2019","unstructured":"Tyler Allen, Xizhou Feng, and Rong Ge. Slate: Enabling workloadaware efficient multiprocessing for modern gpgpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 252--261, 2019."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640366"},{"key":"e_1_3_2_1_6_1","first-page":"358","volume-title":"Proceedings of the 2023 ACM Symposium on Cloud Computing, SoCC '23","author":"Chen Yang","year":"2023","unstructured":"AndrewAudibert, Yang Chen, Dan Graur, Ana Klimovic, Ji\u0159\u00ed \u0160im\u0160a, and Chandramohan A. Thekkath. tf.data service: A case for disaggregating ml input data processing. In Proceedings of the 2023 ACM Symposium on Cloud Computing, SoCC '23, page 358--375, New York, NY, USA, 2023. Association for Computing Machinery."},{"key":"e_1_3_2_1_7_1","volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings","author":"Bahdanau Dzmitry","year":"2015","unstructured":"Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IC2E.2019.00-10"},{"key":"e_1_3_2_1_9_1","volume-title":"March","author":"Black Sid","year":"2021","unstructured":"Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh- Tensorflow, March 2021. If you use this software, please cite it using these metadata."},{"key":"e_1_3_2_1_10_1","volume-title":"Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs","author":"Bradbury James","year":"2018","unstructured":"James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018."},{"key":"e_1_3_2_1_11_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020."},{"key":"e_1_3_2_1_12_1","volume-title":"et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012","author":"Chang Angel X","year":"2015","unstructured":"Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015."},{"key":"e_1_3_2_1_13_1","first-page":"17","volume-title":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22","author":"Chen Chao","year":"2022","unstructured":"Chao Chen, Chris Porter, and Santosh Pande. Case: a compiler-assisted scheduling framework for multi-gpu systems. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22, page 17--31, New York, NY, USA, 2022. Association for Computing Machinery."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3155284.3018748"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.350"},{"key":"e_1_3_2_1_16_1","first-page":"2024","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. CUDA C++ Programming Guide. NVIDIA Corporation, 2024. Accessed: 2024-10-26.","journal-title":"NVIDIA Corporation"},{"key":"e_1_3_2_1_17_1","first-page":"2024","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. CUDA C++ Programming Guide. NVIDIA Corporation, 2024. Accessed: 2024-10-26.","journal-title":"NVIDIA Corporation"},{"key":"e_1_3_2_1_18_1","first-page":"613","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Crankshaw Daniel","year":"2017","unstructured":"Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, March 2017. USENIX Association."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD46524.2019.00075"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_21_1","volume-title":"Onnx runtime. https:\/\/onnxruntime.ai\/","author":"ONNX","year":"2021","unstructured":"ONNX Runtime developers. Onnx runtime. https:\/\/onnxruntime.ai\/, 2021. Version: x.y.z."},{"key":"e_1_3_2_1_22_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171--4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575702"},{"key":"e_1_3_2_1_24_1","volume-title":"Aug","author":"Eadline Doug","year":"2023","unstructured":"Doug Eadline. Nvidia h100: Are 550,000 gpus enough for this year?, Aug 2023."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.peva.2021.102234"},{"key":"e_1_3_2_1_26_1","volume-title":"Kubernetes: Production-grade container orchestration. https:\/\/kubernetes.io","author":"Inc. Google","year":"2014","unstructured":"Inc. Google. Kubernetes: Production-grade container orchestration. https:\/\/kubernetes.io, 2014. Accessed: 2024-06-24."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339596"},{"key":"e_1_3_2_1_28_1","first-page":"539","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539--558, Carlsbad, CA, July 2022. USENIX Association."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476223"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575705"},{"key":"e_1_3_2_1_32_1","volume-title":"CUTLASS: Fast Linear Algebra in CUDA C++","author":"Kerr Andrew","year":"2017","unstructured":"Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. CUTLASS: Fast Linear Algebra in CUDA C++. NVIDIA Corporation, December 2017. Version 2.10."},{"key":"e_1_3_2_1_33_1","volume-title":"Advances in Neural Information Processing Systems","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q.Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012."},{"key":"e_1_3_2_1_34_1","volume-title":"May","author":"Lee Kevin","year":"2024","unstructured":"Kevin Lee, Kevin Lee, Adi Gangidi, and Mathew Oldham. Building meta's genai infrastructure, May 2024."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO53902.2022.9741270"},{"key":"e_1_3_2_1_37_1","volume-title":"Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976","author":"Li Chuyi","year":"2022","unstructured":"Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976, 2022."},{"key":"e_1_3_2_1_38_1","first-page":"663","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663--679, Boston, MA, July 2023. USENIX Association."},{"key":"e_1_3_2_1_39_1","volume-title":"Pointer sentinel mixture models","author":"Merity Stephen","year":"2016","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1206"},{"key":"e_1_3_2_1_41_1","first-page":"481","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. {Heterogeneity-Aware} cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481--498, 2020."},{"key":"e_1_3_2_1_42_1","first-page":"481","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-Aware cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481--498. USENIX Association, November 2020."},{"key":"e_1_3_2_1_43_1","first-page":"12","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. Basic Linear Algebra on NVIDIA GPUs, 2024. Version 12.0.","journal-title":"Basic Linear Algebra on NVIDIA GPUs"},{"key":"e_1_3_2_1_44_1","volume-title":"June","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. CUDA C++ Programming Guide, version 12.5 edition, June 2024. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/."},{"key":"e_1_3_2_1_45_1","volume-title":"NVIDIA A100 Tensor Core GPU","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. NVIDIA A100 Tensor Core GPU, 2024. Accessed: 2024-12-13."},{"key":"e_1_3_2_1_46_1","volume-title":"June","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. NVIDIA Multi-Instance GPU, June 2024. https:\/\/www.nvidia.com\/en-us\/technologies\/multi-instance-gpu\/."},{"key":"e_1_3_2_1_47_1","volume-title":"Available online","author":"Chatgpt AI.","year":"2024","unstructured":"OpenAI. Chatgpt. Available online, 2024. https:\/\/chat.openai.com\/chat."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_3_2_1_49_1","volume-title":"et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019."},{"key":"e_1_3_2_1_50_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Qi Charles R.","year":"2017","unstructured":"Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017."},{"key":"e_1_3_2_1_51_1","first-page":"28492","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via largescale weak supervision. In International Conference on Machine Learning, pages 28492--28518. PMLR, 2023."},{"issue":"8","key":"e_1_3_2_1_52_1","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.","journal-title":"OpenAI blog"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_55_1","volume-title":"Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347","author":"Schulman John","year":"2017","unstructured":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017."},{"key":"e_1_3_2_1_56_1","volume-title":"et al. Singularity: Planet-scale, preemptive and elastic scheduling of ai workloads. arXiv preprint arXiv:2202.07848","author":"Shukla Dharma","year":"2022","unstructured":"Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, et al. Singularity: Planet-scale, preemptive and elastic scheduling of ai workloads. arXiv preprint arXiv:2202.07848, 2022."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629578"},{"key":"e_1_3_2_1_59_1","first-page":"3104","volume-title":"Proceedings of the 27th International Conference on Neural Information Processing Systems -","volume":"2","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, page 3104--3112, Cambridge, MA, USA, 2014. MIT Press."},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_1_61_1","first-page":"317","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, PACT '22","author":"Tan Xiaodan Serina","year":"2023","unstructured":"Xiaodan Serina Tan, Pavel Golikov, Nandita Vijaykumar, and Gennady Pekhimenko. Gpupool: A holistic approach to fine-grained gpu sharing in the cloud. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, PACT '22, page 317--332, New York, NY, USA, 2023. Association for Computing Machinery."},{"key":"e_1_3_2_1_62_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models 2023."},{"key":"e_1_3_2_1_63_1","volume-title":"International Conference on Learning Representations","author":"Veli\u010dkovi\u0107 Petar","year":"2018","unstructured":"Petar Veli\u010dkovi\u0107, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018."},{"key":"e_1_3_2_1_64_1","first-page":"344","volume-title":"Kernel fusion: An effective method for better power efficiency on multithreaded gpu","author":"Wang Guibin","year":"2010","unstructured":"Guibin Wang, YiSong Lin, and Wei Yi. Kernel fusion: An effective method for better power efficiency on multithreaded gpu. pages 344--350, 2010."},{"key":"e_1_3_2_1_65_1","first-page":"599","article-title":"Horizontally fused training array: An effective hardware utilization squeezer for training novel deep learning models","volume":"3","author":"Wang Shang","year":"2021","unstructured":"Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, and Gennady Pekhimenko. Horizontally fused training array: An effective hardware utilization squeezer for training novel deep learning models. Proceedings of Machine Learning and Systems, 3:599--623, 2021.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446078"},{"key":"e_1_3_2_1_67_1","first-page":"945","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945--960, Renton, WA, April 2022. USENIX Association."},{"key":"e_1_3_2_1_68_1","first-page":"69","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zhang Zili","year":"2023","unstructured":"BingyangWu, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. Transparent GPU sharing in container clouds for deep learning workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 69--85, Boston, MA, April 2023. USENIX Association."},{"key":"e_1_3_2_1_69_1","first-page":"595","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595--610, Carlsbad, CA, October 2018. USENIX Association."},{"key":"e_1_3_2_1_70_1","first-page":"533","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Xiao Wencong","year":"2020","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533--548. USENIX Association, November 2020."},{"key":"e_1_3_2_1_71_1","first-page":"98","volume-title":"Proceedings of Machine Learning and Systems","volume":"2","author":"Yu Peifeng","year":"2020","unstructured":"Peifeng Yu and Mosharaf Chowdhury. Fine-grained gpu sharing primitives for deep learning applications. In I. Dhillon, D. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 98--111, 2020."},{"key":"e_1_3_2_1_72_1","volume-title":"Curran Associates Inc.","author":"Yun Seongjun","year":"2019","unstructured":"Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. Graph transformer networks. Curran Associates Inc., Red Hook, NY, USA, 2019."},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2983149"},{"key":"e_1_3_2_1_74_1","first-page":"1049","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC 19)","author":"Zhang Chengliang","year":"2019","unstructured":"Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. MArk: Exploiting cloud services for Cost-Effective, SLO-Aware machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049--1062, Renton, WA, July 2019. USENIX Association."},{"key":"e_1_3_2_1_75_1","first-page":"11328","volume-title":"International conference on machine learning","author":"Zhang Jingqing","year":"2020","unstructured":"Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International conference on machine learning, pages 11328--11339. PMLR, 2020."},{"key":"e_1_3_2_1_76_1","volume-title":"Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv., 52(1), feb","author":"Zhang Shuai","year":"2019","unstructured":"Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv., 52(1), feb 2019."},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483580"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2022.3214088"},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00064"},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.257"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTAS.2015.7108420"}],"event":{"name":"ASPLOS '25: 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","location":"Rotterdam Netherlands","acronym":"ASPLOS '25","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGOPS ACM Special Interest Group on Operating Systems","SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3669940.3707282","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3669940.3707282","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:49:42Z","timestamp":1755787782000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3669940.3707282"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":81,"alternative-id":["10.1145\/3669940.3707282","10.1145\/3669940"],"URL":"https:\/\/doi.org\/10.1145\/3669940.3707282","relation":{},"subject":[],"published":{"date-parts":[[2025,3,30]]},"assertion":[{"value":"2025-03-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}