{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,22]],"date-time":"2026-07-22T15:26:24Z","timestamp":1784733984362,"version":"3.55.0"},"publisher-location":"New York, NY, USA","reference-count":50,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2024YFB4505904"],"award-info":[{"award-number":["2024YFB4505904"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Guangdong Basic and Applied Basic Research Foundation","award":["2023B1515020054"],"award-info":[{"award-number":["2023B1515020054"]}]},{"DOI":"10.13039\/100017052","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62272495"],"award-info":[{"award-number":["62272495"]}],"id":[{"id":"10.13039\/100017052","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Ant Group"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,19]]},"DOI":"10.1145\/3772052.3772242","type":"proceedings-article","created":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:00Z","timestamp":1768321140000},"page":"402-415","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Defragmentation Scheduling with Deep Reinforcement Learning in Shared GPU Clusters"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-9767-8366","authenticated-orcid":false,"given":"Qingfu","family":"Wu","sequence":"first","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, Guangdong, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0972-6900","authenticated-orcid":false,"given":"Pengfei","family":"Chen","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, Guangdong, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9262-8652","authenticated-orcid":false,"given":"Yilun","family":"Wang","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, Guangdong, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1002\/spe.3066"},{"key":"e_1_3_2_1_2_1","volume-title":"Hadamard product in deep learning: introduction, advances and challenges","author":"Chrysos Grigorios G","year":"2025","unstructured":"Grigorios G Chrysos, Yongtao Wu, Razvan Pascanu, Philip Torr, and Volkan Cevher. 2025. Hadamard product in deep learning: introduction, advances and challenges. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)."},{"key":"e_1_3_2_1_3_1","unstructured":"Jianbo Dong Bin Luo Jun Zhang Pengcheng Zhang Fei Feng Yikai Zhu Ang Liu Zian Chen Yi Shi Hairong Jiao et al. 2024. Boosting large-scale parallel training efficiency with c4: A communication-driven approach. arXiv preprint arXiv:2406.04594 (2024)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2740070.2626334"},{"key":"e_1_3_2_1_5_1","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575721"},{"key":"e_1_3_2_1_7_1","volume-title":"16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Gu Juncheng","year":"2019","unstructured":"Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 485\u2013500."},{"key":"e_1_3_2_1_8_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Hadary Ori","year":"2020","unstructured":"Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, et al. 2020. Protean: VM allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 845\u2013861."},{"key":"e_1_3_2_1_9_1","unstructured":"HAMi. 2025. Heterogeneous AI computing virtualization middleware. https:\/\/github.com\/Project-HAMi\/HAMi."},{"key":"e_1_3_2_1_10_1","volume-title":"Generative adversarial imitation learning. Advances in neural information processing systems 29","author":"Ho Jonathan","year":"2016","unstructured":"Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476223"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/3691825.3691864"},{"key":"e_1_3_2_1_13_1","volume-title":"Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"2","author":"Hu Qinghao","year":"2023","unstructured":"Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2023. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 457\u2013472."},{"key":"e_1_3_2_1_14_1","volume-title":"A closer look at invalid action masking in policy gradient algorithms. arXiv preprint arXiv:2006.14171","author":"Huang Shengyi","year":"2020","unstructured":"Shengyi Huang and Santiago Onta\u00f1\u00f3n. 2020. A closer look at invalid action masking in policy gradient algorithms. arXiv preprint arXiv:2006.14171 (2020)."},{"key":"e_1_3_2_1_15_1","volume-title":"PARALLELGPUOS: A concurrent OS-level GPU checkpoint and restore system using validated speculation. arXiv preprint arXiv:2405.12079","author":"Huang Zhuobin","year":"2024","unstructured":"Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, and Haibo Chen. 2024. PARALLELGPUOS: A concurrent OS-level GPU checkpoint and restore system using validated speculation. arXiv preprint arXiv:2405.12079 (2024)."},{"key":"e_1_3_2_1_16_1","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC 19)","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 947\u2013960."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","first-page":"2102","DOI":"10.1002\/spe.3284","article-title":"DRS: A deep reinforcement learning enhanced Kubernetes scheduler for microservice-based system. Software","volume":"54","author":"Jian Zhaolong","year":"2024","unstructured":"Zhaolong Jian, Xueshuo Xie, Yaozheng Fang, Yibing Jiang, Ye Lu, Ankan Dash, Tao Li, and Guiling Wang. 2024. DRS: A deep reinforcement learning enhanced Kubernetes scheduler for microservice-based system. Software: Practice and Experience 54, 10 (2024), 2102\u20132126.","journal-title":"Practice and Experience"},{"key":"e_1_3_2_1_18_1","volume-title":"L4: Diagnosing large-scale llm training failures via automated log analysis. arXiv preprint arXiv:2503.20263","author":"Jiang Zhihan","year":"2025","unstructured":"Zhihan Jiang, Junjie Huang, Zhuangbin Chen, Yichen Li, Guangba Yu, Cong Feng, Yongqiang Yang, Zengyin Yang, and Michael R Lyu. 2025. L4: Diagnosing large-scale llm training failures via automated log analysis. arXiv preprint arXiv:2503.20263 (2025)."},{"key":"e_1_3_2_1_19_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al. 2024. MegaScale: Scaling large language model training to more than 10,000 GPUs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 745\u2013760."},{"key":"e_1_3_2_1_20_1","unstructured":"Kubernetes. 2025. Descheduler for kubernetes. https:\/\/github.com\/kubernetes-sigs\/descheduler."},{"key":"e_1_3_2_1_21_1","unstructured":"Kubernetes. 2025. Extending kubernetes. https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes."},{"key":"e_1_3_2_1_22_1","unstructured":"Kubernetes. 2025. Production-grade container orchestration. https:\/\/kubernetes.io."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587445"},{"key":"e_1_3_2_1_24_1","volume-title":"Decentralized task offloading in edge computing: an offline-to-online reinforcement learning approach","author":"Lin Hongcai","year":"2024","unstructured":"Hongcai Lin, Lei Yang, Hao Guo, and Jiannong Cao. 2024. Decentralized task offloading in edge computing: an offline-to-online reinforcement learning approach. IEEE Trans. Comput. (2024)."},{"key":"e_1_3_2_1_25_1","first-page":"1","article-title":"MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUs","volume":"67","author":"Liu Xuanzhe","year":"2024","unstructured":"Xuanzhe Liu, Yihao Zhao, Shufan Liu, Xiang Li, Yibo Zhu, Xin Liu, and Xin Jin. 2024. MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUs. Science China Information Sciences 67, 12 (2024), 1\u201317.","journal-title":"Science China Information Sciences"},{"key":"e_1_3_2_1_26_1","volume-title":"Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602","author":"Mnih Volodymyr","year":"2013","unstructured":"Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)."},{"key":"e_1_3_2_1_27_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Mohan Jayashree","year":"2022","unstructured":"Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. 2022. Looking beyond GPUs for DNN scheduling on multi-tenant clusters. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 579\u2013596."},{"key":"e_1_3_2_1_28_1","volume-title":"Australasian Joint Conference on Artificial Intelligence. Springer, 144\u2013154","author":"Mougouei Davoud","year":"2017","unstructured":"Davoud Mougouei, David MW Powers, and Asghar Moeini. 2017. An integer linear programming model for binary knapsack problem with dependent item values. In Australasian Joint Conference on Artificial Intelligence. Springer, 144\u2013154."},{"key":"e_1_3_2_1_29_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 481\u2013498."},{"key":"e_1_3_2_1_30_1","unstructured":"Nvidia. 2025. NVIDIA multi-instance GPU: seven independent instances in a single GPU. https:\/\/www.nvidia.com\/en-us\/technologies\/multi-instance-gpu\/."},{"key":"e_1_3_2_1_31_1","volume-title":"Heuristics for vector bin packing. research. microsoft. com","author":"Panigrahy Rina","year":"2011","unstructured":"Rina Panigrahy, Kunal Talwar, Lincoln Uyeda, and Udi Wieder. 2011. Heuristics for vector bin packing. research. microsoft. com (2011)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3052895"},{"key":"e_1_3_2_1_33_1","volume-title":"Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347","author":"Schulman John","year":"2017","unstructured":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)."},{"key":"e_1_3_2_1_34_1","first-page":"804","article-title":"vCUDA: GPU-accelerated high-performance computing in virtual machines","volume":"61","author":"Shi Lin","year":"2011","unstructured":"Lin Shi, Hao Chen, Jianhua Sun, and Kenli Li. 2011. vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Transactions on computers 61, 6 (2011), 804\u2013816.","journal-title":"IEEE Transactions on computers"},{"key":"e_1_3_2_1_35_1","unstructured":"Richard S Sutton Andrew G Barto et al. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge."},{"key":"e_1_3_2_1_36_1","volume-title":"Behavioral cloning from observation. arXiv preprint arXiv:1805.01954","author":"Torabi Faraz","year":"2018","unstructured":"Faraz Torabi, Garrett Warnell, and Peter Stone. 2018. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954 (2018)."},{"key":"e_1_3_2_1_37_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_38_1","volume-title":"ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)","author":"Wan Borui","year":"2025","unstructured":"Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, et al. 2025. ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 559\u2013578."},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of the 2024 ACM Symposium on Cloud Computing. 1\u201317","author":"Wang Qinghe","year":"2024","unstructured":"Qinghe Wang, Futian Wang, and Xinwei Zheng. 2024. Hops: Fine-grained heterogeneous sensing, efficient and fair Deep Learning cluster scheduling system. In Proceedings of the 2024 ACM Symposium on Cloud Computing. 1\u201317."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3691620.3695477"},{"key":"e_1_3_2_1_41_1","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 945\u2013960."},{"key":"e_1_3_2_1_42_1","volume-title":"2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Weng Qizhen","year":"2023","unstructured":"Qizhen Weng, Lingyun Yang, Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. 2023. Beware of fragmentation: Scheduling GPU-sharing workloads with fragmentation gradient descent. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 995\u20131008."},{"key":"e_1_3_2_1_43_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 595\u2013610."},{"key":"e_1_3_2_1_44_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Xiao Wencong","year":"2020","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 533\u2013548."},{"key":"e_1_3_2_1_45_1","volume-title":"Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2776\u20132788","author":"Xing Mingzhe","year":"2023","unstructured":"Mingzhe Xing, Hangyu Mao, Shenglin Yin, Lichen Pan, Zhengchao Zhang, Zhen Xiao, and Jieyi Long. 2023. A dual-agent scheduler for distributed deep learning jobs on public cloud via reinforcement learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2776\u20132788."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3136245"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3511979"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Shulai Zhang Quan Chen Weihao Cui Han Zhao Chunyu Xue Zhen Zheng Wei Lin and Minyi Guo. 2025. Improving GPU sharing performance through adaptive bubbleless spatial-temporal sharing. (2025).","DOI":"10.1145\/3689031.3696070"},{"key":"e_1_3_2_1_49_1","volume-title":"14th USENIX symposium on operating systems design and implementation (OSDI 20)","author":"Zhao Hanyu","year":"2020","unstructured":"Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, et al. 2020. HiveD: Sharing a GPU cluster for deep learning with guarantees. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 515\u2013532."},{"key":"e_1_3_2_1_50_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zheng Pengfei","year":"2023","unstructured":"Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. 2023. Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 703\u2013723."}],"event":{"name":"SoCC '25: ACM Symposium on Cloud Computing","location":"Online USA","acronym":"SoCC '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 2025 ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772052.3772242","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:26:08Z","timestamp":1768321568000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772052.3772242"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":50,"alternative-id":["10.1145\/3772052.3772242","10.1145\/3772052"],"URL":"https:\/\/doi.org\/10.1145\/3772052.3772242","relation":{},"subject":[],"published":{"date-parts":[[2025,11,19]]},"assertion":[{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}