{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T23:07:18Z","timestamp":1768345638144,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":68,"publisher":"ACM","funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["CSR-2106634, CSR-2312785"],"award-info":[{"award-number":["CSR-2106634, CSR-2312785"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grant No. 62202382"],"award-info":[{"award-number":["Grant No. 62202382"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,19]]},"DOI":"10.1145\/3772052.3772257","type":"proceedings-article","created":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:00Z","timestamp":1768321140000},"page":"557-570","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-6148-9953","authenticated-orcid":false,"given":"Shruti","family":"Dongare","sequence":"first","affiliation":[{"name":"Virginia Tech, Blacksburg, Virginia, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3228-6384","authenticated-orcid":false,"given":"Redwan Ibne Seraj","family":"Khan","sequence":"additional","affiliation":[{"name":"Virginia Tech, Seattle, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4732-2707","authenticated-orcid":false,"given":"Hadeel","family":"Albahar","sequence":"additional","affiliation":[{"name":"Kuwait University, Kuwait City, Kuwait"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6059-1154","authenticated-orcid":false,"given":"Nannan","family":"Zhao","sequence":"additional","affiliation":[{"name":"Northwestern Polytechnical University, China, Xi'an, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7385-9879","authenticated-orcid":false,"given":"Diego","family":"Mel\u00e9ndez-Maita","sequence":"additional","affiliation":[{"name":"Virginia Tech, Blacksburg, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0871-7263","authenticated-orcid":false,"given":"Ali R.","family":"Butt","sequence":"additional","affiliation":[{"name":"Virginia Tech, Blacksburg, Virginia, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2024. 2024 Data Center Trends: A Glimpse into the Future of Rack Density AI and Workload Migration. https:\/\/tinyurl.com\/datacentertrend."},{"key":"e_1_3_2_1_2_1","unstructured":"2024. Announcing A3 supercomputers with NVIDIA H100 GPUs purpose-built for AI. https:\/\/cloud.google.com\/blog\/products\/compute\/introducing-a3-supercomputers-with-nvidia-h100-gpus."},{"key":"e_1_3_2_1_3_1","unstructured":"2024. Introducing the AI Research SuperCluster \u2014 Meta's cutting-edge AI supercomputer for AI research. https:\/\/ai.meta.com\/blog\/ai- rsc\/."},{"key":"e_1_3_2_1_4_1","unstructured":"2024. Microsoft announces new supercomputer lays out vision for future AI work. https:\/\/news.microsoft.com\/source\/features\/ai\/openai-azure-supercomputer\/."},{"key":"e_1_3_2_1_5_1","volume-title":"Slurm Workload Manager. https:\/\/slurm.schedmd.com. Accessed","year":"2024","unstructured":"2024. Slurm Workload Manager. https:\/\/slurm.schedmd.com. Accessed: 2024."},{"key":"e_1_3_2_1_6_1","unstructured":"2024. Tesla's Dojo Supercomputer: A Paradigm Shift In Supercomputing? https:\/\/www.forbes.com\/sites\/stevendickens\/2023\/09\/11\/teslas-dojo-supercomputer-a-paradigm-shift-in- supercomputing\/."},{"key":"e_1_3_2_1_7_1","volume-title":"TOP500 Supercomputer Sites. https:\/\/www.top500.org. Accessed","year":"2024","unstructured":"2024. TOP500 Supercomputer Sites. https:\/\/www.top500.org. Accessed: 2024."},{"key":"e_1_3_2_1_8_1","unstructured":"2025. RLTune GitHub Link. https:\/\/github.com\/dshruti20\/RLTune."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1186\/s40537-023-00765-w","article-title":"Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks","volume":"10","author":"Aach Marcel","year":"2023","unstructured":"Marcel Aach, Eray Inanc, Rakesh Sarma, Morris Riedel, and Andreas Lintermann. 2023. Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks. Journal of Big Data 10, 1 (2023), 96.","journal-title":"Journal of Big Data"},{"key":"e_1_3_2_1_10_1","volume-title":"12th USENIX symposium on operating systems design and implementation (OSDI 16)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. {TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265\u2013283."},{"key":"e_1_3_2_1_11_1","unstructured":"Joshua Achiam. 2018. Spinning up in deep reinforcement learning."},{"key":"e_1_3_2_1_12_1","volume-title":"2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 695\u2013705","author":"Albahar Hadeel","year":"2022","unstructured":"Hadeel Albahar, Shruti Dongare, Yanlin Du, Nannan Zhao, Arnab K Paul, and Ali R Butt. 2022. Schedtune: A heterogeneity-aware gpu scheduler for deep learning. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 695\u2013705."},{"key":"e_1_3_2_1_13_1","volume-title":"2018 USENIX Annual Technical Conference (USENIX ATC 18)","author":"Amvrosiadis George","year":"2018","unstructured":"George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 533\u2013546."},{"key":"e_1_3_2_1_14_1","volume-title":"International Conference on Smart Cities, Infrastructure, Technologies and Applications. Springer, 139\u2013154","author":"Aqib Muhammad","year":"2017","unstructured":"Muhammad Aqib, Rashid Mehmood, Aiiad Albeshri, and Ahmed Alzahrani. 2017. Disaster management in smart cities by forecasting traffic plan using deep learning and GPUs. In International Conference on Smart Cities, Infrastructure, Technologies and Applications. Springer, 139\u2013154."},{"key":"e_1_3_2_1_15_1","volume-title":"MARS: Malleable actor-critic reinforcement learning scheduler. In 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)","author":"Baheri Betis","year":"2022","unstructured":"Betis Baheri, Jacob Tronge, Bo Fang, Ang Li, Vipin Chaudhary, and Qiang Guan. 2022. MARS: Malleable actor-critic reinforcement learning scheduler. In 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC). IEEE, 217\u2013226."},{"key":"e_1_3_2_1_16_1","unstructured":"S. Banerjee et al. 2021. Handling the Challenge of Heterogeneous Clusters for Deep Learning: A Case Study of AlphaFold. J. Parallel and Distrib. Comput. (2021)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126955"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387555"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387555"},{"key":"e_1_3_2_1_20_1","volume-title":"John Tomaszewski, Fabio A Gonz\u00e1lez, and Anant Madabhushi.","author":"Cruz-Roa Angel","year":"2017","unstructured":"Angel Cruz-Roa, Hannah Gilmore, Ajay Basavanhally, Michael Feldman, Shridar Ganesan, Natalie NC Shih, John Tomaszewski, Fabio A Gonz\u00e1lez, and Anant Madabhushi. 2017. Accurate and reproducible invasive breast cancer detection in whole-slide images: A Deep Learning approach for quantifying tumor extent. Scientific reports 7 (2017), 46450."},{"key":"e_1_3_2_1_21_1","first-page":"1","article-title":"CVXPY: A Python-embedded modeling language for convex optimization","volume":"17","author":"Diamond Steven","year":"2016","unstructured":"Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17, 83 (2016), 1\u20135.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_22_1","volume-title":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 807\u2013816","author":"Fan Yuping","year":"2021","unstructured":"Yuping Fan, Zhiling Lan, Taylor Childers, Paul Rich, William Allcock, and Michael E Papka. 2021. Deep reinforcement agent for scheduling in HPC. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 807\u2013816."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MLHPC54614.2021.00009"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/646377.689375"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/646379.689529"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.06.013"},{"key":"e_1_3_2_1_27_1","volume-title":"Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision. arXiv preprint arXiv:2205.11913","author":"Gao Wei","year":"2022","unstructured":"Wei Gao, Qinghao Hu, Zhisheng Ye, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2022. Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision. arXiv preprint arXiv:2205.11913 (2022)."},{"key":"e_1_3_2_1_28_1","volume-title":"16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Gu Juncheng","year":"2019","unstructured":"Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A {GPU} cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 485\u2013500."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1109\/MCOM.2018.1700577","article-title":"Environment classification for urban big data using deep learning","volume":"56","author":"Shamim Hossain M","year":"2018","unstructured":"M Shamim Hossain and Ghulam Muhammad. 2018. Environment classification for urban big data using deep learning. IEEE Communications Magazine 56, 11 (2018), 44\u201350.","journal-title":"IEEE Communications Magazine"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476223"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/3691825.3691864"},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"2","author":"Hu Qinghao","year":"2023","unstructured":"Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang. 2023. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 457\u2013472."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613175"},{"key":"e_1_3_2_1_34_1","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC 19)","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, WencongXiao, andFan Yang. 2019. Analysis of{Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 947\u2013960."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btx531"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.5555\/1622737.1622748"},{"key":"e_1_3_2_1_37_1","unstructured":"Kate Keahey Jason Anderson Zhuo Zhen Pierre Riteau Paul Ruth and Dan Stanzione et al. 2020. Lessons learned from the chameleon testbed. In 2020 USENIX annual technical conference (USENIX ATC 20). 219\u2013233."},{"key":"e_1_3_2_1_38_1","volume-title":"Rujia Wang, et al.","author":"Seraj Khan Redwan Ibne","year":"2024","unstructured":"Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renee St Amant, Rujia Wang, et al. 2024. Ensuring Fair LLM Serving Amid Diverse Applications. arXiv preprint arXiv:2411.15997 (2024)."},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of the 2024 ACM Symposium on Cloud Computing. 52\u201368","author":"Seraj Khan Redwan Ibne","year":"2024","unstructured":"Redwan Ibne Seraj Khan, Arnab K Paul, Yue Cheng, Xun Steve Jian, and Ali R Butt. 2024. FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and Scheduling. In Proceedings of the 2024 ACM Symposium on Cloud Computing. 52\u201368."},{"key":"e_1_3_2_1_40_1","volume-title":"SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training. In 21st USENIX Conference on File and Storage Technologies (FAST 23)","author":"Seraj Khan Redwan Ibne","unstructured":"Redwan Ibne Seraj Khan, Ahmad Hossein Yazdani, Yuqi Fu, Arnab K. Paul, Bo Ji, Xun Jian, Yue Cheng, and Ali R. Butt. 2023. SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training. In 21st USENIX Conference on File and Storage Technologies (FAST 23). USENIX Association, Santa Clara, CA, 135\u2013152. https:\/\/www.usenix.org\/conference\/fast23\/presentation\/khan"},{"key":"e_1_3_2_1_41_1","volume-title":"Proceedings of the Fifteenth European Conference on Computer Systems. 1\u201316","author":"Le Tan N","year":"2020","unstructured":"Tan N Le, Xiao Sun, Mosharaf Chowdhury, and Zhenhua Liu. 2020. Allox: compute allocation in hybrid clusters. In Proceedings of the Fifteenth European Conference on Computer Systems. 1\u201316."},{"key":"e_1_3_2_1_42_1","volume-title":"Learning IoT in edge: Deep learning for the Internet of Things with edge computing","author":"Li He","year":"2018","unstructured":"He Li, Kaoru Ota, and Mianxiong Dong. 2018. Learning IoT in edge: Deep learning for the Internet of Things with edge computing. IEEE network 32, 1 (2018), 96\u2013101."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587445"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607054"},{"key":"e_1_3_2_1_45_1","first-page":"169","article-title":"Batch jobs load balancing scheduling in cloud computing using distributional reinforcement learning","volume":"35","author":"Li Tiangang","year":"2023","unstructured":"Tiangang Li, Shi Ying, Yishi Zhao, and Jianga Shang. 2023. Batch jobs load balancing scheduling in cloud computing using distributional reinforcement learning. IEEE Transactions on Parallel and Distributed Systems 35, 1 (2023), 169\u2013185.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_1_46_1","volume-title":"17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)","author":"Mahajan Kshiteej","year":"2020","unstructured":"Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and efficient {GPU} cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 289\u2013304."},{"key":"e_1_3_2_1_47_1","volume-title":"http:\/\/www.gnu.org\/s\/glpk\/glpk.html","author":"Makhorin Andrew","year":"2008","unstructured":"Andrew Makhorin. 2008. GLPK (GNU linear programming kit). http:\/\/www.gnu.org\/s\/glpk\/glpk.html (2008)."},{"key":"e_1_3_2_1_48_1","unstructured":"A. Mathuriya A. Bard P. Mendygral L. Meadows J. Arnemann et al. 2021. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Cluster Computing (2021)."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/3488766.3488793"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00088"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190517"},{"key":"e_1_3_2_1_52_1","volume-title":"Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing.","author":"Qiao Aurick","year":"2021","unstructured":"Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. 2021. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 21)."},{"key":"e_1_3_2_1_53_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Rajasekaran Sudarsanan","year":"2024","unstructured":"Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. { CASSINI} : { Network-Aware} job scheduling in machine learning clusters. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 1403\u20131420."},{"key":"e_1_3_2_1_54_1","volume-title":"2017 American Control Conference (ACC). IEEE, 4914\u20134919","author":"Rausch Viktor","year":"2017","unstructured":"Viktor Rausch, Andreas Hansen, Eugen Solowjow, Chang Liu, Edwin Kreuzer, and J Karl Hedrick. 2017. Learning a deep neural net policy for end-to-end control of autonomous vehicles. In 2017 American Control Conference (ACC). IEEE, 4914\u20134919."},{"key":"e_1_3_2_1_55_1","volume-title":"Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347","author":"Schulman John","year":"2017","unstructured":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)."},{"key":"e_1_3_2_1_56_1","volume-title":"2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1\u201310","author":"Tang Wei","year":"2009","unstructured":"Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner. 2009. Fault-aware, utility-based job scheduling on blue, gene\/p systems. In 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1\u201310."},{"key":"e_1_3_2_1_57_1","volume-title":"Multilayer perceptron (MLP). Geomatic approaches for modeling land change scenarios","author":"Taud Hind","year":"2018","unstructured":"Hind Taud and Jean-Franccois Mas. 2018. Multilayer perceptron (MLP). Geomatic approaches for modeling land change scenarios (2018), 451\u2013455."},{"key":"e_1_3_2_1_58_1","volume-title":"Optimizing Distributed Training Deployment in Heterogeneous GPU Clusters. In The 34th ACM International Conference on Supercomputing (ICS).","author":"Wang F.","unstructured":"F. Wang, G. Yang, H. Xu, X. Hu, and Y. Zhou. 2020. Optimizing Distributed Training Deployment in Heterogeneous GPU Clusters. In The 34th ACM International Conference on Supercomputing (ICS)."},{"key":"e_1_3_2_1_59_1","volume-title":"MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 945\u2013960. https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/weng"},{"key":"e_1_3_2_1_60_1","volume-title":"Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent. In 2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Weng Qizhen","year":"2023","unstructured":"Qizhen Weng, Lingyun Yang, Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. 2023. Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 995\u20131008. https:\/\/www.usenix.org\/conference\/atc23\/presentation\/weng"},{"key":"e_1_3_2_1_61_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 595\u2013610."},{"key":"e_1_3_2_1_62_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Xiao Wencong","year":"2020","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 533\u2013548."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00035"},{"key":"e_1_3_2_1_64_1","volume-title":"Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing. 97\u2013109","author":"Zhang Di","year":"2022","unstructured":"Di Zhang, Dong Dai, and Bing Xie. 2022. Schedinspector: A batch job scheduling inspector using reinforcement learning. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing. 97\u2013109."},{"key":"e_1_3_2_1_65_1","unstructured":"Z. Zhang Y. Wang et al. 2020. Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters. arXiv preprint arXiv:2403.18545 (2020)."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3567499"},{"key":"e_1_3_2_1_67_1","volume-title":"14th USENIX symposium on operating systems design and implementation (OSDI 20)","author":"Zhao Hanyu","year":"2020","unstructured":"Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis CM Lau, Yuqi Wang, Yifan Xiong, et al. 2020. { HiveD} : Sharing a { GPU} cluster for deep learning with guarantees. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 515\u2013532."},{"key":"e_1_3_2_1_68_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zheng Pengfei","year":"2023","unstructured":"Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. 2023. Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 703\u2013723."}],"event":{"name":"SoCC '25: ACM Symposium on Cloud Computing","location":"Online USA","acronym":"SoCC '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 2025 ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772052.3772257","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:23:49Z","timestamp":1768321429000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772052.3772257"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":68,"alternative-id":["10.1145\/3772052.3772257","10.1145\/3772052"],"URL":"https:\/\/doi.org\/10.1145\/3772052.3772257","relation":{},"subject":[],"published":{"date-parts":[[2025,11,19]]},"assertion":[{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}