{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,15]],"date-time":"2026-05-15T01:23:51Z","timestamp":1778808231177,"version":"3.51.4"},"reference-count":49,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T00:00:00Z","timestamp":1762732800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>On GPU-based clusters, the training workloads of machine learning (ML) models, particularly neural networks (NNs), are often structured as Directed Acyclic Graphs (DAGs) and typically deployed for parallel execution across heterogeneous GPU resources. Efficient scheduling of these workloads is crucial for optimizing performance metrics such as execution time, under various constraints including GPU heterogeneity, network capacity, and data dependencies. DAG-structured ML workload scheduling could be modeled as a Nonlinear Integer Program (NIP) problem, and is shown to be NP-complete. By leveraging a positive correlation between Scheduling Plan Distance (SPD) and Finish Time Gap (FTG) identified through an empirical study, we propose to develop a Running Time Gap Strategy for scheduling based on Whale Optimization Algorithm (WOA) and Reinforcement Learning, referred to as WORL-RTGS. The proposed method integrates the global search capabilities of WOA with the adaptive decision-making of Double Deep Q-Networks (DDQN). Particularly, we derive a novel function to generate effective scheduling plans using DDQN, enhancing adaptability to complex DAG structures. Comprehensive evaluations on practical ML workload traces collected from Alibaba on simulated GPU-enabled platforms demonstrate that WORL-RTGS significantly improves WOA\u2019s stability for DAG-structured ML workload scheduling and reduces completion time by up to 66.56% compared with five state-of-the-art scheduling algorithms.<\/jats:p>","DOI":"10.3390\/bdcc9110284","type":"journal-article","created":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T13:51:08Z","timestamp":1762782668000},"page":"284","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Efficient Scheduling for GPU-Based Neural Network Training via Hybrid Reinforcement Learning and Metaheuristic Optimization"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-0689-4279","authenticated-orcid":false,"given":"Nana","family":"Du","sequence":"first","affiliation":[{"name":"School of Computer, Northwest University, Xi\u2019an 710100, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8218-1209","authenticated-orcid":false,"given":"Chase","family":"Wu","sequence":"additional","affiliation":[{"name":"Department of Data Science, New Jersey Institute of Technology, Newark, NJ 07102, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0802-7991","authenticated-orcid":false,"given":"Aiqin","family":"Hou","sequence":"additional","affiliation":[{"name":"School of Computer, Northwest University, Xi\u2019an 710100, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2092-3083","authenticated-orcid":false,"given":"Weike","family":"Nie","sequence":"additional","affiliation":[{"name":"School of Computer, Northwest University, Xi\u2019an 710100, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8617-5474","authenticated-orcid":false,"given":"Ruiqi","family":"Song","sequence":"additional","affiliation":[{"name":"School of Computer, Northwest University, Xi\u2019an 710100, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,10]]},"reference":[{"key":"ref_1","unstructured":"Garefalakis, P., Karanasos, K., Pietzuch, P., Suresh, A., and Rao, S. (2018, January 23\u201326). Medea: Scheduling of long running applications in shared production clusters. Proceedings of the Thirteenth EuroSys Conference, Porto, Portugal."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Sun, P., Guo, Z., Wang, J., Li, J., Lan, J., and Hu, Y. (2021, January 11\u201317). Deepweave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (IJCAI 29), Yokohama, Japan.","DOI":"10.24963\/ijcai.2020\/458"},{"key":"ref_3","unstructured":"Google, C.N.C.F. (2025, November 06). What Is Kubernetes. Website, 2017. Available online: https:\/\/kubernetes.io\/docs\/concepts\/overview\/what-is-kubernetes\/."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., and Seth, S. (2013, January 1\u20133). Apache hadoop yarn: Yet another resource negotiator. Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, CA, USA.","DOI":"10.1145\/2523616.2523633"},{"key":"ref_5","unstructured":"Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., and Stoica, I. (April, January 30). Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), Boston, MA, USA."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., and Alizadeh, M. (2019, January 19\u201323). Learning scheduling algorithms for data processing clusters. Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM 2019), Beijing, China.","DOI":"10.1145\/3341302.3342080"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1016\/j.advengsoft.2016.01.008","article-title":"The whale optimization algorithm","volume":"95","author":"Mirjalili","year":"2016","journal-title":"Adv. Eng. Softw."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"911","DOI":"10.1007\/s11227-025-07415-3","article-title":"Scheduling DAG-structured workloads based on whale optimization algorithm: N. Du et al","volume":"81","author":"Du","year":"2025","journal-title":"J. Supercomput."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1007\/s44443-025-00092-5","article-title":"A dual scheduling framework for task and resource allocation in clouds using deep reinforcement learning","volume":"37","author":"Pan","year":"2025","journal-title":"J. King Saud Univ. Comput. Inf. Sci."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Gao, Y., Yi, H., Chen, H., Fang, X., and Zhao, S. (2024, January 23\u201325). A structure-aware DAG scheduling and allocation on heterogeneous multicore systems. Proceedings of the 2024 IEEE 14th International Symposium on Industrial Embedded Systems (SIES), Chengdu, China.","DOI":"10.1109\/SIES62473.2024.10767927"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Gao, W., Ye, Z., Sun, P., Wen, Y., and Zhang, T. (2021, January 1\u20134). Chronus: A novel deadline-aware scheduler for deep learning training jobs. Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA.","DOI":"10.1145\/3472883.3486978"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Hu, Q., Sun, P., Yan, S., Wen, Y., and Zhang, T. (2021, January 14\u201319). Characterization and prediction of deep learning workloads in large-scale gpu datacenters. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA.","DOI":"10.1145\/3458817.3476223"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"18579","DOI":"10.1007\/s00521-022-07477-x","article-title":"Cost-aware real-time job scheduling for hybrid cloud using deep reinforcement learning","volume":"34","author":"Cheng","year":"2022","journal-title":"Neural Comput. Appl."},{"key":"ref_14","first-page":"1","article-title":"Dag-order: An order-based dynamic dag scheduling for real-time networks-on-chip","volume":"21","author":"Chen","year":"2023","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"4019","DOI":"10.1109\/TPDS.2022.3177046","article-title":"DAG scheduling and analysis on multi-core systems by modelling parallelism and dependency","volume":"33","author":"Zhao","year":"2022","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_16","unstructured":"Fan, S., Rong, Y., Meng, C., Cao, Z., Wang, S., Zheng, Z., Wu, C., Long, G., Yang, J., and Xia, L. (March, January 27). DAPPLE: A pipelined data parallel approach for training large models. Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtually."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Liang, S., Yang, Z., Jin, F., and Chen, Y. (2020, January 11\u201314). Data centers job scheduling with deep reinforcement learning. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore.","DOI":"10.1007\/978-3-030-47436-2_68"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1186\/s13677-021-00276-0","article-title":"Deep reinforcement learning-based workload scheduling for edge computing","volume":"11","author":"Zheng","year":"2022","journal-title":"J. Cloud Comput."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1947","DOI":"10.1109\/TPDS.2021.3052895","article-title":"DL2: A deep learning-driven scheduler for deep learning clusters","volume":"32","author":"Peng","year":"2021","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_20","first-page":"1","article-title":"Deep learning workload scheduling in gpu datacenters: A survey","volume":"56","author":"Ye","year":"2024","journal-title":"ACM Comput. Surv."},{"key":"ref_21","unstructured":"Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D., and Stoica, I. (2021, January 18\u201324). Terapipe: Token-level pipeline parallelism for training large-scale language models. Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"126661","DOI":"10.1016\/j.neucom.2023.126661","article-title":"PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters","volume":"555","author":"Zhang","year":"2023","journal-title":"Neurocomputing"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Mondal, S.S., Sheoran, N., and Mitra, S. (2021, January 19\u201321). Scheduling of time-varying workloads using reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event.","DOI":"10.1609\/aaai.v35i10.17088"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1695","DOI":"10.1109\/TPDS.2021.3124670","article-title":"Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments","volume":"33","author":"Islam","year":"2021","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Hegde, S.N., Srinivas, D., Rajan, M., Rani, S., Kataria, A., and Min, H. (2024). Multi-objective and multi constrained task scheduling framework for computational grids. Sci. Rep., 14.","DOI":"10.1038\/s41598-024-56957-8"},{"key":"ref_26","first-page":"640","article-title":"Multi-Objective Reinforcement Learning Based Algorithm for Dynamic Workflow Scheduling in Cloud Computing","volume":"12","author":"Sudhakar","year":"2024","journal-title":"Indones. J. Electr. Eng. Inform. (IJEEI)"},{"key":"ref_27","first-page":"2808","article-title":"Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters","volume":"33","author":"Gu","year":"2021","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1109\/TPDS.2021.3079202","article-title":"Horus: Interference-aware and prediction-based scheduling in deep learning systems","volume":"33","author":"Yeung","year":"2021","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"4903","DOI":"10.1109\/TPDS.2022.3205325","article-title":"Dras: Deep reinforcement learning for cluster scheduling in high performance computing","volume":"33","author":"Fan","year":"2022","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Saroliya, U., Arima, E., Liu, D., and Schulz, M. (November, January 31). Hierarchical resource partitioning on modern gpus: A reinforcement learning approach. Proceedings of the 2023 IEEE International Conference on Cluster Computing (CLUSTER), Santa Fe, NM, USA.","DOI":"10.1109\/CLUSTER52292.2023.00023"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"4962","DOI":"10.1109\/TNSM.2021.3139607","article-title":"Large-scale machine learning cluster scheduling via multi-agent graph reinforcement learning","volume":"19","author":"Zhao","year":"2021","journal-title":"IEEE Trans. Netw. Serv. Manag."},{"key":"ref_32","first-page":"4375","article-title":"Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC\/AI Converged Cluster","volume":"72","author":"Narantuya","year":"2022","journal-title":"Comput. Mater. Contin."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ryu, J., and Eo, J. (2023, January 17\u201321). Network contention-aware cluster scheduling with reinforcement learning. Proceedings of the 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), Danzhou, China.","DOI":"10.1109\/ICPADS60453.2023.00367"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhang, L., Shi, S., Chu, X., Wang, W., Li, B., and Liu, C. (2023, January 18\u201321). Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining. Proceedings of the 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), Hong Kong, China.","DOI":"10.1109\/ICDCS57875.2023.00054"},{"key":"ref_35","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_36","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1145\/103162.103163","article-title":"What every computer scientist should know about floating-point arithmetic","volume":"23","author":"Goldberg","year":"1991","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1109\/MM.2021.3061394","article-title":"Nvidia a100 tensor core gpu: Performance and innovation","volume":"41","author":"Choquette","year":"2021","journal-title":"IEEE Micro"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Karp, R.M. (1972). Reducibility among combinatorial problems. Complexity of Computer Computations, Springer.","DOI":"10.1007\/978-1-4684-2001-2_9"},{"key":"ref_40","unstructured":"Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2001). Topological sort. Introduction to Algorithms, MIT Press. [2nd ed.]. Chapter 22.4."},{"key":"ref_41","unstructured":"Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2001). Single-source shortest paths in directed acyclic graphs. Introduction to Algorithms, MIT Press. [2nd ed.]. Chapter 24.2."},{"key":"ref_42","first-page":"246","article-title":"Regression towards mediocrity in hereditary stature","volume":"15","author":"Galton","year":"1886","journal-title":"J. Anthropol. Inst. Great Br. Irel."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"5840","DOI":"10.1007\/s11227-020-03494-6","article-title":"A dynamic VM consolidation approach based on load balancing using Pearson correlation in cloud computing","volume":"77","author":"Mapetu","year":"2021","journal-title":"J. Supercomput."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"118714","DOI":"10.1016\/j.eswa.2022.118714","article-title":"Improvement of tasks scheduling algorithm based on load balancing candidate method under cloud computing environment","volume":"212","author":"Chiang","year":"2023","journal-title":"Expert Syst. Appl."},{"key":"ref_45","first-page":"18","article-title":"On random graphs I","volume":"6","author":"ERDdS","year":"1959","journal-title":"Publ. Math. Debr."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1525\/bio.2013.63.2.5","article-title":"Integrative approaches to the study of baleen whale diving behavior, feeding performance, and foraging ecology","volume":"63","author":"Goldbogen","year":"2013","journal-title":"BioScience"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"155","DOI":"10.2307\/1379766","article-title":"Aerial observation of feeding behavior in four baleen whales: Eubalaena glacialis, Balaenoptera borealis, Megaptera novaeangliae, and Balaenoptera physalus","volume":"60","author":"Watkins","year":"1979","journal-title":"J. Mammal."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"675","DOI":"10.1080\/01621459.1937.10503522","article-title":"The use of ranks to avoid the assumption of normality implicit in the analysis of variance","volume":"32","author":"Friedman","year":"1937","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_49","unstructured":"Nemenyi, P.B. (1963). Distribution-Free Multiple Comparisons, Princeton University."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/11\/284\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T05:15:48Z","timestamp":1762924548000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/11\/284"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,10]]},"references-count":49,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["bdcc9110284"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9110284","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,10]]}}}