{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,11]],"date-time":"2025-11-11T15:55:40Z","timestamp":1762876540623,"version":"3.41.2"},"publisher-location":"New York, NY, USA","reference-count":47,"publisher":"ACM","funder":[{"name":"HORIZON EUROPE Digital, Industry and Space","award":["101070568"],"award-info":[{"award-number":["101070568"]}]},{"name":"HORIZON EUROPE Food, Bioeconomy, Natural Resources, Agriculture and Environment","award":["101084642"],"award-info":[{"award-number":["101084642"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,10]]},"DOI":"10.1145\/3701717.3734461","type":"proceedings-article","created":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T13:52:43Z","timestamp":1752673963000},"page":"164-175","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["IgNITE: Scheduling Pipeline-Parallel DNN Training Jobs on Heterogeneous Infrastructures"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2260-5196","authenticated-orcid":false,"given":"Phivos","family":"Dadamis","sequence":"first","affiliation":[{"name":"Athens University of Economics and Business, Athens, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6972-9567","authenticated-orcid":false,"given":"Dimitrios","family":"Tomaras","sequence":"additional","affiliation":[{"name":"Athens University of Economics and Business, Athens, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6421-9947","authenticated-orcid":false,"given":"Vana","family":"Kalogeraki","sequence":"additional","affiliation":[{"name":"Athens University of Economics and Business, Athens, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6339-1879","authenticated-orcid":false,"given":"Dimitrios","family":"Gunopulos","sequence":"additional","affiliation":[{"name":"National and Kapodistrian University of Athens, Athens, Greece"}]}],"member":"320","published-online":{"date-parts":[[2025,6,9]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Ebtesam Almazrouei and et al.2023. Falcon-40B: an open large language model with state-of-the-art performance. (2023)."},{"key":"e_1_3_3_1_3_2","volume-title":"USENIX FAST, virtual event, February 23-25, 2021","author":"Bae Jonghyun","year":"2021","unstructured":"Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae\u00a0Jun Ham, and Jae\u00a0W Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch training of very deep neural networks. In USENIX FAST, virtual event, February 23-25, 2021."},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCSW.2017.30"},{"key":"e_1_3_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2019.00204"},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICGHPC.2016.7508071"},{"key":"e_1_3_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447818.3460372"},{"key":"e_1_3_3_1_8_2","unstructured":"Lequn\u00a0Chen et al. 2023. Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling. preprint arXiv:https:\/\/arXiv.org\/abs\/2308.07470 (2023)."},{"key":"e_1_3_3_1_9_2","volume-title":"EuroSys 2024, Athens, Greece, April 22-25, 2024","author":"al Zhang\u00a0Shiwei et","year":"2024","unstructured":"Zhang\u00a0Shiwei et al. 2024. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis. In EuroSys 2024, Athens, Greece, April 22-25, 2024."},{"key":"e_1_3_3_1_10_2","volume-title":"USENIX OSDI, Santa Clara, CA, USA, July 10-12","author":"Faisal Abdullah\u00a0Bin","year":"2024","unstructured":"Abdullah\u00a0Bin Faisal, Noah Martin, Hafiz\u00a0Mohsin Bashir, Swaminathan Lamelas, and Fahad\u00a0R. Dogar. 2024. When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling. In USENIX OSDI, Santa Clara, CA, USA, July 10-12."},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3313808.3313819"},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"crossref","unstructured":"Soumendu\u00a0Kumar Ghosh Arnab Raha Vijay Raghunathan and Anand Raghunathan. 2024. PArtNNer: Platform-agnostic adaptive edge-cloud DNN partitioning for minimizing end-to-end latency. ACM TECS 23 1 (2024).","DOI":"10.1145\/3630266"},{"key":"e_1_3_3_1_13_2","unstructured":"Lei Guan Wotao Yin Dongsheng Li and Xicheng Lu. 2019. XPipe: Efficient pipeline model parallelism for multi-GPU DNN training. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1911.04610 (2019)."},{"key":"e_1_3_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3524059.3532394"},{"key":"e_1_3_3_1_15_2","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen et\u00a0al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. NIPS Vancouver BC Canada (2019)."},{"key":"e_1_3_3_1_16_2","unstructured":"Forrest\u00a0N. Iandola Matthew\u00a0W. Moskewicz Khalid Ashraf Song Han William\u00a0J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs\/1602.07360 (2016). arXiv:https:\/\/arXiv.org\/abs\/1602.07360"},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517848"},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613175"},{"key":"e_1_3_3_1_19_2","unstructured":"Byungsoo Jeon Mengdi Wu Shiyi Cao Sunghyun Kim Sunghyun Park et\u00a0al. 2024. Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.17145 (2024)."},{"key":"e_1_3_3_1_20_2","volume-title":"NAACL-HLT, Minneapolis, MN, USA, June 2-7, 2019","author":"Kenton Jacob Devlin Ming-Wei\u00a0Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei\u00a0Chang Kenton and Lee\u00a0Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Minneapolis, MN, USA, June 2-7, 2019."},{"key":"e_1_3_3_1_21_2","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey\u00a0E Hinton. 2012. Imagenet classification with deep convolutional neural networks. NIPS 25 (2012)."},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"crossref","unstructured":"Matthias Langer Zhen He Wenny Rahayu and Yanbo Xue. 2020. Distributed training of deep learning models: A taxonomic perspective. IEEE TPDS 31 12 (2020) 2802\u20132818.","DOI":"10.1109\/TPDS.2020.3003307"},{"key":"e_1_3_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587445"},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER51413.2022.00042"},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00165"},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Imen\u00a0Ben Mansour Matthieu Basseur and Fr\u00e9d\u00e9ric Saubion. 2018. A multi-population algorithm for multi-objective knapsack problem. Applied Soft Computing 70 (2018) 814\u2013825.","DOI":"10.1016\/j.asoc.2018.06.024"},{"key":"e_1_3_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM41043.2020.9155237"},{"key":"e_1_3_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_3_1_29_2","first-page":"481","volume-title":"OSDI 2020, Virtual Event, November 4-6, 2020","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In OSDI 2020, Virtual Event, November 4-6, 2020. 481\u2013498."},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3652892.3700767"},{"key":"e_1_3_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00241"},{"key":"e_1_3_3_1_32_2","first-page":"307","volume-title":"USENIX ATC 2020, July 15-17, 2020","author":"Park Jay\u00a0H.","year":"2020","unstructured":"Jay\u00a0H. Park, Gyeongchan Yun, Chang\u00a0M. Yi, Nguyen\u00a0T. Nguyen, Seungmin Lee, Jaesik Choi, Sam\u00a0H. Noh, and Young-ri Choi. 2020. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In USENIX ATC 2020, July 15-17, 2020. 307\u2013321."},{"key":"e_1_3_3_1_33_2","volume-title":"NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023","author":"Penedo Guilherme","year":"2023","unstructured":"Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, et\u00a0al. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only. In NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023."},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_3_1_35_2","volume-title":"ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings."},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00293"},{"key":"e_1_3_3_1_37_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux et\u00a0al. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR abs\/2302.13971 (2023). arXiv:https:\/\/arXiv.org\/abs\/2302.13971"},{"key":"e_1_3_3_1_38_2","volume-title":"NSDI 2022, Renton, WA, USA, April 4-6, 2022","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, et\u00a0al. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In NSDI 2022, Renton, WA, USA, April 4-6, 2022."},{"key":"e_1_3_3_1_39_2","unstructured":"BigScience Workshop Teven\u00a0Le Scao Angela Fan Christopher Akiki Ellie Pavlick et\u00a0al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2211.05100 (2022)."},{"key":"e_1_3_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00145"},{"key":"e_1_3_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442442.3452055"},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3636534.3649363"},{"key":"e_1_3_3_1_43_2","volume-title":"NeurIPS, Vancouver, BC, Canada, December 10 - 15","author":"Yu Lu","year":"2024","unstructured":"Lu Yu, Haiyang Zhang, and Changsheng Xu. 2024. Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models. In NeurIPS, Vancouver, BC, Canada, December 10 - 15."},{"key":"e_1_3_3_1_44_2","first-page":"173","volume-title":"BIC-TA 2018, Beijing, China, November 2-4","author":"Yu Pengfei","year":"2018","unstructured":"Pengfei Yu, Juanjuan He, Xiaoming Liu, and Kai Zhang. 2018. Industrial Air Pollution Prediction Using Deep Neural Network. In BIC-TA 2018, Beijing, China, November 2-4. 173\u2013185."},{"key":"e_1_3_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICAC.2016.58"},{"key":"e_1_3_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098033"},{"key":"e_1_3_3_1_47_2","unstructured":"Shixiong Zhao Fanxin Li Xusheng Chen Xiuxian Guan Jianyu Jiang et\u00a0al. 2021. vPIPE: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training. IEEE TPDS (2021)."},{"key":"e_1_3_3_1_48_2","doi-asserted-by":"crossref","unstructured":"Guangyao Zhou Wenhong Tian Rajkumar Buyya and Kui Wu. 2025. UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training. IEEE TPDS 36 2 (2025) 293\u2013307.","DOI":"10.1109\/TPDS.2024.3515804"}],"event":{"name":"DEBS '25: The 19th ACM International Conference on Distributed and Event-based Systems","location":"Gothenburg Sweden","acronym":"DEBS '25","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGSOFT ACM Special Interest Group on Software Engineering"]},"container-title":["Proceedings of the 19th ACM International Conference on Distributed and Event-based Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3701717.3734461","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T13:54:01Z","timestamp":1752674041000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701717.3734461"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,9]]},"references-count":47,"alternative-id":["10.1145\/3701717.3734461","10.1145\/3701717"],"URL":"https:\/\/doi.org\/10.1145\/3701717.3734461","relation":{},"subject":[],"published":{"date-parts":[[2025,6,9]]},"assertion":[{"value":"2025-06-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}