{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,25]],"date-time":"2025-12-25T12:38:37Z","timestamp":1766666317132,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":75,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,11,20]],"date-time":"2024-11-20T00:00:00Z","timestamp":1732060800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,11,20]]},"DOI":"10.1145\/3698038.3698537","type":"proceedings-article","created":{"date-parts":[[2024,11,14]],"date-time":"2024-11-14T06:32:43Z","timestamp":1731565963000},"page":"792-810","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Dynamic Idle Resource Leasing To Safely Oversubscribe Capacity At Meta"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-0087-1834","authenticated-orcid":false,"given":"Nishant","family":"Gupta","sequence":"first","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7762-4554","authenticated-orcid":false,"given":"Iyswarya","family":"Narayanan","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7389-2276","authenticated-orcid":false,"given":"Shivam","family":"Handa","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8773-5808","authenticated-orcid":false,"given":"Sayak","family":"Chakraborti","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3461-9407","authenticated-orcid":false,"given":"Pankit","family":"Thapar","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7564-7755","authenticated-orcid":false,"given":"Baohua","family":"Shan","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8514-0747","authenticated-orcid":false,"given":"Ariel","family":"Rao","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2693-8518","authenticated-orcid":false,"given":"Yuanlai","family":"Liu","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9755-4283","authenticated-orcid":false,"given":"Pengyuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1879-8940","authenticated-orcid":false,"given":"Yuqing","family":"Wu","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-5960-1036","authenticated-orcid":false,"given":"Qingyi","family":"Gao","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2690-1061","authenticated-orcid":false,"given":"Chris Chao-Chun","family":"Cheng","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4420-2368","authenticated-orcid":false,"given":"Sihan","family":"You","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3524-4813","authenticated-orcid":false,"given":"Louis","family":"Huang","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-4043-6599","authenticated-orcid":false,"given":"Jingyuan","family":"Fan","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2483-0062","authenticated-orcid":false,"given":"Kenny","family":"Yu","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8900-6170","authenticated-orcid":false,"given":"Kevin","family":"Lin","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-2065-3270","authenticated-orcid":false,"given":"Tengfei","family":"Mu","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0589-5048","authenticated-orcid":false,"given":"Parth","family":"Malani","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-4301-3665","authenticated-orcid":false,"given":"Haiying","family":"Wang","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2691-0136","authenticated-orcid":false,"given":"Trey","family":"Lu","sequence":"additional","affiliation":[{"name":"Meta, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2206-3615","authenticated-orcid":false,"given":"Peter","family":"Zhang","sequence":"additional","affiliation":[{"name":"Meta, United States"}]}],"member":"320","published-online":{"date-parts":[[2024,11,20]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Agarwal Anup","year":"2023","unstructured":"Anup Agarwal, Shadi Noghabi, \u00cd\u00f1igo Goiri, Srinivasan Seshan, and Anirudh Badam. 2023. Unlocking unallocated cloud capacity for long, uninterruptible workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 457--478."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2627422"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2674025.2576197"},{"key":"e_1_3_2_1_4_1","volume-title":"Open Catalyst Project: Using AI to model and discover new catalysts to address the energy challenges posed by climate change. https:\/\/opencatalystproject.org\/ Accessed","author":"Meta AI and Carnegie Mellon University","year":"2023","unstructured":"Meta AI and Carnegie Mellon University. 2022. Open Catalyst Project: Using AI to model and discover new catalysts to address the energy challenges posed by climate change. https:\/\/opencatalystproject.org\/ Accessed: December 6, 2023."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2017.2711009"},{"key":"e_1_3_2_1_6_1","volume-title":"Providing SLOs for Resource-Harvesting VMs in Cloud Platforms. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Ambati Pradeep","year":"2020","unstructured":"Pradeep Ambati, Inigo Goiri, Felipe Frujeri, Alper Gun, Ke Wang, Brian Dolan, Brian Corell, Sekhar Pasupuleti, Thomas Moscibroda, Sameh Elnikety, Marcus Fontoura, and Ricardo Bianchini. 2020. Providing SLOs for Resource-Harvesting VMs in Cloud Platforms. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 735--751. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/ambati"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341617.3326143"},{"key":"e_1_3_2_1_8_1","unstructured":"Elisa Shibley Aravind Narayanan and Mayank Pundir. 2020. Fault Tolerance through Optimal Workload Placement. https:\/\/engineering.fb.com\/2020\/09\/08\/data-center-engineering\/fault-tolerance-through-optimal-workload-placement\/"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2670979.2670999"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2017.07.019"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387555"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359655"},{"key":"e_1_3_2_1_13_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Choudhury Arnab","year":"2024","unstructured":"Arnab Choudhury, Yang Wang, Tuomas Pelkonen, Kutta Srinivasan, Abha Jain, Shenghao Lin, Delia David, Siavash Soleimanifard, Michael Chen, Abhishek Yadav, Ritesh Tijoriwala, Denis Samoylov, and Chunqiang Tang. 2024. MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 563--580. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/choudhury"},{"key":"e_1_3_2_1_14_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Chow Mike","year":"2024","unstructured":"Mike Chow, Yang Wang, William Wang, Ayichew Hailu, Rohan Bopardikar, Bin Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, Rodrigo Paim, Mack Ward, Ivor Huang, Matt Mc-Nally, Daniel Hodges, Zoltan Farkas, Caner Gocmen, Elvis Huang, and Chunqiang Tang. 2024. ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 545--562. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/chow"},{"key":"e_1_3_2_1_15_1","volume-title":"HUG: Multi-Resource Fairness for Correlated and Elastic Demands. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16)","author":"Chowdhury Mosharaf","year":"2016","unstructured":"Mosharaf Chowdhury, Zhenhua Liu, Ali Ghodsi, and Ion Stoica. 2016. HUG: Multi-Resource Fairness for Correlated and Elastic Demands. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, Santa Clara, CA, 407--424. https:\/\/www.usenix.org\/conference\/nsdi16\/technical-sessions\/presentation\/chowdhury"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132772"},{"key":"e_1_3_2_1_17_1","volume-title":"Dynolog: An Open-Source System Observability Tool. https:\/\/developers.facebook.com\/blog\/post\/2022\/11\/16\/dynolog-open-source-system-observability\/","author":"Coutinho Brian","year":"2022","unstructured":"Brian Coutinho. 2022. Dynolog: An Open-Source System Observability Tool. https:\/\/developers.facebook.com\/blog\/post\/2022\/11\/16\/dynolog-open-source-system-observability\/"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSPEC.2017.7864753"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCWorkshops50388.2021.9473607"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.68"},{"key":"e_1_3_2_1_21_1","volume-title":"Global Capacity Management With Flux. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Eriksen Marius","year":"2023","unstructured":"Marius Eriksen, Kaushik Veeraraghavan, Yusuf Abdulghani, Andrew Birchall, Po-Yen Chou, Richard Cornew, Adela Kabiljo, Ranjith Kumar S, Maroo Lieuw, Justin Meza, Scott Michelson, Thomas Rohloff, Hayley Russell, Jeff Qin, and Chunqiang Tang. 2023. Global Capacity Management With Flux. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 589--606. https:\/\/www.usenix.org\/conference\/osdi23\/presentation\/eriksen"},{"key":"e_1_3_2_1_22_1","volume-title":"Power provisioning for a warehouse-sized computer. ACM SIGARCH computer architecture news 35, 2","author":"Fan Xiaobo","year":"2007","unstructured":"Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. 2007. Power provisioning for a warehouse-sized computer. ACM SIGARCH computer architecture news 35, 2 (2007), 13--23."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507725"},{"key":"e_1_3_2_1_24_1","volume-title":"Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11)","author":"Ghodsi Ali","year":"2011","unstructured":"Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). USENIX Association, Boston, MA. https:\/\/www.usenix.org\/conference\/nsdi11\/dominant-resource-fairness-fair-allocation-multiple-resource-types"},{"key":"e_1_3_2_1_25_1","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Grubic Boris","year":"2023","unstructured":"Boris Grubic, Yang Wang, Tyler Petrochko, Ran Yaniv, Brad Jones, David Callies, Matt Clarke-Lauer, Dan Kelley, Soteris Demetriou, Kenny Yu, and Chunqiang Tang. 2023. Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 325--342. https:\/\/www.usenix.org\/conference\/osdi23\/presentation\/grubic"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICC45855.2022.9838440"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00076"},{"key":"e_1_3_2_1_28_1","volume-title":"8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11)","author":"Hindman Benjamin","year":"2011","unstructured":"Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for Fine-Grained resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). USENIX Association."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173190"},{"key":"e_1_3_2_1_30_1","volume-title":"Metastable Failures in the Wild. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Huang Lexiang","year":"2022","unstructured":"Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikrishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, and Aleksey Charapko. 2022. Metastable Failures in the Wild. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 73--90. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/huang-lexiang"},{"key":"e_1_3_2_1_31_1","volume-title":"18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)","author":"Hwang Changho","year":"2021","unstructured":"Changho Hwang, Taehyun Kim, Sunghyun Kim, Jinwoo Shin, and KyoungSoo Park. 2021. Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 721--739."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3527539"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2592798.2592821"},{"key":"e_1_3_2_1_34_1","volume-title":"Aryl: An Elastic Cluster Scheduler for Deep Learning. arXiv preprint arXiv:2202.07896","author":"Li Jiamin","year":"2022","unstructured":"Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. 2022. Aryl: An Elastic Cluster Scheduler for Deep Learning. arXiv preprint arXiv:2202.07896 (2022)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2014.6853237"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2749475"},{"key":"e_1_3_2_1_37_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Mahajan Kshiteej","year":"2018","unstructured":"Kshiteej Mahajan, Mosharaf Chowdhury, Aditya Akella, and Shuchi Chawla. 2018. Dynamic Query Re-Planning using QOOP. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 253--267."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341302.3342080"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2011.56"},{"volume-title":"Scribe: Transporting petabytes per hour via a distributed, buffered queueing system. https:\/\/engineering.fb.com\/2019\/10\/07\/core-infra\/scribe\/ Accessed","year":"2019","key":"e_1_3_2_1_40_1","unstructured":"Meta. 2019. Scribe: Transporting petabytes per hour via a distributed, buffered queueing system. https:\/\/engineering.fb.com\/2019\/10\/07\/core-infra\/scribe\/ Accessed: December 6, 2023."},{"volume-title":"How Facebook encodes your videos. https:\/\/engineering.fb.com\/2021\/04\/05\/video-engineering\/how-facebook-encodes-your-videos\/ Accessed","year":"2023","key":"e_1_3_2_1_41_1","unstructured":"Meta. 2021. How Facebook encodes your videos. https:\/\/engineering.fb.com\/2021\/04\/05\/video-engineering\/how-facebook-encodes-your-videos\/ Accessed: December 6, 2023."},{"key":"e_1_3_2_1_42_1","unstructured":"Justin J Meza. 2018. Large scale studies of memory storage and network failures in a modern data center. Ph.D. Dissertation. Carnegie Mellon University."},{"key":"e_1_3_2_1_43_1","volume-title":"Defcon: Preventing Overload with Graceful Feature Degradation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Meza Justin J.","year":"2023","unstructured":"Justin J. Meza, Thote Gowda, Ahmed Eid, Tomiwa Ijaware, Dmitry Chernyshev, Yi Yu, Md Nazim Uddin, Rohan Das, Chad Nachiappan, Sari Tran, Shuyang Shi, Tina Luo, David Ke Hong, Sankaralingam Panneerselvam, Hans Ragas, Svetlin Manavski, Weidong Wang, and Francois Richard. 2023. Defcon: Preventing Overload with Graceful Feature Degradation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 607--622. https:\/\/www.usenix.org\/conference\/osdi23\/presentation\/meza"},{"volume-title":"B-series burstable virtual machine sizes. https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/sizes-b-series-burstable Accessed","year":"2023","key":"e_1_3_2_1_44_1","unstructured":"Microsoft. 2022. B-series burstable virtual machine sizes. https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/sizes-b-series-burstable Accessed: December 6, 2023."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3102980.3102983"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData52589.2021.9671521"},{"key":"e_1_3_2_1_47_1","volume-title":"6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14)","author":"Narayanan Iyswarya","year":"2014","unstructured":"Iyswarya Narayanan, Aman Kansal, Anand Sivasubramaniam, Bhuvan Urgaonkar, and Sriram Govindan. 2014. Towards a Leaner Geo-distributed Cloud Infrastructure. In 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14). USENIX Association, Philadelphia, PA. https:\/\/www.usenix.org\/conference\/hotcloud14\/workshop-program\/presentation\/narayanan"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1755913.1755938"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483578"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190517"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617232.3624853"},{"volume-title":"Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21)","author":"Qiao Aurick","key":"e_1_3_2_1_52_1","unstructured":"Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 1--18. https:\/\/www.usenix.org\/conference\/osdi21\/presentation\/qiao"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613155"},{"key":"e_1_3_2_1_54_1","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Saokar Harshit","year":"2023","unstructured":"Harshit Saokar, Soteris Demetriou, Nick Magerko, Max Kontorovich, Josh Kirstein, Margot Leibold, Dimitrios Skarlatos, Hitesh Khandelwal, and Chunqiang Tang. 2023. ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 969--985. https:\/\/www.usenix.org\/conference\/osdi23\/presentation\/saokar"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2465351.2465386"},{"key":"e_1_3_2_1_56_1","volume-title":"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-spot-instances.html Accessed","author":"Services Amazon Web","year":"2023","unstructured":"Amazon Web Services. 2019. Spot Instances. https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-spot-instances.html Accessed: December 6, 2023."},{"key":"e_1_3_2_1_57_1","volume-title":"Burstable Performance Instances. https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/burstable-performance-instances.html Accessed","author":"Services Amazon Web","year":"2023","unstructured":"Amazon Web Services. 2022. Burstable Performance Instances. https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/burstable-performance-instances.html Accessed: December 6, 2023."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00196"},{"key":"e_1_3_2_1_59_1","volume-title":"Twine: A Unified Cluster Management System for Shared Infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Tang Chunqiang","year":"2020","unstructured":"Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang. 2020. Twine: A Unified Cluster Management System for Shared Infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 787--803. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/tang"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1080\/00031305.2017.1380080"},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387517"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/844128.844151"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523616.2523633"},{"key":"e_1_3_2_1_64_1","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Veeraraghavan Kaushik","year":"2016","unstructured":"Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 635--651. https:\/\/www.usenix.org\/conference\/osdi16\/technical-sessions\/presentation\/veeraraghavan"},{"key":"e_1_3_2_1_65_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Veeraraghavan Kaushik","year":"2018","unstructured":"Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Shruti Padmanabha, Ashish Shah, Yee Jiun Song, and Tianyin Xu. 2018. Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 373--389. https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/veeraraghavan"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2741948.2741964"},{"key":"e_1_3_2_1_67_1","volume-title":"Vigraham and Benjamin Leonhardi","author":"Saranyan","year":"2024","unstructured":"Saranyan A. Vigraham and Benjamin Leonhardi. 2024. Maintaining Large-Scale AI Capacity at Meta. https:\/\/engineering.fb.com\/2024\/06\/12\/production-engineering\/maintaining-large-scale-ai-capacity-meta\/"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447786.3456225"},{"key":"e_1_3_2_1_69_1","volume-title":"MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 945--960. https:\/\/www.usenix.org\/conference\/nsdi22\/presentation\/weng"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001187"},{"key":"e_1_3_2_1_71_1","volume-title":"Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595--610. https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/xiao"},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISIT.2019.8849212"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/1755913.1755940"},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541962"},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483580"}],"event":{"name":"SoCC '24: ACM Symposium on Cloud Computing","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGOPS ACM Special Interest Group on Operating Systems"],"location":"Redmond WA USA","acronym":"SoCC '24"},"container-title":["Proceedings of the ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698038.3698537","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3698038.3698537","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T18:58:12Z","timestamp":1755889092000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698038.3698537"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,20]]},"references-count":75,"alternative-id":["10.1145\/3698038.3698537","10.1145\/3698038"],"URL":"https:\/\/doi.org\/10.1145\/3698038.3698537","relation":{},"subject":[],"published":{"date-parts":[[2024,11,20]]},"assertion":[{"value":"2024-11-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}