{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:12:47Z","timestamp":1750219967806,"version":"3.41.0"},"reference-count":9,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,8,30]],"date-time":"2022-08-30T00:00:00Z","timestamp":1661817600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGMETRICS Perform. Eval. Rev."],"published-print":{"date-parts":[[2022,8,30]]},"abstract":"<jats:p>Most large-scale ML implementations scale to large amounts of data by utilizing multiple servers or virtual machines (VMs) that iteratively compute model updates on local data that are periodically synchronized. Due to the complexity of managing the resulting computing infrastructure, many companies run their ML jobs on external cloud providers' servers. However, cloud resources can be expensive, particularly for large ML jobs with long runtimes.<\/jats:p>\n          <jats:p>A particularly popular method to limit the costs of training ML jobs is to utilize preemptible cloud instances. These may be interrupted at the cloud provider's discretion, but they are significantly (up to 90%) cheaper than conventional on-demand instances. Most studies of these ML methods, however, assume the availability of large datasets at training time. In practice, training data may arrive at irregular intervals and models may be trained online as new data samples arrive, e.g., when monitoring data from IoT sensors. While some software frameworks like Apache Kafka can feed online data arrivals to ML algorithms, they provide little insight into the resulting costs of ML training. We extend prior work on provisioning preemptible instances to analyze available pools of data in order to run online ML on incoming datastreams, which presents new challenges due to the need to carefully handle data arrivals. We design, analyze, and optimize DOLL, which to the best of our knowledge is the first system that provides provable performance guarantees for Distributed OnLine Learning over preemptible instances.<\/jats:p>\n          <jats:p>Research Challenges and Our Contributions: When pools of data are readily available, the bottleneck to distributed ML training often lies in the time required for each VM to compute its model updates. In our scenario, however, the arrival rate of incoming data may also bottleneck data processing. An intuitive strategy would then be for each VM to process each data point as it arrives. However, since arrivals at different VMs may not be coordinated, synchronizing the model parameters at each VM between data arrivals may introduce additional delays, while asynchronous SGD methods can lead to slow convergence [1]. DOLL uses a batching and grouping process to limit the synchronization delay, which naturally realizes traditional mini-batch SGD so as to provide provable model convergence guarantees.<\/jats:p>\n          <jats:p>Handling online data arrivals becomes particularly challenging when we use preemptible instances to compute model updates. Existing methods utilizing preemptible instances for ML jobs largely focus on mitigating training interruptions [2] and their effects on model convergence [3]. When used on datastreams, we face an additional challenge of interruptions pausing the data arrival process, which impedes the rate at which we can compute model updates and thus model convergence. Thus, one should ensure that preemptions do not happen \"too often,\" e.g., by computing some updates on on-demand instances. Our work is the first to optimize the number of preemptible VMs used and demonstrate that we can meet ML convergence guarantees.<\/jats:p>","DOI":"10.1145\/3561074.3561082","type":"journal-article","created":{"date-parts":[[2022,8,31]],"date-time":"2022-08-31T05:48:43Z","timestamp":1661924923000},"page":"21-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["DOLL"],"prefix":"10.1145","volume":"50","author":[{"given":"Harry","family":"Jiang","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaoxi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carlee","family":"Joe-Wong","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,8,30]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Slow and stale gradients can win the race: Error-runtime trade-o's in distributed sgd,\" in Proceedings of AISTATS","author":"Dutta S.","year":"2018","unstructured":"S. Dutta , G. Joshi , S. Ghosh , P. Dube , and P. Nagpurkar , \" Slow and stale gradients can win the race: Error-runtime trade-o's in distributed sgd,\" in Proceedings of AISTATS , 2018 . S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, \"Slow and stale gradients can win the race: Error-runtime trade-o's in distributed sgd,\" in Proceedings of AISTATS, 2018."},{"key":"e_1_2_1_2_1","first-page":"484","volume-title":"ACM","author":"Yan Y.","year":"2016","unstructured":"Y. Yan , Y. Gao , Y. Chen , Z. Guo , B. Chen , and T. Moscibroda , \" Tr-spark: Transient computing for big data analytics,\" in Proceedings of the Seventh ACM SoCC . ACM , 2016 , pp. 484 -- 496 . Y. Yan, Y. Gao, Y. Chen, Z. Guo, B. Chen, and T. Moscibroda, \"Tr-spark: Transient computing for big data analytics,\" in Proceedings of the Seventh ACM SoCC. ACM, 2016, pp. 484--496."},{"key":"e_1_2_1_3_1","volume-title":"Machine learning on volatile instances,\" in Proc. of INFOCOM","author":"Zhang X.","year":"2020","unstructured":"X. Zhang , J. Wang , G. Joshi , and C. Joe-Wong , \" Machine learning on volatile instances,\" in Proc. of INFOCOM , 2020 . X. Zhang, J. Wang, G. Joshi, and C. Joe-Wong, \"Machine learning on volatile instances,\" in Proc. of INFOCOM, 2020."},{"key":"e_1_2_1_4_1","volume-title":"Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets,\" in Proc. of EuroSys","author":"Harlap A.","year":"2017","unstructured":"A. Harlap , A. Tumanov , A. Chung , G. R. Ganger , and P. B. Gibbons , \" Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets,\" in Proc. of EuroSys , 2017 . A. Harlap, A. Tumanov, A. Chung, G. R. Ganger, and P. B. Gibbons, \"Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets,\" in Proc. of EuroSys, 2017."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.2307\/2333008"},{"issue":"7","key":"e_1_2_1_7_1","first-page":"1046","article-title":"Aggregated traffic models for real-world data in the internet of things","volume":"9","author":"L\u00b4opez-Ben\u00b4tez M.","year":"2020","unstructured":"M. L\u00b4opez-Ben\u00b4tez , C. Majumdar , and S. N. Merchant , \" Aggregated traffic models for real-world data in the internet of things ,\" IEEE Wireless Communications Letters , vol. 9 , no. 7 , pp. 1046 -- 1050 , 2020 . M. L\u00b4opez-Ben\u00b4tez, C. Majumdar, and S. N. Merchant, \"Aggregated traffic models for real-world data in the internet of things,\" IEEE Wireless Communications Letters, vol. 9, no. 7, pp. 1046--1050, 2020.","journal-title":"IEEE Wireless Communications Letters"},{"key":"e_1_2_1_8_1","unstructured":"Amazon EC2 \"Amazon ec2 spot instances \" https:\/\/aws.amazon.com\/ec2\/spot\/ 2021.  Amazon EC2 \"Amazon ec2 spot instances \" https:\/\/aws.amazon.com\/ec2\/spot\/ 2021."},{"key":"e_1_2_1_9_1","unstructured":"Google Cloud Platform \"Preemptible vms \" https:\/\/cloud.google.com\/preemptible-vms\/ 2021.  Google Cloud Platform \"Preemptible vms \" https:\/\/cloud.google.com\/preemptible-vms\/ 2021."},{"key":"e_1_2_1_10_1","unstructured":"Amazon EC2 \"Spot instance advisor \" https:\/\/aws.amazon.com\/ec2\/spot\/instance-advisor\/ 2021.  Amazon EC2 \"Spot instance advisor \" https:\/\/aws.amazon.com\/ec2\/spot\/instance-advisor\/ 2021."}],"container-title":["ACM SIGMETRICS Performance Evaluation Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3561074.3561082","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3561074.3561082","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:15Z","timestamp":1750182555000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3561074.3561082"}},"subtitle":["Distributed OnLine Learning Using Preemptible Cloud Instances"],"short-title":[],"issued":{"date-parts":[[2022,8,30]]},"references-count":9,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,8,30]]}},"alternative-id":["10.1145\/3561074.3561082"],"URL":"https:\/\/doi.org\/10.1145\/3561074.3561082","relation":{},"ISSN":["0163-5999"],"issn-type":[{"type":"print","value":"0163-5999"}],"subject":[],"published":{"date-parts":[[2022,8,30]]},"assertion":[{"value":"2022-08-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}