{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T14:10:56Z","timestamp":1773843056180,"version":"3.50.1"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>\n            Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU\/CPU\/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges:\n            <jats:italic>low resource utilization<\/jats:italic>\n            due to suboptimal configurations by users and\n            <jats:italic>the tendency to encounter abnormalities<\/jats:italic>\n            due to an unstable cloud environment. To overcome them, we introduce DLRover, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover is open-sourced and has been adopted by 10+ companies.\n          <\/jats:p>","DOI":"10.14778\/3685800.3685832","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4130-4144","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud"],"prefix":"10.14778","volume":"17","author":[{"given":"Qinlong","family":"Wang","sequence":"first","affiliation":[{"name":"Independent Researcher"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tingfeng","family":"Lan","sequence":"additional","affiliation":[{"name":"Sichuan University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yinghao","family":"Tang","sequence":"additional","affiliation":[{"name":"Sichuan University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bo","family":"Sang","sequence":"additional","affiliation":[{"name":"Independent Researcher"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ziling","family":"Huang","sequence":"additional","affiliation":[{"name":"Sichuan University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yiheng","family":"Du","sequence":"additional","affiliation":[{"name":"Sichuan University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haitao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Independent Researcher"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jian","family":"Sha","sequence":"additional","affiliation":[{"name":"Independent Researcher"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hui","family":"Lu","sequence":"additional","affiliation":[{"name":"The University of Texas at Arlington"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuanchun","family":"Zhou","sequence":"additional","affiliation":[{"name":"Chinese Academy of Science"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ke","family":"Zhang","sequence":"additional","affiliation":[{"name":"Independent Researcher"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingjie","family":"Tang","sequence":"additional","affiliation":[{"name":"Sichuan University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2014. Criteo. http:\/\/labs.criteo.com\/downloads\/2014-kaggle-display-advertising-challenge-dataset\/"},{"key":"e_1_2_1_2_1","unstructured":"2023. Kubeflow: The Machine Learning Toolkit for Kubernetes. https:\/\/www.kubeflow.org\/"},{"key":"e_1_2_1_3_1","unstructured":"2023. Pymoo NSGA-II. https:\/\/pymoo.org\/algorithms\/moo\/nsga2.html"},{"key":"e_1_2_1_4_1","unstructured":"2023. SciPy NNLS. https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.optimize.nnls.html"},{"key":"e_1_2_1_5_1","unstructured":"2024. Accelerate Enterprise AI Path to Production. https:\/\/www.alluxio.io\/"},{"key":"e_1_2_1_6_1","unstructured":"2024. DeepRec. https:\/\/github.com\/DeepRec-AI\/DeepRec"},{"key":"e_1_2_1_7_1","unstructured":"Mart\u00edn Abadi and Ashish Agarwal et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/ Software available from tensorflow.org."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00072"},{"key":"e_1_2_1_9_1","volume-title":"Divya Mahajan, and Prashant J Nair.","author":"Adnan Muhammad","year":"2021","unstructured":"Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J Nair. 2021. Accelerating recommendation system training by leveraging popular choices. arXiv preprint arXiv:2103.00686 (2021)."},{"key":"e_1_2_1_10_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Annamalai Muthukaruppan","year":"2018","unstructured":"Muthukaruppan Annamalai, Kaushik Ravichandran, Harish Srinivas, Igor Zinkovsky, Luning Pan, Tony Savor, David Nagle, and Michael Stumm. 2018. Sharding the Shards: Managing Datastore Locality at Scale with Akkio. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 445--460. https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/annamalai"},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3394138","article-title":"Practical privacy preserving POI recommendation","volume":"11","author":"Chen Chaochao","year":"2020","unstructured":"Chaochao Chen, Jun Zhou, Bingzhe Wu, Wenjing Fang, Li Wang, Yuan Qi, and Xiaolin Zheng. 2020. Practical privacy preserving POI recommendation. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 5 (2020), 1--20.","journal-title":"ACM Transactions on Intelligent Systems and Technology (TIST)"},{"key":"e_1_2_1_12_1","unstructured":"Jianmin Chen Xinghao Pan Rajat Monga Samy Bengio and Rafal Jozefowicz. 2017. Revisiting Distributed Synchronous SGD. arXiv:1604.00981 [cs.LG]"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2988450.2988454"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620678.3624661"},{"key":"e_1_2_1_15_1","volume-title":"Notes from the AI frontier: Insights from hundreds of use cases","author":"Chui Michael","year":"2018","unstructured":"Michael Chui, James Manyika, Mehdi Miremadi, Nicolaus Henke, Rita Chung, Pieter Nel, and Sankalp Malhotra. 2018. Notes from the AI frontier: Insights from hundreds of use cases. McKinsey Global Institute 2 (2018)."},{"key":"e_1_2_1_16_1","volume-title":"Toward understanding the impact of staleness in distributed machine learning. arXiv preprint arXiv:1810.03264","author":"Dai Wei","year":"2018","unstructured":"Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric P Xing. 2018. Toward understanding the impact of staleness in distributed machine learning. arXiv preprint arXiv:1810.03264 (2018)."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation","author":"Gu Juncheng","year":"2019","unstructured":"Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (Boston, MA, USA) (NSDI'19). USENIX Association, USA, 485\u00e2\u0102\u015e500."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2017\/239"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3326285.3329074"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00047"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467080"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987554"},{"key":"e_1_2_1_23_1","unstructured":"Zhaoxin Huan Ke Ding Ang Li Xiaolu Zhang Xu Min Yong He Liang Zhang Jun Zhou Linjian Mo Jinjie Gu et al. 2023. AntM2C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction. arXiv preprint arXiv:2308.16437 (2023)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2749432"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467084"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10639-019-10063-9"},{"key":"e_1_2_1_27_1","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Lai Fan","year":"2023","unstructured":"Fan Lai, Wei Zhang, Rui Liu, William Tsai, Xiaohan Wei, Yuxi Hu, Sabin Devkota, Jianyu Huang, Jongsoo Park, Xing Liu, et al. 2023. {AdaEmbed}: Adaptive Embedding for {Large-Scale} Recommendation Models. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 817--831."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3659437.3659449"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587445"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611565"},{"key":"e_1_2_1_31_1","volume-title":"Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14","author":"Li Mu","year":"2014","unstructured":"Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6--8, 2014, Jason Flinn and Hank Levy (Eds.). USENIX Association, 583--598. https:\/\/www.usenix.org\/conference\/osdi14\/technicalsessions\/presentation\/li_mu"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","unstructured":"Mingzhen Li Wencong Xiao Hailong Yang Biao Sun Hanyu Zhao Shiru Ren Zhongzhi Luan Xianyan Jia Yi Liu Yong Li Wei Lin and Depei Qian. 2023. EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs. In Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis (&lt;conf-loc&gt; &lt;city&gt;Denver&lt;\/city&gt; &lt;state&gt;CO&lt;\/state&gt; &lt;country&gt;USA&lt;\/country&gt; &lt;\/conf-loc&gt;) (SC '23). Association for Computing Machinery New York NY USA Article 55 14 pages. 10.1145\/3581784.3607054","DOI":"10.1145\/3581784.3607054"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220023"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267809.3267830"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS51385.2021.00033"},{"key":"e_1_2_1_36_1","volume-title":"CNN Money http:\/\/tech.fortune.cnn.com\/2012\/07\/30\/amazon-5","author":"Mangalindan JP","year":"2012","unstructured":"JP Mangalindan. 2012. Amazon\u00e2\u0102&Zacute;s recommendation secret. CNN Money http:\/\/tech.fortune.cnn.com\/2012\/07\/30\/amazon-5 (2012)."},{"key":"e_1_2_1_37_1","article-title":"MLlib: Machine Learning in Apache Spark","volume":"17","author":"Meng Xiangrui","year":"2016","unstructured":"Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 17 (2016), 34:1--34:7. http:\/\/jmlr.org\/papers\/v17\/15-237.html","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3533727"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2018.2816644"},{"key":"e_1_2_1_40_1","unstructured":"Maxim Naumov John Kim Dheevatsa Mudigere Srinivas Sridharan Xiaodong Wang Whitney Zhao Serhat Yilmaz Changkyu Kim Hector Yuen Mustafa Ozdal et al. 2020. Deep learning training in facebook data centers: Design of scale-up and scale-out systems. arXiv preprint arXiv:2003.09518 (2020)."},{"key":"e_1_2_1_41_1","volume-title":"Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al.","author":"Naumov Maxim","year":"2019","unstructured":"Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019)."},{"key":"e_1_2_1_42_1","unstructured":"Xiaonan Nie Yi Liu Fangcheng Fu Jinbao Xue Dian Jiao Xupeng Miao Yangyu Tao and Bin Cui. 2023. Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent. arXiv:2303.02868 [cs.LG]"},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of Machine Learning and Systems 2020","author":"Or Andrew","year":"2020","unstructured":"Andrew Or, Haoyu Zhang, and Michael J. Freedman. 2020. Resource Elasticity in Distributed Deep Learning. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2--4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https:\/\/proceedings.mlsys.org\/book\/314.pdf"},{"key":"e_1_2_1_44_1","unstructured":"Adam Paszke and Gross et al. 2019. PyTorch: An Imperative Style High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 H. Wallach H. Larochelle A. Beygelzimer F. d'Alch\u00e9-Buc E. Fox and R. Garnett (Eds.). Curran Associates Inc. 8024--8035."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190517"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2017.12.020"},{"key":"e_1_2_1_47_1","unstructured":"PyTorch. 2020. Pytorch with Elastic. https:\/\/pytorch.org\/elastic\/0.1.0rc2\/overview.html."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-10665-1_63"},{"key":"e_1_2_1_49_1","volume-title":"Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021","author":"Qiao Aurick","year":"2021","unstructured":"Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14--16, 2021, Angela Demke Brown and Jay R. Lorch (Eds.). USENIX Association. https:\/\/www.usenix.org\/conference\/osdi21\/presentation\/qiao"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2391229.2391236"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2020.03.284"},{"key":"e_1_2_1_52_1","volume-title":"Cougar: A General Framework for Jobs Optimization In Cloud. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 3417--3429","author":"Sang Bo","year":"2023","unstructured":"Bo Sang, Shuwei Gu, Xiaojun Zhan, Mingjie Tang, Jian Liu, Xuan Chen, Jie Tan, Haoyuan Ge, Ke Zhang, Ruoyi Ruan, et al. 2023. Cougar: A General Framework for Jobs Optimization In Cloud. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 3417--3429."},{"key":"e_1_2_1_53_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs\/1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs\/1802.05799 (2018). arXiv:1802.05799 http:\/\/arxiv.org\/abs\/1802.05799"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00146"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSST.2010.5496972"},{"key":"e_1_2_1_56_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Sima Chijun","year":"2022","unstructured":"Chijun Sima, Yao Fu, Man-Kit Sit, Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, Haidong Rong, Pierre-Louis Aublin, et al. 2022. Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 821--839."},{"key":"e_1_2_1_57_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Tang Chunqiang","year":"2020","unstructured":"Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, et al. 2020. Twine: A unified cluster management system for shared infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 787--803."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901355"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/3579075.3579083"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/2741948.2741964"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472883.3486987"},{"key":"e_1_2_1_62_1","doi-asserted-by":"crossref","unstructured":"Qinlong Wang Tingfeng Lan Yinghao Tang Ziling Huang Yiheng Du Haitao Zhang Jian Sha Hui Lu Yuanchun Zhou Ke Zhang and Mingjie Tang. 2024. DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud. arXiv:2304.01468 [cs.DC] https:\/\/arxiv.org\/abs\/2304.01468","DOI":"10.14778\/3685800.3685832"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3124749.3124754"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613145"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613145"},{"key":"e_1_2_1_66_1","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. {MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale } heterogeneous {GPU} clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 945--960."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3064966"},{"key":"e_1_2_1_68_1","volume-title":"Ping Tak Peter Tang, and Andrew Tulloch","author":"Yang Jie Amy","year":"2020","unstructured":"Jie Amy Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, and Andrew Tulloch. 2020. Mixed-precision embedding using a cache. arXiv preprint arXiv:2010.11305 (2020)."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2022.3188414"},{"key":"e_1_2_1_70_1","first-page":"448","article-title":"Tt-rec: Tensor train compression for deep learning recommendation models","volume":"3","author":"Yin Chunxing","year":"2021","unstructured":"Chunxing Yin, Bilge Acun, Carole-Jean Wu, and Xing Liu. 2021. Tt-rec: Tensor train compression for deep learning recommendation models. Proceedings of Machine Learning and Systems 3 (2021), 448--462.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_71_1","volume-title":"Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)","author":"Zaharia Matei","year":"2012","unstructured":"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX Association, San Jose, CA, 15--28."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539034"},{"key":"e_1_2_1_73_1","first-page":"15190","article-title":"Dreamshard: Generalizable embedding table placement for recommender systems","volume":"35","author":"Zha Daochen","year":"2022","unstructured":"Daochen Zha, Louis Feng, Qiaoyu Tan, Zirui Liu, Kwei-Herng Lai, Bhargav Bhushanam, Yuandong Tian, Arun Kejariwal, and Xia Hu. 2022. Dreamshard: Generalizable embedding table placement for recommender systems. Advances in Neural Information Processing Systems 35 (2022), 15190--15203.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437963.3441761"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3542929.3563465"},{"key":"e_1_2_1_76_1","first-page":"412","article-title":"Distributed hierarchical gpu parameter server for massive scale deep learning ads systems","volume":"2","author":"Zhao Weijie","year":"2020","unstructured":"Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems. Proceedings of Machine Learning and Systems 2 (2020), 412--428.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358045"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33015941"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219823"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685832","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:31:07Z","timestamp":1735623067000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685832"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":79,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685832"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685832","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}