{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T21:17:40Z","timestamp":1776115060282,"version":"3.50.1"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:p>\n            Despite the increasing success of Machine Learning (ML) techniques in real-world applications, their maintenance over time remains challenging. In particular, the prediction accuracy of deployed ML models can suffer due to significant changes between training and serving data over time, known as\n            <jats:italic>data drift.<\/jats:italic>\n            Traditional data drift solutions primarily focus on detecting drift, and then retraining the ML models, but do not discern whether the detected drift is harmful to model performance. In this paper, we observe that not all data drifts lead to degradation in prediction accuracy. We then introduce a novel approach for identifying portions of data distributions in serving data where drift can be potentially harmful to model performance, which we term Data Distributions with Low Accuracy (DDLA). Our approach, using decision trees, precisely pinpoints low-accuracy zones within ML models, especially Blackbox models. By focusing on these DDLAs, we effectively assess the impact of data drift on model performance and make informed decisions in the ML pipeline. In contrast to existing data drift techniques, we advocate for model retraining only in cases of harmful drifts that detrimentally affect model performance. Through extensive experimental evaluations on various datasets and models, our findings demonstrate that our approach significantly improves cost-efficiency over baselines, while achieving comparable accuracy.\n          <\/jats:p>","DOI":"10.14778\/3681954.3681984","type":"journal-article","created":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T16:23:36Z","timestamp":1725035016000},"page":"3072-3081","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines"],"prefix":"10.14778","volume":"17","author":[{"given":"Sijie","family":"Dong","sequence":"first","affiliation":[{"name":"Universit\u00e9 Paris Cit\u00e9, Paris, France"}]},{"given":"Qitong","family":"Wang","sequence":"additional","affiliation":[{"name":"Universit\u00e9 Paris Cit\u00e9, Paris, France"}]},{"given":"Soror","family":"Sahri","sequence":"additional","affiliation":[{"name":"Universit\u00e9 Paris Cit\u00e9, Paris, France"}]},{"given":"Themis","family":"Palpanas","sequence":"additional","affiliation":[{"name":"Universit\u00e9 Paris Cit\u00e9, Paris, France"}]},{"given":"Divesh","family":"Srivastava","sequence":"additional","affiliation":[{"name":"AT&amp;T, Bedminster, NJ, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,8,30]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Active Learning Playground. https:\/\/github.com\/google\/active-learning\/tree\/master Last accessed on","year":"2024","unstructured":"2017. Active Learning Playground. https:\/\/github.com\/google\/active-learning\/tree\/master Last accessed on July 28, 2024."},{"key":"e_1_2_1_2_1","volume-title":"House Sales in King County Data. https:\/\/www.kaggle.com\/datasets\/harlfoxem\/housesalesprediction. Last accessed on","year":"2024","unstructured":"2019. House Sales in King County Data. https:\/\/www.kaggle.com\/datasets\/harlfoxem\/housesalesprediction. Last accessed on July 28, 2024."},{"key":"e_1_2_1_3_1","volume-title":"Detection of data drift and outliers affecting machine learning model performance over time. arXiv preprint arXiv:2012.09258","author":"Ackerman Samuel","year":"2020","unstructured":"Samuel Ackerman, Eitan Farchi, Orna Raz, Marcel Zalmanovici, and Parijat Dube. 2020. Detection of data drift and outliers affecting machine learning model performance over time. arXiv preprint arXiv:2012.09258 (2020)."},{"key":"e_1_2_1_4_1","volume-title":"Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672","author":"Ackerman Samuel","year":"2021","unstructured":"Samuel Ackerman, Orna Raz, Marcel Zalmanovici, and Aviad Zlotnick. 2021. Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672 (2021)."},{"key":"e_1_2_1_5_1","first-page":"435","article-title":"Data classification using Support vector Machine (SVM), a simplified approach","volume":"3","author":"Amarappa S","year":"2014","unstructured":"S Amarappa and SV Sathyanarayana. 2014. Data classification using Support vector Machine (SVM), a simplified approach. Int. J. Electron. Comput. Sci. Eng 3 (2014), 435--445.","journal-title":"Int. J. Electron. Comput. Sci. Eng"},{"key":"e_1_2_1_6_1","volume-title":"international conference on machine learning. PMLR, 301--310","author":"Bachman Philip","year":"2017","unstructured":"Philip Bachman, Alessandro Sordoni, and Adam Trischler. 2017. Learning algorithms for active learning. In international conference on machine learning. PMLR, 301--310."},{"key":"e_1_2_1_7_1","volume-title":"Handling Concept Drift for Predictions in Business Process Mining. 2020 IEEE 22nd Conference on Business Informatics (CBI) 1","author":"Baier Lucas","year":"2020","unstructured":"Lucas Baier, Josua Reimold, and Niklas K\u00fchl. 2020. Handling Concept Drift for Predictions in Business Process Mining. 2020 IEEE 22nd Conference on Business Informatics (CBI) 1 (2020), 76--83."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.24432\/C5XW20"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/3379393"},{"key":"e_1_2_1_10_1","volume-title":"Hendrik Patzlaff, Hazar Harmouch, and Felix Naumann.","author":"Budach Lukas","year":"2022","unstructured":"Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Sina Noack, Hendrik Patzlaff, Hazar Harmouch, and Felix Naumann. 2022. The Effects of Data Quality on Machine Learning Performance."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2022.3175691"},{"key":"e_1_2_1_12_1","volume-title":"Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems 28","author":"Chwialkowski Kacper P","year":"2015","unstructured":"Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. 2015. Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems 28 (2015)."},{"key":"e_1_2_1_13_1","volume-title":"CER Smart Metering Project - Electricity Customer Behaviour Trial","author":"Commission for Energy Regulation (CER). 2012.","year":"2009","unstructured":"Commission for Energy Regulation (CER). 2012. CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009--2010 [dataset]. Irish Social Science Data Archive. https:\/\/www.ucd.ie\/issda\/data\/commissionforenergyregulationcer\/ SN: 0012-00, Last accessed on July 28, 2024."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359786"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1093\/mnras\/225.1.155"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00176"},{"key":"e_1_2_1_17_1","volume-title":"Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784","author":"Frosst Nicholas","year":"2017","unstructured":"Nicholas Frosst and Geoffrey Hinton. 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017)."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings 17","author":"Gama Joao","year":"2004","unstructured":"Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. In Advances in Artificial Intelligence-SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17. Springer, 286--295."},{"key":"e_1_2_1_19_1","volume-title":"A survey on concept drift adaptation. ACM Comput. Surv. 46, 4","author":"Gama Jo\u00e3o","year":"2014","unstructured":"Jo\u00e3o Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1--44:37."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/2735461.2735467"},{"key":"e_1_2_1_21_1","volume-title":"Krishnan","author":"Ginsberg Tom","year":"2023","unstructured":"Tom Ginsberg, Zhongyuan Liang, and Rahul G. Krishnan. 2023. A Learning Based Hypothesis Test for Harmful Covariate Shift. arXiv:2212.02742 [cs.LG]"},{"key":"e_1_2_1_22_1","first-page":"723","article-title":"A kernel two-sample test","volume":"13","author":"Gretton Arthur","year":"2012","unstructured":"Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. 2012. A kernel two-sample test. The Journal of Machine Learning Research 13, 1 (2012), 723--773.","journal-title":"The Journal of Machine Learning Research"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Arthur Gretton Alex Smola Jiayuan Huang Marcel Schmittfull Karsten Borgwardt Bernhard Sch\u00f6lkopf et al. 2009. Covariate shift by kernel mean matching. Dataset shift in machine learning 3 4 (2009) 5.","DOI":"10.7551\/mitpress\/9780262170055.003.0008"},{"key":"e_1_2_1_24_1","volume-title":"Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al.","author":"Gupta Nitin","year":"2021","unstructured":"Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1609\/hcomp.v7i1.5265"},{"key":"e_1_2_1_26_1","volume-title":"Active learning by querying informative and representative examples. Advances in neural information processing systems 23","author":"Huang Sheng-Jun","year":"2010","unstructured":"Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. 2010. Active learning by querying informative and representative examples. Advances in neural information processing systems 23 (2010)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s12650-019-00607-z"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939502.2939503"},{"key":"e_1_2_1_29_1","volume-title":"Xu Chu, Wentao Wu, and Ce Zhang.","author":"Karla\u0161 Bojan","year":"2020","unstructured":"Bojan Karla\u0161, Peng Li, Renzhi Wu, Nezihe Merve G\u00fcrel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_31_1","volume-title":"KDD'17","author":"Lakkaraju Himabindu","year":"2017","unstructured":"Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2017. Interpretable & explorable approximations of black box models. KDD'17, Workshop on Fairness, Accountability, and Transparency in Machine Learning (2017)."},{"key":"e_1_2_1_32_1","volume-title":"CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. CoRR abs\/1904.09483","author":"Li Peng","year":"2019","unstructured":"Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. CoRR abs\/1904.09483 (2019)."},{"key":"e_1_2_1_33_1","volume-title":"International conference on machine learning. PMLR, 3122--3130","author":"Lipton Zachary","year":"2018","unstructured":"Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In International conference on machine learning. PMLR, 3122--3130."},{"key":"e_1_2_1_34_1","volume-title":"International conference on machine learning. PMLR, 6316--6326","author":"Liu Feng","year":"2020","unstructured":"Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. 2020. Learning deep kernels for non-parametric two-sample tests. In International conference on machine learning. PMLR, 6316--6326."},{"key":"e_1_2_1_35_1","volume-title":"Revisiting classifier two-sample tests. ICLR","author":"Lopez-Paz David","year":"2017","unstructured":"David Lopez-Paz and Maxime Oquab. 2017. Revisiting classifier two-sample tests. ICLR (2017)."},{"key":"e_1_2_1_36_1","first-page":"77","article-title":"Matchmaker: Data Drift Mitigation in Machine Learning for Large-Scale Systems","volume":"4","author":"Mallick Ankur","year":"2022","unstructured":"Ankur Mallick, Kevin Hsieh, Behnaz Arzani, and Gauri Joshi. 2022. Matchmaker: Data Drift Mitigation in Machine Learning for Large-Scale Systems. Proceedings of Machine Learning and Systems 4 (2022), 77--94.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-016-0484-8"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.3102\/1076998619872761"},{"key":"e_1_2_1_39_1","unstructured":"Aleksandr Podkopaev and Aaditya Ramdas. 2022. Tracking the risk of a deployed model and detecting harmful distribution shifts. arXiv:2110.06177 [stat.ML]"},{"key":"e_1_2_1_40_1","volume-title":"Active learning: an empirical study of common baselines. Data mining and knowledge discovery 31","author":"Ramirez-Loaiza Maria E","year":"2017","unstructured":"Maria E Ramirez-Loaiza, Manali Sharma, Geet Kumar, and Mustafa Bilgic. 2017. Active learning: an empirical study of common baselines. Data mining and knowledge discovery 31 (2017), 287--313."},{"key":"e_1_2_1_41_1","volume-title":"Shreyas Padhy, and Balaji Lakshminarayanan.","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Lakshminarayanan. 2021. A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022 (2021)."},{"key":"e_1_2_1_42_1","volume-title":"Bojan Karla\u0161, Wentao Wu, and Ce Zhang.","author":"Renggli Cedric","year":"2021","unstructured":"Cedric Renggli, Luka Rimanic, Nezihe Merve G\u00fcrel, Bojan Karla\u0161, Wentao Wu, and Ce Zhang. 2021. A Data Quality-Driven View of MLOps. arXiv:2102.07750 [cs.LG]"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1214\/21-SS133"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380604"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-016-0460-3"},{"key":"e_1_2_1_48_1","first-page":"820","article-title":"A Survey on Active Learning: State-of-the-Art","volume":"11","author":"Tharwat Alaa","year":"2023","unstructured":"Alaa Tharwat and Wolfram Schenck. 2023. A Survey on Active Learning: State-of-the-Art, Practical Challenges and Research Directions. Mathematics 11, 4 (2023), 820.","journal-title":"Practical Challenges and Research Directions. Mathematics"},{"key":"e_1_2_1_49_1","volume-title":"Concept drift detection for streaming data. In 2015 international joint conference on neural networks (IJCNN)","author":"Wang Heng","unstructured":"Heng Wang and Zubin Abraham. 2015. Concept drift detection for streaming data. In 2015 international joint conference on neural networks (IJCNN). IEEE, 1--9."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-018-0554-1"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1018046501280"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-36669-7_83"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457566"},{"key":"e_1_2_1_54_1","volume-title":"Mechanical MNIST - Distribution Shift. https:\/\/open.bu.edu\/handle\/2144\/44485 Last accessed on","author":"Yuan Lingxiao","year":"2024","unstructured":"Lingxiao Yuan, S. Park, Harold, and Emma Lejeune. 2022. Mechanical MNIST - Distribution Shift. https:\/\/open.bu.edu\/handle\/2144\/44485 Last accessed on July 28, 2024."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cie.2019.106031"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00642"},{"key":"e_1_2_1_57_1","volume-title":"International Conference on Learning Representations.","author":"Zhao Shengjia","year":"2021","unstructured":"Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming Song, and Stefano Ermon. 2021. Comparing distributions by measuring differences that affect decision making. In International Conference on Learning Representations."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3681954.3681984","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T18:43:10Z","timestamp":1725475390000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3681954.3681984"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7]]},"references-count":56,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["10.14778\/3681954.3681984"],"URL":"https:\/\/doi.org\/10.14778\/3681954.3681984","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,7]]},"assertion":[{"value":"2024-08-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}