{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T10:28:27Z","timestamp":1768904907306,"version":"3.49.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,11]]},"abstract":"<jats:p>Data imputation has been extensively explored to solve the missing data problem. The dramatically rising volume of missing data makes the training of imputation models computationally infeasible in real-life scenarios. In this paper, we propose an efficient and effective data imputation system with<jats:italic>influence functions<\/jats:italic>, named EDIT, which quickly trains a parametric imputation model with representative samples under imputation accuracy guarantees. EDIT mainly consists of two modules, i.e., an<jats:italic>imputation influence evaluation<\/jats:italic>(IIE) module and a<jats:italic>representative sample selection<\/jats:italic>(RSS) module. IIE leverages the influence functions to estimate the effect of (in)complete samples on the prediction result of parametric imputation models. RSS builds a minimum set of the high-effect samples to satisfy a user-specified imputation accuracy. Moreover, we introduce a weighted loss function that drives the parametric imputation model to pay more attention on the high-effect samples. Extensive experiments upon ten state-of-the-art imputation methods demonstrate that, EDIT adopts only about 5% samples to speed up the model training by 4x in average with more than 11% accuracy gain.<\/jats:p>","DOI":"10.14778\/3494124.3494143","type":"journal-article","created":{"date-parts":[[2022,2,5]],"date-time":"2022-02-05T00:31:46Z","timestamp":1644021106000},"page":"624-632","source":"Crossref","is-referenced-by-count":29,"title":["Efficient and effective data imputation with influence functions"],"prefix":"10.14778","volume":"15","author":[{"given":"Xiaoye","family":"Miao","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Yangyang","family":"Wu","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Lu","family":"Chen","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, China"}]},{"given":"Yunjun","family":"Gao","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Jun","family":"Wang","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, China"}]},{"given":"Jianwei","family":"Yin","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2022,2,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1080\/00031305.1992.10475879","article-title":"An introduction to kernel and nearest -neighbor non-parametric regression","volume":"46","author":"Altman Naomi S","year":"1992","unstructured":"Naomi S Altman . 1992 . An introduction to kernel and nearest -neighbor non-parametric regression . The American Statistician 46 , 3 (1992), 175 -- 185 . Naomi S Altman. 1992. An introduction to kernel and nearest -neighbor non-parametric regression. The American Statistician 46, 3 (1992), 175--185.","journal-title":"The American Statistician"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305404"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3204028.3204032"},{"key":"e_1_2_1_4_1","first-page":"1","article-title":"DataWig: Missing value imputation for tables","volume":"20","author":"Biessmann Felix","year":"2019","unstructured":"Felix Biessmann , Tammo Rukat , Philipp Schmidt , Prathik Naidu , Sebastian Schelter , Andrey Taptunov , Dustin Lange , and David Salinas . 2019 . DataWig: Missing value imputation for tables . Journal of Machine Learning Research 20 , 1 (2019), 1 -- 6 . Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing value imputation for tables. Journal of Machine Learning Research 20, 1 (2019), 1--6.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_5_1","unstructured":"Muzellec Boris Josse Julie Boyer Claire and Cuturi Marco. 2020. Missing data imputation using optimal transport. In ICML. 1--18. Muzellec Boris Josse Julie Boyer Claire and Cuturi Marco. 2020. Missing data imputation using optimal transport. In ICML. 1--18."},{"key":"e_1_2_1_6_1","volume-title":"Importance weighted autoencoders. ArXiv Preprint ArXiv:1509.00519","author":"Burda Yuri","year":"2015","unstructured":"Yuri Burda , Roger Grosse , and Ruslan Salakhutdinov . 2015. Importance weighted autoencoders. ArXiv Preprint ArXiv:1509.00519 ( 2015 ). Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance weighted autoencoders. ArXiv Preprint ArXiv:1509.00519 (2015)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_2_1_8_1","volume-title":"Jastin Pompeu Soares, and Pedro Henriques Abreu.","author":"Costa Adriana Fonseca","year":"2018","unstructured":"Adriana Fonseca Costa , Miriam Seoane Santos , Jastin Pompeu Soares, and Pedro Henriques Abreu. 2018 . Missing data imputation via denoising autoencoders: The untold story. In IDA. 87--98. Adriana Fonseca Costa, Miriam Seoane Santos, Jastin Pompeu Soares, and Pedro Henriques Abreu. 2018. Missing data imputation via denoising autoencoders: The untold story. In IDA. 87--98."},{"key":"e_1_2_1_9_1","unstructured":"CriteoLabs. 2014. http:\/\/labs.criteo.com\/2014\/02\/download-kaggle-display-advertising-challenge-dataset\/. (2014). CriteoLabs. 2014. http:\/\/labs.criteo.com\/2014\/02\/download-kaggle-display-advertising-challenge-dataset\/. (2014)."},{"key":"e_1_2_1_10_1","unstructured":"Whiteson Daniel. 2014. https:\/\/archive.ics.uci.edu\/ml\/datasets\/HIGGS. (2014). Whiteson Daniel. 2014. https:\/\/archive.ics.uci.edu\/ml\/datasets\/HIGGS. (2014)."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00159-z"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCA.2007.902631"},{"key":"e_1_2_1_13_1","unstructured":"Hebrail Georges and Berard Alice. 2012. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Individual+household+electric+power+consumption. (2012). Hebrail Georges and Berard Alice. 2012. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Individual+household+electric+power+consumption. (2012)."},{"key":"e_1_2_1_14_1","volume-title":"Multiple imputation using deep denoising autoencoders. ArXiv Preprint ArXiv:1705.02737","author":"Gondara Lovedeep","year":"2017","unstructured":"Lovedeep Gondara and Ke Wang . 2017. Multiple imputation using deep denoising autoencoders. ArXiv Preprint ArXiv:1705.02737 ( 2017 ). Lovedeep Gondara and Ke Wang. 2017. Multiple imputation using deep denoising autoencoders. ArXiv Preprint ArXiv:1705.02737 (2017)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969125"},{"key":"e_1_2_1_16_1","volume-title":"Reducing the dimensionality of data with neural networks. Science 313, 5786","author":"Hinton Geoffrey E","year":"2006","unstructured":"Geoffrey E Hinton and Ruslan R Salakhutdinov . 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 ( 2006 ), 504--507. Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504--507."},{"key":"e_1_2_1_17_1","unstructured":"Burgu Javier. 2019. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Gas+sensor+array+temperature+modulation. (2019). Burgu Javier. 2019. https:\/\/archive.ics.uci.edu\/ml\/datasets\/Gas+sensor+array+temperature+modulation. (2019)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.artmed.2010.05.002"},{"key":"e_1_2_1_19_1","volume-title":"Hubert HK Teo, and Percy Liang","author":"Koh Pang Wei","year":"2019","unstructured":"Pang Wei Koh , Kai-Siang Ang , Hubert HK Teo, and Percy Liang . 2019 . On the accuracy of influence functions for measuring group effects. ArXiv Preprint ArXiv :1905.13289 (2019). Pang Wei Koh, Kai-Siang Ang, Hubert HK Teo, and Percy Liang. 2019. On the accuracy of influence functions for measuring group effects. ArXiv Preprint ArXiv:1905.13289 (2019)."},{"key":"e_1_2_1_20_1","volume-title":"Understanding black-box predictions via influence functions. ArXiv Preprint ArXiv:1703.04730","author":"Koh Pang Wei","year":"2017","unstructured":"Pang Wei Koh and Percy Liang . 2017. Understanding black-box predictions via influence functions. ArXiv Preprint ArXiv:1703.04730 ( 2017 ). Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. ArXiv Preprint ArXiv:1703.04730 (2017)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/3450980.3450989"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00165-1"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_24_1","volume-title":"MIWAE: Deep generative modelling and imputation of incomplete data sets. In ICML. 4413--4423.","author":"Mattei Pierre-Alexandre","year":"2019","unstructured":"Pierre-Alexandre Mattei and Jes Frellsen . 2019 . MIWAE: Deep generative modelling and imputation of incomplete data sets. In ICML. 4413--4423. Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep generative modelling and imputation of incomplete data sets. In ICML. 4413--4423."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2018.09.406"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Xiaoye Miao Yangyang Wu Jun Wang Yunjun Gao Xudong Mao and Jianwei Yin. 2021. Generative semi-supervised learning for multivariate time series imputation. In AAAI. 8983--8991. Xiaoye Miao Yangyang Wu Jun Wang Yunjun Gao Xudong Mao and Jianwei Yin. 2021. Generative semi-supervised learning for multivariate time series imputation. In AAAI. 8983--8991.","DOI":"10.1609\/aaai.v35i10.17086"},{"key":"e_1_2_1_27_1","volume-title":"Handling incomplete heterogeneous data using VAEs. ArXiv Preprint ArXiv:1807.03653","author":"Nazabal Alfredo","year":"2018","unstructured":"Alfredo Nazabal , Pablo M Olmos , Zoubin Ghahramani , and Isabel Valera . 2018. Handling incomplete heterogeneous data using VAEs. ArXiv Preprint ArXiv:1807.03653 ( 2018 ). Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. 2018. Handling incomplete heterogeneous data using VAEs. ArXiv Preprint ArXiv:1807.03653 (2018)."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_29_1","volume-title":"Schwing","author":"Ren Zhongzheng","year":"2020","unstructured":"Zhongzheng Ren , Raymond A. Yeh , and Alexander G . Schwing . 2020 . Not all unlabeled data are equal: Learning to weight data in semi-supervised learning. In NeurIPS. 1--12. Zhongzheng Ren, Raymond A. Yeh, and Alexander G. Schwing. 2020. Not all unlabeled data are equal: Learning to weight data in semi-supervised learning. In NeurIPS. 1--12."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.18637\/jss.v045.i04"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/63.3.581"},{"key":"e_1_2_1_32_1","volume-title":"An overview of gradient descent optimization algorithms. ArXiv Preprint ArXiv:1609.04747","author":"Ruder Sebastian","year":"2016","unstructured":"Sebastian Ruder . 2016. An overview of gradient descent optimization algorithms. ArXiv Preprint ArXiv:1609.04747 ( 2016 ). Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. ArXiv Preprint ArXiv:1609.04747 (2016)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536274.2536320"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-009-0176-8"},{"key":"e_1_2_1_35_1","volume-title":"Missing data imputation with adversarially-trained graph convolutional networks. ArXiv Preprint ArXiv:1905.01907","author":"Spinelli Indro","year":"2019","unstructured":"Indro Spinelli , Simone Scardapane , and Aurelio Uncini . 2019. Missing data imputation with adversarially-trained graph convolutional networks. ArXiv Preprint ArXiv:1905.01907 ( 2019 ). Indro Spinelli, Simone Scardapane, and Aurelio Uncini. 2019. Missing data imputation with adversarially-trained graph convolutional networks. ArXiv Preprint ArXiv:1905.01907 (2019)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btr597"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1080\/08839510902872223"},{"key":"e_1_2_1_38_1","unstructured":"B Twala M Cartwright and M Shepperd. 2005. Comparison of various methods for handling incomplete data in software engineering databases. In ESEM. 234--239. B Twala M Cartwright and M Shepperd. 2005. Comparison of various methods for handling incomplete data in software engineering databases. In ESEM. 234--239."},{"key":"e_1_2_1_39_1","unstructured":"O. Wahltinez et al. 2020. COVID-19 Open-Data: Curating a fine-grained global-scale data repository for SARS-CoV-2. (2020). https:\/\/goo.gle\/covid-19-open-data Work in progress. O. Wahltinez et al. 2020. COVID-19 Open-Data: Curating a fine-grained global-scale data repository for SARS-CoV-2. (2020). https:\/\/goo.gle\/covid-19-open-data Work in progress."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536206.2536212"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342626"},{"key":"e_1_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Mike Wojnowicz Ben Cruz Xuan Zhao Brian Wallace Matt Wolff Jay Luan and Caleb Crable. 2016. Influence sketching: Finding influential samples in large-scale regressions. In Big Data. 3601--3612. Mike Wojnowicz Ben Cruz Xuan Zhao Brian Wallace Matt Wolff Jay Luan and Caleb Crable. 2016. Influence sketching: Finding influential samples in large-scale regressions. In Big Data. 3601--3612.","DOI":"10.1109\/BigData.2016.7841024"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2017.04.005"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00154-4"},{"key":"e_1_2_1_45_1","volume-title":"GAIN: Missing data imputation using generative adversarial nets. In ICML. 5675--5684.","author":"Yoon Jinsung","year":"2018","unstructured":"Jinsung Yoon , James Jordon , and Mihaela Schaar . 2018 . GAIN: Missing data imputation using generative adversarial nets. In ICML. 5675--5684. Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. GAIN: Missing data imputation using generative adversarial nets. In ICML. 5675--5684."},{"key":"e_1_2_1_46_1","doi-asserted-by":"crossref","unstructured":"Aoqian Zhang Shaoxu Song Yu Sun and Jianmin Wang. 2019. Learning individual models for imputation. In ICDE. 160--171. Aoqian Zhang Shaoxu Song Yu Sun and Jianmin Wang. 2019. Learning individual models for imputation. In ICDE. 160--171.","DOI":"10.1109\/ICDE.2019.00023"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-020-00144-y"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.5555\/520809.796139"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3494124.3494143","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,17]],"date-time":"2024-09-17T22:34:54Z","timestamp":1726612494000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3494124.3494143"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11]]},"references-count":48,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,11]]}},"alternative-id":["10.14778\/3494124.3494143"],"URL":"https:\/\/doi.org\/10.14778\/3494124.3494143","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,11]]}}}