{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:19:35Z","timestamp":1760059175992,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T00:00:00Z","timestamp":1748476800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Shanghai Baiyulan Talent Project Pujiang Program","award":["24PJD115"],"award-info":[{"award-number":["24PJD115"]}]},{"name":"Shanghai Yangpu District Postdoctoral Innovation &amp; Practice Base Project","award":["24PJD115"],"award-info":[{"award-number":["24PJD115"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>Data corruption, including missing and noisy entries, is a common challenge in real-world machine learning. This paper examines its impact and mitigation strategies through two experimental setups: supervised NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal control (Signal-RL). This study analyzes how varying corruption levels affect model performance, evaluate imputation strategies, and assess whether expanding datasets can counteract corruption effects. The results indicate that performance degradation follows a diminishing-return pattern, well modeled by an exponential function. Noisy data harm performance more than missing data, especially in sequential tasks like Signal-RL where errors may compound. Imputation helps recover missing data but can introduce noise, with its effectiveness depending on corruption severity and imputation accuracy. This study identifies clear boundaries between when imputation is beneficial versus harmful, and classifies tasks as either noise-sensitive or noise-insensitive. Larger datasets reduce corruption effects but offer diminishing gains at high corruption levels. These insights guide the design of robust systems, emphasizing smart data collection, imputation decisions, and preprocessing strategies in noisy environments.<\/jats:p>","DOI":"10.3390\/fi17060241","type":"journal-article","created":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T09:47:34Z","timestamp":1748512054000},"page":"241","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1587-8524","authenticated-orcid":false,"given":"Qi","family":"Liu","sequence":"first","affiliation":[{"name":"Key Laboratory of Road and Traffic Engineering of the Ministry of Education, College of Transportation, Tongji University, Shanghai 200092, China"}]},{"given":"Wanjing","family":"Ma","sequence":"additional","affiliation":[{"name":"Key Laboratory of Road and Traffic Engineering of the Ministry of Education, College of Transportation, Tongji University, Shanghai 200092, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Little, R.J.A., and Rubin, D.B. (2019). Statistical Analysis with Missing Data, Wiley. [3rd ed.].","DOI":"10.1002\/9781119482260"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40537-021-00516-9","article-title":"A Survey on Missing Data in Machine Learning","volume":"8","author":"Emmanuel","year":"2021","journal-title":"J. Big Data"},{"key":"ref_3","unstructured":"Brown, T.B. (2020). Language Models Are Few-shot Learners. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Pathak, D., Agrawal, P., Efros, A.A., and Darrell, T. (2017, January 6\u201311). Curiosity-driven Exploration by Self-supervised Prediction. Proceedings of the International Conference on Machine Learning, Sydney, Australia.","DOI":"10.1109\/CVPRW.2017.70"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1093\/biomet\/63.3.581","article-title":"Inference and Missing Data","volume":"63","author":"Rubin","year":"1976","journal-title":"Biometrika"},{"key":"ref_6","unstructured":"Bishop, C.M., and Nasrabadi, N.M. (2006). Pattern Recognition and Machine Learning, Springer."},{"key":"ref_7","unstructured":"Goodfellow, I.J., Shlens, J., and Szegedy, C. (2014). Explaining and Harnessing Adversarial Examples. arXiv."},{"key":"ref_8","unstructured":"Moon, T.K., and Stirling, W.C. (2000). Mathematical Methods and Algorithms for Signal Processing, Prentice Hall."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"8135","DOI":"10.1109\/TNNLS.2022.3152527","article-title":"Learning from Noisy Labels with Deep Neural Networks: A Survey","volume":"34","author":"Song","year":"2022","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1162\/tacl_a_00300","article-title":"SpanBERT: Improving Pre-training by Representing and Predicting Spans","volume":"8","author":"Joshi","year":"2020","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_11","unstructured":"Devlin, J. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_12","first-page":"arXiv:1907.11692","article-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","volume":"364","author":"Liu","year":"2019","journal-title":"arXiv"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., De Cao, N., Thorne, J., Jernite, Y., Karpukhin, V., and Maillard, J. (2020). KILT: A Benchmark for Knowledge Intensive Language Tasks. arXiv.","DOI":"10.18653\/v1\/2021.naacl-main.200"},{"key":"ref_14","first-page":"141","article-title":"Deep Recurrent Q-learning for Partially Observable MDPs","volume":"45","author":"Hausknecht","year":"2015","journal-title":"AAAI Fall Symp. Ser."},{"key":"ref_15","unstructured":"Bai, X., Guan, J., and Wang, H. (2019, January 8\u201314). A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation. Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level Control Through Deep Reinforcement Learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"ref_17","first-page":"1633","article-title":"Transfer Learning for Reinforcement Learning Domains: A Survey","volume":"10","author":"Taylor","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_18","unstructured":"Zhou, Y., Aryal, S., and Bouadjenek, M.R. (2024). Review for Handling Missing Data with Special Missing Mechanism. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1037\/1082-989X.7.2.147","article-title":"Missing Data: Our View of the State of the Art","volume":"7","author":"Schafer","year":"2002","journal-title":"Psychol. Methods"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys, Wiley.","DOI":"10.1002\/9780470316696"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1093\/bioinformatics\/17.6.520","article-title":"Missing Value Estimation Methods for DNA Microarrays","volume":"17","author":"Troyanskaya","year":"2001","journal-title":"Bioinformatics"},{"key":"ref_22","unstructured":"Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees, Wadsworth International Group."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random Forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008, January 5\u20139). Extracting and Composing Robust Features with Denoising Autoencoders. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland.","DOI":"10.1145\/1390156.1390294"},{"key":"ref_25","first-page":"2672","article-title":"Generative Adversarial Nets","volume":"27","author":"Goodfellow","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_26","unstructured":"Yuan, J., Wang, R., and Zhang, Y. (2021, January 7\u201311). Missing Token Imputation Using Masked Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Che, Z., Purushotham, S., Cho, K., Sontag, D., and Liu, Y. (2018). Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep., 8.","DOI":"10.1038\/s41598-018-24271-9"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"10823","DOI":"10.1109\/ACCESS.2019.2891073","article-title":"Artificial Intelligence for Vehicle-to-Everything: A Survey","volume":"7","author":"Tong","year":"2019","journal-title":"IEEE Access"},{"key":"ref_29","unstructured":"Feller, W. (1991). An Introduction to Probability Theory and Its Applications, Wiley. [3rd ed.]."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Rakhmanov, A., and Wiseman, Y. (2023). Compression of GNSS Data with the Aim of Speeding Up Communication to Autonomous Vehicles. Remote Sens., 15.","DOI":"10.3390\/rs15082165"}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/6\/241\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:43:17Z","timestamp":1760031797000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/6\/241"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,29]]},"references-count":30,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,6]]}},"alternative-id":["fi17060241"],"URL":"https:\/\/doi.org\/10.3390\/fi17060241","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2025,5,29]]}}}