{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T18:22:26Z","timestamp":1764872546778,"version":"build-2065373602"},"reference-count":31,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2023,6,4]],"date-time":"2023-06-04T00:00:00Z","timestamp":1685836800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education","award":["NRF-2022R1A2C1011937","IITP-2023-2018-08-01417"],"award-info":[{"award-number":["NRF-2022R1A2C1011937","IITP-2023-2018-08-01417"]}]},{"name":"MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program","award":["NRF-2022R1A2C1011937","IITP-2023-2018-08-01417"],"award-info":[{"award-number":["NRF-2022R1A2C1011937","IITP-2023-2018-08-01417"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.<\/jats:p>","DOI":"10.3390\/data8060102","type":"journal-article","created":{"date-parts":[[2023,6,5]],"date-time":"2023-06-05T01:53:25Z","timestamp":1685930005000},"page":"102","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["A Self-Attention-Based Imputation Technique for Enhancing Tabular Data Quality"],"prefix":"10.3390","volume":"8","author":[{"given":"Do-Hoon","family":"Lee","sequence":"first","affiliation":[{"name":"School of Electrical and Computer Engineering, University of Seoul, 163 Seoulsiripdaero, Seoul 02504, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4510-5685","authenticated-orcid":false,"given":"Han-joon","family":"Kim","sequence":"additional","affiliation":[{"name":"School of Electrical and Computer Engineering, University of Seoul, 163 Seoulsiripdaero, Seoul 02504, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2023,6,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1080\/713827181","article-title":"An analysis of four missing data treatment methods for supervised learning","volume":"17","author":"Batista","year":"2003","journal-title":"Appl. Artif. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1487","DOI":"10.1007\/s10462-019-09709-4","article-title":"Missing value imputation: A review and analysis of the literature (2006\u20132017)","volume":"53","author":"Lin","year":"2020","journal-title":"Artif. Intell. Rev."},{"key":"ref_3","unstructured":"Yoon, J., Jordon, J., and Schaar, M. (2018, January 10). Gain: Missing data imputation using generative adversarial nets. Proceedings of the 34th International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Gondara, L., and Wang, K. (2017). Multiple imputation using deep denoising autoencoders. arXiv.","DOI":"10.1007\/978-3-319-93040-4_21"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1093\/bioinformatics\/btr597","article-title":"MissForest\u2014Non-parametric missing value imputation for mixed-type data","volume":"28","author":"Stekhoven","year":"2012","journal-title":"Bioinformatics"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Biessmann, F., Salinas, D., Schelter, S., Schmidt, P., and Lange, D. (2018, January 22\u201326). Deep learning for missing value imputation in tables with non-numerical data. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy.","DOI":"10.1145\/3269206.3272005"},{"key":"ref_7","unstructured":"Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., and Goldstein, T. (2021). Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Rekatsinas, T., Chu, X., Ilyas, I.F., and R\u00e9, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. arXiv.","DOI":"10.14778\/3137628.3137631"},{"key":"ref_9","unstructured":"Breve, B., Caruccio, L., Deufemia, V., and Polese, G. (April, January 29). RENUVER: A missing value imputation algorithm based on relaxed functional dependencies. Proceedings of the International Conference on Extending Database Technology, Edinburgh, UK."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"275","DOI":"10.1109\/TKDE.2018.2883103","article-title":"Enriching data imputation under similarity rule constraints","volume":"32","author":"Song","year":"2018","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.knosys.2021.107114","article-title":"Missing data imputation for traffic congestion data based on joint matrix factorization","volume":"225","author":"Jia","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Borisov, V., Leemann, T., Se\u00dfler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2021). Deep neural networks and tabular data: A survey. arXiv.","DOI":"10.1109\/TNNLS.2022.3229161"},{"key":"ref_13","unstructured":"Kim, J., Kim, T., Choi, J.H., and Choo, J. (2020, January 10\u201315). End-to-end multi-task learning of missing value imputation and forecasting in time-series data. Proceedings of the 25th IEEE International Conference on Pattern Recognition, Milan, Italy."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"106838","DOI":"10.1016\/j.asoc.2020.106838","article-title":"Autoencoder-based multi-task learning for imputation and classification of incomplete data","volume":"98","author":"Lai","year":"2021","journal-title":"Appl. Soft Comput."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lall, R., and Thomas, R. Efficient Multiple Imputation for Diverse Data in Python and R: MIDASpy and rMIDAS. J. Stat. Softw., 2023. in press.","DOI":"10.18637\/jss.v107.i09"},{"key":"ref_16","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_17","unstructured":"Arik, S.\u00d6., and Pfister, T. (March, January 22). Tabnet: Attentive interpretable tabular learning. Proceedings of the AAAI Conference on Artificial Intelligence (Virtual Conference), Virtual."},{"key":"ref_18","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_19","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., and Houlsby, N. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_20","unstructured":"Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. (2020). Tabtransformer: Tabular data modeling using contextual embeddings. arXiv."},{"key":"ref_21","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st Conference on Neural Information Processing System, Long Beach, CA, USA."},{"key":"ref_22","unstructured":"Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv."},{"key":"ref_23","unstructured":"Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, MIT Press."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. (2008, January 5\u20139). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland.","DOI":"10.1145\/1390156.1390294"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wong, S.C., Gatt, A., Stamatescu, V., and McDonnell, M.D. (2016, January 6\u20138). Understanding data augmentation for classification: When to warp?. Proceedings of the International Conference on Digital Image Computing: Techniques and Applications, Gold Coast, Australia.","DOI":"10.1109\/DICTA.2016.7797091"},{"key":"ref_26","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. arXiv."},{"key":"ref_27","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_28","unstructured":"Dua, D., and Graff, C. (2021, December 10). UCI Machine Learning Repository. Available online: http:\/\/archive.ics.uci.edu\/ml."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2088","DOI":"10.1093\/bioinformatics\/btg287","article-title":"A Bayesian missing value estimation method for gene expression profile data","volume":"19","author":"Oba","year":"2003","journal-title":"Bioinformatics"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1093\/biomet\/63.3.581","article-title":"Inference and missing data","volume":"63","author":"Rubin","year":"1976","journal-title":"Biometrika"},{"key":"ref_31","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/6\/102\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:48:02Z","timestamp":1760125682000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/6\/102"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,4]]},"references-count":31,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2023,6]]}},"alternative-id":["data8060102"],"URL":"https:\/\/doi.org\/10.3390\/data8060102","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2023,6,4]]}}}