{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:37:46Z","timestamp":1760240266330,"version":"build-2065373602"},"reference-count":33,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2019,4,19]],"date-time":"2019-04-19T00:00:00Z","timestamp":1555632000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the new century talent supporting project of education ministry in China","award":["B43451914"],"award-info":[{"award-number":["B43451914"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>In banks, governments, and internet companies, due to the increasing demand for data in various information systems and continuously shortening of the cycle for data collection and update, there may be a variety of data quality issues in a database. As the expansion of data scales, methods such as pre-specifying business rules or introducing expert experience into a repair process are no longer applicable to some information systems requiring rapid responses. In this case, we divided data cleaning into supervised and unsupervised forms according to whether there were interventions in the repair processes and put forward a new dimension suitable for unsupervised cleaning in this paper. For weak logic errors in unsupervised data cleaning, we proposed an attribute correlation-based (ACB)-Framework under blocking, and designed three different data blocking methods to reduce the time complexity and test the impact of clustering accuracy on data cleaning. The experiments showed that the blocking methods could effectively reduce the repair time by maintaining the repair validity. Moreover, we concluded that the blocking methods with a too high clustering accuracy tended to put tuples with the same elements into a data block, which reduced the cleaning ability. In summary, the ACB-Framework with blocking can reduce the corresponding time cost and does not need the guidance of domain knowledge or interventions in repair, which can be applied in information systems requiring rapid responses, such as internet web pages, network servers, and sensor information acquisition.<\/jats:p>","DOI":"10.3390\/sym11040575","type":"journal-article","created":{"date-parts":[[2019,4,22]],"date-time":"2019-04-22T03:15:53Z","timestamp":1555902953000},"page":"575","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking"],"prefix":"10.3390","volume":"11","author":[{"given":"Pei","family":"Li","sequence":"first","affiliation":[{"name":"Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chaofan","family":"Dai","sequence":"additional","affiliation":[{"name":"Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenqian","family":"Wang","sequence":"additional","affiliation":[{"name":"Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,4,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1145\/2935694.2935702","article-title":"Cleanix: A Parallel Big Data Cleaning System","volume":"44","author":"Wang","year":"2015","journal-title":"SIGMOD Rec."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"453","DOI":"10.1515\/revce-2015-0022","article-title":"Data cleaning in the process industries","volume":"31","author":"Xu","year":"2015","journal-title":"Rev. Chem. Eng."},{"key":"ref_3","first-page":"1727","article-title":"Consistent Estimation of Query Result in Inconsistent Data","volume":"9","author":"Liu","year":"2015","journal-title":"Chin. J. Comput."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"6","DOI":"10.7566\/JPSJ.86.063801","article-title":"Statistical-Mechanical Analysis Connecting Supervised Learning and Semi-Supervised Learning","volume":"86","author":"Fujii","year":"2017","journal-title":"J. Phys. Soc. Jpn."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1007\/s10522-017-9683-y","article-title":"A review of supervised machine learning applied to ageing research","volume":"18","author":"Fabris","year":"2017","journal-title":"Biogerontology"},{"key":"ref_6","first-page":"665","article-title":"Classification Algorithm Combined with Unsupervised Learning for Data Stream","volume":"29","author":"Xu","year":"2016","journal-title":"Pattern Recognit. Artif. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Kim, J., Jang, G.J., and Lee, M. (2016, January 16\u201321). Investigation of the Efficiency of Unsupervised Learning for Multi-task Classification in Convolutional Neural Network. Proceedings of the International Conference on Neural Information Processing, Kyoto, Japan.","DOI":"10.1007\/978-3-319-46675-0_60"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Can, B., and Manandhar, S. (2014, January 6\u201312). Methods and Algorithms for Unsupervised Learning of Morphology. Proceedings of the International Conference on Intelligent Text Processing and Computational, Kathmandu, Nepal.","DOI":"10.1007\/978-3-642-54906-9_15"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"537","DOI":"10.1587\/transinf.2015EDL8170","article-title":"An Optimization Strategy for CFDMiner: An Algorithm of Discovering Constant Conditional Functional Dependencies","volume":"E99.D","author":"Zhou","year":"2016","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2447","DOI":"10.1587\/transinf.2017EDP7378","article-title":"Uncertain Rule Based Method for Determining Data Currency","volume":"E101-D","author":"Li","year":"2018","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_11","unstructured":"Mcgilvray, D. (2008). Executing Data Quality Projects, Elsevier LTD Press."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1296","DOI":"10.1109\/TKDE.2018.2791607","article-title":"Multi-View Missing Data Completion","volume":"30","author":"Zhang","year":"2018","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_13","first-page":"3134","article-title":"Research on Online Cleaning and Repair Methods of Large-Scale Distribution Network Load Data","volume":"11","author":"Diao","year":"2015","journal-title":"Power Syst. Technol."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Benbernou, S., and Ouziri, M. (2017, January 11\u201314). Enhancing Data Quality by Cleaning Inconsistent Big RDF Data. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA.","DOI":"10.1109\/BigData.2017.8257913"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Fisher, J., Christen, P., Wang, Q., and Rahm, E. (2015, January 10\u201313). A Clustering-Based Framework to Control Block Sizes for Entity Resolution. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia.","DOI":"10.1145\/2783258.2783396"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"593","DOI":"10.1007\/s10619-018-7240-6","article-title":"An effective weighted rule-based method for entity resolution","volume":"36","author":"Ahmad","year":"2018","journal-title":"Distrib. Parallel Databases"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1007\/s10115-015-0818-7","article-title":"Efficient entity resolution based on subgraph cohesion","volume":"46","author":"Wang","year":"2016","journal-title":"Knowl. Inf. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"280","DOI":"10.1080\/13658816.2014.965711","article-title":"Rank-based strategies for cleaning inconsistent spatial databases","volume":"29","author":"Brisaboa","year":"2015","journal-title":"Int. J. Geogr. Inf. Sci."},{"key":"ref_19","first-page":"1685","article-title":"Repairing Inconsistent Relational Data Based on Possible World Model","volume":"27","author":"Xu","year":"2016","journal-title":"J. Softw."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"54","DOI":"10.1109\/TEVC.2013.2285016","article-title":"A New Multiobjective Evolutionary Algorithm for Mining a Reduced Set of Interesting Positive and Negative Quantitative Association Rules","volume":"18","author":"Martin","year":"2014","journal-title":"IEEE Trans. Evol. Comput."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"117","DOI":"10.3233\/IDA-150434","article-title":"Incremental maintenance of discovered association rules and approximate dependencies","volume":"21","author":"Medina","year":"2017","journal-title":"Int. Data Anal."},{"key":"ref_22","first-page":"104","article-title":"An Accurate Method for Mining top-k Frequent Pattern under Differential Privacy","volume":"51","author":"Zhang","year":"2014","journal-title":"J. Comput. Res. Dev."},{"key":"ref_23","unstructured":"Zhang, C.S., and Diao, Y.F. (2015, January 15\u201317). Conditional Functional Dependency Discovery and Data Repair Based on Decision Tree. Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, China."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1016\/j.knosys.2018.06.012","article-title":"Handling missing values: A study of popular imputation packages in R","volume":"160","author":"Yadav","year":"2018","journal-title":"Knowl.-Based Syst."},{"key":"ref_25","unstructured":"Krishnan, S., Franklin, M.J., Goldberg, K., and Wu, E. (2017). Boostclean: Automated error detection and repair for machine learning. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1016\/j.csda.2013.11.015","article-title":"A Bayesian semiparametric regression model for reliability data using effective age","volume":"73","author":"Li","year":"2014","journal-title":"Comput. Stat. Data Anal."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Karakasidis, A., Koloniari, G., and Verykios, V.S. (2015, January 10\u201313). Scalable Blocking for Privacy Preserving Record Linkage. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia.","DOI":"10.1145\/2783258.2783290"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1929","DOI":"10.14778\/2733085.2733098","article-title":"Supervised Meta-blocking","volume":"7","author":"Papadakis","year":"2014","journal-title":"Proc. VLDB Endow."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1109\/TCSVT.2013.2270366","article-title":"Multiscale Saliency Detection Using Random Walk with Restart","volume":"24","author":"Kim","year":"2014","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_30","first-page":"2303","article-title":"Entity Resolution Oriented Clustering Algorithm","volume":"27","author":"Sun","year":"2016","journal-title":"J. Softw."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tong, H.H., Faloutsos, C., and Pan, J.Y. (2006, January 18\u201322). Fast random walk with restart and its applications. Proceedings of the Sixth International Conference on Data Mining, Hong Kong, China.","DOI":"10.1109\/ICDM.2006.70"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1007\/s11760-016-0938-x","article-title":"Improving retrieval framework using information gain models","volume":"11","author":"Le","year":"2017","journal-title":"Signal Image Video Process."},{"key":"ref_33","first-page":"429","article-title":"Informative Gene Selection Method Based on Symmetric Uncertainty and SVM Recursive Feature Elimination","volume":"30","author":"Ye","year":"2017","journal-title":"Pattern Recognit. Artif. Intell."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/11\/4\/575\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T12:46:51Z","timestamp":1760186811000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/11\/4\/575"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,4,19]]},"references-count":33,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2019,4]]}},"alternative-id":["sym11040575"],"URL":"https:\/\/doi.org\/10.3390\/sym11040575","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2019,4,19]]}}}