{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T14:59:33Z","timestamp":1771340373545,"version":"3.50.1"},"reference-count":120,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2018,7,1]],"date-time":"2018-07-01T00:00:00Z","timestamp":1530403200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.<\/jats:p>","DOI":"10.3390\/sym10070248","type":"journal-article","created":{"date-parts":[[2018,7,2]],"date-time":"2018-07-02T10:56:52Z","timestamp":1530529012000},"page":"248","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":23,"title":["From Theory to Practice: A Data Quality Framework for Classification Tasks"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4717-3040","authenticated-orcid":false,"given":"David Camilo","family":"Corrales","sequence":"first","affiliation":[{"name":"Grupo de Ingenier\u00eda Telem\u00e1tica, Universidad del Cauca, Campus Tulc\u00e1n, 190002 Popay\u00e1n, Colombia"},{"name":"Departamento de Inform\u00e1tica, Universidad Carlos III de Madrid, Avenida de la Universidad, 30, 28911 Legan\u00e9s, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0041-6829","authenticated-orcid":false,"given":"Agapito","family":"Ledezma","sequence":"additional","affiliation":[{"name":"Departamento de Inform\u00e1tica, Universidad Carlos III de Madrid, Avenida de la Universidad, 30, 28911 Legan\u00e9s, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5608-9097","authenticated-orcid":false,"given":"Juan Carlos","family":"Corrales","sequence":"additional","affiliation":[{"name":"Grupo de Ingenier\u00eda Telem\u00e1tica, Universidad del Cauca, Campus Tulc\u00e1n, 190002 Popay\u00e1n, Colombia"}]}],"member":"1968","published-online":{"date-parts":[[2018,7,1]]},"reference":[{"key":"ref_1","unstructured":"Gantz, J., and Reinsel, D. (2018, April 20). The Digital Universe in 2020: Big Data, Bigger Digital Shadows, And Biggest Growth in the Far East. Available online: https:\/\/www.emc-technology.com\/collateral\/analyst-reports\/idc-the-digital-universe-in-2020.pdf."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1109\/ACCESS.2014.2332453","article-title":"Toward Scalable Systems for Big Data Analytics: A Technology Tutorial","volume":"2","author":"Hu","year":"2014","journal-title":"IEEE Access"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Rajaraman, A., and Ullman, J.D. (2011). Mining of Massive Datasets, Cambridge University Press.","DOI":"10.1017\/CBO9781139058452"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Pacheco, F., Rangel, C., Aguilar, J., Cerrada, M., and Altamiranda, J. (2014, January 15\u201319). Methodological framework for data processing based on the Data Science paradigm. Proceedings of the 2014 XL Latin American Computing Conference (CLEI), Montevideo, Uruguay.","DOI":"10.1109\/CLEI.2014.6965184"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Sebastian-Coleman, L. (2012). Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework, Morgan Kaufmann Publishers Inc.","DOI":"10.1016\/B978-0-12-397033-6.00020-1"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Eyob, E. (2009). Social Implications of Data Mining and Information Privacy: Interdisciplinary Frameworks and Solutions: Interdisciplinary Frameworks and Solutions, Information Science Reference.","DOI":"10.4018\/978-1-60566-196-4"},{"key":"ref_7","unstructured":"Piateski, G., and Frawley, W. (1991). Knowledge Discovery in Databases, MIT Press."},{"key":"ref_8","unstructured":"Chapman, P. (2018, April 20). CRISP-DM 1.0: Step-By-Step Data Mining Guide. Available online: http:\/\/www.crisp-dm.org\/CRISPWP-0800.pdf."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The WEKA Data Mining Software: An Update","volume":"11","author":"Hall","year":"2009","journal-title":"SIGKDD Explor. Newsl."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. (2006, January 20\u201323). YALE: Rapid Prototyping for Complex Data Mining Tasks. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA.","DOI":"10.1145\/1150402.1150531"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1145\/1656274.1656280","article-title":"KNIME\u2014The Konstanz information miner: Version 2.0 and Beyond","volume":"11","author":"Berthold","year":"2009","journal-title":"ACM SIGKDD Explor. Newsl."},{"key":"ref_12","unstructured":"MATHWORKS (2004). Matlab, The MathWorks Inc."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1080\/10618600.1996.10474713","article-title":"R: A language for data analysis and graphics","volume":"5","author":"Ihaka","year":"1996","journal-title":"J. Comput. Graph. Stat."},{"key":"ref_14","unstructured":"Eaton, J.W. (2002). GNU Octave Manual, Network Theory Limited."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"396","DOI":"10.17706\/jcp.10.6.396-405","article-title":"A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal","volume":"10","author":"Corrales","year":"2015","journal-title":"J. Comput."},{"key":"ref_16","unstructured":"Caballero, I., Verbo, E., Calero, C., and Piattini, M. (2007). A Data Quality Measurement Information Model Based on ISO\/IEC 15939, ICIQ."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1287\/mnsc.31.2.150","article-title":"Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems","volume":"31","author":"Ballou","year":"1985","journal-title":"Manag. Sci."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Guillet, F.J., and Hamilton, H.J. (2007). Measuring and Modelling Data Quality for Quality-Awareness in Data Mining. Quality Measures in Data Mining, Springer.","DOI":"10.1007\/978-3-540-44918-8"},{"key":"ref_19","unstructured":"Kerr, K., and Norris, T. (2004, January 5\u20137). The Development of a Healthcare Data Quality Framework and Strategy. Proceedings of the Ninth International Conference on Information Quality (ICIQ-04), Cambridge, MA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1080\/07421222.1996.11518099","article-title":"Beyond accuracy: What data quality means to data consumers","volume":"12","author":"Wang","year":"1996","journal-title":"J. Manag. Inf. Syst."},{"key":"ref_21","unstructured":"Eppler, M.J., and Wittig, D. (2000, January 20\u201322). Conceptualizing Information Quality: A Review of Information Quality Frameworks from the Last Ten Years. Proceedings of the 2000 International Conference on Information Quality (IQ 2000), Cambridge, MA, USA."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"907","DOI":"10.1006\/ijhc.1995.1081","article-title":"Toward principles for the design of ontologies used for knowledge sharing?","volume":"43","author":"Gruber","year":"1995","journal-title":"Int. J. Hum. Comput. Stud."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1017\/S0269888900007797","article-title":"Ontologies: Principles, methods and applications","volume":"11","author":"Uschold","year":"1996","journal-title":"Knowl. Eng. Rev."},{"key":"ref_24","first-page":"18:1","article-title":"Ontology-Based Data Quality Management for Data Streams","volume":"7","author":"Geisler","year":"2016","journal-title":"J. Data Inf. Qual."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S.C., Li, T., and Zhang, Y. (2015, January 1\u20133). A Data Quality Framework for Customer Relationship Analytics. Proceedings of the WISE 2015 16th International Conference on Web Information Systems Engineering, Miami, FL, USA.","DOI":"10.1007\/978-3-319-26187-4"},{"key":"ref_26","unstructured":"Galhard, H., Florescu, D., Shasha, D., and Simon, E. (March, January 28). An extensible Framework for Data Cleaning. Proceedings of the 16th International Conference on Data Engineering, Washington, DC, USA."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"8304","DOI":"10.1016\/j.eswa.2015.06.050","article-title":"DQ2S\u2014A framework for data quality-aware information management","volume":"42","author":"Sampaio","year":"2015","journal-title":"Expert Syst. Appl."},{"key":"ref_28","unstructured":"Yang, Q., and Webb, G. (2006, January 7\u201311). An Object-Oriented Framework for Data Quality Management of Enterprise Data Warehouse. Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence Trends in Artificial Intelligence (PRICAI 2006), Guilin, China."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Taleb, I., Dssouli, R., and Serhani, M.A. (July, January 27). Big Data Pre-processing: A Quality Framework. Proceedings of the 2015 IEEE International Congress on Big Data, New York, NY, USA.","DOI":"10.1109\/BigDataCongress.2015.35"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/j.ijmedinf.2016.03.006","article-title":"Data quality assessment framework to assess electronic medical record data for use in research","volume":"90","author":"Reimer","year":"2016","journal-title":"Int. J. Med. Inform."},{"key":"ref_31","unstructured":"Almutiry, O., Wills, G., and Alwabel, A. (2013, January 24\u201326). Toward a framework for data quality in cloud-based health information system. Proceedings of the International Conference on Information Society (i-Society 2013), Toronto, ON, Canada."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1197\/jamia.M1087","article-title":"Defining and improving data quality in medical registries: A literature review, case study, and generic framework","volume":"9","author":"Arts","year":"2002","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Myrseth, P., Stang, J., and Dalberg, V. (2011, January 19\u201324). A data quality framework applied to e-government metadata: A prerequsite to establish governance of interoperable e-services. Proceedings of the 2011 International Conference on E-Business and E-Government (ICEE), Maui, Hawaii.","DOI":"10.1109\/ICEBEG.2011.5881298"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1016\/j.giq.2016.02.001","article-title":"Open data quality measurement framework: Definition and application to Open Government Data","volume":"33","author":"Vetro","year":"2016","journal-title":"Gov. Inf. Q."},{"key":"ref_35","first-page":"4422","article-title":"A Framework to Construct Data Quality Dimensions Relationships","volume":"6","author":"Panahy","year":"2013","journal-title":"Indian J. Sci. Technol."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"623","DOI":"10.1109\/69.404034","article-title":"A framework for analysis of data quality research","volume":"7","author":"Wang","year":"1995","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Corrales, D.C., Corrales, J.C., and Ledezma, A. (2018). How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning. Symmetry, 10.","DOI":"10.3390\/sym10040099"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Rasta, K., Nguyen, T.H., and Prinz, A. (2013, January 29\u201331). A framework for data quality handling in enterprise service bus. Proceedings of the 2013 Third International Conference on Innovative Computing Technology (INTECH), London, UK.","DOI":"10.1109\/INTECH.2013.6653640"},{"key":"ref_39","unstructured":"Olson, D.L., and Delen, D. (2008). Advanced Data Mining Techniques, Springer Science & Business Media."},{"key":"ref_40","unstructured":"Schutt, R., and O\u2019Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline, O\u2019Reilly Media, Inc."},{"key":"ref_41","unstructured":"Wang, X., Hamilton, H.J., and Bither, Y. (2005). An Ontology-Based Approach to Data Cleaning, Department of Computer Science, University of Regina. Technical Report CS-2005-05."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Almeida, R., Oliveira, P., Braga, L., and Barroso, J. (2012, January 19\u201321). Ontologies for Reusing Data Cleaning Knowledge. Proceedings of the 2012 IEEE Sixth International Conference on Semantic Computing, Palermo, Italy.","DOI":"10.1109\/ICSC.2012.19"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Yu, G., Bertino, E., and Xu, G. (2008, January 26\u201328). Rule Mining for Automatic Ontology Based Data Cleaning. Proceedings of the 10th Asia-Pacific Web Conference ON Progress in WWW Research and Development, Shenyang, China.","DOI":"10.1007\/978-3-540-78849-2"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Andersson, B., Bergholtz, M., and Johannesson, P. (2002). Ontology-Based Data Cleaning. Natural Language Processing and Information Systems, Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002, Stockholm, Sweden, 27\u201328 June 2002, Springer.","DOI":"10.1007\/3-540-36271-1"},{"key":"ref_45","first-page":"1937","article-title":"A Data Quality Ontology for the Secondary Use of EHR Data","volume":"2015","author":"Johnson","year":"2015","journal-title":"AMIA Ann. Symp. Proc."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Abarza, R.G., Motz, R., and Urrutia, A. (2014, January 8\u201314). Quality Assessment Using Data Ontologies. Proceedings of the 2014 33rd International Conference of the Chilean Computer Science Society (SCCC), Talca, Chile.","DOI":"10.1109\/SCCC.2014.26"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Da Silva Jacinto, A., da Silva Santos, R., and de Oliveira, J.M.P. (2014, January 10\u201312). Automatic and semantic pre-Selection of features using ontology for data mining on datasets related to cancer. Proceedings of the International Conference on Information Society (i-Society 2014), London, UK.","DOI":"10.1109\/i-Society.2014.7009060"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Garcia, L.F., Graciolli, V.M., Ros, L.F.D., and Abel, M. (2016, January 6\u20138). An Ontology-Based Conceptual Framework to Improve Rock Data Quality in Reservoir Models. Proceedings of the 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), San Jose, CA, USA.","DOI":"10.1109\/ICTAI.2016.0166"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Coulet, A., Smail-Tabbone, M., Benlian, P., Napoli, A., and Devignes, M.D. (2008). Ontology-guided data preparation for discovering genotype-phenotype relationships. BMC Bioinform., 9.","DOI":"10.1186\/1471-2105-9-S4-S3"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1177\/160940690900800406","article-title":"Building a conceptual framework: Philosophy, definitions, and procedure","volume":"8","author":"Jabareen","year":"2009","journal-title":"Int. J. Qual. Methods"},{"key":"ref_51","first-page":"105","article-title":"Competing paradigms in qualitative research","volume":"2","author":"Guba","year":"1994","journal-title":"Handb. Qual. Res."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Corrales, D.C., Ledezma, A., and Corrales, J.C. (2016). A systematic review of data quality issues in knowledge discovery tasks. Rev. Ing. Univ. Medel., 15.","DOI":"10.22395\/rium.v15n28a7"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"304","DOI":"10.1109\/TKDE.2006.46","article-title":"Enhancing data analysis with noise removal","volume":"18","author":"Xiong","year":"2006","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"15:1","DOI":"10.1145\/1541880.1541882","article-title":"Anomaly Detection: A Survey","volume":"41","author":"Chandola","year":"2009","journal-title":"ACM Comput. Surv."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1016\/j.ins.2013.01.021","article-title":"A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm","volume":"233","author":"Aydilek","year":"2013","journal-title":"Inf. Sci."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Hawkins, D.M. (1980). Identification of Outliers, Springer.","DOI":"10.1007\/978-94-015-3994-4"},{"key":"ref_57","unstructured":"Barnett, V., and Lewis, T. (1994). Outliers in Statistical Data, Wiley."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Johnson, R.A., and Wichern, D.W. (2014). Applied Multivariate Statistical Analysis, Prentice-Hall.","DOI":"10.1002\/9781118445112.stat02623"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Khalid, S., Khalil, T., and Nasreen, S. (2014, January 27\u201329). A survey of feature selection and feature extraction techniques in machine learning. Proceedings of the Science and Information Conference (SAI), London, UK.","DOI":"10.1109\/SAI.2014.6918213"},{"key":"ref_60","unstructured":"Tang, J., Alelyani, S., and Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, Chapman and Hall\/CRC."},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"1263","DOI":"10.1109\/TKDE.2008.239","article-title":"Learning from Imbalanced Data","volume":"21","author":"He","year":"2009","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Chairi, I., Alaoui, S., and Lyhyaoui, A. (2012, January 10\u201312). Learning from imbalanced data using methods of sample selection. Proceedings of the 2012 International Conference on Multimedia Computing and Systems (ICMCS), Tangier, Morocco.","DOI":"10.1109\/ICMCS.2012.6320291"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Bosu, M.F., and MacDonell, S.G. (2013, January 4\u20137). A Taxonomy of Data Quality Challenges in Empirical Software Engineering. Proceedings of the 2013 22nd Australian Software Engineering Conference, Melbourne, Australia.","DOI":"10.1109\/ASWEC.2013.21"},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1145\/505168.505196","article-title":"Resolving Semantic Heterogeneity in Schema Integration","volume":"Volume 2001","author":"Hakimpour","year":"2001","journal-title":"Proceedings of the International Conference on Formal Ontology in Information Systems"},{"key":"ref_65","unstructured":"Finger, M., and Silva, F.S.D. (1998, January 16\u201317). Temporal data obsolescence: Modelling problems. Proceedings of the Fifth International Workshop on Temporal Representation and Reasoning (Cat. No. 98EX157), Sanibel Island, FL, USA."},{"key":"ref_66","unstructured":"Maydanchik, A. (2007). Data Quality Assessment, Technics Publications."},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Aljuaid, T., and Sasi, S. (2016, January 23\u201325). Proper imputation techniques for missing values in datasets. Proceedings of the 2016 International Conference on Data Science and Engineering (ICDSE), Cochin, India.","DOI":"10.1109\/ICDSE.2016.7823957"},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"890","DOI":"10.1109\/32.962560","article-title":"Software cost estimation with incomplete data","volume":"27","author":"Strike","year":"2001","journal-title":"IEEE Trans. Softw. Eng."},{"key":"ref_69","first-page":"2007","article-title":"Techniques for dealing with missing data in knowledge discovery tasks","volume":"15","author":"Magnani","year":"2004","journal-title":"Obtido"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Breunig, M.M., Kriegel, H.P., Ng, R.T., and Sander, J. (2000). LOF: Identifying Density-Based Local Outliers, ACM. ACM Sigmod Record.","DOI":"10.1145\/342009.335388"},{"key":"ref_71","unstructured":"Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996, January 2\u20134). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96 Proceedings), Portland, OR, USA."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Kriegel, H.P., Zimek, A., and Hubert, M.S. (2008, January 24\u201327). Angle-based outlier detection in high-dimensional data. Proceedings of the 14th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, Las Vegas, NV, USA.","DOI":"10.1145\/1401890.1401946"},{"key":"ref_73","unstructured":"Fayyad, U.M., Piatetsky-Shapiro, G., and Smyth, P. (1996). Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence. Chapter from Data Mining to Knowledge Discovery: An Overview."},{"key":"ref_74","first-page":"1787","article-title":"Feature Selection Methods And Algorithms","volume":"3","author":"Ladha","year":"2011","journal-title":"Int. J. Comput. Sci. Eng."},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.compeleceng.2013.11.024","article-title":"A survey on feature selection methods","volume":"40","author":"Chandrashekar","year":"2014","journal-title":"Comput. Electr. Eng."},{"key":"ref_76","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/S0004-3702(97)00063-5","article-title":"Selection of relevant features and examples in machine learning","volume":"97","author":"Blum","year":"1997","journal-title":"Artif. Intell."},{"key":"ref_77","unstructured":"Jolliffe, I. (2002). Principal Component Analysis, Wiley Online Library."},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Wang, J., Xu, M., Wang, H., and Zhang, J. (2006, January 16\u201320). Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. Proceedings of the 2006 8th international Conference on Signal Processing, Beijing, China.","DOI":"10.1109\/ICOSP.2006.345752"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"He, H., and Ma, Y. (2013). Imbalanced Learning: Foundations, Algorithms, and Applications, John Wiley and Sons.","DOI":"10.1002\/9781118646106"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"845","DOI":"10.1109\/TNNLS.2013.2292894","article-title":"Classification in the Presence of Label Noise: A Survey","volume":"25","author":"Frenay","year":"2014","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Huang, L., Jin, H., Yuan, P., and Chu, F. (2008, January 3\u20135). Duplicate Records Cleansing with Length Filtering and Dynamic Weighting. Proceedings of the 2008 Fourth International Conference on Semantics, Knowledge and Grid, Beijing, China.","DOI":"10.1109\/SKG.2008.88"},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"Pav\u00f3n, J., Duque-M\u00e9ndez, N.D., and Fuentes-Fern\u00e1ndez, R. (2012). Improving SMOTE with Fuzzy Rough Prototype Selection to Detect Noise in Imbalanced Classification Data. Advances in Artificial Intelligence\u2014IBERAMIA 2012, Proceedings of the 13th Ibero-American Conference on AI, Cartagena de Indias, Colombia, 13\u201316 November 2012, Springer.","DOI":"10.1007\/978-3-642-34654-5"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"359","DOI":"10.2307\/2097958","article-title":"Entropy measure of diversification and corporate growth","volume":"27","author":"Jacquemin","year":"1979","journal-title":"J. Ind. Econ."},{"key":"ref_84","unstructured":"Asuncion, A., Newman, D., and UCI Machine Learning Repository (2018, March 15). Irvine, CA: University of California, School of Information and Computer Science. Available online: http:\/\/www.ics.uci.edu\/~{}mlearn\/MLRepository.html."},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.enbuild.2015.11.071","article-title":"Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models","volume":"112","author":"Candanedo","year":"2016","journal-title":"Energy Build."},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Reiss, A., and Stricker, D. (2012, January 6\u20138). Creating and Benchmarking a New Dataset for Physical Activity Monitoring. Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, Heraklion, Greece.","DOI":"10.1145\/2413097.2413148"},{"key":"ref_87","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1016\/j.sbspro.2015.02.063","article-title":"Methodologies to Build Ontologies for Terminological Purposes","volume":"173","year":"2015","journal-title":"Procedia Soc. Behav. Sci."},{"key":"ref_88","unstructured":"G\u00f3mez-P\u00e9rez, A., Fern\u00e1ndez-L\u00f3pez, M., and Corcho, O. (2007). Ontological Engineering: With Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web. (Advanced Information and Knowledge Processing), Springer-Verlag New York, Inc."},{"key":"ref_89","unstructured":"Horrocks, I., Patel-Schneider, P.F., Bole, H., Tabet, S., Grosof, B., and Dean, M. (2018, May 01). SWRL: A Semantic Web Rule Language Combining OWL and RuleML. Available online: https:\/\/www.w3.org\/Submission\/SWRL\/."},{"key":"ref_90","doi-asserted-by":"crossref","unstructured":"Rodr\u00edguez, J.P., Gir\u00f3n, E.J., Corrales, D.C., and Corrales, J.C. (2017, January 22\u201324). A Guideline for Building Large Coffee Rust Samples Applying Machine Learning Methods. Proceedings of the International Conference of ICT for Adapting Agriculture to Climate Change, Popay\u00e1n, Colombia.","DOI":"10.1007\/978-3-319-70187-5_8"},{"key":"ref_91","doi-asserted-by":"crossref","unstructured":"Juddoo, S. (2015, January 4\u20135). Overview of data quality challenges in the context of Big Data. Proceedings of the 2015 International Conference on Computing, Communication and Security (ICCCS), Pamplemousses, Mauritius.","DOI":"10.1109\/CCCS.2015.7374131"},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Cai, L., and Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Sci. J., 14.","DOI":"10.5334\/dsj-2015-002"},{"key":"ref_93","doi-asserted-by":"crossref","first-page":"2825","DOI":"10.3233\/JIFS-169470","article-title":"Feature selection for classification tasks: Expert knowledge or traditional methods?","volume":"34","author":"Corrales","year":"2018","journal-title":"J. Intell. Fuzzy Syst."},{"key":"ref_94","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v028.i05","article-title":"Building predictive models in R using the caret package","volume":"28","author":"Kuhn","year":"2008","journal-title":"J. Stat. Softw."},{"key":"ref_95","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1186\/2193-1801-2-222","article-title":"Principled missing data methods for researchers","volume":"2","author":"Dong","year":"2013","journal-title":"SpringerPlus"},{"key":"ref_96","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1177\/096228029900800102","article-title":"Multiple imputation: A primer","volume":"8","author":"Schafer","year":"1999","journal-title":"Stat. Methods Med. Res."},{"key":"ref_97","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1080\/00401706.1969.10490657","article-title":"Procedures for detecting outlying observations in samples","volume":"11","author":"Grubbs","year":"1969","journal-title":"Technometrics"},{"key":"ref_98","unstructured":"Rennie, J.D.M., Shih, L., Teevan, J., and Karger, D.R. (2003, January 21\u201324). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA."},{"key":"ref_99","doi-asserted-by":"crossref","first-page":"7367","DOI":"10.1016\/j.eswa.2015.05.030","article-title":"An incremental technique for real-time bioacoustic signal segmentation","volume":"42","author":"Colonna","year":"2015","journal-title":"Expert Syst. Appl."},{"key":"ref_100","doi-asserted-by":"crossref","unstructured":"Luaces, O., G\u00e1mez, J.A., Barrenechea, E., Troncoso, A., Galar, M., Quinti\u00e1n, H., and Corchado, E. (2016). How to Correctly Evaluate an Automatic Bioacoustics Classification Method. Advances in Artificial Intelligence, Springer International Publishing.","DOI":"10.1007\/978-3-319-44636-3"},{"key":"ref_101","doi-asserted-by":"crossref","unstructured":"Calders, T., Ceci, M., and Malerba, D. (2016). Recognizing Family, Genus, and Species of Anuran Using a Hierarchical Classification Approach. Discovery Science, Springer International Publishing.","DOI":"10.1007\/978-3-319-46307-0"},{"key":"ref_102","doi-asserted-by":"crossref","unstructured":"Thabtah, F. (2017, January 20\u201322). Autism Spectrum Disorder Screening: Machine Learning Adaptation and DSM-5 Fulfillment. Proceedings of the 1st International Conference on Medical and Health Informatics, Taichung City, Taiwan.","DOI":"10.1145\/3107514.3107515"},{"key":"ref_103","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1007\/BF02344684","article-title":"Classification of breast tissue by electrical impedance spectroscopy","volume":"38","author":"Jossinet","year":"2000","journal-title":"Med. Biol. Eng. Comput."},{"key":"ref_104","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1002\/1520-6661(200009\/10)9:5<311::AID-MFM12>3.0.CO;2-9","article-title":"SisPorto 2.0: A program for automated analysis of cardiotocograms","volume":"9","author":"Bernardes","year":"2000","journal-title":"J. Matern.-Fetal Med."},{"key":"ref_105","doi-asserted-by":"crossref","first-page":"2473","DOI":"10.1016\/j.eswa.2007.12.020","article-title":"The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients","volume":"36","author":"Yeh","year":"2009","journal-title":"Expert Syst. Appl."},{"key":"ref_106","doi-asserted-by":"crossref","first-page":"754","DOI":"10.1016\/j.neucom.2015.07.085","article-title":"Transition-aware human activity recognition using smartphones","volume":"171","author":"Oneto","year":"2016","journal-title":"Neurocomputing"},{"key":"ref_107","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1007\/s10115-007-0095-1","article-title":"Forecasting skewed biased stochastic ozone days: Analyses, solutions and beyond","volume":"14","author":"Zhang","year":"2008","journal-title":"Knowl. Inf. Syst."},{"key":"ref_108","doi-asserted-by":"crossref","first-page":"5948","DOI":"10.1016\/j.eswa.2014.03.019","article-title":"Phishing detection based Associative Classification data mining","volume":"41","author":"Abdelhamid","year":"2014","journal-title":"Expert Syst. Appl."},{"key":"ref_109","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1016\/j.eswa.2016.04.001","article-title":"Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction","volume":"58","author":"Zikeba","year":"2016","journal-title":"Expert Syst. Appl."},{"key":"ref_110","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1016\/j.dss.2014.03.001","article-title":"A data-driven approach to predict the success of bank telemarketing","volume":"62","author":"Moro","year":"2014","journal-title":"Decis. Support Syst."},{"key":"ref_111","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1007\/s00521-013-1490-z","article-title":"Predicting phishing websites based on self-structuring neural network","volume":"25","author":"Mohammad","year":"2014","journal-title":"Neural Comput. Appl."},{"key":"ref_112","doi-asserted-by":"crossref","first-page":"867","DOI":"10.1021\/ci4000213","article-title":"Quantitative structure\u2013activity relationship models for ready biodegradability of chemicals","volume":"53","author":"Mansouri","year":"2013","journal-title":"J. Chem. Inf. Model."},{"key":"ref_113","doi-asserted-by":"crossref","unstructured":"Alexandre, L.A., Salvador S\u00e1nchez, J., and Rodrigues, J.M.F. (2017). Transfer Learning with Partial Observability Applied to Cervical Cancer Screening. Pattern Recognition and Image Analysis, Springer International Publishing.","DOI":"10.1007\/978-3-319-58838-4"},{"key":"ref_114","first-page":"115","article-title":"Enhanced Classification Model for Cervical Cancer Dataset based on Cost Sensitive Classifier","volume":"4","author":"Fatlawi","year":"2007","journal-title":"Int. J. Comput. Tech."},{"key":"ref_115","first-page":"262","article-title":"Application of rule-based models for seismic hazard prediction in coal mines","volume":"18","author":"Kabiesz","year":"2013","journal-title":"Acta Montan. Slovaca"},{"key":"ref_116","doi-asserted-by":"crossref","first-page":"487","DOI":"10.1109\/TLA.2009.5349049","article-title":"On the Application of Ensembles of Classifiers to the Diagnosis of Pathologies of the Vertebral Column: A Comparative Analysis","volume":"7","year":"2009","journal-title":"IEEE Latin Am. Trans."},{"key":"ref_117","doi-asserted-by":"crossref","unstructured":"Vitri\u00e0, J., Sanches, J.M., and Hern\u00e1ndez, M. (2011). Diagnostic of Pathology on the Vertebral Column with Embedded Reject Option. Pattern Recognition and Image Analysis, Springer.","DOI":"10.1007\/978-3-642-21257-4"},{"key":"ref_118","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1109\/TNSRE.2013.2293575","article-title":"Objective Automatic Assessment of Rehabilitative Speech Treatment in Parkinson\u2019s Disease","volume":"22","author":"Tsanas","year":"2014","journal-title":"IEEE Trans. Neural Syst. Rehabil. Eng."},{"key":"ref_119","first-page":"1","article-title":"A Feature Subset Selection Algorithm Automatic Recommendation Method","volume":"47","author":"Wang","year":"2013","journal-title":"J. Artif. Int. Res."},{"key":"ref_120","unstructured":"Reif, M., Shafait, F., and Dengel, A. (2012, January 24). Meta2-features: Providing meta-learners more information. Proceedings of the 35th German Conference on Artificial Intelligence, Saarbr\u00fccken, Germany."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/10\/7\/248\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:10:55Z","timestamp":1760195455000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/10\/7\/248"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,7,1]]},"references-count":120,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2018,7]]}},"alternative-id":["sym10070248"],"URL":"https:\/\/doi.org\/10.3390\/sym10070248","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,7,1]]}}}