{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,30]],"date-time":"2025-10-30T07:14:41Z","timestamp":1761808481790,"version":"build-2065373602"},"reference-count":65,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T00:00:00Z","timestamp":1638316800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>In the era of the Internet of Things and big data, we are faced with the management of a flood of information. The complexity and amount of data presented to the decision-maker are enormous, and existing methods often fail to derive nonredundant information quickly. Thus, the selection of the most satisfactory set of solutions is often a struggle. This article investigates the possibilities of using the entropy measure as an indicator of data difficulty. To do so, we focus on real-world data covering various fields related to markets (the real estate market and financial markets), sports data, fake news data, and more. The problem is twofold: First, since we deal with unprocessed, inconsistent data, it is necessary to perform additional preprocessing. Therefore, the second step of our research is using the entropy-based measure to capture the nonredundant, noncorrelated core information from the data. Research is conducted using well-known algorithms from the classification domain to investigate the quality of solutions derived based on initial preprocessing and the information indicated by the entropy measure. Eventually, the best 25% (in the sense of entropy measure) attributes are selected to perform the whole classification procedure once again, and the results are compared.<\/jats:p>","DOI":"10.3390\/e23121621","type":"journal-article","created":{"date-parts":[[2021,12,2]],"date-time":"2021-12-02T02:40:02Z","timestamp":1638412802000},"page":"1621","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["Real-World Data Difficulty Estimation with the Use of Entropy"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7893-5410","authenticated-orcid":false,"given":"Przemys\u0142aw","family":"Juszczuk","sequence":"first","affiliation":[{"name":"Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2128-6998","authenticated-orcid":false,"given":"Jan","family":"Kozak","sequence":"additional","affiliation":[{"name":"Faculty of Informatics and Communication, Department of Machine Learning, University of Economics in Katowice, 1 Maja 50, 40-287 Katowice, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5307-0067","authenticated-orcid":false,"given":"Grzegorz","family":"Dziczkowski","sequence":"additional","affiliation":[{"name":"Faculty of Informatics and Communication, Department of Machine Learning, University of Economics in Katowice, 1 Maja 50, 40-287 Katowice, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1137-4350","authenticated-orcid":false,"given":"Szymon","family":"G\u0142owania","sequence":"additional","affiliation":[{"name":"Faculty of Informatics and Communication, Department of Machine Learning, University of Economics in Katowice, 1 Maja 50, 40-287 Katowice, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9463-5562","authenticated-orcid":false,"given":"Tomasz","family":"Jach","sequence":"additional","affiliation":[{"name":"Faculty of Informatics and Communication, Department of Machine Learning, University of Economics in Katowice, 1 Maja 50, 40-287 Katowice, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5122-2645","authenticated-orcid":false,"given":"Barbara","family":"Probierz","sequence":"additional","affiliation":[{"name":"Faculty of Informatics and Communication, Department of Machine Learning, University of Economics in Katowice, 1 Maja 50, 40-287 Katowice, Poland"}]}],"member":"1968","published-online":{"date-parts":[[2021,12,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"115561","DOI":"10.1016\/j.eswa.2021.115561","article-title":"Big data analytics and machine learning: A retrospective overview and bibliometric analysis","volume":"184","author":"Zhang","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1016\/j.inffus.2020.01.005","article-title":"Overview and comparative study of dimensionality reduction techniques for high dimensional data","volume":"59","author":"Ayesha","year":"2020","journal-title":"Inf. Fusion"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"107353","DOI":"10.1016\/j.asoc.2021.107353","article-title":"Attribute reduction methods in fuzzy rough set theory: An overview, comparative experiments, and new directions","volume":"107","author":"Yuan","year":"2021","journal-title":"Appl. Soft Comput."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Jolliffe, I. (2021). A 50-year personal journey through time with principal component analysis. J. Multivar. Anal., 104820.","DOI":"10.1016\/j.jmva.2021.104820"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"107633","DOI":"10.1016\/j.knosys.2021.107633","article-title":"A self-adaptive weighted differential evolution approach for large-scale feature selection","volume":"235","author":"Wang","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"104210","DOI":"10.1016\/j.engappai.2021.104210","article-title":"Review of swarm intelligence-based feature selection methods","volume":"100","author":"Rostami","year":"2020","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"100663","DOI":"10.1016\/j.swevo.2020.100663","article-title":"A survey on swarm intelligence approaches to feature selection in data mining","volume":"54","author":"Nguyen","year":"2020","journal-title":"Swarm Evol. Comput."},{"key":"ref_8","first-page":"115895","article-title":"A framework for feature selection through boosting","volume":"187","author":"Alsahaf","year":"2022","journal-title":"Knowl.-Based Syst."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1124","DOI":"10.1126\/science.185.4157.1124","article-title":"Judgment under Uncertainty: Heuristics and Biases","volume":"184","author":"Tversky","year":"1974","journal-title":"Science"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"376","DOI":"10.1016\/j.inffus.2021.07.001","article-title":"Advances in Data Preprocessing for Biomedical Data Fusion: An Overview of the Methods, Challenges, and Prospects","volume":"76","author":"Wang","year":"2021","journal-title":"Inf. Fusion"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"114743","DOI":"10.1016\/j.eswa.2021.114743","article-title":"Towards missing electric power data imputation for energy management systems","volume":"174","author":"Wang","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"107114","DOI":"10.1016\/j.knosys.2021.107114","article-title":"Missing data imputation for traffic congestion data based on joint matrix factorization","volume":"225","author":"Jia","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A mathematical theory of communications","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst. Tech. J."},{"key":"ref_14","unstructured":"R\u00e8nyi, A. (1961, January 20\u201330). On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1007\/BF01016429","article-title":"Possible generalization of Boltzmann-Gibbs statistics","volume":"52","author":"Tsallis","year":"1988","journal-title":"J. Stat. Phys."},{"key":"ref_16","unstructured":"Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1007\/BF00116251","article-title":"Induction of decision trees","volume":"1","author":"Quinlan","year":"1986","journal-title":"Mach. Learn."},{"key":"ref_18","first-page":"27","article-title":"Conditional likelihood maximization: A unifying framework for information theoretic feature selection","volume":"13","author":"Brown","year":"2012","journal-title":"J. Mach. Learn."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1184","DOI":"10.1109\/TSP.2011.2178406","article-title":"Survival information potential: A new criterion for adaptive system training","volume":"60","author":"Chen","year":"2012","journal-title":"IEEE Trans. Signal Process"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"164","DOI":"10.1016\/j.infrared.2018.04.003","article-title":"Particle swarm optimization-based local entropy weighted histogram equalization for infrared image enhancement","volume":"91","author":"Wan","year":"2018","journal-title":"Infrared Phys. Technol."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1016\/j.asoc.2017.04.030","article-title":"Entropic simplified swarm optimization for the task assignment problem","volume":"58","author":"Lai","year":"2017","journal-title":"Appl. Soft Comput."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1016\/j.engappai.2013.07.022","article-title":"Entropy based Binary Particle Swarm Optimization and classification for ear detection","volume":"27","author":"Ganesh","year":"2014","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Principe, J.C. (2010). Information Theoretic Learning: R\u00e9nyi\u2019s Entropy and Kernel Perspectives, Springer.","DOI":"10.1007\/978-1-4419-1570-2"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.fss.2020.10.017","article-title":"Fuzzy information entropy-based adaptive approach for hybrid feature outlier detection","volume":"421","author":"Yuan","year":"2021","journal-title":"Fuzzy Sets Syst."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"108052","DOI":"10.1016\/j.ymssp.2021.108052","article-title":"Multiscale symbolic fuzzy entropy: An entropy denoising method for weak feature extraction of rotating machinery","volume":"162","author":"Li","year":"2022","journal-title":"Mech. Syst. Signal Process."},{"key":"ref_26","unstructured":"Kumar, R., Gandotra, N. (2021). A novel pythagorean fuzzy entropy measure using MCDM application in preference of the advertising company with TOPSIS approach. Mater. Proc., in press."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1016\/j.jspi.2021.05.009","article-title":"The properties of entropy as a measure of randomness in a clinical trial","volume":"216","author":"Hoberman","year":"2022","journal-title":"J. Stat. Plan. Inference"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1016\/j.ins.2021.01.073","article-title":"Entropy measure for orderable sets","volume":"561","author":"Zhang","year":"2021","journal-title":"Inf. Sci."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"126068","DOI":"10.1016\/j.physa.2021.126068","article-title":"Measuring information flow among international stock markets: An approach of entropy-based networks on multi time-scales","volume":"577","author":"Kuang","year":"2021","journal-title":"Phys. A Stat. Mech. Its Appl."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kozak, J., Kania, K., and Juszczuk, P. (2020). Permutation entropy as a measure of information gain\/loss in the different symbolic descriptions of financial data. Entropy, 22.","DOI":"10.3390\/e22030330"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"6285","DOI":"10.1016\/j.arabjc.2020.05.021","article-title":"On entropy measures of molecular graphs using topological indices","volume":"13","author":"Manzoor","year":"2020","journal-title":"Arab. J. Chem."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"103796","DOI":"10.1016\/j.rinp.2020.103796","article-title":"Entropic measures of an atom confined in modified Hulthen potential","volume":"21","author":"Kumar","year":"2021","journal-title":"Results Phys."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1016\/j.physa.2003.08.022","article-title":"Multiscale entropy analysis of human gait dynamics","volume":"330","author":"Costa","year":"2003","journal-title":"Phys. A Stat. Mech. Its Appl."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"2297","DOI":"10.1073\/pnas.88.6.2297","article-title":"Approximate entropy as a measure of system complexity","volume":"88","author":"Pincus","year":"1991","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"H2039","DOI":"10.1152\/ajpheart.2000.278.6.H2039","article-title":"Physiological time-series analysis using approximate entropy and sample entropy","volume":"278","author":"Richman","year":"2000","journal-title":"Am. J. Physiol. Heart Circ. Physiol."},{"key":"ref_36","first-page":"H2039","article-title":"Revisiting sample entropy analysis","volume":"278","author":"Govindan","year":"2000","journal-title":"Phys. A Stat. Mech. Its Appl."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"4058","DOI":"10.1016\/j.jfranklin.2021.02.024","article-title":"Permutation entropy based detection scheme of replay attacks in industrial cyber-physical systems","volume":"358","author":"Zhou","year":"2021","journal-title":"J. Frankl. Inst."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"474","DOI":"10.1016\/j.ymssp.2011.11.022","article-title":"Permutation entropy: A nonlinear statistical measure for status characterization of rotary machines","volume":"29","author":"Yan","year":"2012","journal-title":"Mech. Syst. Signal Process."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1016\/j.autcon.2017.12.036","article-title":"Analysing real world data streams with spatio-temporal correlations: Entropy vs. Pearson correlation","volume":"88","author":"Barnaghi","year":"2018","journal-title":"Autom. Constr."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"2073","DOI":"10.1111\/mec.13082","article-title":"Information entropy as a measure of genetic diversity and evolvability in colonization","volume":"24","author":"Day","year":"2015","journal-title":"Mol. Ecol."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Liu, X., Jiang, A., Xu, N., and Xue, J. (2016). Increment Entropy as a Measure of Complexity for Time Series. Entropy, 18.","DOI":"10.3390\/e18010022"},{"key":"ref_42","first-page":"157","article-title":"Urban Development and Complexity: Shannon Entropy as a Measure of Diversity","volume":"37","author":"Zachary","year":"2020","journal-title":"Plan. Pract. Res."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Mayer, C., Bachler, M., H\u00f6rtenhuber, M., Stocker, C., Holzinger, A., and Wassertheurer, S. (2014). Selection of entropy-measure parameters for knowledge discovery in heart rate variability data. BMC Bioinform., 15.","DOI":"10.1186\/1471-2105-15-S6-S2"},{"key":"ref_44","first-page":"1036","article-title":"Approximate Entropy as a Measure of Cognitive Fatigue: An EEG Pilot Study","volume":"20","author":"Chuckravanen","year":"2020","journal-title":"Int. J. Emerg. Trends Sci. Technol."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Coates, L., Shi, J., Rochester, L., Del Din, S., and Pantall, A. (2020). Entropy of Real-World Gait in Parkinson\u2019s Disease Determined from Wearable Sensors as a Digital Marker of Altered Ambulatory Behavior. Sensors, 20.","DOI":"10.3390\/s20092631"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1257\/jep.31.2.211","article-title":"Social media and fake news in the 2016 election","volume":"31","author":"Allcott","year":"2017","journal-title":"J. Econ. Perspect."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"eaau4586","DOI":"10.1126\/sciadv.aau4586","article-title":"Less than you think: Prevalence and predictors of fake news dissemination on Facebook","volume":"5","author":"Guess","year":"2019","journal-title":"Sci. Adv."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1094","DOI":"10.1126\/science.aao2998","article-title":"The science of fake news","volume":"359","author":"Lazer","year":"2018","journal-title":"Science"},{"key":"ref_49","first-page":"7","article-title":"Preprocessing techniques for text mining","volume":"5","author":"Kannan","year":"2014","journal-title":"Int. J. Comput. Sci. Commun. Netw."},{"key":"ref_50","unstructured":"Wang, K., Thrasher, C., Viegas, E., Li, X., and Hsu, B.J.P. (2010, January 2\u20134). An overview of Microsoft Web N-gram corpus and applications. Proceedings of the NAACL HLT 2010 Demonstration Session, Los Angeles, CA, USA."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1007\/s10339-019-00912-3","article-title":"Automating the process of identifying the preferred representational system in Neuro Linguistic Programming using Natural Language Processing","volume":"20","author":"Amirhosseini","year":"2019","journal-title":"Cogn. Process."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Strakov\u00e1, J., Straka, M., and Hajic, J. (2014, January 23\u201324). Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA.","DOI":"10.3115\/v1\/P14-5003"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Kalra, V., and Agrawal, R. (2019). Challenges of text analytics in opinion mining. Extracting Knowledge from Opinion Mining, IGI Global.","DOI":"10.4018\/978-1-5225-6117-0"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"412","DOI":"10.36689\/uhk\/hed\/2021-01-042","article-title":"The COVID-19 Pandemic and the Professional Situation on the Real Estate Market in Poland","volume":"11","author":"Koszel","year":"2021","journal-title":"Hradec Econ. Days"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"83","DOI":"10.15804\/rop2020105","article-title":"Program, Strategy and Tactics of Communist Movement in Contemporary Epoche","volume":"11","author":"Wiktor","year":"2020","journal-title":"Real. Politics Estim.-Comments"},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"741","DOI":"10.1016\/j.ijforecast.2018.01.003","article-title":"Predictive analysis and modelling football results using machine learning approach for English Premier League","volume":"35","author":"Baboota","year":"2019","journal-title":"Int. J. Forecast."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"544","DOI":"10.1016\/j.knosys.2006.04.011","article-title":"Predicting football results using Bayesian nets and other machine learning techniques","volume":"19","author":"Joseph","year":"2006","journal-title":"Knowl.-Based Syst."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Eryarsoy, E., and Delen, D. (2019, January 8\u201311). Predicting the Outcome of a Football Game: A Comparative Analysis of Single and Ensemble Analytics Methods. Proceedings of the 52nd Hawaii International Conference on System Sciences, Maui, HI, USA.","DOI":"10.24251\/HICSS.2019.136"},{"key":"ref_59","unstructured":"Schauberger, G., Groll, A., and Tutz, G. (2016). Modeling football results in the German Bundesliga using match-specific covariates. Engineering."},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"460","DOI":"10.1177\/1471082X18799934","article-title":"Predicting matches in international football tournaments with random forests","volume":"18","author":"Schauberger","year":"2018","journal-title":"Stat. Model."},{"key":"ref_61","unstructured":"(2021, August 31). STS.PL. Available online: https:\/\/stats.sts.pl\/pl."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"1573","DOI":"10.1016\/j.procs.2021.08.161","article-title":"Heterogeneous ensembles of classifiers in predicting Bundesliga football results","volume":"192","author":"Kozak","year":"2021","journal-title":"Procedia Comput. Sci."},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"e9","DOI":"10.1002\/spy2.9","article-title":"Detecting opinion spams and fake news using text classification","volume":"1","author":"Ahmed","year":"2018","journal-title":"Secur. Priv."},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"2893","DOI":"10.1016\/j.procs.2021.09.060","article-title":"Rapid detection of fake news based on machine learning methods","volume":"192","author":"Probierz","year":"2021","journal-title":"Procedia Comput. Sci."},{"key":"ref_65","unstructured":"Hall, M.A. (1998). Correlation-Based Feature Subset Selection for Machine Learning. [Ph.D. Thesis, University of Waikato]."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/12\/1621\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:38:29Z","timestamp":1760168309000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/12\/1621"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12,1]]},"references-count":65,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["e23121621"],"URL":"https:\/\/doi.org\/10.3390\/e23121621","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2021,12,1]]}}}