{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T15:23:43Z","timestamp":1767626623494,"version":"build-2065373602"},"reference-count":25,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2018,4,20]],"date-time":"2018-04-20T00:00:00Z","timestamp":1524182400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"The authors acknowledge the financial support of the Brazilian financial agency S\u00e3o Paulo Research Foundation (FAPESP) - grant #2013\/03452-0."}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Typically, textual information is available as unstructured data, which require processing so that data mining algorithms can handle such data; this processing is known as the pre-processing step in the overall text mining process. This paper aims at analyzing the strong impact that the pre-processing step has on most mining tasks. Therefore, we propose a methodology to vary distinct combinations of pre-processing steps and to analyze which pre-processing combination allows high precision. In order to show different combinations of pre-processing methods, experiments were performed by comparing some combinations such as stemming, term weighting, term elimination based on low frequency cut and stop words elimination. These combinations were applied in text and opinion mining tasks, from which correct classification rates were computed to highlight the strong impact of the pre-processing combinations. Additionally, we provide graphical representations from each pre-processing combination to show how visual approaches are useful to show the processing effects on document similarities and group formation (i.e., cohesion and separation).<\/jats:p>","DOI":"10.3390\/info9040100","type":"journal-article","created":{"date-parts":[[2018,4,23]],"date-time":"2018-04-23T04:29:17Z","timestamp":1524457757000},"page":"100","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":32,"title":["Analysis of Document Pre-Processing Effects in Text and Opinion Mining"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9493-145X","authenticated-orcid":false,"given":"Danilo Medeiros","family":"Eler","sequence":"first","affiliation":[{"name":"Departamento de Matematica e Computa\u00e7\u00e3o, Sao Paulo State University\u2014UNESP, Presidente Prudente 19060-900, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5773-8490","authenticated-orcid":false,"given":"Denilson","family":"Grosa","sequence":"additional","affiliation":[{"name":"Departamento de Matematica e Computa\u00e7\u00e3o, Sao Paulo State University\u2014UNESP, Presidente Prudente 19060-900, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7300-7535","authenticated-orcid":false,"given":"Ives","family":"Pola","sequence":"additional","affiliation":[{"name":"Departamento de Inform\u00e1tica, University of Technology\u2014UTFPR, Pato Branco 85503-390, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1248-528X","authenticated-orcid":false,"given":"Rog\u00e9rio","family":"Garcia","sequence":"additional","affiliation":[{"name":"Departamento de Matematica e Computa\u00e7\u00e3o, Sao Paulo State University\u2014UNESP, Presidente Prudente 19060-900, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ronaldo","family":"Correia","sequence":"additional","affiliation":[{"name":"Departamento de Matematica e Computa\u00e7\u00e3o, Sao Paulo State University\u2014UNESP, Presidente Prudente 19060-900, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jaqueline","family":"Teixeira","sequence":"additional","affiliation":[{"name":"Departamento de Matematica e Computa\u00e7\u00e3o, Sao Paulo State University\u2014UNESP, Presidente Prudente 19060-900, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2018,4,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Hu, Y., Milios, E.E., and Blustein, J. (2012, January 26\u201330). Enhancing Semi-supervised Document Clustering with Feature Supervision. Proceedings of the 27th Annual ACM Symposium on Applied Computing, Trento, Italy.","DOI":"10.1145\/2245276.2245457"},{"key":"ref_2","unstructured":"Nogueira, B.M., Moura, M.F., Conrado, M.S., Rossi, R.G., Marcacini, R.M., and Rezende, S.O. (2008, January 26\u201330). Winning Some of the Document Preprocessing Challenges in a Text Mining Process. Proceedings of the Anais do IV Workshop em Algoritmos e Aplica\u00e7\u00f5es de Minera\u00e7\u00e3o de Dados\u2014WAAMD, XXIII Simp\u00f3sio Brasileiro de Banco de Dados\u2014SBBD, Campinas, Sao Paulo, Brazil."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chandrasekar, P., and Qian, K. (2016). The Impact of Data Preprocessing on the Performance of a Naive Bayes Classifier, IEEE Computer Society.","DOI":"10.1109\/COMPSAC.2016.205"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tugizimana, F., Steenkamp, P., Piater, L., and Dubery, I. (2016). Conversation on Data Mining Strategies in LC-MS Untargeted Metabolomics: Pre-Processing and Pre-Treatment Steps. Metabolites, 6.","DOI":"10.3390\/metabo6040040"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Lee, J.L., and Yi, J.-S. (2017). Predicting Project\u2019s Uncertainty Risk in the Bidding Process by Integrating Unstructured Text Data and Structured Numerical Data Using Text Mining. Appl. Sci., 7.","DOI":"10.3390\/app7111141"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Roh, T., Jeong, Y., and Yoon, B. (2017). Developing a Methodology of Structuring and Layering Technological Information in Patent Documents through Natural Language Processing. Sustainability, 9.","DOI":"10.3390\/su9112117"},{"key":"ref_7","first-page":"3","article-title":"About relationship between business text patterns and financial performance in corporate data","volume":"4","author":"Lee","year":"2018","journal-title":"J. Open Innov. Technol. Mark. Complex."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"613","DOI":"10.1145\/361219.361220","article-title":"A vector space model for automatic indexing","volume":"18","author":"Salton","year":"1975","journal-title":"Commun. ACM"},{"key":"ref_9","unstructured":"Porter, M.F. (1997). An Algorithm for Suffix Stripping, Morgan Kaufmann Publishers Inc."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1108\/eb026562","article-title":"On the specification of term values in automatic indexing","volume":"29","author":"Salton","year":"1973","journal-title":"J. Doc."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1147\/rd.22.0159","article-title":"The automatic creation of literature abstracts","volume":"2","author":"Luhn","year":"1958","journal-title":"IBM J. Res. Dev."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"218","DOI":"10.1057\/palgrave.ivs.9500054","article-title":"On improved projection techniques to support visual exploration of multidimensional datasets","volume":"2","author":"Tejada","year":"2003","journal-title":"Inf. Vis."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"564","DOI":"10.1109\/TVCG.2007.70443","article-title":"Least Square Projection: A fast high precision multidimensional projection technique and its application to document mapping","volume":"14","author":"Paulovich","year":"2008","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Eler, D.M., Paulovich, F.V., de Oliveira, M.C.F., and Minghim, R. (2008, January 9\u201311). Coordinated and Multiple Views for Visualizing Text Collections. Proceedings of the 12th International Conference Information Visualisation, London, UK.","DOI":"10.1109\/IV.2008.39"},{"key":"ref_15","unstructured":"Eler, D.M., Pola, I.R.V., Garcia, R.E., and Teixeira, J.B.M. (2017). Visualizing the Document Pre-processing Effects in Text Mining Process. Advances in Intelligent Systems and Computing, Proceedings of the 14th International Conference on Information Technology: New Generations (ITNG 2017), Las Vegas, NV, USA, 10\u201312 April 2017, Springer International Publishing."},{"key":"ref_16","unstructured":"Tan, P.N., Steinbach, M., and Kumar, V. (2005). Introduction to Data Mining, Addison-Wesley Longman Publishing Co., Inc.. [1st ed.]."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Liu, B. (2012). Sentiment Analysis and Opinion Mining, Morgan and Claypool Publishers.","DOI":"10.1007\/978-3-031-02145-9"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"58","DOI":"10.5747\/ce.2017.v09.n1.e184","article-title":"Feature Space Unidimensional Projections for Scatterplots","volume":"9","author":"Eler","year":"2017","journal-title":"Colloq. Exactarum"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"923","DOI":"10.1007\/s00371-009-0368-7","article-title":"Visual analysis of image collections","volume":"25","author":"Eler","year":"2009","journal-title":"Vis. Comput."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1091","DOI":"10.1111\/j.1467-8659.2011.01958.x","article-title":"Piecewise Laplacian-based Projection for Interactive Data Exploration and Organization","volume":"30","author":"Paulovich","year":"2011","journal-title":"Comput. Graph. Forum"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Bodo, L., de Oliveira, H.C., Breve, F.A., and Eler, D.M. (2016, January 10\u201313). Performance Indicators Analysis in Software Processes Using Semi-supervised Learning with Information Visualization. Proceedings of the 13th International Conference on Information Technology, New Generations (ITNG 2016), Las Vegas, NV, USA.","DOI":"10.1007\/978-3-319-32467-8_49"},{"key":"ref_22","unstructured":"Esuli, A., and Sebastiani, F. (2006, January 22\u201328). SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. Proceedings of the 5th Conference on Language Resources and Evaluation, Genoa, Italy."},{"key":"ref_23","unstructured":"Cambria, E., Speer, R., Havasi, C., and Hussain, A. (2010). SenticNet: A Publicly Available Semantic Resource for Opinion Mining. AAAI Fall Symposium: Commonsense Knowledge, AAAI Press. AAAI Technical Report."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Eler, D.M., and Garcia, R.E. (2013, January 16\u201318). Using Otsu\u2019s Threshold Selection Method for Eliminating Terms in Vector Space Model Computation. Proceedings of the International Conference on Information Visualization, London, UK.","DOI":"10.1109\/IV.2013.29"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/9\/4\/100\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:01:28Z","timestamp":1760194888000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/9\/4\/100"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,4,20]]},"references-count":25,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2018,4]]}},"alternative-id":["info9040100"],"URL":"https:\/\/doi.org\/10.3390\/info9040100","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2018,4,20]]}}}