{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T06:30:10Z","timestamp":1774074610191,"version":"3.50.1"},"reference-count":39,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2016,4,18]],"date-time":"2016-04-18T00:00:00Z","timestamp":1460937600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Preprocessing is one of the main components in a conventional document categorization (DC) framework. This paper aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system. In this study, three classification techniques are used, namely, naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM). Experimental analysis on Arabic datasets reveals that preprocessing techniques have a significant impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Choosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document categorization depending on the feature size and classification techniques. Findings of this study show that the SVM technique has outperformed the KNN and NB techniques. The SVM technique achieved 96.74% micro-F1 value by using the combination of normalization and stemming as preprocessing tasks.<\/jats:p>","DOI":"10.3390\/a9020027","type":"journal-article","created":{"date-parts":[[2016,4,18]],"date-time":"2016-04-18T10:37:17Z","timestamp":1460975837000},"page":"27","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":50,"title":["The Effect of Preprocessing on Arabic Document Categorization"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0328-0613","authenticated-orcid":false,"given":"Abdullah","family":"Ayedh","sequence":"first","affiliation":[{"name":"School of Information Science and Engineering, Central south University, Changsha 410000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guanzheng","family":"TAN","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Central south University, Changsha 410000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Khaled","family":"Alwesabi","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Central south University, Changsha 410000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hamdi","family":"Rajeh","sequence":"additional","affiliation":[{"name":"College of Computer Science and Electrical Engineering, Hunan University, Changsha 410000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2016,4,18]]},"reference":[{"key":"ref_1","unstructured":"Al-Kabi, M., Al-Shawakfa, E., and Alsmadi, I. (2013). The Effect of Stemming on Arabic Text. Classification: An. Empirical Study. Inf. Retr. Methods Multidiscip. Appl."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Springer.","DOI":"10.1007\/BFb0026683"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Nehar, A., Ziadi, D., Cherroun, H., and Guellouma, Y. (2012). An efficient stemming for arabic text classification. Innov. Inf. Technol.","DOI":"10.1109\/INNOVATIONS.2012.6207760"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1007\/s10044-005-0256-3","article-title":"A comparative study on text representation schemes in text categorization","volume":"8","author":"Song","year":"2005","journal-title":"Pattern Anal. Appl."},{"key":"ref_5","first-page":"354","article-title":"Influence of word normalization on text classification","volume":"4","author":"Toman","year":"2006","journal-title":"Proc. InSciT"},{"key":"ref_6","first-page":"430","article-title":"The Influence of preprocessing parameters on text categorization","volume":"1","author":"Rehurek","year":"2007","journal-title":"Int. J. Appl. Sci. Eng. Technol."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1016\/j.ipm.2013.08.006","article-title":"The impact of preprocessing on text classification","volume":"50","author":"Uysal","year":"2014","journal-title":"Inf. Proc. Manag."},{"key":"ref_8","unstructured":"M\u00e9ndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Diaz, F., and Corchado, J.M. (2005). Current Topics in Artificial Intelligence, Springer."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Chirawichitchai, N., Sa-nguansat, P., and Meesad, P. (2010, January 24\u201325). Developing an Effective Thai Document Categorization Framework Base on Term Relevance Frequency Weighting. Proceedings of the 2010 8th International Conference on ICT, Bangkok, Thailand.","DOI":"10.1109\/ICTKE.2010.5692907"},{"key":"ref_10","unstructured":"Moh\u2019d Mesleh, A. (2008). Advances in Computer and Information Sciences and Engineering, Springer."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"430","DOI":"10.3844\/jcssp.2007.430.435","article-title":"Chi square feature extraction based SVMs Arabic language text categorization system","volume":"3","year":"2007","journal-title":"J. Comput. Sci."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"68","DOI":"10.4018\/jitwe.2011040106","article-title":"An experimental study for the effect of stop words elimination for arabic text. classification algorithms","volume":"6","author":"Olayah","year":"2011","journal-title":"Int. J. Inf. Technol. Web Eng."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Al-Shammari, E.T., and Lin, J. (2008, January 26\u201330). Towards an Error-Free Arabic Stemming. Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, Napa Valley, CA, USA.","DOI":"10.1145\/1460027.1460030"},{"key":"ref_14","unstructured":"Kanan, T., and Fox, E.A. (2016). Automated Arabic Text. Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy. J. Assoc. Inf. Sci. Technol."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"2347","DOI":"10.1002\/asi.21173","article-title":"Feature reduction techniques for Arabic text categorization","volume":"60","author":"Duwairi","year":"2009","journal-title":"J. Am. Soc. Inf. Sci. Technol."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1007\/s10579-013-9221-8","article-title":"Comparative evaluation of text classification techniques using a large diverse Arabic dataset","volume":"47","author":"Khorsheed","year":"2013","journal-title":"Lang. Resour. Eval."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"219","DOI":"10.14445\/22312803\/IJCTT-V7P109","article-title":"Vector space models to classify arabic text","volume":"7","author":"Ababneh","year":"2014","journal-title":"Int. J. Comput. Trends Technol."},{"key":"ref_18","first-page":"127","article-title":"A Hybrid Method N-Grams-TFIDF with radial basis for indexing and classification of Arabic documents","volume":"8","author":"Zaki","year":"2014","journal-title":"Int. J. Softw. Eng. Its Appl."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Thabtah, F., Gharaibeh, O., and Al-Zubaidy, R. (2012). Arabic text mining using rule based classification. J. Inf. Knowl. Manag., 11.","DOI":"10.1142\/S0219649212500062"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"125","DOI":"10.2498\/cit.1001770","article-title":"Arabic Text. Classification framework based on latent dirichlet allocation","volume":"20","author":"Zrigui","year":"2012","journal-title":"J. Comput. Inf. Technol."},{"key":"ref_21","unstructured":"Khoja, S. (2001, January 2\u20137). APT: Arabic Part-of-Speech Tagger. Proceedings of the Student Workshop at NAACL, Pittsburghm, PA, USA."},{"key":"ref_22","first-page":"125","article-title":"Arabic Text. Categorization","volume":"4","author":"Duwairi","year":"2007","journal-title":"Int. Arab J. Inf. Technol."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Nwesri, A.F., Tahaghoghi, S.M., and Scholer, F. (2006, January 22\u201323). Capturing Out-of-Vocabulary Words in Arabic text. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, Australia.","DOI":"10.3115\/1610075.1610113"},{"key":"ref_24","unstructured":"Khoja, S., and Garside, R. (1999). Computing Department, Lancaster University. Available online: http:\/\/www.comp.lancs.ac.uk\/computing\/users\/khoja\/stemmer.ps."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Kanaan, G., Al-Shalabi, R., Ababneh, M., and Al-Nobani, A. (2008, January 16\u201318). Building an Effective Rule-Based Light Stemmer for Arabic Language to Inprove Search Effectiveness. Proceedings of the 2008 International Conference on Innovations in Information Technology, Al Ain, Arab Emirates.","DOI":"10.1109\/INNOVATIONS.2008.4781687"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Aljlayl, M., and Frieder, O. (2002, January 4\u20139). On Arabic Search: Improving the Retrieval Effectiveness via a Light Stemming Approach. Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, VA, USA.","DOI":"10.1145\/584792.584848"},{"key":"ref_27","unstructured":"Larkey, L.S., Ballesteros, L., and Connell, M.E. (2007). Arabic Computational Morphology, Springer."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Proc. Manag."},{"key":"ref_29","first-page":"1289","article-title":"Extensive empirical study of feature selection metrics for text classification","volume":"3","author":"Forman","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_30","first-page":"69","article-title":"Text Feature Selection using Particle Swarm Optimization Algorithm","volume":"7","author":"Zahran","year":"2009","journal-title":"World Appl. Sci. J."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"6826","DOI":"10.1016\/j.eswa.2008.08.006","article-title":"Feature selection with a measure of deviations from Poisson in text categorization","volume":"36","author":"Ogura","year":"2009","journal-title":"Expert Syst. Appl."},{"key":"ref_32","unstructured":"Thabtah, F., Eljinini, M., Zamzeer, M., and Hadi, W. (2009, January 4\u20136). Na\u00efve Bayesian Based on Chi Square to Categorize Arabic Data. Proceedings of the 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, Cairo, Egypt."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/505282.505283","article-title":"Machine learning in automated text categorization","volume":"34","author":"Sebastiani","year":"2002","journal-title":"ACM Comput. Surv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"El Kourdi, M., Bensaid, A., and Rachidi, T.-E. (2004, January 28). Automatic Arabic document categorization based on the Na\u00efve Bayes algorithm. Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Geneva, Switzerland.","DOI":"10.3115\/1621804.1621819"},{"key":"ref_35","first-page":"118","article-title":"Associative classification to categorize Arabic data sets","volume":"1","year":"2010","journal-title":"Int. J. Acm Jordan"},{"key":"ref_36","first-page":"1","article-title":"An intelligent system for Arabic text categorization","volume":"6","author":"Syiam","year":"2006","journal-title":"Int. J. Intell. Comput. Inf. Sci."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"600","DOI":"10.3844\/jcssp.2008.600.605","article-title":"Arabic Text Classification Using K-NN and Naive Bayes","volume":"4","author":"Bawaneh","year":"2008","journal-title":"J. Comput. Sci."},{"key":"ref_38","unstructured":"Alaa, E. (2008). A comparative study on arabic text classification. Egypt. Comput. Sci. J., 2."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1016\/j.aei.2007.12.001","article-title":"Performance of KNN and SVM classifiers on full word Arabic articles","volume":"22","author":"Hmeidi","year":"2008","journal-title":"Adv. Eng. Inform."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/9\/2\/27\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T19:22:25Z","timestamp":1760210545000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/9\/2\/27"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,4,18]]},"references-count":39,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2016,6]]}},"alternative-id":["a9020027"],"URL":"https:\/\/doi.org\/10.3390\/a9020027","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,4,18]]}}}