{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T14:13:33Z","timestamp":1760710413274,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2020,11,12]],"date-time":"2020-11-12T00:00:00Z","timestamp":1605139200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Deanship of Scientific Research at King Saud University","award":["RGP-264"],"award-info":[{"award-number":["RGP-264"]}]},{"DOI":"10.13039\/100014717","name":"National Outstanding Youth Science Fund Project of National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U1736105"],"award-info":[{"award-number":["U1736105"]}],"id":[{"id":"10.13039\/100014717","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2020,11,30]]},"abstract":"<jats:p>In this article, first a comprehensive study of the impact of term weighting schemes on the topic modeling performance (i.e., LDA and DMM) on Arabic long and short texts is presented. We investigate six term weighting methods including Word count method (standard topic models), TFIDF, PMI, BDC, CLPB, and CEW. Moreover, we propose a novel combination term weighting scheme, namely, CmTLB. We utilize the mTFIDF that takes into account the missing terms and the number of the documents in which the term appears when calculating the term weight. For further robust term weight, we combine mTFIDF with two weighting methods. We evaluate CmTLB against the studied weighting schemes by the quality of the learned topics (topic visualization and topic coherence), classification, and clustering tasks. We applied weighting schemes to Latent Dirichlet allocation (LDA) and Dirichlet multinomial mixture (DMM) on eight Arabic long and short document datasets, respectively. The experiment results outline that appropriate weighting schemes can effectively improve topic modeling performance on Arabic texts. More importantly, our proposed CmTLB significantly outperforms the other weighting schemes. Secondly, we investigate whether the Arabic stemming process can improve topic modeling performance. We study the three approaches of Arabic stemming including root-based, stem-based, and statistical approaches. We also train topic models with weighting schemes on documents after applying four stemmers related to different stemming approaches. The results outline that applying the stemming process not only reduces the dimensionality of term-document matrix leading to fast estimation process, but also show enhancement of topic modeling performance both on short and long Arabic documents. Moreover, Farasa stemmer achieves the highest performance in most cases, since it prevents the ambiguity that may happen because of the blind removal of the affixes such as in root-based or stem-based stemmers.<\/jats:p>","DOI":"10.1145\/3405843","type":"journal-article","created":{"date-parts":[[2020,11,24]],"date-time":"2020-11-24T13:01:34Z","timestamp":1606222894000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2320-1692","authenticated-orcid":false,"given":"Tinghuai","family":"Ma","sequence":"first","affiliation":[{"name":"Nanjing University of Information Science and Technology, China"}]},{"given":"Raeed","family":"Al-Sabri","sequence":"additional","affiliation":[{"name":"Nanjing University of Information Science and Technology, China"}]},{"given":"Lejun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Yangzhou University"}]},{"given":"Bockarie","family":"Marah","sequence":"additional","affiliation":[{"name":"Nanjing University of Information Science and Technology, China"}]},{"given":"Najla","family":"Al-Nabhan","sequence":"additional","affiliation":[{"name":"King Saud University, Saudi Arabia"}]}],"member":"320","published-online":{"date-parts":[[2020,11,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1080\/0952813X.2016.1212100"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-3003"},{"volume-title":"Jihad El Sana, and Walid Abusalah","year":"2014","author":"Abuaiadah Diab","key":"e_1_2_1_3_1"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jksuci.2014.04.001"},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"M. Alhawarat and M. Hegazi. 2018. Revisiting K-means and topic modeling a comparison study to cluster Arabic documents. IEEE Access (2018). DOI:https:\/\/doi.org\/10.1109\/ACCESS.2018.2852648  M. Alhawarat and M. Hegazi. 2018. Revisiting K-means and topic modeling a comparison study to cluster Arabic documents. IEEE Access (2018). DOI:https:\/\/doi.org\/10.1109\/ACCESS.2018.2852648","DOI":"10.1109\/ACCESS.2018.2852648"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jclepro.2019.02.063"},{"volume-title":"Proceedings of the 10th International Conference on Web and Social Media (ICWSM\u201916)","year":"2016","author":"Alsaedi Nasser","key":"e_1_2_1_7_1"},{"volume-title":"Building and benchmarking novel Arabic stemmer for document classification. Journal of Computational and Theoretical Nanoscience","year":"2016","author":"Ayedh Abdullah","key":"e_1_2_1_8_1"},{"volume-title":"Jordan","year":"2003","author":"Blei David M.","key":"e_1_2_1_9_1"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.asej.2017.04.007"},{"volume-title":"Proceedings of German Society for Computational Linguistics (GSCL\u201909)","year":"2009","author":"Bouma Gerlof","key":"e_1_2_1_11_1"},{"volume-title":"Improved TFIDF in big news retrieval: An empirical study. Pattern Recognition Letters","year":"2017","author":"Chen Chien Hsing","key":"e_1_2_1_12_1"},{"volume-title":"Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916)","year":"2016","author":"Darwish Kareem","key":"e_1_2_1_13_1"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.dib.2019.104076"},{"volume-title":"Eduardo F. Morales, and Jos\u00e9 Mart\u00ednez-Carranza.","year":"2015","author":"Escalante Hugo Jair","key":"e_1_2_1_15_1"},{"volume-title":"Arabic natural language processing: An overview","year":"2019","author":"Guellil Imane","key":"e_1_2_1_16_1"},{"volume-title":"Introduction to Arabic Natural Language Processing","author":"Habash Nizar Y.","key":"e_1_2_1_17_1"},{"volume-title":"NAACL 2001","year":"2001","author":"Khoja G.","key":"e_1_2_1_18_1"},{"volume-title":"Conference on Data Mining (DMIN\u201906)","year":"2006","author":"Khreisat Laila","key":"e_1_2_1_19_1"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"R. Lakshmi and S. Baskar. 2019. Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms. Expert Systems with Applications (2019). DOI:https:\/\/doi.org\/10.1016\/j.eswa.2019.07.022  R. Lakshmi and S. Baskar. 2019. Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms. Expert Systems with Applications (2019). DOI:https:\/\/doi.org\/10.1016\/j.eswa.2019.07.022","DOI":"10.1016\/j.eswa.2019.07.022"},{"volume-title":"Connell","year":"2007","author":"Larkey Leah S.","key":"e_1_2_1_21_1"},{"volume-title":"Topic modeling for short texts with auxiliary word embeddings categories and subject descriptors. SIGIR","year":"2016","author":"Li Chenliang","key":"e_1_2_1_22_1"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13042-017-0681-9"},{"volume-title":"Filtering out the noise in short text topic modeling. Information Sciences 456 (Aug","year":"2018","author":"Li Ximing","key":"e_1_2_1_24_1"},{"volume-title":"Exploring coherent topics by topic modeling with term weighting. Information Processing and Management","year":"2018","author":"Li Ximing","key":"e_1_2_1_25_1"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33017884"},{"volume-title":"Al-Rodhaan","year":"2019","author":"Ma Tinghuai","key":"e_1_2_1_27_1"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2018.08.010"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/RoboMech.2016.7813155"},{"volume-title":"52nd Annual Meeting of the Association for Computational Linguistics (ACL\u201914)","year":"2034","author":"Monroe Will","key":"e_1_2_1_30_1"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2019.09.238"},{"key":"e_1_2_1_32_1","first-page":"45","article-title":"Arabic topic identification based on empirical studies of topic models","volume":"27","author":"Naili Marwa","year":"2017","journal-title":"ARIMA Journal"},{"volume-title":"Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Pasha Arfath","key":"e_1_2_1_33_1"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2018.10.492"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2019.03.023"},{"key":"e_1_2_1_36_1","first-page":"2993272","article-title":"A self-play and sentiment-emphasized comment integration framework based on deep q-learning in a crowdsourcing scenario","volume":"2020","author":"Rong Huan","year":"2020","journal-title":"DOI:https:\/\/doi.org\/10.1109\/tkde."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2017.04.069"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOORIC.2013.55"},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","unstructured":"Hussein Soori Jan Plato\u0161 and V\u00e1clav Sn\u00e1\u0161el. 2012. Simple stemming rules for Arabic language. In Advances in Intelligent Systems and Computing. DOI:https:\/\/doi.org\/10.1007\/978-3-642-31603-6_9  Hussein Soori Jan Plato\u0161 and V\u00e1clav Sn\u00e1\u0161el. 2012. Simple stemming rules for Arabic language. In Advances in Intelligent Systems and Computing. DOI:https:\/\/doi.org\/10.1007\/978-3-642-31603-6_9","DOI":"10.1007\/978-3-642-31603-6_9"},{"key":"e_1_2_1_40_1","first-page":"5","article-title":"A statistical interpretation of term specificity and its application in retrieval","volume":"60","author":"Jones Karen Sp\u00e4rck","year":"2004","journal-title":"Journal of Documentation"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ITCC.2005.90"},{"key":"e_1_2_1_42_1","first-page":"3","article-title":"Probabilistic \u2014Asurvey","volume":"9","author":"Padmaja CH V.","year":"2018","journal-title":"International Journal of Advanced Research in Computer Science"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCTAE.2010.5544382"},{"volume-title":"Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI). DOI:https:\/\/doi.org\/10","year":"2016","author":"Wang Tao","key":"e_1_2_1_44_1"},{"volume-title":"Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT\u201910), Proceedings of the main conference.","author":"Andrew","key":"e_1_2_1_45_1"},{"volume-title":"Proceedings of the 26th International Conference on Computational Linguistics (COLING\u201916)","year":"2016","author":"Yang Kai","key":"e_1_2_1_46_1"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2623330.2623715"},{"volume-title":"IEEE Transactions on Knowledge and Data Engineering","year":"2017","author":"Zhuang Yueting","key":"e_1_2_1_48_1"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3405843","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3405843","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:51Z","timestamp":1750195911000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3405843"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,11,12]]},"references-count":48,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11,30]]}},"alternative-id":["10.1145\/3405843"],"URL":"https:\/\/doi.org\/10.1145\/3405843","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2020,11,12]]},"assertion":[{"value":"2020-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}