{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T07:49:02Z","timestamp":1778572142923,"version":"3.51.4"},"reference-count":38,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2020,11,5]],"date-time":"2020-11-05T00:00:00Z","timestamp":1604534400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is \u201cDataset without overlapping\u201d in which all classes have distinct nature. The other is \u201cDataset with overlapping\u201d in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.<\/jats:p>","DOI":"10.3390\/info11110518","type":"journal-article","created":{"date-parts":[[2020,11,5]],"date-time":"2020-11-05T09:04:34Z","timestamp":1604567074000},"page":"518","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling"],"prefix":"10.3390","volume":"11","author":[{"given":"Mubashar","family":"Mustafa","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Central South University, 410083 Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1541-1326","authenticated-orcid":false,"given":"Feng","family":"Zeng","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Central South University, 410083 Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hussain","family":"Ghulam","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Central South University, 410083 Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hafiz","family":"Muhammad Arslan","sequence":"additional","affiliation":[{"name":"School of Software Engineering, Northeastern University, 110819 Shenyang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,11,5]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Hanbury, A., Rauber, A., and de Vries, A.P. (2011). Multilingual Document Clustering Using Wikipedia as External Knowledge. Multidisciplinary Information Retrieval, Springer.","DOI":"10.1007\/978-3-642-21353-3"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"651","DOI":"10.1016\/j.patrec.2009.09.011","article-title":"Data Clustering: 50 Years Beyond K-means","volume":"31","author":"Jain","year":"2010","journal-title":"Pattern Recognit. Lett."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3173044","article-title":"Mining Event-Oriented Topics in Microblog Stream with Unsupervised Multi-View Hierarchical Embedding","volume":"12","author":"Peng","year":"2018","journal-title":"ACM Trans. Knowl. Discov. Data"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Peng, M., Zhu, J., Li, X., Huang, J., Wang, H., and Zhang, Y. (2015, January 19\u201323). Central Topic Model for Event-oriented Topics Mining in Microblog Stream. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM \u201915, Melbourne, Australia.","DOI":"10.1145\/2806416.2806561"},{"key":"ref_5","unstructured":"Ghosh, J., and Strehl, A. (2006). Similarity-Based Text Clustering: A Comparative Study. Grouping Multidimensional Data: Recent Advances in Clustering, Springer."},{"key":"ref_6","unstructured":"Liu, L., Kang, J., Yu, J., and Wang, Z. (November, January 30). A comparative study on unsupervised feature selection methods for text clustering. Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, China."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Rahman, A.U., Khan, K., Khan, W., Khan, A., and Saqia, B. (2018). Unsupervised Machine Learning based Documents Clustering in Urdu. EAI Endorsed Trans. Scalable Inf. Syst., 5.","DOI":"10.4108\/eai.19-12-2018.156081"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"42740","DOI":"10.1109\/ACCESS.2018.2852648","article-title":"Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents","volume":"6","author":"Alhawarat","year":"2018","journal-title":"IEEE Access"},{"key":"ref_9","unstructured":"Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Blei, D.M. (2009). Reading Tea Leaves: How Humans Interpret Topic Models. Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_10","unstructured":"Jagarlamudi, J., Daum\u00e9 III, H., and Udupa, R. (2012). Incorporating Lexical Priors into Topic Models. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"945","DOI":"10.1093\/genetics\/155.2.945","article-title":"Inference of Population Structure Using Multilocus Genotype Data","volume":"155","author":"Pritchard","year":"2000","journal-title":"Genetics"},{"key":"ref_12","unstructured":"Filipe, J., and Cordeiro, J. (2009). Enhancing Text Clustering Performance Using Semantic Similarity. Enterprise Information Systems, Springer."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1145\/2133806.2133826","article-title":"Probabilistic Topic Models","volume":"55","author":"Blei","year":"2012","journal-title":"Commun. ACM"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Amine, A., Otmane, A.M., and Bellatreche, L. (2013). Clustering with Probabilistic Topic Models on Arabic Texts. Modeling Approaches and Algorithms for Advanced Computer Applications, Springer International Publishing.","DOI":"10.1007\/978-3-319-00560-7"},{"key":"ref_15","unstructured":"Humayoun, M. (2007). Urdu Morphology, Orthography and Lexicon Extraction, Linguistic Institute, Stanford University. CAASL-2, the Second Workshop on Computational Approaches to Arabic Script-based Languages."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1838751.1838754","article-title":"An Information-Extraction System for Urdu\u2014A Resource-Poor Language","volume":"9","author":"Mukund","year":"2010","journal-title":"ACM Trans. Asian Lang. Inf. Process. (TALIP)"},{"key":"ref_17","first-page":"18","article-title":"Article: Automatic Text Summarization","volume":"109","author":"Patil","year":"2015","journal-title":"Int. J. Comput. Appl."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1007\/s10462-016-9482-x","article-title":"Urdu language processing: A survey","volume":"47","author":"Daud","year":"2017","journal-title":"Artif. Intell. Rev."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Shabbir, S., Javed, N., Siddiqi, I., and Khurshid, K. (2017, January 27\u201328). A comparative study on clustering techniques for Urdu ligatures in nastaliq font. Proceedings of the 13th International Conference on Emerging Technologies (ICET), Islamabad, Pakistan.","DOI":"10.1109\/ICET.2017.8281724"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"503","DOI":"10.1007\/s10586-017-0916-2","article-title":"Urdu ligature recognition using multi-level agglomerative hierarchical clustering","volume":"21","author":"Khan","year":"2018","journal-title":"Clust. Comput."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1007\/s10588-018-9271-y","article-title":"Ligature Categorization Based Nastaliq Urdu Recognition Using Deep Neural Networks","volume":"25","author":"Rafeeq","year":"2019","journal-title":"Comput. Math. Organ. Theory"},{"key":"ref_22","unstructured":"Khan, S.A., Anwar, W., Bajwa, U.I., and Wang, X. (2012, January 8\u201315). A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, Mumbai, India."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"105749","DOI":"10.1016\/j.dib.2020.105749","article-title":"Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images","volume":"31","author":"Chandio","year":"2020","journal-title":"Data Brief"},{"key":"ref_24","unstructured":"Nasim, Z., and Haider, S. (2020). Cluster analysis of urdu tweets. J. King Saud Univ. Comput. Inf. Sci., in press."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"102383","DOI":"10.1016\/j.ipm.2020.102383","article-title":"Extractive Text Summarization Models for Urdu Language","volume":"57","author":"Nawaz","year":"2020","journal-title":"Inf. Process. Manag."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"113001","DOI":"10.1016\/j.eswa.2019.113001","article-title":"Website categorization: A formal approach and robustness analysis in the case of e-commerce detection","volume":"142","author":"Bruni","year":"2020","journal-title":"Expert Syst. Appl."},{"key":"ref_27","first-page":"77","article-title":"Finding Topics in Urdu: A Study of Applicability of Document Clustering in Urdu Language","volume":"23","author":"Ehsan","year":"2018","journal-title":"Pak. J. Eng. Appl. Sci."},{"key":"ref_28","unstructured":"Allahyari, M., Pouriyeh, S.A., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., and Kochut, K. (2017). A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Aggarwal, C.C., and Zhai, C. (2012). A Survey of Text Clustering Algorithms. Mining Text Data, Springer US.","DOI":"10.1007\/978-1-4614-3223-4"},{"key":"ref_30","first-page":"993","article-title":"Latent Dirichlet Allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_31","unstructured":"Paatero, P., and Tapper, U. (1992, January 17\u201321). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Proceedings of the Fourth International Conference on Statistical Methods for the Environmental Sciences, Espoo, Finland."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1038\/44565","article-title":"Learning the parts of objects by non-negative matrix factorization","volume":"401","author":"Lee","year":"1999","journal-title":"Nature"},{"key":"ref_33","first-page":"279","article-title":"Orthographic Diacritics and Multilingual Computing","volume":"47","author":"Wells","year":"2001","journal-title":"Proc. Lang. Probl. Lang. Plan."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"5228","DOI":"10.1073\/pnas.0307752101","article-title":"Finding scientific topics","volume":"101","author":"Griffiths","year":"2004","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"178","DOI":"10.1007\/s10791-010-9141-9","article-title":"Investigating task performance of probabilistic topic models: An empirical study of PLSA and LDA","volume":"14","author":"Lu","year":"2011","journal-title":"Inf. Retr."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Larsen, B., and Aone, C. (1999, January 15\u201318). Fast and Effective Text Mining Using Linear-time Document Clustering. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201999, San Diego, CA, USA.","DOI":"10.1145\/312129.312186"},{"key":"ref_37","unstructured":"Rijsbergen, C.J.V. (1979). Information Retrieval, Butterworth-Heinemann. [2nd ed.]."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"846","DOI":"10.1080\/01621459.1971.10482356","article-title":"Objective Criteria for the Evaluation of Clustering Methods","volume":"66","author":"Rand","year":"1971","journal-title":"J. Am. Stat. Assoc."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/11\/11\/518\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:29:35Z","timestamp":1760178575000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/11\/11\/518"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,11,5]]},"references-count":38,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2020,11]]}},"alternative-id":["info11110518"],"URL":"https:\/\/doi.org\/10.3390\/info11110518","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,11,5]]}}}