{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,18]],"date-time":"2026-05-18T16:44:58Z","timestamp":1779122698385,"version":"3.51.4"},"reference-count":80,"publisher":"Cambridge University Press (CUP)","issue":"4","license":[{"start":{"date-parts":[[2022,1,19]],"date-time":"2022-01-19T00:00:00Z","timestamp":1642550400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This study describes a Natural Language Processing (NLP) toolkit, as the first contribution of a larger project, for an under-resourced language\u2014Urdu. In previous studies, standard NLP toolkits have been developed for English and many other languages. There is also a dire need for standard text processing tools and methods for Urdu, despite it being widely spoken in different parts of the world with a large amount of digital text being readily available. This study presents the first version of the<jats:italic>UNLT (Urdu Natural Language Toolkit)<\/jats:italic>which contains three key text processing tools required for an Urdu NLP pipeline; word tokenizer, sentence tokenizer, and part-of-speech (POS) tagger. The UNLT word tokenizer employs a morpheme matching algorithm coupled with a state-of-the-art stochastic<jats:italic>n<\/jats:italic>-gram language model with back-off and smoothing characteristics for the space omission problem. The space insertion problem for compound words is tackled using a dictionary look-up technique. The UNLT sentence tokenizer is a combination of various machine learning, rule-based, regular-expressions, and dictionary look-up techniques. Finally, the UNLT POS taggers are based on Hidden Markov Model and Maximum Entropy-based stochastic techniques. In addition, we have developed large gold standard training and testing data sets to improve and evaluate the performance of new techniques for Urdu word tokenization, sentence tokenization, and POS tagging. For comparison purposes, we have compared the proposed approaches with several methods. Our proposed UNLT, the training and testing data sets, and supporting resources are all free and publicly available for academic use.<\/jats:p>","DOI":"10.1017\/s1351324921000425","type":"journal-article","created":{"date-parts":[[2022,1,19]],"date-time":"2022-01-19T06:36:26Z","timestamp":1642574186000},"page":"942-977","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":21,"title":["UNLT: Urdu Natural Language Toolkit"],"prefix":"10.1017","volume":"29","author":[{"given":"Jawad","family":"Shafi","sequence":"first","affiliation":[]},{"given":"Hafiz Rizwan","family":"Iqbal","sequence":"additional","affiliation":[]},{"given":"Rao Muhammad Adeel","family":"Nawab","sequence":"additional","affiliation":[]},{"given":"Paul","family":"Rayson","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2022,1,19]]},"reference":[{"key":"S1351324921000425_ref21","doi-asserted-by":"publisher","DOI":"10.3115\/1067807.1067821"},{"key":"S1351324921000425_ref8","unstructured":"Bhat, R.A. and Sharma, D.M. (2012). A dependency treebank of Urdu and its evaluation. In Proceedings of the Sixth Linguistic Annotation Workshop (LAW VI\u201912), Jeju, Republic of Korea, pp. 157\u2013165."},{"key":"S1351324921000425_ref45","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1007"},{"key":"S1351324921000425_ref33","unstructured":"Hardie, A . (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In Archer D., Rayson P., Wilson A. and McEnery T. (eds), Proceedings of the Corpus Linguistics 2003 Conference. UCREL Technical Papers, Lancaster, UK, vol. 16, pp. 298\u2013307."},{"key":"S1351324921000425_ref80","doi-asserted-by":"publisher","DOI":"10.1109\/ICITBS.2015.26"},{"key":"S1351324921000425_ref36","doi-asserted-by":"publisher","DOI":"10.1109\/5254.708428"},{"key":"S1351324921000425_ref20","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1002854"},{"key":"S1351324921000425_ref59","doi-asserted-by":"publisher","DOI":"10.1109\/IALP.2012.11"},{"key":"S1351324921000425_ref65","unstructured":"Riaz, K. (2010). Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities WorkShop (NEWS\u201910), Uppsala, Sweden, pp. 126\u2013135."},{"key":"S1351324921000425_ref2","unstructured":"Ahmed, T. , Urooj, S. , Hussain, S. , Mustafa, A. , Parveen, R. , Adeeba, F. , Hautli, A. and Butt, M. (2014). The CLE Urdu POS tagset. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC\u201914), Reykjavik, Iceland, pp. 2920\u20132925."},{"key":"S1351324921000425_ref4","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-16248-0_52"},{"key":"S1351324921000425_ref13","unstructured":"Butt, J.M. (1995). The Structure of Complex Predicates in Urdu. PhD Thesis, Center for the Study of Language (CSLI), department of linguistics, Stanford University."},{"key":"S1351324921000425_ref47","unstructured":"Lehal, G.S . (2010). A word segmentation system for handling space omission problem in Urdu script. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 43\u201350."},{"key":"S1351324921000425_ref27","first-page":"137","article-title":"Automatic stochastic tagging of natural language texts","volume":"21","author":"Evangelos","year":"1995","journal-title":"Computational Linguistics"},{"key":"S1351324921000425_ref7","unstructured":"Azimizadeh, A. , Arab, M.M. and Quchani, S.R. (2008). Persian part of speech tagger based on Hidden Markov Model. In Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data (JADT\u201908), Lyon, France, pp. 121\u2013128."},{"key":"S1351324921000425_ref5","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLC.2007.4370739"},{"key":"S1351324921000425_ref28","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324904003523"},{"key":"S1351324921000425_ref12","doi-asserted-by":"publisher","DOI":"10.3115\/974147.974178"},{"key":"S1351324921000425_ref17","doi-asserted-by":"publisher","DOI":"10.1037\/h0026256"},{"key":"S1351324921000425_ref43","doi-asserted-by":"publisher","DOI":"10.1186\/1472-6947-15-S2-S4"},{"key":"S1351324921000425_ref50","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5010"},{"key":"S1351324921000425_ref57","doi-asserted-by":"publisher","DOI":"10.1109\/5.18626"},{"key":"S1351324921000425_ref40","doi-asserted-by":"publisher","DOI":"10.5121\/csit.2013.3639"},{"key":"S1351324921000425_ref46","doi-asserted-by":"publisher","DOI":"10.1002\/9781119282105.ch8"},{"key":"S1351324921000425_ref22","unstructured":"Dandapat, S. (2008). Part of specch tagging and chunking with maximum entropy model. In Proceedings of the IJCAI Workshop on Shallow Parsing for South Asian Languages (IJCAI\u201908), Hyderabad, India, pp. 29\u201332."},{"key":"S1351324921000425_ref54","first-page":"437","article-title":"Urdu part of speech tagging using transformation based error driven learning","volume":"16","author":"Naz","year":"2012","journal-title":"World Applied Sciences Journal (WASJ)"},{"key":"S1351324921000425_ref55","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.eacl-demos.10"},{"key":"S1351324921000425_ref58","first-page":"395","article-title":"An artificial neural network approach for sentence boundary disambiguation in Urdu","volume":"12","author":"Raj","year":"2015","journal-title":"The International Arab Journal of Information Technology"},{"key":"S1351324921000425_ref52","doi-asserted-by":"publisher","DOI":"10.3115\/1690299.1690303"},{"key":"S1351324921000425_ref68","volume-title":"Unpublished MS Thesis","author":"Sajjad","year":"2007"},{"key":"S1351324921000425_ref73","unstructured":"Shafi, J. (2020). An Urdu Semantic Tagger\u2013Lexicons, Corpora, Methods, and Tools. PhD Thesis, Lancaster University, UK."},{"key":"S1351324921000425_ref76","doi-asserted-by":"publisher","DOI":"10.3115\/1034678.1034712"},{"key":"S1351324921000425_ref77","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, Uszkoreit N. , Jones, L. , Gomez, A.N. , Kaiser, L. and Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762 1, 1\u20135."},{"key":"S1351324921000425_ref79","unstructured":"Wicaksono, A.F. and Purwarianti, A. (2010). HMM based part-of-speech tagger for Bahasa Indonesia. In Proceedings of the Fourth International MALINDO Workshop, Jakarta, Indonesia, pp. 1\u20137."},{"key":"S1351324921000425_ref16","unstructured":"Christer, S. (1996). Handling sparse data by successive abstraction. In Proceedings of the the 16th International Conference on Computational Linguistics (COLING\u201996), Copenhagen, Denmark, pp. 895\u2013900."},{"key":"S1351324921000425_ref51","doi-asserted-by":"publisher","DOI":"10.1145\/2786451.2786500"},{"key":"S1351324921000425_ref56","unstructured":"Platts, J.T. (1909). A Grammar of the Hindustani or Urdu Language. Crosby Lockwood and Son, London, republished in 2002 by Sang-e-Meel Publications, Lahore."},{"key":"S1351324921000425_ref62","first-page":"250","article-title":"A hybrid approach for Urdu sentence boundary disambiguation","volume":"9","author":"Rehman","year":"2012","journal-title":"The International Arab Journal of Information Technology"},{"key":"S1351324921000425_ref37","volume-title":"Nai Urdu Qawaid","author":"Javed","year":"1985"},{"key":"S1351324921000425_ref69","doi-asserted-by":"publisher","DOI":"10.3115\/1609067.1609144"},{"key":"S1351324921000425_ref66","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1044"},{"key":"S1351324921000425_ref72","volume-title":"Urdu, an Essential Grammar (Routledge Essential Grammars)","volume":"1","author":"Schmidt","year":"1999"},{"key":"S1351324921000425_ref63","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0068178"},{"key":"S1351324921000425_ref26","first-page":"67","article-title":"Maximum entropy based Bengali part of speech tagging","volume":"33","author":"Ekbal","year":"2008","journal-title":"Advances in Natural Language Processing and Applications, Research in Computing Science (RCS) Journal"},{"key":"S1351324921000425_ref75","unstructured":"Tafseer, A. (2009). Roman to Urdu transliteration using wordlist. In Proceedings of the Conference on Language and Technology (CLT\u201909), Lahore, Pakistan, pp. 1\u20138."},{"key":"S1351324921000425_ref10","doi-asserted-by":"publisher","DOI":"10.3115\/1627306.1627317"},{"key":"S1351324921000425_ref6","doi-asserted-by":"publisher","DOI":"10.3923\/itj.2007.1190.1198"},{"key":"S1351324921000425_ref14","doi-asserted-by":"publisher","DOI":"10.1006\/csla.1999.0128"},{"key":"S1351324921000425_ref38","unstructured":"Jawaid, B. , Kamran, A. and Bojar, O. (2014). A tagged corpus and a tagger for Urdu. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC\u201909), Reykjav\u00cdk, Iceland, pp. 2938\u201343."},{"key":"S1351324921000425_ref64","first-page":"92","article-title":"Comparison of Hindi and Urdu in computational context","volume":"1","author":"Riaz","year":"2012","journal-title":"International Journal of Computational Linguistics and Natural Language Processing (IJCLNLP)"},{"key":"S1351324921000425_ref71","doi-asserted-by":"publisher","DOI":"10.3115\/1599081.1599179"},{"key":"S1351324921000425_ref29","first-page":"2282","article-title":"Chinese word segmentation as morpheme-based lexical chunking","volume":"178","author":"Fu","year":"2008","journal-title":"Information Sciences"},{"key":"S1351324921000425_ref19","unstructured":"Cunningham, H. , Maynard, D. , Bontcheva, K. and Tablan, V. (2002). GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, USA, pp. 168\u2013175."},{"key":"S1351324921000425_ref3","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2006-333"},{"key":"S1351324921000425_ref60","unstructured":"Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201996), New Jersey, USA, vol. 1, pp. 133\u2013142."},{"key":"S1351324921000425_ref25","unstructured":"Durrani, N. and Hussain, S. (2010). Urdu word segmentation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, USA, pp. 528\u2013536."},{"key":"S1351324921000425_ref30","doi-asserted-by":"publisher","DOI":"10.4324\/9781315841366-13"},{"key":"S1351324921000425_ref61","unstructured":"Rehman, Z. , Anwar, W. and Bajwa, U.I . (2011). Challenges in Urdu text tokenization and sentence boundary disambiguation. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP\u201911), Chiang Mai, Thailand, pp. 40\u201345."},{"key":"S1351324921000425_ref70","unstructured":"Schmid, H. (1994b). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), Manchester, UK, vol. 12, pp. 44\u201349."},{"key":"S1351324921000425_ref31","unstructured":"Gim\u00e9nez, J. and Marquez, L. (2004). SVMTool: a general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC\u201904), Lisbon, Portugal, pp. 43\u201346."},{"key":"S1351324921000425_ref32","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139013734.015"},{"key":"S1351324921000425_ref39","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780198503682.001.0001","volume-title":"The Theory of Probability","author":"Jeffreys","year":"1998"},{"key":"S1351324921000425_ref44","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-2012"},{"key":"S1351324921000425_ref1","doi-asserted-by":"publisher","DOI":"10.2316\/P.2012.771-009"},{"key":"S1351324921000425_ref34","unstructured":"Hardie, A . (2004). The Computational Analysis of Morphosyntactic Categories in Urdu. PhD Thesis, Lancaster University, UK."},{"key":"S1351324921000425_ref15","unstructured":"Christensen, H. (2014). HC corpora. Available at http:\/\/www.corpora.heliohost.org\/ (accessed 5 March 2017)."},{"key":"S1351324921000425_ref74","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-016-9367-2"},{"key":"S1351324921000425_ref35","unstructured":"Hautli, A. and Sulger, S. (2011). Extracting and classifying Urdu multiword expressions. In Proceedings of the the ACL-HLT Student Session (ACL-HLT\u201911), Portland, OR, USA, pp. 24\u201329."},{"key":"S1351324921000425_ref11","unstructured":"B\u00f6gel, T. , Butt, M. , Hautli, A. and Sulger, S. (2007). Developing a finite-state morphological analyzer for Urdu and Hindi. In Proceedings of the Finite-State Methods and Natural Language Processing: 6 th International Workshop (FSMNLP\u201907), Potsdam, Germany, pp. 86\u201396."},{"key":"S1351324921000425_ref49","volume-title":"Foundations of Statistical Natural Language Processing","volume":"999","author":"Manning","year":"1999"},{"key":"S1351324921000425_ref41","volume-title":"Speech and Language Processing","volume":"3","author":"Jurafsky","year":"2014"},{"key":"S1351324921000425_ref42","unstructured":"Khan, S.A. , Anwar, W. , Bajwa, U.I. and Wang, X. (2012). A light weight stemmer for Urdu language: a scarce resourced language. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP-COLING\u201912), Mumbai, India, pp. 69\u201378."},{"key":"S1351324921000425_ref53","doi-asserted-by":"publisher","DOI":"10.1145\/1838751.1838754"},{"key":"S1351324921000425_ref24","unstructured":"Dietzel, A. and Maynard, D. (2015). Climate change: a chance for political re-engagement? In Proceedings of the Political Studies Association 65th Annual International Conference (PSA\u201915), Sheffield, UK, pp. 1\u201319."},{"key":"S1351324921000425_ref48","unstructured":"Malik, A. (2009). A hybrid model for Urdu Hindi translation. In Proceedings of the Named Entities WorkShop (NEWS\u201909), Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, (ACL-IJCNLP\u201909) Singapore, pp. 177\u2013185."},{"key":"S1351324921000425_ref67","first-page":"1","article-title":"A word sense disambiguation corpus for Urdu","volume":"1","author":"Saeed","year":"2018","journal-title":"Language Resources and Evaluation (LRE)"},{"key":"S1351324921000425_ref18","first-page":"1","volume":"1","author":"Conneau","year":"2020","journal-title":"Unsupervised cross-lingual representation learning at scale"},{"key":"S1351324921000425_ref9","volume-title":"Natural Language Processing with Python","author":"Bird","year":"2009"},{"key":"S1351324921000425_ref23","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-016-9482-x"},{"key":"S1351324921000425_ref78","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1967.1054010"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324921000425","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,15]],"date-time":"2023-11-15T23:43:35Z","timestamp":1700091815000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324921000425\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,19]]},"references-count":80,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,7]]}},"alternative-id":["S1351324921000425"],"URL":"https:\/\/doi.org\/10.1017\/s1351324921000425","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,1,19]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https:\/\/creativecommons.org\/licenses\/by\/4.0\/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}