{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T18:53:56Z","timestamp":1768071236758,"version":"3.49.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>Recent advances in machine learning and deep learning have demonstrated the applicability and utility of cross-lingual, transfer learning methods in low and zero-resource scenarios. We explore the applicability of transfer learning methods from pre-trained models in zero-shot and few-shot scenarios for part-of-speech tagging. We report the results of an ablation study to understand the impact of training data size in low-resource languages on the system\u2019s performance. Since building or augmenting datasets for low-resource languages is tricky, costly and a lot of time not feasible, the study provides valuable insights into the expected relative data requirements for both the high-resource language (the source language for transfer) and the low-resource language and the kind of performance boost one could expect when one is planning to use transfer learning for low-resource languages. The study is conducted with Hindi as the high-resource language and the three related languages\u2014Magahi, Bhojpuri, and Braj\u2014as extremely low-resource languages. Overall, the study addresses four broad research questions: (a) How much data in the low-resource as well as high-resource language is \u201csufficient\u201d for attaining optimum performance in a downstream task like part-of-speech annotation, and is there any specific advantage for low-resource language if we use multilingual data during fine-tuning? (b) Do different multilingual pre-trained models, specifically multilingual-BERT, multilingual-DistilBERT, XLM-RoBERTa, and MuRIL, offer any significant advantage in terms of dataset requirements for attaining an optimum performance in Indian languages? (c) In the case of multiple closely-related low-resource languages, does distributing the dataset across multiple languages result in a performance comparable to that of a system trained on a single language? (d) What is the impact of the typological similarity of the languages on the dataset requirement for successful transfer learning?<\/jats:p>","DOI":"10.1145\/3783981","type":"journal-article","created":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T10:56:12Z","timestamp":1765536972000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["How Much Data in Low-resource Indian Languages is \"Sufficient' for Transfer Learning: A Comparative Study for POS Annotation"],"prefix":"10.1145","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2080-9910","authenticated-orcid":false,"given":"Mohit","family":"Raj","sequence":"first","affiliation":[{"name":"Linguistics, Dr Bhim Rao Ambedkar University","place":["Agra, India"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5151-2546","authenticated-orcid":false,"given":"Ritesh","family":"Kumar","sequence":"additional","affiliation":[{"name":"UnReaL-TecE LLP","place":["Agra, India"]},{"name":"Council for Diversity and Innovation","place":["Agra, India"]}]}],"member":"320","published-online":{"date-parts":[[2026,1,10]]},"reference":[{"issue":"11","key":"e_1_3_3_2_2","article-title":"A framework for learning predictive structures from multiple tasks and unlabeled data.","volume":"6","author":"Ando Rie Kubota","year":"2005","unstructured":"Rie Kubota Ando, Tong Zhang, and Peter Bartlett. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 11 (2005).","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_3_2","doi-asserted-by":"crossref","first-page":"659","DOI":"10.1007\/978-94-024-0881-2_24","volume-title":"Handbook of Linguistic Annotation","author":"Bhat Riyaz Ahmad","year":"2017","unstructured":"Riyaz Ahmad Bhat, Rajesh Bhatt, Annahita Farudi, Prescott Klassen, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, Ashwini Vaidya, Sri Ramagurumurthy Vishnu, et\u00a0al. 2017. The Hindi\/Urdu treebank project. In Handbook of Linguistic Annotation. Springer, 659\u2013697."},{"key":"e_1_3_3_4_2","volume-title":"The Origin and Development of the Bengali Language; Part 2","author":"Chatterji Suniti Kumar","year":"1926","unstructured":"Suniti Kumar Chatterji. 1926. The Origin and Development of the Bengali Language; Part 2. Calcutta University Press."},{"key":"e_1_3_3_5_2","first-page":"4543","volume-title":"Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Cieri Christopher","year":"2016","unstructured":"Christopher Cieri, Mike Maxwell, Stephanie Strassel, and Jennifer Tracey. 2016. Selection criteria for low resource language programs. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916). 4543\u20134549."},{"key":"e_1_3_3_6_2","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","author":"Collobert Ronan","year":"2011","unstructured":"Ronan Collobert, Jason Weston, L\u00e9on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, ARTICLE (2011), 2493\u20132537.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_7_2","unstructured":"Alexis Conneau Kartikay Khandelwal Naman Goyal Vishrav Chaudhary Guillaume Wenzek Francisco Guzm\u00e1n Edouard Grave Myle Ott Luke Zettlemoyer and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. arxiv:1911.02116 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/1911.02116"},{"key":"e_1_3_3_8_2","first-page":"95","article-title":"Sarnami: A living language","author":"Damsteegt Theo","year":"1988","unstructured":"Theo Damsteegt. 1988. Sarnami: A living language. In Language Transplanted: The Development of Overseas Hindi . 95\u2013119.","journal-title":"Language Transplanted: The Development of Overseas Hindi"},{"key":"e_1_3_3_9_2","first-page":"1","volume-title":"Proceedings of the 1st International Conference on Human Language Technology Research","author":"David Yarowsky","year":"2001","unstructured":"Yarowsky David, Ngai Grace, Wicentowski Richard, et\u00a0al. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research. 1\u20138."},{"key":"e_1_3_3_10_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_3_11_2","volume-title":"The East Indian Speech Community In Guyana: A Sociolinguistic Study With Special Reference To Koine Formation.","author":"Gambhir Surendra Kumar","year":"1981","unstructured":"Surendra Kumar Gambhir. 1981. The East Indian Speech Community In Guyana: A Sociolinguistic Study With Special Reference To Koine Formation.Ph. D. Dissertation."},{"key":"e_1_3_3_12_2","volume-title":"Linguistic Survey of India, Vol-III","author":"Grierson George Abraham","year":"1903","unstructured":"George Abraham Grierson. 1903. Linguistic Survey of India, Vol-III. Office of the Superintendent, Government of PRI, Calcutta."},{"key":"e_1_3_3_13_2","volume-title":"1928, Linguistic Survey of India","author":"Grierson George Abraham","year":"1903","unstructured":"George Abraham Grierson and Sten Konow. 1903. 1928, Linguistic Survey of India."},{"key":"e_1_3_3_14_2","volume-title":"https:\/\/syntaxfest.github.io\/syntaxfest19\/proceedings\/papers\/paper_55.pdfProceedings of the Universal Dependencies Workshop 2019","author":"Heinecke Johannes","year":"2019","unstructured":"Johannes Heinecke. 2019. ConlluEditor: A fully graphical editor for Universal Dependencies treebank files. Retrieved from https:\/\/syntaxfest.github.io\/syntaxfest19\/proceedings\/papers\/paper_55.pdf. In Proceedings of the Universal Dependencies Workshop 2019. Retrieved from https:\/\/github.com\/Orange-OpenSource\/conllueditor\/"},{"key":"e_1_3_3_15_2","doi-asserted-by":"crossref","DOI":"10.4324\/9780203945315","volume-title":"The Indo-Aryan Languages","author":"Jain Danesh","year":"2007","unstructured":"Danesh Jain and George Cardona. 2007. The Indo-Aryan Languages. Routledge."},{"issue":"3","key":"e_1_3_3_16_2","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1163\/000000076790079708","article-title":"The position of the Bih\u0101r\u012b dialects in Indo-Aryan","volume":"18","author":"Jeffers Robert J.","year":"1976","unstructured":"Robert J. Jeffers. 1976. The position of the Bih\u0101r\u012b dialects in Indo-Aryan. Indo-Iranian Journal 18, 3\/4 (1976), 215\u2013225.","journal-title":"Indo-Iranian Journal"},{"key":"e_1_3_3_17_2","first-page":"182","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)","author":"Johnson Andrew","year":"2019","unstructured":"Andrew Johnson, Penny Karanasou, Judith Gaspers, and Dietrich Klakow. 2019. Cross-lingual transfer learning for Japanese named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). 182\u2013189."},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.560"},{"key":"e_1_3_3_19_2","volume-title":"Problems of Reconstruction in Indo-Aryan","author":"Katre Sumitra Mangesh","year":"1968","unstructured":"Sumitra Mangesh Katre. 1968. Problems of Reconstruction in Indo-Aryan. Indian Institute of Advanced Study, Simla."},{"key":"e_1_3_3_20_2","unstructured":"Simran Khanuja Diksha Bansal Sarvesh Mehtani Savya Khosla Atreyee Dey Balaji Gopalan Dilip Kumar Margam Pooja Aggarwal Rajiv Teja Nagipogu Shachi Dave et\u00a0al. 2021. Muril: Multilingual representations for Indian languages. arXiv:2103.10730. Retrieved from https:\/\/arxiv.org\/abs\/2103.10730"},{"key":"e_1_3_3_21_2","unstructured":"Ritesh Kumar and Girish Nath Jha. 2010. Magahi verb analyser and generator."},{"key":"e_1_3_3_22_2","first-page":"491","volume-title":"Proceedings of the Language and Technology Conference","author":"Kumar Ritesh","year":"2011","unstructured":"Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2011. Developing LRs for non-scheduled Indian languages. In Proceedings of the Language and Technology Conference. Springer, 491\u2013501."},{"key":"e_1_3_3_23_2","first-page":"105","volume-title":"Proceedings of the 10th Workshop on Asian Language Resources","author":"Kumar Ritesh","year":"2012","unstructured":"Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2012. Developing a POS tagger for Magahi: A comparative study. In Proceedings of the 10th Workshop on Asian Language Resources. 105\u2013114."},{"key":"e_1_3_3_24_2","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1007\/978-3-319-08958-4_40","volume-title":"Human Language Technology Challenges for Computer Science and Linguistics: 5th Language and Technology Conference, LTC 2011, Revised Selected Papers 5","author":"Kumar Ritesh","year":"2014","unstructured":"Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2014. Developing LRs for non-scheduled Indian languages: A case of Magahi. In Human Language Technology Challenges for Computer Science and Linguistics: 5th Language and Technology Conference, LTC 2011, Revised Selected Papers 5. Springer, 491\u2013501."},{"key":"e_1_3_3_25_2","unstructured":"Ritesh Kumar Bornini Lahiri Deepak Alok Atul Kr Ojha Mayank Jain Abdul Basit and Yogesh Dawer. 2018. Automatic identification of closely-related Indian languages: Resources and experiments. arXiv:1803.09405. Retrieved from https:\/\/arxiv.org\/abs\/1803.09405"},{"key":"e_1_3_3_26_2","doi-asserted-by":"crossref","unstructured":"Ritesh Kumar Siddharth Singh Shyam Ratan Mohit Raj Sonal Sinha Vivek Seshadri Kalika Bali Atul Kr Ojha et\u00a0al. 2022. Annotated speech corpus for low resource Indian languages: Awadhi Bhojpuri Braj and Magahi. arXiv:2206.12931. Retrieved from https:\/\/arxiv.org\/abs\/2206.12931","DOI":"10.21437\/S4SG.2022-1"},{"key":"e_1_3_3_27_2","unstructured":"Chia-Hsuan Lee and Hung-Yi Lee. 2019. Cross-lingual transfer learning for question answering. arXiv:1907.06042. Retrieved from https:\/\/arxiv.org\/abs\/1907.06042"},{"key":"e_1_3_3_28_2","unstructured":"Alexandre Magueresse Vincent Carles and Evan Heetderks. 2020. Low-resource languages: A review of past work and future challenges. arXiv:2006.07264. Retrieved from https:\/\/arxiv.org\/abs\/2006.07264"},{"key":"e_1_3_3_29_2","volume-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018)","author":"Dawer Nandini Chauhan Mayank Jain, Yogesh","year":"2018","unstructured":"Nandini Chauhan Mayank Jain, Yogesh Dawer and Anjali Gupta. 2018. Developing resources for a less resourced language: Braj bhasha. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018) (Miyazaki, Japan, 7-12). Girish Nath Jha, Kalika Bali, Sobha L, and Atul Kr. Ojha (Eds.), European Language Resources Association (ELRA), Paris, France."},{"key":"e_1_3_3_30_2","volume-title":"Trinidad Bhojpuri: A Morphological Study.","author":"Mohan Peggy Ramesar","year":"1978","unstructured":"Peggy Ramesar Mohan. 1978. Trinidad Bhojpuri: A Morphological Study.Ph. D. Dissertation."},{"key":"e_1_3_3_31_2","unstructured":"Rajesh Kumar Mundotiya Shantanu Kumar Umesh Chandra Chaudhary Supriya Chauhan Swasti Mishra Praveen Gatla Anil Kumar Singh et\u00a0al. 2020. Development of a dataset and a deep learning baseline named entity recognizer for three low resource languages: Bhojpuri Maithili and Magahi. arXiv:2009.06451. Retrieved from https:\/\/arxiv.org\/abs\/2009.06451"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458250"},{"key":"e_1_3_3_33_2","volume-title":"Research Paper Presented in 4th International Endangered and Lesser-known Languages Conference (ELKL-4)","author":"Ojha A. K.","year":"2016","unstructured":"A. K. Ojha. 2016. Developing a machine readable multilingual dictionary for Bhojpuri-Hindi-English. In Research Paper Presented in 4th International Endangered and Lesser-known Languages Conference (ELKL-4)."},{"key":"e_1_3_3_34_2","unstructured":"Atul Kr Ojha. 2019. English-Bhojpuri SMT System: Insights from the Karaka model. arXiv:1905.02239. Retrieved from https:\/\/arxiv.org\/abs\/1905.02239"},{"key":"e_1_3_3_35_2","first-page":"524","volume-title":"Proceedings of the 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics","author":"Ojha Atul Ku","year":"2015","unstructured":"Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524\u2013529."},{"key":"e_1_3_3_36_2","first-page":"33","volume-title":"Proceedings of the WILDRE5\u20135th Workshop on Indian language Data: Resources and Evaluation","author":"Ojha Atul Kr","year":"2020","unstructured":"Atul Kr Ojha and Daniel Zeman. 2020. Universal dependency treebanks for low-resource Indian languages: The case of Bhojpuri. In Proceedings of the WILDRE5\u20135th Workshop on Indian language Data: Resources and Evaluation. 33\u201338."},{"key":"e_1_3_3_37_2","doi-asserted-by":"crossref","unstructured":"Nanyun Peng and Mark Dredze. 2016. Improving named entity recognition for Chinese social media with word segmentation representation learning. arXiv:1603.00786. Retrieved from https:\/\/arxiv.org\/abs\/1603.00786","DOI":"10.18653\/v1\/P16-2025"},{"key":"e_1_3_3_38_2","unstructured":"Mohit Raj Shyam Ratan Deepak Alok Ritesh Kumar and Atul Kr Ojha. 2022. Developing universal dependency treebanks for Magahi and Braj. arXiv:2204.12633. Retrieved from https:\/\/arxiv.org\/abs\/2204.12633"},{"key":"e_1_3_3_39_2","unstructured":"Priya Rani Atul Kr Ojha and Girish Nath Jha. 2018. Automatic language identification system for Hindi and Magahi. arXiv:1804.05095. Retrieved from https:\/\/arxiv.org\/abs\/1804.05095"},{"key":"e_1_3_3_40_2","volume-title":"Some Morphological and Syntactic Features of Mauritian Bhojpuri","author":"Ranjan Rakesh","year":"1997","unstructured":"Rakesh Ranjan. 1997. Some Morphological and Syntactic Features of Mauritian Bhojpuri. Ph. D. Dissertation. University of Delhi, Department of Linguistics."},{"key":"e_1_3_3_41_2","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. DistilBERT a distilled version of BERT: Smaller faster cheaper and lighter. arXiv:1910.01108. Retrieved from https:\/\/arxiv.org\/abs\/1910.01108"},{"key":"e_1_3_3_42_2","volume-title":"Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages","author":"Singh Anil Kumar","year":"2008","unstructured":"Anil Kumar Singh. 2008. Natural language processing for less privileged languages: Where do we come from? Where are we going?. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages."},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-022-07337-8"},{"key":"e_1_3_3_44_2","volume-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA)","author":"Sinha Shagun","year":"2018","unstructured":"Shagun Sinha and Girish Nath Jha. 2018. Issues in conversational Sanskrit to Bhojpuri MT. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA)."},{"key":"e_1_3_3_45_2","volume-title":"The Origin and Development of Bhojpuri","author":"Tiwari Udai Narain","year":"1960","unstructured":"Udai Narain Tiwari. 1960. The Origin and Development of Bhojpuri. The Asiatic Society."},{"key":"e_1_3_3_46_2","article-title":"Opportunities and challenges in working with low-resource languages","author":"Tsvetkov Yulia","year":"2017","unstructured":"Yulia Tsvetkov. 2017. Opportunities and challenges in working with low-resource languages. Slides Part-1 (2017).","journal-title":"Slides Part-1"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00165"},{"key":"e_1_3_3_48_2","first-page":"64","volume-title":"Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages","author":"Yadav Saumitra","year":"2019","unstructured":"Saumitra Yadav, Vandan Mujadia, and Manish Shrivastava. 2019. A3-108 machine translation system for LoResMT 2019. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages. 64\u201367."},{"key":"e_1_3_3_49_2","first-page":"390","volume-title":"Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Zirikly Ayah","year":"2015","unstructured":"Ayah Zirikly and Masato Hagiwara. 2015. Cross-lingual transfer of named entity recognizers without parallel corpora. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 390\u2013396."},{"key":"e_1_3_3_50_2","doi-asserted-by":"crossref","unstructured":"Barret Zoph Deniz Yuret Jonathan May and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. arXiv:1604.02201. Retrieved from https:\/\/arxiv.org\/abs\/1604.02201","DOI":"10.18653\/v1\/D16-1163"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3783981","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T15:56:50Z","timestamp":1768060610000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3783981"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,10]]},"references-count":49,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3783981"],"URL":"https:\/\/doi.org\/10.1145\/3783981","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,10]]},"assertion":[{"value":"2023-06-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}