{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:36:28Z","timestamp":1772908588362,"version":"3.50.1"},"reference-count":128,"publisher":"MIT Press","issue":"2","license":[{"start":{"date-parts":[[2024,1,19]],"date-time":"2024-01-19T00:00:00Z","timestamp":1705622400000},"content-version":"vor","delay-in-days":383,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>While most transliteration research is focused on single tokens such as named entities\u2014for example, transliteration of from the Gujarati script to the Latin script \u201cAhmedabad\u201d footnoteThe most populous city in the Indian state of Gujarat. the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this article, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models fine-tuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.<\/jats:p>","DOI":"10.1162\/coli_a_00510","type":"journal-article","created":{"date-parts":[[2024,1,19]],"date-time":"2024-01-19T19:51:07Z","timestamp":1705693867000},"page":"475-534","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":6,"title":["Context-aware Transliteration of Romanized South Asian Languages"],"prefix":"10.1162","volume":"50","author":[{"given":"Christo","family":"Kirov","sequence":"first","affiliation":[{"name":"Google Research. ckirov@google.com"}]},{"given":"Cibu","family":"Johny","sequence":"additional","affiliation":[{"name":"Google Research. cibu@google.com"}]},{"given":"Anna","family":"Katanova","sequence":"additional","affiliation":[{"name":"Google Research. akatanova@google.com"}]},{"given":"Alexander","family":"Gutkin","sequence":"additional","affiliation":[{"name":"Google Research. agutkin@google.com"}]},{"given":"Brian","family":"Roark","sequence":"additional","affiliation":[{"name":"Google Research. roark@google.com"}]}],"member":"281","published-online":{"date-parts":[[2023,6,1]]},"reference":[{"key":"2024070220223715800_bib1","doi-asserted-by":"publisher","first-page":"14466","DOI":"10.18653\/v1\/2023.acl-long.809","article-title":"Script normalization for unconventional writing of under-resourced languages in bilingual communities","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Ahmadi","year":"2023"},{"key":"2024070220223715800_bib2","doi-asserted-by":"publisher","first-page":"30","DOI":"10.3115\/v1\/W14-1604","article-title":"Automatic transliteration of romanized dialectal Arabic","volume-title":"Proceedings of the Eighteenth Conference on Computational Natural Language Learning","author":"Al-Badrashiny","year":"2014"},{"key":"2024070220223715800_bib3","doi-asserted-by":"publisher","first-page":"40","DOI":"10.3115\/1075096.1075102","article-title":"Generalized algorithms for constructing statistical language models","volume-title":"Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics","author":"Allauzen","year":"2003"},{"key":"2024070220223715800_bib4","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1007\/978-3-540-76336-9_3","article-title":"OpenFst: A general and efficient weighted finite-state transducer library","volume-title":"Proceedings of 12th International Conference on Implementation and Application of Automata (CIAA)","author":"Allauzen","year":"2007"},{"key":"2024070220223715800_bib5","doi-asserted-by":"publisher","first-page":"2461","DOI":"10.18653\/v1\/2020.findings-emnlp.223","article-title":"On Romanization for model transfer between scripts in neural machine translation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Amrhein","year":"2020"},{"key":"2024070220223715800_bib6","volume-title":"A Reference Grammar of the Tamil Language","author":"Andronov","year":"2004"},{"key":"2024070220223715800_bib7","article-title":"Neural machine translation by jointly learning to align and translate","author":"Bahdanau","year":"2014","journal-title":"arXiv preprint arXiv:1409.0473"},{"issue":"6","key":"2024070220223715800_bib8","doi-asserted-by":"publisher","first-page":"1554","DOI":"10.1214\/aoms\/1177699147","article-title":"Statistical inference for probabilistic functions of finite state Markov chains","volume":"37","author":"Baum","year":"1966","journal-title":"The Annals of Mathematical Statistics"},{"key":"2024070220223715800_bib9","doi-asserted-by":"publisher","first-page":"105","DOI":"10.21437\/ICSLP.2002-78","article-title":"Investigations on joint-multigram models for grapheme-to-phoneme conversion","volume-title":"Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002)","author":"Bisani","year":"2002"},{"issue":"5","key":"2024070220223715800_bib10","doi-asserted-by":"publisher","first-page":"434","DOI":"10.1016\/j.specom.2008.01.002","article-title":"Joint-sequence models for grapheme-to-phoneme conversion","volume":"50","author":"Bisani","year":"2008","journal-title":"Speech Communication"},{"issue":"1","key":"2024070220223715800_bib11","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1075\/wll.2.1.03bri","article-title":"A matter of typology: Alphasyllabaries and abugidas","volume":"2","author":"Bright","year":"1999","journal-title":"Written Language & Literacy"},{"key":"2024070220223715800_bib12","unstructured":"Celisse, Alain\n          . 2008. Model Selection via Cross-validation in Density Estimation, Regression, and Change-points Detection. Ph.D. thesis, Facult\u00e9 des Sciences d\u2019Orsay, Universit\u00e9 Paris Sud XI, Paris, France."},{"key":"2024070220223715800_bib13","doi-asserted-by":"publisher","first-page":"2486","DOI":"10.1109\/ICASSP.2018.8462678","article-title":"Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion","volume-title":"Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Chae","year":"2018"},{"key":"2024070220223715800_bib14","doi-asserted-by":"publisher","first-page":"232","DOI":"10.3115\/980451.980883","article-title":"Proper name translation in cross-language information retrieval","volume-title":"COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics","author":"Chen","year":"1998"},{"key":"2024070220223715800_bib15","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1008","article-title":"The best of both worlds: Combining recent advances in neural machine translation","author":"Chen","year":"2018","journal-title":"arXiv preprint arXiv:1804.09849"},{"key":"2024070220223715800_bib16","doi-asserted-by":"publisher","first-page":"2033","DOI":"10.21437\/Eurospeech.2003-584","article-title":"Conditional and joint models for grapheme-to-phoneme conversion","volume-title":"Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003)","author":"Chen","year":"2003"},{"issue":"1","key":"2024070220223715800_bib17","doi-asserted-by":"publisher","first-page":"62","DOI":"10.1086\/706549","article-title":"From transcript to \u201ctrans-script\u201d: Romanized Santali across semiotic media","volume":"8","author":"Choksi","year":"2020","journal-title":"Signs and Society"},{"key":"2024070220223715800_bib18","doi-asserted-by":"publisher","first-page":"20","DOI":"10.3115\/1622153.1622156","article-title":"A diachronic approach for schwa deletion in Indo Aryan languages","volume-title":"Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology","author":"Choudhury","year":"2004"},{"key":"2024070220223715800_bib19","doi-asserted-by":"publisher","first-page":"8440","DOI":"10.18653\/v1\/2020.acl-main.747","article-title":"Unsupervised cross-lingual representation learning at scale","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Conneau","year":"2020"},{"issue":"3","key":"2024070220223715800_bib20","doi-asserted-by":"publisher","first-page":"171","DOI":"10.1145\/363958.363994","article-title":"A technique for computer detection and correction of spelling errors","volume":"7","author":"Damerau","year":"1964","journal-title":"Communications of the ACM"},{"key":"2024070220223715800_bib21","doi-asserted-by":"publisher","first-page":"8239","DOI":"10.1109\/ICASSP40776.2020.9053443","article-title":"Language-agnostic multilingual modeling","volume-title":"ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Datta","year":"2020"},{"key":"2024070220223715800_bib22","first-page":"6662","article-title":"Criteria for useful automatic Romanization in South Asian languages","volume-title":"Proceedings of the Thirteenth Language Resources and Evaluation Conference","author":"Demirsahin","year":"2022"},{"key":"2024070220223715800_bib23","doi-asserted-by":"publisher","first-page":"399","DOI":"10.18653\/v1\/P16-1038","article-title":"Grapheme-to-phoneme models for (almost) any language","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Deri","year":"2016"},{"key":"2024070220223715800_bib24","doi-asserted-by":"publisher","first-page":"8584","DOI":"10.18653\/v1\/2021.emnlp-main.675","article-title":"Role of language relatedness in multilingual fine-tuning of language models: A case study in Indo-Aryan languages","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Dhamecha","year":"2021"},{"key":"2024070220223715800_bib25","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2210.12273","article-title":"Graphemic normalization of the Perso-Arabic script","author":"Doctor","year":"2022","journal-title":"arXiv preprint arXiv:2210.12273"},{"key":"2024070220223715800_bib26","doi-asserted-by":"publisher","first-page":"12402","DOI":"10.18653\/v1\/2023.acl-long.693","article-title":"Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models for Indic languages","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Doddapaneni","year":"2023"},{"key":"2024070220223715800_bib27","doi-asserted-by":"publisher","first-page":"489","DOI":"10.18653\/v1\/D18-1045","article-title":"Understanding back-translation at scale","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Edunov","year":"2018"},{"key":"2024070220223715800_bib28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3115\/v1\/W14-3901","article-title":"Foreign words and the automatic processing of Arabic social media text written in Roman script","volume-title":"Proceedings of the First Workshop on Computational Approaches to Code Switching","author":"Eskander","year":"2014"},{"key":"2024070220223715800_bib29","first-page":"48","article-title":"Transliteration using a phrase-based statistical machine translation system to re-score the output of a joint multigram model","volume-title":"Proceedings of the 2010 Named Entities Workshop","author":"Finch","year":"2010"},{"key":"2024070220223715800_bib30","first-page":"6","article-title":"Bi-directional conversion between graphemes and phonemes using a joint n-gram model","volume-title":"Proceedings of the 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis","author":"Galescu","year":"2001"},{"key":"2024070220223715800_bib31","first-page":"368","article-title":"\u201cye word kis lang ka hai bhai?\u201d Testing the limits of word level language identification","volume-title":"Proceedings of the 11th International Conference on Natural Language Processing","author":"Gella","year":"2014"},{"key":"2024070220223715800_bib32","first-page":"94","article-title":"Use of transformer- based models for word-level transliteration of the Book of the Dean of Lismore","volume-title":"Proceedings of the 4th Celtic Language Technology Workshop within LREC2022","author":"Gow-Smith","year":"2022"},{"key":"2024070220223715800_bib33","doi-asserted-by":"publisher","first-page":"227","DOI":"10.1016\/B978-012373591-1\/50012-7","article-title":"Text entry in South and Southeast Asian scripts","volume-title":"Text Entry Systems: Mobility, Accessibility, Universality","author":"Gupta","year":"2007"},{"key":"2024070220223715800_bib34","doi-asserted-by":"publisher","first-page":"381","DOI":"10.18653\/v1\/2022.wanlp-1.36","article-title":"Beyond Arabic: Software for Perso-Arabic script manipulation","volume-title":"Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)","author":"Gutkin","year":"2022"},{"key":"2024070220223715800_bib35","first-page":"6450","article-title":"Extensions to Brahmic script processing within the Nisaba library: New scripts, languages and utilities","volume-title":"Proceedings of the Thirteenth Language Resources and Evaluation Conference","author":"Gutkin","year":"2022"},{"key":"2024070220223715800_bib36","doi-asserted-by":"publisher","first-page":"10","DOI":"10.18653\/v1\/W17-4002","article-title":"Transliterated mobile keyboard input via weighted finite-state transducers","volume-title":"Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)","author":"Hellsten","year":"2017"},{"issue":"8","key":"2024070220223715800_bib37","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Computation"},{"key":"2024070220223715800_bib38","first-page":"75","article-title":"Processing informal, romanized Pakistani text messages","volume-title":"Proceedings of the Second Workshop on Language in Social Media","author":"Irvine","year":"2012"},{"key":"2024070220223715800_bib39","article-title":"ISO 15919: Transliteration of Devanagari and related Indic scripts into Latin characters","author":"ISO","year":"2001"},{"key":"2024070220223715800_bib40","article-title":"ISO 639-1: Codes for the representation of names of languages\u2014part 1: Alpha-2 code","author":"ISO","year":"2002"},{"key":"2024070220223715800_bib41","doi-asserted-by":"publisher","first-page":"874","DOI":"10.18653\/v1\/2021.eacl-main.74","article-title":"Leveraging passage retrieval with generative models for open domain question answering","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Izacard","year":"2021"},{"key":"2024070220223715800_bib42","volume-title":"Statistical Methods for Speech Recognition","author":"Jelinek","year":"1998"},{"issue":"3","key":"2024070220223715800_bib43","doi-asserted-by":"publisher","first-page":"250","DOI":"10.1109\/TIT.1975.1055384","article-title":"Design of a linguistic statistical decoder for the recognition of continuous speech","volume":"21","author":"Jelinek","year":"1975","journal-title":"IEEE Transactions on Information Theory"},{"key":"2024070220223715800_bib44","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.21437\/Interspeech.2019-1951","article-title":"Direct speech-to-speech translation with a sequence-to-sequence model","volume-title":"Proceedings of Interspeech 2019","author":"Jia","year":"2019"},{"key":"2024070220223715800_bib45","first-page":"697","article-title":"Integrating joint n-gram features into a discriminative training framework","volume-title":"Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics","author":"Jiampojamarn","year":"2010"},{"key":"2024070220223715800_bib46","doi-asserted-by":"publisher","first-page":"264","DOI":"10.21437\/SLTU.2018-55","article-title":"Brahmic schwa-deletion with neural classifiers: Experiments with Bengali","volume-title":"Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)","author":"Johny","year":"2018"},{"key":"2024070220223715800_bib47","doi-asserted-by":"publisher","first-page":"14","DOI":"10.18653\/v1\/2021.eacl-demos.3","article-title":"Finite-state script normalization and processing utilities: The Nisaba Brahmic library","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations","author":"Johny","year":"2021"},{"issue":"3","key":"2024070220223715800_bib48","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1922649.1922654","article-title":"Machine transliteration survey","volume":"43","author":"Karimi","year":"2011","journal-title":"ACM Computing Surveys"},{"key":"2024070220223715800_bib49","first-page":"4247","article-title":"Cross-lingual named entity list search via transliteration","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Khakhmovich","year":"2020"},{"key":"2024070220223715800_bib50","doi-asserted-by":"publisher","first-page":"1529","DOI":"10.21437\/Interspeech.2021-2062","article-title":"Low resource ASR: The surprising effectiveness of high resource transliteration","volume-title":"Proceedings of Interspeech 2021","author":"Khare","year":"2021"},{"key":"2024070220223715800_bib51","doi-asserted-by":"publisher","first-page":"74","DOI":"10.18653\/v1\/W18-2709","article-title":"On the impact of various types of noise on neural machine translation","volume-title":"Proceedings of the 2nd Workshop on Neural Machine Translation and Generation","author":"Khayrallah","year":"2018"},{"key":"2024070220223715800_bib52","article-title":"Adam: A method for stochastic optimization","author":"Kingma","year":"2014","journal-title":"arXiv preprint arXiv:1412.6980"},{"key":"2024070220223715800_bib53","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1109\/ICASSP.1995.479394","article-title":"Improved backing-off for m-gram language modeling","volume-title":"Proceedings of 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP \u201995)","author":"Kneser","year":"1995"},{"issue":"4","key":"2024070220223715800_bib54","first-page":"599","article-title":"Machine transliteration","volume":"24","author":"Knight","year":"1998","journal-title":"Computational Linguistics"},{"key":"2024070220223715800_bib55","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1162\/tacl_a_00447","article-title":"Quality at a glance: An audit of web-crawled multilingual datasets","volume":"10","author":"Kreutzer","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024070220223715800_bib56","doi-asserted-by":"publisher","first-page":"66","DOI":"10.18653\/v1\/D18-2012","article-title":"SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Kudo","year":"2018"},{"key":"2024070220223715800_bib57","doi-asserted-by":"publisher","first-page":"16","DOI":"10.18653\/v1\/2020.wnut-1.3","article-title":"Noisy text data: Achilles\u2019 heel of BERT","volume-title":"Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)","author":"Kumar","year":"2020"},{"key":"2024070220223715800_bib58","doi-asserted-by":"publisher","first-page":"217","DOI":"10.18653\/v1\/E17-2035","article-title":"Morphological analysis of the Dravidian language family","volume-title":"Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers","author":"Kumar","year":"2017"},{"key":"2024070220223715800_bib59","doi-asserted-by":"publisher","first-page":"3469","DOI":"10.18653\/v1\/2021.eacl-main.303","article-title":"A large-scale evaluation of neural machine transliteration for Indic languages","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Kunchukuttan","year":"2021"},{"key":"2024070220223715800_bib60","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1162\/tacl_a_00022","article-title":"Leveraging orthographic similarity for multilingual neural transliteration","volume":"6","author":"Kunchukuttan","year":"2018","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024070220223715800_bib61","doi-asserted-by":"publisher","first-page":"81","DOI":"10.3115\/v1\/N15-3017","article-title":"Brahmi-net: A transliteration and script conversion system for languages of the Indian subcontinent","volume-title":"Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations","author":"Kunchukuttan","year":"2015"},{"key":"2024070220223715800_bib62","doi-asserted-by":"publisher","first-page":"79","DOI":"10.18653\/v1\/W18-2411","article-title":"A deep learning based approach to transliteration","volume-title":"Proceedings of the Seventh Named Entities Workshop","author":"Kundu","year":"2018"},{"key":"2024070220223715800_bib63","first-page":"282","article-title":"Conditional random fields: Probabilistic models for segmenting and labeling sequence data","volume-title":"Proceedings of the 18th International Conference on Machine Learning (ICML)","author":"Lafferty","year":"2001"},{"key":"2024070220223715800_bib64","doi-asserted-by":"publisher","first-page":"58","DOI":"10.18653\/v1\/2022.findings-acl.6","article-title":"Pre-trained multilingual sequence-to-sequence models: A hope for low-resource language translation?","volume-title":"Findings of the Association for Computational Linguistics: ACL 2022","author":"Lee","year":"2022"},{"key":"2024070220223715800_bib65","first-page":"633","article-title":"Conversion between scripts of Punjabi: Beyond simple transliteration","volume-title":"Proceedings of COLING 2012: Posters","author":"Lehal","year":"2012"},{"key":"2024070220223715800_bib66","first-page":"232","article-title":"Sangam: A Perso-Arabic to Indic script machine transliteration model","volume-title":"Proceedings of the 11th International Conference on Natural Language Processing","author":"Lehal","year":"2014"},{"key":"2024070220223715800_bib67","volume-title":"A Grammar of Modern Tamil","author":"Lehmann","year":"1993"},{"issue":"8","key":"2024070220223715800_bib68","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Soviet Physics\u2014Doklady"},{"key":"2024070220223715800_bib69","doi-asserted-by":"publisher","first-page":"159","DOI":"10.3115\/1218955.1218976","article-title":"A joint source-channel model for machine transliteration","volume-title":"Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)","author":"Li","year":"2004"},{"key":"2024070220223715800_bib70","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2301.10472","article-title":"XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models","author":"Liang","year":"2023","journal-title":"arXiv preprint arXiv:2301.10472"},{"key":"2024070220223715800_bib71","doi-asserted-by":"publisher","first-page":"1412","DOI":"10.18653\/v1\/D15-1166","article-title":"Effective approaches to attention-based neural machine translation","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Luong","year":"2015"},{"key":"2024070220223715800_bib72","doi-asserted-by":"publisher","first-page":"816","DOI":"10.18653\/v1\/2023.acl-short.71","article-title":"Bhasa-abhijnaanam: Native-script and romanized language identification for 22 Indic languages","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Madhani","year":"2023"},{"key":"2024070220223715800_bib73","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.4","article-title":"Aksharantar: Towards building open transliteration tools for the next billion users","author":"Madhani","year":"2022","journal-title":"arXiv preprint arXiv:2205.03018"},{"key":"2024070220223715800_bib74","article-title":"Converting romanized Persian to the Arabic writing systems","volume-title":"Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC\u201908)","author":"Maleki","year":"2008"},{"key":"2024070220223715800_bib75","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/TAI.2022.3205567","article-title":"DReD\u2014A descriptive relation dataset for expanding relation extraction","volume-title":"IEEE Transactions on Artificial Intelligence","author":"Markewich","year":"2022"},{"key":"2024070220223715800_bib76","first-page":"630","article-title":"Design challenges in named entity transliteration","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics","author":"Merhav","year":"2018"},{"key":"2024070220223715800_bib77","first-page":"195","article-title":"Romanagari an alternative for modern media writings","volume":"75","author":"Mhaiskar","year":"2015","journal-title":"Bulletin of the Deccan College Post-Graduate and Research Institute"},{"key":"2024070220223715800_bib78","doi-asserted-by":"publisher","first-page":"80","DOI":"10.1007\/s10278-022-00692-x","article-title":"Application of deep learning in generating structured radiology reports: A transformer-based technique","volume":"36","author":"Moezzi","year":"2023","journal-title":"Journal of Digital Imaging"},{"issue":"3","key":"2024070220223715800_bib79","first-page":"321","article-title":"Semiring frameworks and algorithms for shortest-distance problems","volume":"7","author":"Mohri","year":"2002","journal-title":"Journal of Automata, Languages and Combinatorics"},{"key":"2024070220223715800_bib80","doi-asserted-by":"publisher","first-page":"670","DOI":"10.18653\/v1\/2023.findings-eacl.50","article-title":"Does transliteration help multilingual language modeling?","volume-title":"Findings of the Association for Computational Linguistics: EACL 2023","author":"Moosa","year":"2023"},{"key":"2024070220223715800_bib81","doi-asserted-by":"publisher","first-page":"1558","DOI":"10.18653\/v1\/2021.emnlp-main.117","article-title":"Evaluating the robustness of neural language models to input perturbations","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Moradi","year":"2021"},{"key":"2024070220223715800_bib82","first-page":"79","article-title":"Effective architectures for low resource multilingual named entity transliteration","volume-title":"Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages","author":"Moran","year":"2020"},{"key":"2024070220223715800_bib83","doi-asserted-by":"publisher","first-page":"51","DOI":"10.18653\/v1\/N16-2008","article-title":"Developing language technology tools and resources for a resource-poor language: Sindhi","volume-title":"Proceedings of the NAACL Student Research Workshop","author":"Motlani","year":"2016"},{"key":"2024070220223715800_bib84","doi-asserted-by":"publisher","first-page":"448","DOI":"10.18653\/v1\/2021.naacl-main.38","article-title":"When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Muller","year":"2021"},{"key":"2024070220223715800_bib85","doi-asserted-by":"publisher","first-page":"189","DOI":"10.18653\/v1\/2020.sigmorphon-1.22","article-title":"Transliteration for cross-lingual morphological inflection","volume-title":"Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology","author":"Murikinati","year":"2020"},{"issue":"1","key":"2024070220223715800_bib86","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1080\/19472498.2017.1411049","article-title":"Writing Punjabi across borders","volume":"9","author":"Murphy","year":"2018","journal-title":"South Asian History and Culture"},{"key":"2024070220223715800_bib87","doi-asserted-by":"publisher","first-page":"628","DOI":"10.18653\/v1\/2022.acl-long.47","article-title":"AraT5: Text-to-text transformers for Arabic language generation","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Nagoudi","year":"2022"},{"key":"2024070220223715800_bib88","doi-asserted-by":"publisher","first-page":"84","DOI":"10.18653\/v1\/W18-2412","article-title":"Comparison of assorted models for transliteration","volume-title":"Proceedings of the Seventh Named Entities Workshop","author":"Najafi","year":"2018"},{"key":"2024070220223715800_bib89","doi-asserted-by":"publisher","first-page":"833","DOI":"10.1109\/ICASSP.1987.1169844","article-title":"A data-driven organization of the dynamic programming beam search for continuous speech recognition","volume-title":"Proceedings of the IEEE 1987 International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Ney","year":"1987"},{"key":"2024070220223715800_bib90","doi-asserted-by":"publisher","first-page":"72","DOI":"10.18653\/v1\/W15-3911","article-title":"Multiple system combination for transliteration","volume-title":"Proceedings of the Fifth Named Entity Workshop","author":"Nicolai","year":"2015"},{"key":"2024070220223715800_bib91","doi-asserted-by":"publisher","first-page":"33","DOI":"10.18653\/v1\/2023.cawl-1.5","article-title":"Distinguishing romanized Hindi from romanized Urdu","volume-title":"Proceedings of the Workshop on Computation and Written Language (CAWL 2023)","author":"Nielsen","year":"2023"},{"key":"2024070220223715800_bib92","doi-asserted-by":"publisher","first-page":"495","DOI":"10.1007\/978-3-540-88690-7_37","article-title":"A linear time histogram metric for improved SIFT matching","volume-title":"Computer Vision\u2013ECCV 2008","author":"Pele","year":"2008"},{"key":"2024070220223715800_bib93","doi-asserted-by":"publisher","first-page":"460","DOI":"10.1109\/ICCV.2009.5459199","article-title":"Fast and robust earth mover\u2019s distances","volume-title":"2009 IEEE 12th International Conference on Computer Vision","author":"Pele","year":"2009"},{"issue":"2","key":"2024070220223715800_bib94","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1109\/5.18626","article-title":"A tutorial on hidden markov models and selected applications in speech recognition","volume":"77","author":"Rabiner","year":"1989","journal-title":"Proceedings of the IEEE"},{"issue":"1","key":"2024070220223715800_bib95","first-page":"5485","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2024070220223715800_bib96","doi-asserted-by":"publisher","first-page":"26","DOI":"10.18653\/v1\/W19-1403","article-title":"Joint approach to deromanization of code-mixed texts","volume-title":"Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects","author":"Riyadh","year":"2019"},{"key":"2024070220223715800_bib97","first-page":"61","article-title":"The OpenGrm open-source finite-state grammar software libraries","volume-title":"Proceedings of the ACL 2012 System Demonstrations","author":"Roark","year":"2012"},{"key":"2024070220223715800_bib98","first-page":"2413","article-title":"Processing South Asian languages written in the Latin script: The Dakshina dataset","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Roark","year":"2020"},{"key":"2024070220223715800_bib99","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2203.17189","article-title":"Scaling up models and data with t5x and seqio","author":"Roberts","year":"2022","journal-title":"arXiv preprint arXiv:2203.17189"},{"key":"2024070220223715800_bib100","doi-asserted-by":"publisher","first-page":"10215","DOI":"10.18653\/v1\/2021.emnlp-main.802","article-title":"XTREME-R: Towards more challenging and nuanced multilingual evaluation","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Ruder","year":"2021"},{"key":"2024070220223715800_bib101","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-62703-646-7","volume-title":"Multiple Sequence Alignment Methods","author":"Russell","year":"2014"},{"key":"2024070220223715800_bib102","first-page":"373","article-title":"Brahmi and Kharoshthi","volume-title":"The World\u2019s Writing Systems","author":"Salomon","year":"1996"},{"key":"2024070220223715800_bib103","unstructured":"Samaranayake, V. K., S. T.Nandasara, J. B.Disanayaka, A. R.Weerasinghe, and H.Wijayawardhana. 2003. An introduction to UNICODE for Sinhala characters. Technical Report UCSC 03\/01, University Of Colombo, School of Computing, Colombo, Sri Lanka."},{"issue":"191","key":"2024070220223715800_bib104","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1515\/IJSL.2008.024","article-title":"The Ausbau issue in the Dravidian languages: The case of Tamil and the problem of purism","volume":"2008","author":"Schiffman","year":"2008","journal-title":"International Journal of the Sociology of Language"},{"key":"2024070220223715800_bib105","doi-asserted-by":"publisher","first-page":"266","DOI":"10.18653\/v1\/2023.acl-srw.37","article-title":"Data selection for fine-tuning large language models using transferred Shapley values","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)","author":"Schoch","year":"2023"},{"key":"2024070220223715800_bib106","doi-asserted-by":"publisher","first-page":"5149","DOI":"10.1109\/ICASSP.2012.6289079","article-title":"Japanese and Korean voice search","volume-title":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Schuster","year":"2012"},{"key":"2024070220223715800_bib107","doi-asserted-by":"publisher","first-page":"86","DOI":"10.18653\/v1\/P16-1009","article-title":"Improving neural machine translation models with monolingual data","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sennrich","year":"2016"},{"issue":"9","key":"2024070220223715800_bib108","doi-asserted-by":"publisher","first-page":"229","DOI":"10.14569\/IJACSA.2019.0100929","article-title":"Identification of issues and challenges in romanized Sindhi text","volume":"10","author":"Sodhar","year":"2019","journal-title":"International Journal of Advanced Computer Science and Applications (IJACSA)"},{"key":"2024070220223715800_bib109","first-page":"36","article-title":"Partial traceback in continuous speech recognition","volume-title":"Proceedings of the IEEE 1980 International Conference on Cybernetics and Society (ICCS)","author":"Spohrer","year":"1980"},{"key":"2024070220223715800_bib110","doi-asserted-by":"publisher","first-page":"725","DOI":"10.4324\/9780203214961-36","article-title":"Tamil and the Dravidian languages","volume-title":"The World\u2019s Major Languages","author":"Steever","year":"1987"},{"key":"2024070220223715800_bib111","doi-asserted-by":"publisher","DOI":"10.4324\/9781315722580","volume-title":"The Dravidian Languages","author":"Steever","year":"2019","edition":"2nd edition"},{"key":"2024070220223715800_bib112","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511816338","volume-title":"Text-to-Speech Synthesis","author":"Taylor","year":"2009"},{"key":"2024070220223715800_bib113","first-page":"461","article-title":"South and Central Asia - I","volume-title":"The Unicode Standard (Version 15.0.0)","author":"Unicode Consortium","year":"2022"},{"key":"2024070220223715800_bib114","unstructured":"United\n              Nations\n            \n          . 2007. Technical reference manual for the standardization of geographical names. Technical Report ST\/ESA\/STAT\/SER.M\/87, United Nations, Department of Economic and Social Affairs, Statistics Division, New York. United Nations Group of Experts on Geographical Names. URLhttps:\/\/unstats.un.org\/unsd\/geoinfo\/ungegn\/docs\/pubs\/UNGEGN\u201c%20tech\u201c%20ref\u201c%20manual_m87_combined.pdf."},{"key":"2024070220223715800_bib115","first-page":"5998","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2024070220223715800_bib116","doi-asserted-by":"publisher","first-page":"57","DOI":"10.3115\/1119384.1119392","article-title":"Transliteration of proper names in cross-lingual information retrieval","volume-title":"Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition","author":"Virga","year":"2003"},{"key":"2024070220223715800_bib117","first-page":"219","article-title":"Part-of-speech tagging","volume-title":"The Oxford Handbook of Computational Linguistics","author":"Voutilainen","year":"2003"},{"key":"2024070220223715800_bib118","doi-asserted-by":"publisher","first-page":"316","DOI":"10.18653\/v1\/K19-1030","article-title":"Improving pre-trained multilingual model with vocabulary expansion","volume-title":"Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)","author":"Wang","year":"2019"},{"key":"2024070220223715800_bib119","volume-title":"The Conversion of Scripts: Its Nature, History, and Utilization","author":"Wellisch","year":"1978"},{"key":"2024070220223715800_bib120","first-page":"20","article-title":"Implementation of Internet domain names in Sinhala","volume-title":"Proceedings of International Symposium on Country Domain Governance (CDG)","author":"Wijayawardhana","year":"2008"},{"key":"2024070220223715800_bib121","first-page":"354","article-title":"String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage","volume-title":"Proceedings of the Section on Survey Research of American Statistical Association (ASA)","author":"Winkler","year":"1990"},{"issue":"4","key":"2024070220223715800_bib122","doi-asserted-by":"publisher","first-page":"1085","DOI":"10.1109\/18.87000","article-title":"The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression","volume":"37","author":"Witten","year":"1991","journal-title":"IEEE Transactions on Information Theory"},{"key":"2024070220223715800_bib123","doi-asserted-by":"publisher","first-page":"108","DOI":"10.18653\/v1\/W19-3114","article-title":"Latin script keyboards for South Asian languages with finite-state normalization","volume-title":"Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing","author":"Wolf-Sonkin","year":"2019"},{"key":"2024070220223715800_bib124","doi-asserted-by":"publisher","first-page":"Article 101283","DOI":"10.1016\/j.csl.2021.101283","article-title":"Improving low-resource machine transliteration by using 3-way transfer learning","volume":"72","author":"Wu","year":"2022","journal-title":"Computer Speech & Language"},{"key":"2024070220223715800_bib125","doi-asserted-by":"publisher","first-page":"291","DOI":"10.1162\/tacl_a_00461","article-title":"ByT5: Towards a token-free future with pre-trained byte-to-byte models","volume":"10","author":"Xue","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024070220223715800_bib126","doi-asserted-by":"publisher","first-page":"483","DOI":"10.18653\/v1\/2021.naacl-main.41","article-title":"mT5: A massively multilingual pre-trained text-to-text transformer","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Xue","year":"2021"},{"key":"2024070220223715800_bib127","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4471-5779-3","volume-title":"Automatic Speech Recognition: A Deep Learning Approach","author":"Yu","year":"2015"},{"issue":"2","key":"2024070220223715800_bib128","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1162\/coli_a_00349","article-title":"Neural models of text normalization for speech applications","volume":"45","author":"Zhang","year":"2019","journal-title":"Computational Linguistics"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/50\/2\/475\/2456386\/coli_a_00510.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/50\/2\/475\/2456386\/coli_a_00510.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,2]],"date-time":"2024-07-02T20:23:24Z","timestamp":1719951804000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/50\/2\/475\/119145\/Context-aware-Transliteration-of-Romanized-South"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"references-count":128,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,6,1]]},"published-print":{"date-parts":[[2023,6,1]]}},"URL":"https:\/\/doi.org\/10.1162\/coli_a_00510","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023]]},"published":{"date-parts":[[2023]]}}}