{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:30:10Z","timestamp":1750231810241,"version":"3.41.0"},"reference-count":28,"publisher":"MIT Press","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computational Linguistics"],"published-print":{"date-parts":[[2017,6]]},"abstract":"<jats:p>We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.<\/jats:p>","DOI":"10.1162\/coli_a_00286","type":"journal-article","created":{"date-parts":[[2017,3,28]],"date-time":"2017-03-28T19:43:24Z","timestamp":1490730204000},"page":"349-375","source":"Crossref","is-referenced-by-count":8,"title":["Statistical Models for Unsupervised, Semi-Supervised, and Supervised Transliteration Mining"],"prefix":"10.1162","volume":"43","author":[{"given":"Hassan","family":"Sajjad","sequence":"first","affiliation":[{"name":"Qatar Computing Research Institute"}]},{"given":"Helmut","family":"Schmid","sequence":"additional","affiliation":[{"name":"Ludwig Maximilian University of Munich"}]},{"given":"Alexander","family":"Fraser","sequence":"additional","affiliation":[{"name":"Ludwig Maximilian University of Munich"}]},{"given":"Hinrich","family":"Sch\u00fctze","sequence":"additional","affiliation":[{"name":"Ludwig Maximilian University of Munich"}]}],"member":"281","reference":[{"key":"bib1","unstructured":"Aransa, Walid, Holger Schwenk and Loic Barrault. 2012. Semi-supervised transliteration mining from parallel and comparable corpora. In Proceedings of the 9th International Workshop on Spoken Language Translation, pages 185\u2013192, Hong Kong."},{"key":"bib2","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177699147"},{"key":"bib3","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2008.01.002"},{"key":"bib4","unstructured":"Darwish, Kareem. 2010, Transliteration mining with phonetic conflation and iterative training. In Proceedings of the 2010 Named Entities Workshop, 53\u201356, Uppsala."},{"key":"bib5","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1977.tb01600.x"},{"key":"bib6","unstructured":"Durrani, Nadir and Philipp Koehn. 2014. Improving machine translation via triangulation and transliteration. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, EAMT'14, pages 71\u201378, Dubrovnik."},{"key":"bib7","unstructured":"Durrani, Nadir, Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2010. Hindi-to-Urdu machine translation through transliteration. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 465\u2013474, Uppsala."},{"key":"bib8","doi-asserted-by":"crossref","unstructured":"Durrani, Nadir, Hassan Sajjad, Hieu Hoang and Philipp Koehn. 2014. Integrating an unsupervised transliteration model into statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 148\u2013153, Gothenburg.","DOI":"10.3115\/v1\/E14-4029"},{"key":"bib9","unstructured":"Eisele, Andreas and Yu Chen. 2010, MultiUN: A multilingual corpus from United Nation documents. In Proceedings of the Seventh Conference on International Language Resources and Evaluation, pages 2868\u20132872, Valletta."},{"key":"bib10","unstructured":"El-Kahki, Ali, Kareem Darwish, Ahmed Saad El Din, Mohamed Abd El-Wahab, Ahmed Hefny, and Waleed Ammar. 2011. Improved transliteration mining using graph reinforcement. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1384\u20131393, Edinburgh."},{"key":"bib11","unstructured":"Gale, William A. and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):177\u2013184."},{"key":"bib12","unstructured":"Huang, Fei. 2005. Multilingual Named Entity Extraction and Translation from Text and Speech. Ph.D. thesis, Language Technology Institute, Carnegie Mellon University."},{"key":"bib13","unstructured":"Jiampojamarn, Sittichai, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010. Transliteration generation and mining with limited training resources. In Proceedings of the 2010 Named Entities Workshop, pages 39\u201347, Uppsala."},{"key":"bib14","doi-asserted-by":"crossref","unstructured":"Koehn, Philipp, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference, pages 48\u201354, Edmonton.","DOI":"10.3115\/1073445.1073462"},{"key":"bib15","unstructured":"Kumaran, A., Mitesh M. Khapra, and Haizhou Li. 2010. Whitepaper of NEWS 2010 shared task on transliteration mining. In Proceedings of the 2010 Named Entities Workshop, pages 19\u201326, Uppsala."},{"key":"bib17","doi-asserted-by":"crossref","unstructured":"Li, Haizhou, Zhang Min, and Su Jian. 2004. A joint source-channel model for machine transliteration. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 159\u2013166, Barcelona.","DOI":"10.3115\/1218955.1218976"},{"key":"bib18","doi-asserted-by":"publisher","DOI":"10.3115\/1654449.1654460"},{"key":"bib19","unstructured":"Nabende, Peter. 2010. Mining transliterations from Wikipedia using Pair HMMs. In Proceedings of the 2010 Named Entities Workshop, pages 76\u201380, Uppsala."},{"key":"bib20","doi-asserted-by":"crossref","unstructured":"Noeman, Sara and Amgad Madkour. 2010. Language independent transliteration mining system using finite state automata framework. In Proceedings of the 2010 Named Entities Workshop, pages 112\u2013115, Uppsala.","DOI":"10.3115\/1699705.1699734"},{"key":"bib21","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"bib22","unstructured":"Sajjad, Hassan, Nadir Durrani, Helmut Schmid, and Alexander Fraser. 2011. Comparing two techniques for learning transliteration models using a parallel corpus. In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 129\u2013137, Chiang Mai."},{"key":"bib23","unstructured":"Sajjad, Hassan, Alexander Fraser, and Helmut Schmid. 2011. An algorithm for unsupervised transliteration mining with an application to word alignment. In Proceedings of the 49th Annual Conference of the Association for Computational Linguistics, pages 430\u2013439, Portland, OR."},{"key":"bib24","unstructured":"Sajjad, Hassan, Alexander Fraser, and Helmut Schmid. 2012. A statistical model for unsupervised and semi-supervised transliteration mining. In Proceedings of the 50th Annual Conference of the Association for Computational Linguistics, pages 469\u2013477, Jeju Island."},{"key":"bib25","unstructured":"Sajjad, Hassan, Francisco Guzm\u00e1n, Preslav Nakov, Ahmed Abdelali, Kenton Murray, Fahad Al Obaidli, and Stephan Vogel. 2013a. QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Technology (IWSLT-13), Heidelburg."},{"key":"bib26","unstructured":"Sajjad, Hassan, Svetlana Smekalova, Nadir Durrani, Alexander Fraser, and Helmut Schmid. 2013b. QCRI-MES submission at WMT13: Using transliteration mining to improve statistical machine translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 219\u2013224, Sofia."},{"key":"bib27","unstructured":"Sherif, Tarek and Grzegorz Kondrak. 2007. Bootstrapping a stochastic transducer for Arabic-English transliteration extraction. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 864\u2013871, Prague."},{"key":"bib28","doi-asserted-by":"crossref","unstructured":"Tao, Tao, Su-Yoon Yoon, Andrew Fister, Richard Sproat, and ChengXiang Zhai. 2006. Unsupervised named entity transliteration using temporal and phonetic correlation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 250\u2013257, Sydney.","DOI":"10.3115\/1610075.1610112"},{"key":"bib29","doi-asserted-by":"publisher","DOI":"10.1109\/18.87000"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/COLI_a_00286","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:18Z","timestamp":1750186938000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/43\/2\/349-375\/1568"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,6]]},"references-count":28,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2017,6]]}},"alternative-id":["10.1162\/COLI_a_00286"],"URL":"https:\/\/doi.org\/10.1162\/coli_a_00286","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"type":"print","value":"0891-2017"},{"type":"electronic","value":"1530-9312"}],"subject":[],"published":{"date-parts":[[2017,6]]}}}