{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,8,19]],"date-time":"2023-08-19T04:19:07Z","timestamp":1692418747653},"reference-count":30,"publisher":"Cambridge University Press (CUP)","issue":"4","license":[{"start":{"date-parts":[[2016,6,15]],"date-time":"2016-06-15T00:00:00Z","timestamp":1465948800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2016,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic\u2013English and Urdu\u2013English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.<\/jats:p>","DOI":"10.1017\/s1351324916000139","type":"journal-article","created":{"date-parts":[[2016,6,15]],"date-time":"2016-06-15T18:25:18Z","timestamp":1466015118000},"page":"549-573","source":"Crossref","is-referenced-by-count":8,"title":["Extracting parallel phrases from comparable data for machine translation"],"prefix":"10.1017","volume":"22","author":[{"given":"SANJIKA","family":"HEWAVITHARANA","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"STEPHAN","family":"VOGEL","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2016,6,15]]},"reference":[{"key":"S1351324916000139_ref003","first-page":"263","article-title":"The mathematics of statistical machine translation: parameter estimation","volume":"19","author":"Brown","year":"1993","journal-title":"Computational Linguistics"},{"key":"S1351324916000139_ref021","doi-asserted-by":"publisher","DOI":"10.1162\/089120103322711578"},{"key":"S1351324916000139_ref009","unstructured":"Hewavitharana S. and Vogel S. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, pp. 61\u20138."},{"key":"S1351324916000139_ref015","doi-asserted-by":"crossref","unstructured":"Munteanu D. S. and Marcu D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81\u20138.","DOI":"10.3115\/1220175.1220186"},{"key":"S1351324916000139_ref023","doi-asserted-by":"crossref","unstructured":"Tillmann C. and Hewavitharana S. 2011. An efficient unified alignment algorithm for bilingual data. In Proceedings of Interspeech 2011, Florence, Italy, August.","DOI":"10.1017\/S135132491100026X"},{"key":"S1351324916000139_ref006","doi-asserted-by":"crossref","unstructured":"Fung P. and Yee L. Y. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada, pp. 414\u201320.","DOI":"10.3115\/980845.980916"},{"key":"S1351324916000139_ref005","unstructured":"Fung P. and Cheung P. 2004. Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 57\u201363."},{"key":"S1351324916000139_ref013","unstructured":"Kumano T. , Tanaka H. and Tokunaga T. 2007. Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden, September."},{"key":"S1351324916000139_ref012","doi-asserted-by":"crossref","unstructured":"Koehn P. , Hoang H. , Birch A. , Callison-Burch C. , Federico M. , Bertoldi N. , Cowan B. , Shen W. , Moran C. , Zens R. , Dyer C. , Bojar O. , Constantin A. , and Herbst E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June.","DOI":"10.3115\/1557769.1557821"},{"key":"S1351324916000139_ref014","doi-asserted-by":"publisher","DOI":"10.1162\/089120105775299168"},{"key":"S1351324916000139_ref001","unstructured":"Banerjee S. and Lavie A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, Michigan, USA, June, pp. 65\u201372."},{"key":"S1351324916000139_ref027","doi-asserted-by":"crossref","unstructured":"Vogel S. 2003. SMT decoder dissected: word reordering. In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China, October, pp. 561\u201366.","DOI":"10.1109\/NLPKE.2003.1275968"},{"key":"S1351324916000139_ref019","doi-asserted-by":"crossref","unstructured":"Rapp R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, pp. 320\u201322.","DOI":"10.3115\/981658.981709"},{"key":"S1351324916000139_ref018","unstructured":"Quirk C. , Udupa R. U. and Menezes A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, pp. 377\u201384."},{"key":"S1351324916000139_ref008","unstructured":"Gupta R. , Pal S. and Bandyopadhyay S. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, August."},{"key":"S1351324916000139_ref024","doi-asserted-by":"publisher","DOI":"10.1017\/S135132491100026X"},{"key":"S1351324916000139_ref011","unstructured":"Kikui G. , Sumita E. , Takezawa T. and Yamamoto S. 2003. Creating corpora for speech-to-speech translation. In Proceedings of EUROSPEECH, Geneva, pp. 381\u201384."},{"key":"S1351324916000139_ref002","doi-asserted-by":"publisher","DOI":"10.1007\/s10590-011-9089-6"},{"key":"S1351324916000139_ref026","doi-asserted-by":"crossref","unstructured":"Utiyama M. and Isahara H. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72\u20139.","DOI":"10.3115\/1075096.1075106"},{"key":"S1351324916000139_ref016","doi-asserted-by":"crossref","unstructured":"Och F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 160\u201367.","DOI":"10.3115\/1075096.1075117"},{"key":"S1351324916000139_ref025","doi-asserted-by":"crossref","unstructured":"Tillmann C. and Xu J.-M. 2009. A simple sentence-level extraction algorithm for comparable data. In Companion Vol. of NAACL HLT 09, Boulder, CA, June.","DOI":"10.3115\/1620853.1620881"},{"key":"S1351324916000139_ref004","first-page":"551","article-title":"Online passive-agressive algorithms","volume":"7","author":"Crammer","year":"2006","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324916000139_ref007","unstructured":"Gupta M. , Hewavitharana S. and Vogel S. 2011. Extending a probabilistic phrase alignment approach for SMT. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), San Francisco, CA, December."},{"key":"S1351324916000139_ref028","unstructured":"Vogel S. 2005. PESA: phrase pair extraction as sentence splitting. In Proceedings of the Machine Translation Summit X, Phuket, Thailand, September."},{"key":"S1351324916000139_ref017","unstructured":"Papineni K. , Roukos S. , Ward T. and Zhu W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July, pp. 311\u201318."},{"key":"S1351324916000139_ref029","unstructured":"Zhao B. and Vogel S. 2002a. Adaptive parallel sentence mining from web bilingual news collection. In Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 745\u201348."},{"key":"S1351324916000139_ref022","unstructured":"Snover M. , Dorr B. , Schwartz R. , Micciulla L. , and Makhoul J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Cambridge, MA."},{"key":"S1351324916000139_ref020","doi-asserted-by":"crossref","unstructured":"Rapp R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519\u201326.","DOI":"10.3115\/1034678.1034756"},{"key":"S1351324916000139_ref010","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-20128-8_10"},{"key":"S1351324916000139_ref030","doi-asserted-by":"crossref","unstructured":"Zhao B. and Vogel S. 2002b. Full-text story alignment models for Chinese-English bilingual news corpora. In Proceedings of the ICSLP '02, Denver, CO, September.","DOI":"10.21437\/ICSLP.2002-181"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324916000139","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,18]],"date-time":"2023-08-18T22:21:47Z","timestamp":1692397307000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324916000139\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,6,15]]},"references-count":30,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,7]]}},"alternative-id":["S1351324916000139"],"URL":"https:\/\/doi.org\/10.1017\/s1351324916000139","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,6,15]]}}}