{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T07:53:39Z","timestamp":1672559619191},"reference-count":46,"publisher":"Cambridge University Press (CUP)","issue":"4","license":[{"start":{"date-parts":[[2016,6,15]],"date-time":"2016-06-15T00:00:00Z","timestamp":1465948800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2016,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English\u2013Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.<\/jats:p>","DOI":"10.1017\/s1351324916000164","type":"journal-article","created":{"date-parts":[[2016,6,15]],"date-time":"2016-06-15T18:25:18Z","timestamp":1466015118000},"page":"627-653","source":"Crossref","is-referenced-by-count":3,"title":["Building a multi-domain comparable corpus using a learning to rank method"],"prefix":"10.1017","volume":"22","author":[{"given":"RAZIEH","family":"RAHIMI","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"AZADEH","family":"SHAKERY","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"JAVID","family":"DADASHKARIMI","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"MOZHDEH","family":"ARIANNEZHAD","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"MOSTAFA","family":"DEHGHANI","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"HOSSEIN NASR","family":"ESFAHANI","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2016,6,15]]},"reference":[{"key":"S1351324916000164_ref008","doi-asserted-by":"crossref","unstructured":"Braschler M. and Sch\u00e4uble P. 1998. Multilingual information retrieval based on document alignment techniques. In Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, ECDL'98, London, UK: Springer-Verlag, pp. 183\u2013197.","DOI":"10.1007\/3-540-49653-X_12"},{"key":"S1351324916000164_ref033","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2015.08.001"},{"key":"S1351324916000164_ref025","doi-asserted-by":"crossref","unstructured":"Munteanu D. S. and Marcu D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 81\u201388.","DOI":"10.3115\/1220175.1220186"},{"key":"S1351324916000164_ref024","doi-asserted-by":"publisher","DOI":"10.1162\/089120105775299168"},{"key":"S1351324916000164_ref017","doi-asserted-by":"crossref","unstructured":"Gaussier E. , Renders J.-M. , Matveeva I. , Goutte C. , and D\u00e9jean H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL'04, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 527\u2013534.","DOI":"10.3115\/1218955.1219022"},{"key":"S1351324916000164_ref041","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-008-9058-8"},{"key":"S1351324916000164_ref016","doi-asserted-by":"crossref","unstructured":"Garera N. , Callison-Burch C. , and Yarowsky D. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL'09, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 129\u2013137.","DOI":"10.3115\/1596374.1596397"},{"key":"S1351324916000164_ref038","unstructured":"Smith J. R. , Quirk C. and Toutanova K. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 403\u2013411."},{"key":"S1351324916000164_ref032","doi-asserted-by":"crossref","unstructured":"Rahimi R. and Shakery A. 2013. A language modeling approach for extracting translation knowledge from comparable corpora. In Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR'13, Berlin, Heidelberg. Springer-Verlag, pp. 606\u2013617.","DOI":"10.1007\/978-3-642-36973-5_51"},{"key":"S1351324916000164_ref027","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"S1351324916000164_ref045","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2004.06.009"},{"key":"S1351324916000164_ref043","doi-asserted-by":"crossref","unstructured":"Ture F. , Elsayed T. and Lin J. 2011. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 943\u2013952.","DOI":"10.1145\/2009916.2010042"},{"key":"S1351324916000164_ref022","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14267-3"},{"key":"S1351324916000164_ref023","doi-asserted-by":"crossref","unstructured":"McNamee P. , Mayfield J. , and Nicholas C. 2009. Translation corpus source and size in bilingual retrieval. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short'09 Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 25\u201328.","DOI":"10.3115\/1620853.1620862"},{"key":"S1351324916000164_ref021","doi-asserted-by":"crossref","unstructured":"Joachims T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'02, New York, NY, USA. ACM, pp. 133\u2013142.","DOI":"10.1145\/775047.775067"},{"key":"S1351324916000164_ref018","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2013.10.002"},{"key":"S1351324916000164_ref007","unstructured":"Azarbonyad H. , Shakery A. and Faili H. 2012. Using learning to rank approach for parallel corpora based cross-language information retrieval. In Proceedings of 20th European Conference on Artificial Intelligence (ECAI), Montpellier, France, pp. 79\u201384."},{"key":"S1351324916000164_ref046","doi-asserted-by":"crossref","unstructured":"Zhai C. and Lafferty J. 2001. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th International Conference on Information and Knowledge Management, CIKM'01, New York, NY, USA, ACM, pp. 403\u2013410.","DOI":"10.1145\/502585.502654"},{"key":"S1351324916000164_ref026","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-02138-1","volume-title":"Cross-Language Information Retrieval","author":"Nie","year":"2010"},{"key":"S1351324916000164_ref005","volume-title":"The Ninth International Conference on Language Resources and Evaluation (LREC'14)","author":"Aker","year":"2014"},{"key":"S1351324916000164_ref014","doi-asserted-by":"crossref","unstructured":"Finkel J. R. , Grenager T. and Manning C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL'05, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 363\u2013370.","DOI":"10.3115\/1219840.1219885"},{"key":"S1351324916000164_ref037","unstructured":"Skadia I. , Aker A. , Mastropavlos N. , Su F. , Tufi D. , Verlic M. , Vasijevs A. , Babych B. , Clough P. , Gaizauskas R. , Glaros N. , Paramita M. L. , and Pinnis M. (2012). Collecting and using comparable corpora for statistical machine translation. In N. C. C. Chair , K. Choukri , T. Declerck , M. U. Doan , B. Maegaard , J. Mariani , A. Moreno , J. Odijk , and S. Piperidis (eds.), In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey: European Language Resources Association 824 (ELRA)."},{"key":"S1351324916000164_ref029","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-20128-8_5"},{"key":"S1351324916000164_ref034","doi-asserted-by":"crossref","unstructured":"Rahimi Z. and Shakery A. 2011. Topic based creation of a Persian-English comparable corpus. In Proceedings of the 7th Asia Conference on Information Retrieval Technology, AIRS'11, Berlin, Heidelberg, Springer-Verlag, pp. 458\u2013469.","DOI":"10.1007\/978-3-642-25631-8_41"},{"key":"S1351324916000164_ref009","doi-asserted-by":"publisher","DOI":"10.1007\/BF00994018"},{"key":"S1351324916000164_ref030","unstructured":"Pilevar M. T. , Faili H. and Pilevar A. H. 2011. TEP: Tehran English-Persian parallel corpus. In Proceedings of the 12th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II, CICLing'11, Berlin, Heidelberg, Springer-Verlag, pp. 68\u201379."},{"key":"S1351324916000164_ref006","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2009.05.002"},{"key":"S1351324916000164_ref013","unstructured":"Ferro N. and Peters C. 2009. Clef 2009 ad hoc track overview: TEL and Persian tasks. In Proceedings of the 10th Cross-Language Evaluation Forum Conference on Multilingual Information Access Evaluation: Text Retrieval Experiments, CLEF'09, Berlin, Heidelberg, Springer-Verlag, pp. 13\u201335."},{"key":"S1351324916000164_ref001","unstructured":"AbduI-Rauf S. , and Schwenk H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL'09, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 16\u201323."},{"key":"S1351324916000164_ref003","unstructured":"Aker A. , Kanoulas E. and Gaizauskas R. 2012. A light way to collect comparable corpora from the web. In N. C. C. Chair , K. Choukri , T. Declerck , M. U. Doan , B. Maegaard , J. Mariani , A. Moreno , J. Odijk , and S. Piperidis (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. European Language Resources Association (ELRA)."},{"key":"S1351324916000164_ref019","doi-asserted-by":"crossref","unstructured":"Hashemi H. B. , Shakery A. and Faili H. 2010. Creating a Persian-English comparable corpus. In Proceedings of the 2010 International Conference on Multilingual and Multimodal Information Access Evaluation: Cross-Language Evaluation Forum, CLEF'10, Berlin, Heidelberg, Springer-Verlag, pp. 27\u201339.","DOI":"10.1007\/978-3-642-15998-5_5"},{"key":"S1351324916000164_ref010","unstructured":"Dadashkarimi J. , Shakery A. and Heshaam F. 2014. A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In Proceedings of the 3th Conference on Computational Linguistic, CLConference'14, Tehran, Iran."},{"key":"S1351324916000164_ref031","unstructured":"Pomik\u00e1lek J. 2011. Removing boilerplate and duplicate content from web corpora. PhD thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic."},{"key":"S1351324916000164_ref004","unstructured":"Aker A. , Paramita M. and Gaizauskas R. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Soa, Bulgaria: Association for Computational Linguistics, pp. 402\u2013411."},{"key":"S1351324916000164_ref036","doi-asserted-by":"crossref","unstructured":"Sheridan P. and Ballerini J. P. 1996. Experiments in multilingual information retrieval using the spider system. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'96, New York, NY, USA. ACM, pp. 58\u201365.","DOI":"10.1145\/243199.243213"},{"key":"S1351324916000164_ref012","doi-asserted-by":"publisher","DOI":"10.1145\/1961209.1961210"},{"key":"S1351324916000164_ref002","doi-asserted-by":"crossref","unstructured":"Agirre E. , Di Nunzio G. M. , Ferro N. , Mandl T. , and Peters C. 2009. Clef 2008: ad hoc track overview. In Proceedings of the 9th Cross-language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, CLEF'08, Berlin, Heidelberg, Springer-Verlag, pp. 15\u201337.","DOI":"10.1007\/978-3-642-04447-2_2"},{"key":"S1351324916000164_ref040","doi-asserted-by":"publisher","DOI":"10.1145\/1198296.1198300"},{"key":"S1351324916000164_ref028","doi-asserted-by":"crossref","unstructured":"Pal S. , Pakray P. and Naskar K. S. 2014. Automatic Building and Using Parallel Resources for SMT from Comparable Corpora. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra) at EACL. Association for Computational Linguistics, pp. 48\u201357.","DOI":"10.3115\/v1\/W14-1009"},{"key":"S1351324916000164_ref035","doi-asserted-by":"publisher","DOI":"10.1016\/j.sbspro.2013.10.620"},{"key":"S1351324916000164_ref044","unstructured":"Vuli\u0107 I. and Moens M.-F. 2012. Detecting highly confident word translations from comparable corpora without any prior knowledge. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL'12, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 449\u2013459."},{"key":"S1351324916000164_ref042","doi-asserted-by":"crossref","unstructured":"Tao T. and Zhai C. 2005. Mining comparable bilingual text corpora for cross-language information integration. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD'05, New York, NY, USA: ACM, pp. 691\u2013696.","DOI":"10.1145\/1081870.1081958"},{"key":"S1351324916000164_ref020","unstructured":"Huang D. , Zhao L. , Li L. and Yu H. 2010. Mining large-scale comparable corpora from Chinese-English news collections. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. COLING'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 472\u2013480."},{"key":"S1351324916000164_ref039","unstructured":"Str\u00f6tgen J. , Gertz M. and Junghans C. 2011. An event-centric model for multilingual document similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 953\u2013962."},{"key":"S1351324916000164_ref015","doi-asserted-by":"crossref","unstructured":"Fung P. and Cheung P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics, COLING'04, Stroudsburg, PA, USA. Association for Computational Linguistics.","DOI":"10.3115\/1220355.1220506"},{"key":"S1351324916000164_ref011","doi-asserted-by":"crossref","unstructured":"Darwish K. and Oard D. W. 2003. Probabilistic structured query methods. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR'03, New York, NY, USA. ACM, pp. 338\u2013344.","DOI":"10.1145\/860435.860497"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324916000164","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,7,1]],"date-time":"2022-07-01T19:42:40Z","timestamp":1656704560000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324916000164\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,6,15]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,7]]}},"alternative-id":["S1351324916000164"],"URL":"https:\/\/doi.org\/10.1017\/s1351324916000164","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,6,15]]}}}