{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:13:16Z","timestamp":1750306396128,"version":"3.41.0"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2015,12,11]],"date-time":"2015-12-11T00:00:00Z","timestamp":1449792000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"JSPS Grant-in-Aid for JSPS Fellows"},{"name":"Japan Society for the Promotion of Science (JSPS) Research Fellow"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2016,2]]},"abstract":"<jats:p>Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance.<\/jats:p>","DOI":"10.1145\/2833089","type":"journal-article","created":{"date-parts":[[2015,12,14]],"date-time":"2015-12-14T14:19:41Z","timestamp":1450102781000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora"],"prefix":"10.1145","volume":"15","author":[{"given":"Chenhui","family":"Chu","sequence":"first","affiliation":[{"name":"Japan Science and Technology Agency, Honcho, Kawaguchi-shi, Saitama, Japan"}]},{"given":"Toshiaki","family":"Nakazawa","sequence":"additional","affiliation":[{"name":"Japan Science and Technology Agency, Honcho, Kawaguchi-shi, Saitama, Japan"}]},{"given":"Sadao","family":"Kurohashi","sequence":"additional","affiliation":[{"name":"Kyoto University, Sakyo-ku, Kyoto, Japan"}]}],"member":"320","published-online":{"date-parts":[[2015,12,11]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10590-011-9114-9"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources. 62--69","author":"Adafre Sisay Fissaha","year":"2006","unstructured":"Sisay Fissaha Adafre and Maarten de Rijke . 2006 . Finding similar sentences across multiple languages in Wikipedia . In Proceedings of the Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources. 62--69 . Sisay Fissaha Adafre and Maarten de Rijke. 2006. Finding similar sentences across multiple languages in Wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources. 62--69."},{"key":"e_1_2_1_3_1","first-page":"13","volume-title":"Proceedings of the 6th International Joint Conference on Natural Language Processing. 286--292","author":"Afli Haithem","year":"2013","unstructured":"Haithem Afli , Lo\u00efc Barrault , and Holger Schwenk . 2013 . Multimodal comparable corpora as resources for extracting parallel data: Parallel phrases extraction . In Proceedings of the 6th International Joint Conference on Natural Language Processing. 286--292 . http:\/\/www.aclweb.org\/anthology\/I 13 - 1033 . Haithem Afli, Lo\u00efc Barrault, and Holger Schwenk. 2013. Multimodal comparable corpora as resources for extracting parallel data: Parallel phrases extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 286--292. http:\/\/www.aclweb.org\/anthology\/I13-1033."},{"key":"e_1_2_1_4_1","first-page":"12","volume-title":"Proceedings of COLING 2012: Posters. 23--32","author":"Aker Ahmet","year":"2012","unstructured":"Ahmet Aker , Yang Feng , and Robert Gaizauskas . 2012 . Automatic bilingual phrase extraction from comparable corpora . In Proceedings of COLING 2012: Posters. 23--32 . http:\/\/www.aclweb.org\/anthology\/C 12 - 2003 . Ahmet Aker, Yang Feng, and Robert Gaizauskas. 2012. Automatic bilingual phrase extraction from comparable corpora. In Proceedings of COLING 2012: Posters. 23--32. http:\/\/www.aclweb.org\/anthology\/C12-2003."},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Aker Ahmet","year":"2014","unstructured":"Ahmet Aker , Monica Paramita , Marcis Pinnis , and Robert Gaizauskas . 2014 . Bilingual dictionaries for all EU languages . In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC\u201914) . 26--31. http:\/\/www.lrec-conf.org\/proceedings\/lrec2014\/pdf\/803_Paper.pdf. Ahmet Aker, Monica Paramita, Marcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC\u201914). 26--31. http:\/\/www.lrec-conf.org\/proceedings\/lrec2014\/pdf\/803_Paper.pdf."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1963192.1963199"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/972470.972474"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1961189.1961199"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2523057.2523059"},{"key":"e_1_2_1_10_1","first-page":"13","volume-title":"Proceedings of the 6th International Joint Conference on Natural Language Processing. 1144--1150","author":"Chu Chenhui","year":"2013","unstructured":"Chenhui Chu , Toshiaki Nakazawa , and Sadao Kurohashi . 2013 b. Accurate parallel fragment extraction from quasi--comparable corpora using alignment model and translation lexicon . In Proceedings of the 6th International Joint Conference on Natural Language Processing. 1144--1150 . http:\/\/www.aclweb.org\/anthology\/I 13 - 1163 . Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2013b. Accurate parallel fragment extraction from quasi--comparable corpora using alignment model and translation lexicon. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 1144--1150. http:\/\/www.aclweb.org\/anthology\/I13-1163."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 9th Conference on International Language Resources and Evaluation (LREC\u201914)","author":"Chu Chenhui","year":"2014","unstructured":"Chenhui Chu , Toshiaki Nakazawa , and Sadao Kurohashi . 2014 . Constructing a Chinese--Japanese parallel corpus from Wikipedia . In Proceedings of the 9th Conference on International Language Resources and Evaluation (LREC\u201914) . 642--647. Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2014. Constructing a Chinese--Japanese parallel corpus from Wikipedia. In Proceedings of the 9th Conference on International Language Resources and Evaluation (LREC\u201914). 642--647."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT\u201910)","author":"Diep Do Thi Ngoc","year":"2010","unstructured":"Thi Ngoc Diep Do , Laurent Besacier , and Eric Castelli . 2010 . A fully unsupervised approach for mining parallel data from comparable corpora . In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT\u201910) . Thi Ngoc Diep Do, Laurent Besacier, and Eric Castelli. 2010. A fully unsupervised approach for mining parallel data from comparable corpora. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT\u201910)."},{"key":"e_1_2_1_13_1","first-page":"13","volume-title":"Proceedings of the 6th International Joint Conference on Natural Language Processing. 972--976","author":"Fu Xiaoyin","year":"2013","unstructured":"Xiaoyin Fu , Wei Wei , Shixiang Lu , Zhenbiao Chen , and Bo Xu . 2013 . Phrase-based parallel fragments extraction from comparable corpora . In Proceedings of the 6th International Joint Conference on Natural Language Processing. 972--976 . http:\/\/www.aclweb.org\/anthology\/I 13 - 1129 . Xiaoyin Fu, Wei Wei, Shixiang Lu, Zhenbiao Chen, and Bo Xu. 2013. Phrase-based parallel fragments extraction from comparable corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 972--976. http:\/\/www.aclweb.org\/anthology\/I13-1129."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220355.1220506"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (BUCC\u201910) and Language Resource and Evaluation Conference (LREC\u201910)","author":"Fung Pascale","year":"2010","unstructured":"Pascale Fung , Emmanuel Prochasson , and Simon Shi . 2010 . Trillions of comparable documents . In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (BUCC\u201910) and Language Resource and Evaluation Conference (LREC\u201910) . 26--34. Pascale Fung, Emmanuel Prochasson, and Simon Shi. 2010. Trillions of comparable documents. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (BUCC\u201910) and Language Resource and Evaluation Conference (LREC\u201910). 26--34."},{"key":"e_1_2_1_16_1","first-page":"11","volume-title":"Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 44--51","author":"Gahbiche-Braham Souhir","year":"2011","unstructured":"Souhir Gahbiche-Braham , H\u00e9l\u00e8ne Bonneau-Maynard , and Fran\u00e7ois Yvon . 2011 . Two ways to use a noisy parallel news corpus for improving statistical machine translation . In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 44--51 . http:\/\/www.aclweb.org\/anthology\/W 11 - 1207 . Souhir Gahbiche-Braham, H\u00e9l\u00e8ne Bonneau-Maynard, and Fran\u00e7ois Yvon. 2011. Two ways to use a noisy parallel news corpus for improving statistical machine translation. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 44--51. http:\/\/www.aclweb.org\/anthology\/W11-1207."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/11562214_59"},{"key":"e_1_2_1_18_1","first-page":"13","volume-title":"Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 69--76","author":"Gupta Rajdeep","year":"2013","unstructured":"Rajdeep Gupta , Santanu Pal , and Sivaji Bandyopadhyay . 2013 . Improving MT system using extracted parallel fragments of text from comparable corpora . In Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 69--76 . http:\/\/www.aclweb.org\/anthology\/W 13 - 2509 . Rajdeep Gupta, Santanu Pal, and Sivaji Bandyopadhyay. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora. 69--76. http:\/\/www.aclweb.org\/anthology\/W13-2509."},{"key":"e_1_2_1_19_1","first-page":"11","volume-title":"Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 61--68","author":"Hewavitharana Sanjika","year":"2011","unstructured":"Sanjika Hewavitharana and Stephan Vogel . 2011 . Extracting parallel phrases from comparable data . In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 61--68 . http:\/\/www.aclweb.org\/anthology\/W 11 - 1209 . Sanjika Hewavitharana and Stephan Vogel. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. 61--68. http:\/\/www.aclweb.org\/anthology\/W11-1209."},{"key":"e_1_2_1_20_1","first-page":"10","volume-title":"Proceedings of the 23rd International Conference on Computational Linguistics (Coling\u201910)","author":"Hong Gumwon","year":"2010","unstructured":"Gumwon Hong , Chi-Ho Li , Ming Zhou , and Hae-Chang Rim . 2010 . An empirical study on Web mining of parallel data . In Proceedings of the 23rd International Conference on Computational Linguistics (Coling\u201910) . 474--482. http:\/\/www.aclweb.org\/anthology\/C 10 - 1054 . Gumwon Hong, Chi-Ho Li, Ming Zhou, and Hae-Chang Rim. 2010. An empirical study on Web mining of parallel data. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling\u201910). 474--482. http:\/\/www.aclweb.org\/anthology\/C10-1054."},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the MT Summit.","author":"Ishisaka Tatsuya","year":"2009","unstructured":"Tatsuya Ishisaka , Masao Utiyama , Eiichiro Sumita , and Kazuhide Yamamoto . 2009 . Development of a Japanese--English software manual parallel corpus . In Proceedings of the MT Summit. Tatsuya Ishisaka, Masao Utiyama, Eiichiro Sumita, and Kazuhide Yamamoto. 2009. Development of a Japanese--English software manual parallel corpus. In Proceedings of the MT Summit."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/1690219.1690268"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201904)","author":"Koehn Philipp","year":"2004","unstructured":"Philipp Koehn . 2004 . Statistical significance tests for machine translation evaluation . In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201904) .388--395. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201904).388--395."},{"volume-title":"Statistical Machine Translation","author":"Koehn Philipp","key":"e_1_2_1_24_1","unstructured":"Philipp Koehn . 2010. Statistical Machine Translation . Cambridge University Press , New York, NY . Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York, NY."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/1557769.1557821"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the International Workshop on Sharable Natural Language. 22--28","author":"Kurohashi Sadao","year":"1994","unstructured":"Sadao Kurohashi , Toshihisa Nakamura , Yuji Matsumoto , and Makoto Nagao . 1994 . Improvements of Japanese morphological analyzer JUMAN . In Proceedings of the International Workshop on Sharable Natural Language. 22--28 . Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language. 22--28."},{"key":"e_1_2_1_27_1","first-page":"13","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 176--186","author":"Ling Wang","year":"2013","unstructured":"Wang Ling , Guang Xiang , Chris Dyer , Alan Black , and Isabel Trancoso . 2013 . Microblogs as parallel corpora . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 176--186 . http:\/\/www.aclweb.org\/anthology\/P 13 - 1018 . Wang Ling, Guang Xiang, Chris Dyer, Alan Black, and Isabel Trancoso. 2013. Microblogs as parallel corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 176--186. http:\/\/www.aclweb.org\/anthology\/P13-1018."},{"volume-title":"Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC\u201910)","author":"Lu Bin","key":"e_1_2_1_28_1","unstructured":"Bin Lu , Tao Jiang , Kapo Chow , and Benjamin K. Tsou . 2010. Building a large English--Chinese parallel corpus from comparable patents and its experimental application to SMT . In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC\u201910) . 42--49. Bin Lu, Tao Jiang, Kapo Chow, and Benjamin K. Tsou. 2010. Building a large English--Chinese parallel corpus from comparable patents and its experimental application to SMT. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC\u201910). 42--49."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.3115\/992730.992806"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120105775299168"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220186"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/312624.312656"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.3115\/1075096.1075117"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of MT Summit XI.","author":"Quirk Chris","year":"2007","unstructured":"Chris Quirk , Raghavendra U. Udupa , and Arul Menezes . 2007 . Generative models of noisy translations with applications to parallel fragment extraction . In Proceedings of MT Summit XI. Chris Quirk, Raghavendra U. Udupa, and Arul Menezes. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of MT Summit XI."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120103322711578"},{"key":"e_1_2_1_38_1","first-page":"12","volume-title":"Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 538--542","author":"Riesa Jason","year":"2012","unstructured":"Jason Riesa and Daniel Marcu . 2012 . Automatic parallel fragment extraction from noisy data . In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 538--542 . http:\/\/www.aclweb.org\/anthology\/N 12 - 1061 . Jason Riesa and Daniel Marcu. 2012. Automatic parallel fragment extraction from noisy data. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 538--542. http:\/\/www.aclweb.org\/anthology\/N12-1061."},{"key":"e_1_2_1_39_1","first-page":"10","volume-title":"Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 403--411","author":"Smith Jason R.","year":"2010","unstructured":"Jason R. Smith , Chris Quirk , and Kristina Toutanova . 2010 . Extracting parallel sentences from comparable corpora using document level alignment . In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 403--411 . http:\/\/www.aclweb.org\/anthology\/N 10 - 1063 . Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 403--411. http:\/\/www.aclweb.org\/anthology\/N10-1063."},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing\u201913)","author":"\u015etef\u01cenescu Dan","year":"2013","unstructured":"Dan \u015etef\u01cenescu and Radu Ion . 2013 . Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia . In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing\u201913) . 117--128. Dan \u015etef\u01cenescu and Radu Ion. 2013. Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia. In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing\u201913). 117--128."},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT\u201912)","author":"\u015etef\u01cenescunescu Dan","year":"2012","unstructured":"Dan \u015etef\u01cenescunescu , Radu Ion , and Sabine Hunsicker . 2012 . Hybrid parallel sentence mining from comparable corpora . In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT\u201912) . Trento, Italy, 137--144. Dan \u015etef\u01cenescunescu, Radu Ion, and Sabine Hunsicker. 2012. Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT\u201912). Trento, Italy, 137--144."},{"key":"e_1_2_1_42_1","unstructured":"Chew Lim Tan and Makoto Nagao. 1995. Automatic alignment of Japanese--Chinese bilingual texts. IEICE Transactions on Information and Systems E78-D 1 68--76.  Chew Lim Tan and Makoto Nagao. 1995. Automatic alignment of Japanese--Chinese bilingual texts. IEICE Transactions on Information and Systems E78-D 1 68--76."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.5555\/1667583.1667653"},{"key":"e_1_2_1_44_1","first-page":"10","volume-title":"Proceedings of the 23rd International Conference on Computational Linguistics (Coling\u201910)","author":"Uszkoreit Jakob","year":"2010","unstructured":"Jakob Uszkoreit , Jay Ponte , Ashok Popat , and Moshe Dubiner . 2010 . Large scale parallel document mining for machine translation . In Proceedings of the 23rd International Conference on Computational Linguistics (Coling\u201910) . 1101--1109. http:\/\/www.aclweb.org\/anthology\/C 10 - 1124 . Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling\u201910). 1101--1109. http:\/\/www.aclweb.org\/anthology\/C10-1124."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.3115\/1075096.1075106"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of MT Summit XI. 475--482","author":"Utiyama Masao","year":"2007","unstructured":"Masao Utiyama and Hitoshi Isahara . 2007 . A Japanese--English patent parallel corpus . In Proceedings of MT Summit XI. 475--482 . Masao Utiyama and Hitoshi Isahara. 2007. A Japanese--English patent parallel corpus. In Proceedings of MT Summit XI. 475--482."},{"key":"e_1_2_1_47_1","first-page":"12","volume-title":"Proceedings of the 24th International Conference on Computational Linguistics (COLING\u201912)","author":"Vuli\u0107 Ivan","year":"2012","unstructured":"Ivan Vuli\u0107 and Marie-Francine Moens . 2012 . Sub-corpora sampling with an application to bilingual lexicon extraction . In Proceedings of the 24th International Conference on Computational Linguistics (COLING\u201912) . 2721--2738. http:\/\/www.aclweb.org\/anthology\/C 12 - 1166 . Ivan Vuli\u0107 and Marie-Francine Moens. 2012. Sub-corpora sampling with an application to bilingual lexicon extraction. In Proceedings of the 24th International Conference on Computational Linguistics (COLING\u201912). 2721--2738. http:\/\/www.aclweb.org\/anthology\/C12-1166."},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the 2nd International Conference on Language Resources and Evaluation.","author":"Xia Fei","year":"2000","unstructured":"Fei Xia , Martha Palmer , Nianwen Xue , Mary Ellen Okurowski , John Kovarik , Fu Dong Chiou , and Shizhe Huang . 2000 . Developing guidelines and ensuring consistency for Chinese text annotation . In Proceedings of the 2nd International Conference on Language Resources and Evaluation. Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu Dong Chiou, and Shizhe Huang. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation."},{"key":"e_1_2_1_49_1","first-page":"13","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1425--1434","author":"Zhang Jiajun","year":"2013","unstructured":"Jiajun Zhang and Chengqing Zong . 2013 . Learning a phrase-based translation model from monolingual data with application to domain adaptation . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1425--1434 . http:\/\/www.aclweb.org\/anthology\/P 13 - 1140 . Jiajun Zhang and Chengqing Zong. 2013. Learning a phrase-based translation model from monolingual data with application to domain adaptation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1425--1434. http:\/\/www.aclweb.org\/anthology\/P13-1140."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/11735106_37"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/844380.844785"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2833089","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2833089","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T05:07:21Z","timestamp":1750223241000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2833089"}},"subtitle":["A Case Study on Chinese--Japanese Wikipedia"],"short-title":[],"issued":{"date-parts":[[2015,12,11]]},"references-count":51,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2016,2]]}},"alternative-id":["10.1145\/2833089"],"URL":"https:\/\/doi.org\/10.1145\/2833089","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2015,12,11]]},"assertion":[{"value":"2014-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-12-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}