{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T00:30:18Z","timestamp":1761611418649,"version":"3.41.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2017,2,28]],"date-time":"2017-02-28T00:00:00Z","timestamp":1488240000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"WDAqua","award":["64279"],"award-info":[{"award-number":["64279"]}]},{"name":"COST Action IC1302"},{"name":"ERC under ALEXANDRIA","award":["H2020-MSCA-ITN-2014"],"award-info":[{"award-number":["H2020-MSCA-ITN-2014"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2017,2,28]]},"abstract":"<jats:p>In this article, we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences, and build a basis for qualitative analysis of the articles. An important challenge in this context is the tradeoff between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian, and English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki, a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. The MultiWiki demonstration is publicly available and currently supports four language pairs.<\/jats:p>","DOI":"10.1145\/3004296","type":"journal-article","created":{"date-parts":[[2017,4,5]],"date-time":"2017-04-05T12:46:38Z","timestamp":1491396398000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["MultiWiki"],"prefix":"10.1145","volume":"11","author":[{"given":"Simon","family":"Gottschalk","sequence":"first","affiliation":[{"name":"L3S Research Center, Hannover, Germany"}]},{"given":"Elena","family":"Demidova","sequence":"additional","affiliation":[{"name":"University of Southampton, UK and L3S Research Center, Hannover, Germany"}]}],"member":"320","published-online":{"date-parts":[[2017,4,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL\u201906)","author":"Adafre Sisay Fissaha","year":"2006","unstructured":"Sisay Fissaha Adafre and Maarten De Rijke . 2006 . Finding similar sentences across multiple languages in wikipedia . In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL\u201906) . 62--69. Sisay Fissaha Adafre and Maarten De Rijke. 2006. Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL\u201906). 62--69."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498759.1498813"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S16-1081"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISDA.2010.5687287"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2207676.2208553"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2809786"},{"key":"e_1_2_1_7_1","volume-title":"Paul Clough, and Paolo Rosso.","author":"Barr\u00f3n-Cede\u00f1o Alberto","year":"2014","unstructured":"Alberto Barr\u00f3n-Cede\u00f1o , Monica Lestari Paramita , Paul Clough, and Paolo Rosso. 2014 . A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval . Alberto Barr\u00f3n-Cede\u00f1o, Monica Lestari Paramita, Paul Clough, and Paolo Rosso. 2014. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2009.2023394"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP\u201913)","author":"Chu Chenhui","year":"2013","unstructured":"Chenhui Chu , Toshiaki Nakazawa , and Sadao Kurohashi . 2013 . Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon . In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP\u201913) , 1144--1150. Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2013. Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP\u201913), 1144--1150."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390446"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2506182.2506198"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2442076.2442077"},{"volume-title":"HLT-NAACL","author":"Faruqui Manaal","key":"e_1_2_1_13_1","unstructured":"Manaal Faruqui and Shankar Kumar . 2015. Multilingual open relation extraction using cross-lingual projection .. In HLT-NAACL , Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics , Stroudsburg, PA , 1351--1356. Manaal Faruqui and Shankar Kumar. 2015. Multilingual open relation extraction using cross-lingual projection.. In HLT-NAACL, Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics, Stroudsburg, PA, 1351--1356."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/1572433.1572438"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911472"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation (WILDRE\u201912)","author":"Gupta Ankush","year":"2012","unstructured":"Ankush Gupta and Kiran Pala . 2012 . A generic and robust algorithm for paragraph alignment and its impact on sentence alignment in parallel corpora . In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation (WILDRE\u201912) . 18--27. Ankush Gupta and Kiran Pala. 2012. A generic and robust algorithm for paragraph alignment and its impact on sentence alignment in parallel corpora. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation (WILDRE\u201912). 18--27."},{"key":"e_1_2_1_17_1","unstructured":"Kilem L Gwet. 2014. Handbook of inter-rater reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics LLC.  Kilem L Gwet. 2014. Handbook of inter-rater reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics LLC."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2615569.2615684"},{"key":"e_1_2_1_19_1","first-page":"1","article-title":"TextTiling: Segmenting text into multi-paragraph subtopic passages","volume":"23","author":"Hearst Marti A.","year":"1997","unstructured":"Marti A. Hearst . 1997 . TextTiling: Segmenting text into multi-paragraph subtopic passages . Comput. Linguist. 23 , 1 (Mar. 1997), 33--64. Marti A. Hearst. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 1 (Mar. 1997), 33--64.","journal-title":"Comput. Linguist."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 10th Machine Translation Summit (MT Summit\u201905)","author":"Koehn Philipp","year":"2005","unstructured":"Philipp Koehn . 2005 . Europarl: A parallel corpus for statistical machine translation . In Proceedings of the 10th Machine Translation Summit (MT Summit\u201905) . AAMT, 79--86. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit (MT Summit\u201905). AAMT, 79--86."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"volume-title":"Introduction to Information Retrieval","author":"Manning Christopher D.","key":"e_1_2_1_22_1","unstructured":"Christopher D. Manning , Prabhakar Raghavan , and Hinrich Sch\u00fctze . 2008. Introduction to Information Retrieval . Cambridge University Press , New York, NY . Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch\u00fctze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2462932.2462960"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCEA.2010.203"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00179"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1935826.1935887"},{"volume-title":"Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC\u201912)","author":"Paramita Monica Lestari","key":"e_1_2_1_27_1","unstructured":"Monica Lestari Paramita , Paul D. Clough , Ahmet Aker , and Robert J. Gaizauskas . 2012. Correlation between similarity measures for inter-language linked wikipedia articles . In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC\u201912) , 790--797. Monica Lestari Paramita, Paul D. Clough, Ahmet Aker, and Robert J. Gaizauskas. 2012. Correlation between similarity measures for inter-language linked wikipedia articles. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC\u201912), 790--797."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL\u201913)","author":"Pilehvar Mohammad Taher","year":"2013","unstructured":"Mohammad Taher Pilehvar , David Jurgens , and Roberto Navigli . 2013 . Align, disambiguate and walk: A unified approach for measuring semantic similarity . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL\u201913) , 1341--1351. Mohammad Taher Pilehvar, David Jurgens, and Roberto Navigli. 2013. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL\u201913), 1341--1351."},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the 23rd International Conference on Computational Linguistics (COLING\u201910)","author":"Potthast Martin","year":"2010","unstructured":"Martin Potthast , Benno Stein , Alberto Barr\u00f3n-Cede\u00f1o , and Paolo Rosso . 2010 . An evaluation framework for plagiarism detection . In Proceedings of the 23rd International Conference on Computational Linguistics (COLING\u201910) . Association for Computational Linguistics, Stroudsburg, PA, 997--1005. Martin Potthast, Benno Stein, Alberto Barr\u00f3n-Cede\u00f1o, and Paolo Rosso. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING\u201910). Association for Computational Linguistics, Stroudsburg, PA, 997--1005."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-25631-8_52"},{"volume-title":"Digital Methods","author":"Rogers Richard","key":"e_1_2_1_31_1","unstructured":"Richard Rogers . 2013. Digital Methods . The MIT Press , Chapter Wikipedia as Cultural Reference. Richard Rogers. 2013. Digital Methods. The MIT Press, Chapter Wikipedia as Cultural Reference."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24027-5_42"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390432"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT\u201910)","author":"Smith Jason R.","year":"2010","unstructured":"Jason R. Smith , Chris Quirk , and Kristina Toutanova . 2010 . Extracting parallel sentences from comparable corpora using document level alignment . In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT\u201910) . Association for Computational Linguistics, Stroudsburg, PA, 403--411. Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT\u201910). Association for Computational Linguistics, Stroudsburg, PA, 403--411."},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC\u201906)","author":"Steinberger Ralf","year":"2006","unstructured":"Ralf Steinberger , Bruno Pouliquen , Anna Widiger , Camelia Ignat , Toma Erjavec , and Dan Tufi . 2006 . The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages . In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC\u201906) . 2142--2147. Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma Erjavec, and Dan Tufi. 2006. The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC\u201906). 2142--2147."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-012-9179-y"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766462.2767752"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2872427.2883077"},{"volume-title":"Global Wikipedia: International and Cross-cultural Issues in Online Collaboration","author":"Yasseri Taha","key":"e_1_2_1_39_1","unstructured":"Taha Yasseri , Anselm Spoerri , Mark Graham , and Janos Kertesz . 2014. The most controversial topics in wikipedia: A multilingual and geographical analysis . In Global Wikipedia: International and Cross-cultural Issues in Online Collaboration . Scarecrow Press . Taha Yasseri, Anselm Spoerri, Mark Graham, and Janos Kertesz. 2014. The most controversial topics in wikipedia: A multilingual and geographical analysis. In Global Wikipedia: International and Cross-cultural Issues in Online Collaboration. Scarecrow Press."}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3004296","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3004296","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:05:12Z","timestamp":1750273512000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3004296"}},"subtitle":["Interlingual Text Passage Alignment in Wikipedia"],"short-title":[],"issued":{"date-parts":[[2017,2,28]]},"references-count":39,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2017,2,28]]}},"alternative-id":["10.1145\/3004296"],"URL":"https:\/\/doi.org\/10.1145\/3004296","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"type":"print","value":"1559-1131"},{"type":"electronic","value":"1559-114X"}],"subject":[],"published":{"date-parts":[[2017,2,28]]},"assertion":[{"value":"2016-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-04-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}