{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,8,19]],"date-time":"2023-08-19T23:23:31Z","timestamp":1692487411064},"reference-count":47,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2018,10,22]],"date-time":"2018-10-22T00:00:00Z","timestamp":1540166400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2018,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Research on topic segmentation has recently focused on segmenting documents by taking advantage of documents covering the same topics. In order to properly evaluate such approaches, a dataset of related documents is needed. However, existing datasets are limited in the number of related documents per domain. In addition, most of the available datasets do not consider documents from different media sources (PowerPoints, videos, etc.), which pose specific challenges to segmentation. We fill this gap with the <jats:bold>MU<\/jats:bold>ltimedia <jats:bold>SE<\/jats:bold>gmentation <jats:bold>D<\/jats:bold>ataset (MUSED), a collection of documents manually segmented, from different media sources, in seven different domains, with an average of twenty related documents per domain. In this paper, we describe the process of building MUSED. A multi-annotator study is carried out to determine if it is possible to observe agreement among human judges and characterize their disagreement patterns. In addition, we use MUSED to compare the state-of-the-art topic segmentation techniques, including the ones that take advantage of related documents. Moreover, we study the impact of having documents from different media sources in the dataset. To the best of our knowledge, MUSED is the first dataset that allows a straightforward evaluation of both single- and multiple-documents topic segmentation techniques, as well as to study how these behave in the presence of documents from different media sources. Results show that some techniques are, indeed, sensitive to different media sources, and also that current multi-document segmentation models do not outperform previous models, pointing to a research line that needs to be boosted.<\/jats:p>","DOI":"10.1017\/s1351324918000359","type":"journal-article","created":{"date-parts":[[2018,10,22]],"date-time":"2018-10-22T08:52:46Z","timestamp":1540198366000},"page":"921-946","source":"Crossref","is-referenced-by-count":2,"title":["MUSED: A multimedia multi-document dataset for topic segmentation"],"prefix":"10.1017","volume":"24","author":[{"given":"PEDRO","family":"MOTA","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"MAXINE","family":"ESKENAZI","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"LU\u00cdSA","family":"COHEUR","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2018,10,22]]},"reference":[{"key":"S1351324918000359_ref041","doi-asserted-by":"crossref","first-page":"899","DOI":"10.1145\/2187836.2187957","volume-title":"Proceedings of the International Conference on World Wide Web","author":"Shahaf","year":"2012"},{"key":"S1351324918000359_ref023","first-page":"284","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Kazantseva","year":"2011"},{"key":"S1351324918000359_ref034","first-page":"295","volume-title":"Proceedings of the International Conference on Application of Natural Language to Information Systems","author":"Prince","year":"2007"},{"key":"S1351324918000359_ref025","volume-title":"Content Analysis: An Introduction to its Methodology","author":"Krippendorff","year":"2004"},{"key":"S1351324918000359_ref008","first-page":"334","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Eisenstein","year":"2008"},{"key":"S1351324918000359_ref004","first-page":"26","volume-title":"Proceedings of the North American Chapter of the Association for Computational Lingustics","author":"Choi","year":"2000"},{"key":"S1351324918000359_ref030","first-page":"5","volume-title":"Proceedings of the International Conference on Computer Supported Education","author":"Noh","year":"2010"},{"key":"S1351324918000359_ref029","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-013-5417-9"},{"key":"S1351324918000359_ref026","first-page":"25","volume-title":"Proceedings of the International Conference on Computational Lingustics","author":"Malioutov","year":"2006"},{"key":"S1351324918000359_ref016","first-page":"33","article-title":"Texttiling: segmenting text into multi-paragraph subtopic passages","volume":"23","author":"Hearst","year":"1997","journal-title":"Computational Lingustics"},{"key":"S1351324918000359_ref011","volume-title":"The Brown Corpus: A Standard Corpus of Present-Day Edited American English","author":"Francis","year":"1979"},{"key":"S1351324918000359_ref010","first-page":"1702","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Lingustics","author":"Fournier","year":"2013"},{"key":"S1351324918000359_ref033","doi-asserted-by":"publisher","DOI":"10.1162\/089120102317341756"},{"key":"S1351324918000359_ref001","unstructured":"Alemi A. , and Ginsparg P. 2015. Text segmentation based on semantic word embeddings. ArXiv e-prints, 1503.05543."},{"key":"S1351324918000359_ref042","doi-asserted-by":"publisher","DOI":"10.1037\/0033-2909.86.2.420"},{"key":"S1351324918000359_ref038","first-page":"209","volume-title":"Proceedings of the Association for Computational Lingustics International Conference on Multimedia","author":"Shah","year":"2014"},{"key":"S1351324918000359_ref045","volume-title":"Clinical Methods: The History, Physical, and Laboratory Examinations","author":"Walker","year":"1990"},{"key":"S1351324918000359_ref037","doi-asserted-by":"publisher","DOI":"10.1086\/266577"},{"key":"S1351324918000359_ref024","first-page":"211","volume-title":"Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics","author":"Kazantseva","year":"2012"},{"key":"S1351324918000359_ref006","first-page":"190","volume-title":"Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics","author":"Du","year":"2013"},{"key":"S1351324918000359_ref015","volume-title":"Cohesion in English","author":"Halliday","year":"1976"},{"key":"S1351324918000359_ref005","doi-asserted-by":"publisher","DOI":"10.1177\/001316446002000104"},{"key":"S1351324918000359_ref009","first-page":"353","volume-title":"Proceedings of the Human Language Technologies North American Chapter of the Association for Computational Lingustics","author":"Eisenstein","year":"2009"},{"key":"S1351324918000359_ref003","first-page":"543","volume-title":"Proceedings of the International Joint Conference on Natural Language Processing","author":"Bougouin","year":"2013"},{"key":"S1351324918000359_ref044","first-page":"499","volume-title":"Proceedings of the Annual Meeting on Association for Computational Lingustics","author":"Utiyama","year":"2001"},{"key":"S1351324918000359_ref040","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-61807-4"},{"key":"S1351324918000359_ref018","doi-asserted-by":"publisher","DOI":"10.1198\/1061860043001"},{"key":"S1351324918000359_ref046","unstructured":"Ward N. G. , Werner S. D. , Novick D. G. , Shriberg E. E. , Oertel C. , and Kawahara T. 2013. The similar segments in social speech task. In Working Notes Proceedings of the MediaEval Workshop, Barcelona, Spain."},{"key":"S1351324918000359_ref039","first-page":"217","volume-title":"Proceedings of the International Symposium on Multimedia","author":"Shah","year":"2015"},{"key":"S1351324918000359_ref036","first-page":"37","volume-title":"Proceedings of the Association for Computational Lingustics Student Research Workshop","author":"Riedl","year":"2012"},{"key":"S1351324918000359_ref035","first-page":"17","volume-title":"Proceedings of the International Conference on Computational Lingustics","author":"Purver","year":"2006"},{"key":"S1351324918000359_ref043","first-page":"199","volume-title":"Proceedings of Association for Computational Lingustics Special Interest Group on Information Retrieval","author":"Sun","year":"2007"},{"key":"S1351324918000359_ref012","doi-asserted-by":"publisher","DOI":"10.1126\/science.1136800"},{"key":"S1351324918000359_ref021","volume-title":"Discrete Multivariate Distributions","author":"Johnson","year":"1997"},{"key":"S1351324918000359_ref031","first-page":"103","article-title":"Discourse segmentation by human and automated means","volume":"23","author":"Passonneau","year":"1997","journal-title":"Computational Lingustics"},{"key":"S1351324918000359_ref047","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2010.07.006"},{"key":"S1351324918000359_ref017","first-page":"273","volume-title":"Proceedings of the European Chapter of the Association for Computational Linguistics","author":"Hsueh","year":"2006"},{"key":"S1351324918000359_ref013","first-page":"562","volume-title":"Proceedings of the Annual Meeting on Association for Computational Lingustics","author":"Galley","year":"2003"},{"key":"S1351324918000359_ref020","first-page":"364","volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing Workshop","author":"Janin","year":"2004"},{"key":"S1351324918000359_ref027","first-page":"1119","volume-title":"Proceedings of the Association for Computational Lingustics International Conference on Information and Knowledge Management","author":"Minwoo","year":"2010"},{"key":"S1351324918000359_ref022","doi-asserted-by":"publisher","DOI":"10.1613\/jair.3940"},{"key":"S1351324918000359_ref007","first-page":"2232","volume-title":"Proceedings of the Association for the Advancement of Artificial Intelligence Conference","author":"Du","year":"2015"},{"key":"S1351324918000359_ref032","first-page":"1532","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Pennington","year":"2014"},{"key":"S1351324918000359_ref028","first-page":"443","volume-title":"Proceedings of the International Workshop on Semantic Multimedia","author":"Mota","year":"2016"},{"key":"S1351324918000359_ref019","first-page":"203","volume-title":"Proceedings of the International Conference on Research and Development in Information Retrieval","author":"Jameel","year":"2013"},{"key":"S1351324918000359_ref014","first-page":"362","volume-title":"Proceedings of the 3rd Linguistic Annotation Workshop","author":"Haghighi","year":"2009"},{"key":"S1351324918000359_ref002","first-page":"1","volume-title":"Proceedings of the International Conference on Technology Enhanced Education","author":"Balagopalan","year":"2012"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324918000359","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,4,13]],"date-time":"2019-04-13T14:55:11Z","timestamp":1555167311000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324918000359\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,10,22]]},"references-count":47,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2018,11]]}},"alternative-id":["S1351324918000359"],"URL":"https:\/\/doi.org\/10.1017\/s1351324918000359","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,10,22]]}}}