{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,6]],"date-time":"2025-11-06T20:07:04Z","timestamp":1762459624838,"version":"3.41.0"},"reference-count":74,"publisher":"MIT Press","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computational Linguistics"],"published-print":{"date-parts":[[2018,9]]},"abstract":"<jats:p>Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information. Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.<\/jats:p>","DOI":"10.1162\/coli_a_00325","type":"journal-article","created":{"date-parts":[[2018,6,7]],"date-time":"2018-06-07T18:53:10Z","timestamp":1528397590000},"page":"483-524","source":"Crossref","is-referenced-by-count":3,"title":["Using Semantics for Granularities of Tokenization"],"prefix":"10.1162","volume":"44","author":[{"given":"Martin","family":"Riedl","sequence":"first","affiliation":[{"name":"University of Stuttgart, Institut f\u00fcr maschinelle Sprachverarbeitung."}]},{"given":"Chris","family":"Biemann","sequence":"additional","affiliation":[{"name":"University of Hamburg, Language Technology Group."}]}],"member":"281","reference":[{"key":"bib1","first-page":"2233","volume-title":"Proceedings of the Fourth International Conference on Language Resources and Evaluation","author":"Abeill\u00e9 Anne","year":"2004"},{"key":"bib2","first-page":"101","volume-title":"Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World","author":"Acosta Otavio Costa","year":"2011"},{"key":"bib3","first-page":"129","volume-title":"Proceedings of the Workshop on Phonetics and Phonology in ASR","author":"Adda-Decker Martine","year":"2000"},{"key":"bib4","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-006-0884-2"},{"key":"bib5","doi-asserted-by":"publisher","DOI":"10.3115\/1557690.1557763"},{"key":"bib6","unstructured":"Anastasiou, Dimitra. 2010. Idiom Treatment Experiments in Machine Translation. Ph.D. thesis, Universit\u00e4t des Saarlandes, Saarbr\u00fccken, Germany."},{"key":"bib7","first-page":"1760","volume-title":"Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008","author":"Biemann Chris","year":"2008"},{"key":"bib8","doi-asserted-by":"publisher","DOI":"10.15398\/jlm.v1i1.60"},{"key":"bib9","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-76336-9_29"},{"key":"bib10","first-page":"674","volume-title":"Proceedings of the Eighth International Conference on Language Resources and Evaluation","author":"Bouamor Dhouha","year":"2012"},{"key":"bib11","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15998-5_13"},{"key":"bib12","first-page":"20","volume-title":"Proceedings of the 1st Deep Machine Translation Workshop","author":"Daiber Joachim","year":"2015"},{"key":"bib13","doi-asserted-by":"publisher","DOI":"10.3115\/991886.991975"},{"key":"bib14","unstructured":"Evert, Stefan. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Institut f\u00fcr maschinelle Sprachverarbeitung, University of Stuttgart, Germany."},{"key":"bib15","first-page":"3","volume-title":"Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions","author":"Evert Stefan","year":"2008"},{"key":"bib16","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-49653-X_35"},{"key":"bib17","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4615-2710-7"},{"volume-title":"Methods in Structural Linguistics","year":"1951","author":"Harris Zellig Sabbetai","key":"bib18"},{"key":"bib19","doi-asserted-by":"publisher","DOI":"10.2495\/DATA060021"},{"key":"bib20","first-page":"420","volume-title":"Proceedings of the International Conference on Recent Advances in Natural Language Processing 2011","author":"Henrich Verena","year":"2011"},{"key":"bib21","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-1006"},{"key":"bib22","doi-asserted-by":"publisher","DOI":"10.3115\/981823.981857"},{"key":"bib23","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78135-6_11"},{"key":"bib24","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324900000048"},{"key":"bib25","first-page":"55","volume-title":"Inquiries into Words, Constraints and Contexts","author":"Kaplan Ronald M.","year":"2005"},{"key":"bib26","doi-asserted-by":"publisher","DOI":"10.3115\/1613692.1613696"},{"key":"bib27","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btg1023"},{"key":"bib28","doi-asserted-by":"publisher","DOI":"10.3115\/1067807.1067833"},{"key":"bib29","unstructured":"Korkontzelos, Ioannis. 2010. Unsupervised Learning of Multiword Expressions. Ph.D. thesis, University of York, UK."},{"key":"bib30","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2050"},{"key":"bib31","first-page":"64","volume-title":"Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics","author":"Lin Dekang","year":"1997"},{"key":"bib32","doi-asserted-by":"publisher","DOI":"10.3115\/980432.980696"},{"key":"bib33","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10888-9_6"},{"key":"bib34","first-page":"1395","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies","author":"Macherey Klaus","year":"2011"},{"volume-title":"Foundations of Statistical Natural Language Processing","year":"1999","author":"Manning Christopher D.","key":"bib35"},{"key":"bib36","unstructured":"Marek, Torsten. 2006. Analysis of German compounds using weighted finite state transducers. Bachelor thesis, Universit\u00e4t T\u00fcbingen, Germany."},{"key":"bib37","doi-asserted-by":"publisher","DOI":"10.3115\/1708141.1708143"},{"key":"bib38","first-page":"1310","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Mikolov Tomas","year":"2013"},{"key":"bib39","first-page":"2082","volume-title":"Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers","author":"Milde Benjamin","year":"2016"},{"key":"bib40","first-page":"262","volume-title":"Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Forum","author":"Monz Christof","year":"2001"},{"key":"bib41","doi-asserted-by":"publisher","DOI":"10.1145\/321479.321481"},{"key":"bib42","doi-asserted-by":"publisher","DOI":"10.3115\/1118771.1118778"},{"key":"bib43","doi-asserted-by":"publisher","DOI":"10.1075\/term.9.2.04nak"},{"key":"bib44","first-page":"225","volume-title":"Proceedings of the European Conference on Speech Communication and Technology","author":"Ordelman Roeland","year":"2003"},{"key":"bib45","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-009-9101-4"},{"key":"bib46","unstructured":"Ramisch, Carlos. 2012. A Generic and Open Framework for Multiword Expressions Treatment: From Acquisition to Applications. Ph.D. thesis, Universidade Federal Do Rio Grande do Sul, Brazil."},{"key":"bib47","first-page":"1","volume-title":"Proceedings of the Student Research Workshop of the 50th Meeting of the Association for Computational Linguistics","author":"Ramisch Carlos","year":"2012"},{"key":"bib48","first-page":"68","volume-title":"Proceedings of the Fifth Slovenian and First International Language Technologies Conference","author":"Richter Matthias","year":"2006"},{"key":"bib49","unstructured":"Riedl, Martin. 2016. Unsupervised Methods for Learning and Using Semantics of Natural Language. Ph.D. thesis, Technische Universit\u00e4t Darmstadt, Germany."},{"key":"bib50","doi-asserted-by":"crossref","first-page":"884","DOI":"10.18653\/v1\/D13-1089","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing","author":"Riedl Martin","year":"2013"},{"key":"bib51","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1290"},{"key":"bib52","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1075"},{"key":"bib53","first-page":"264","volume-title":"Proceedings of the 12th International Conference on Computational Semantics","author":"Riedl Martin","year":"2017"},{"key":"bib54","doi-asserted-by":"publisher","DOI":"10.3115\/1118984.1118988"},{"key":"bib55","first-page":"1","volume-title":"Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics","author":"Sag Ivan Andrew","year":"2001"},{"key":"bib56","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/E14-1050"},{"key":"bib57","series-title":"Synthesis Lectures on Human Language Technologies","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-02152-7","volume-title":"Web Corpus Construction","author":"Sch\u00e4fer Roland","year":"2013"},{"key":"bib58","first-page":"239","volume-title":"Proceedings of the 5th International Workshop on Finite-State Methods and Natural Language Processing","author":"Schiller Anne","year":"2005"},{"key":"bib59","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1036"},{"key":"bib60","first-page":"100","volume-title":"Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing","author":"Schone Patrick","year":"2001"},{"key":"bib61","first-page":"3249","volume-title":"Proceedings of the Eight International Conference on Language Resources and Evaluation","author":"Seddah Djam\u00e9","year":"2012"},{"key":"bib62","doi-asserted-by":"crossref","first-page":"146","DOI":"10.18653\/v1\/W13-4917","volume-title":"Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages","author":"Seddah Djam\u00e9","year":"2013"},{"key":"bib63","first-page":"630","volume-title":"Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers","author":"Shapiro Naomi T.","year":"2016"},{"key":"bib64","unstructured":"Shapiro, Naomi T., Joshua Falk, Kati Kiiskinen, and Arto Anttila. 2017. FinnSyll 2.0.0: A Finnish syllabifier. Technical report, Stanford University, https:\/\/pypi.python.org\/pypi\/FinnSyll."},{"key":"bib65","doi-asserted-by":"publisher","DOI":"10.4301\/S1807-17752012000200002"},{"key":"bib66","series-title":"Taalkommissie van die Suid-Afrikaanse Akademie vir Wetenskap en Kuns","volume-title":"Taalkommissiekorpus 1.1","author":"Taalkommissie","year":"2011"},{"key":"bib67","doi-asserted-by":"publisher","DOI":"10.3115\/1708141.1708149"},{"key":"bib68","unstructured":"Trim, Craig. 2013. The art of tokenization. Technical Report, IBM Developer Works. https:\/\/www.ibm.com\/developerworks\/community\/blogs\/nlp\/entry\/tokenization?lang=en_us."},{"key":"bib69","doi-asserted-by":"publisher","DOI":"10.1145\/1067268.1067272"},{"key":"bib70","doi-asserted-by":"publisher","DOI":"10.3115\/992424.992434"},{"key":"bib71","first-page":"809","volume-title":"Annual AMIA Symposium Proceedings","author":"Wermter Joachim","year":"2005"},{"key":"bib72","series-title":"NODALIDA 2005","first-page":"210","volume-title":"Proceedings of the 15th Nordic Conference of Computational Linguistics","author":"Witschel Hans Friedrich","year":"2005"},{"key":"bib73","first-page":"1056","volume-title":"Proceedings of the 9th International Conference on Language Resources and Evaluation","author":"van Zaanen Menno","year":"2014"},{"key":"bib74","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1078"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/coli_a_00325","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,5]],"date-time":"2025-07-05T01:12:55Z","timestamp":1751677975000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/44\/3\/483-524\/1599"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,9]]},"references-count":74,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2018,9]]}},"alternative-id":["10.1162\/coli_a_00325"],"URL":"https:\/\/doi.org\/10.1162\/coli_a_00325","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"type":"print","value":"0891-2017"},{"type":"electronic","value":"1530-9312"}],"subject":[],"published":{"date-parts":[[2018,9]]}}}