{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T02:29:28Z","timestamp":1777429768532,"version":"3.51.4"},"reference-count":57,"publisher":"MIT Press","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computational Linguistics"],"published-print":{"date-parts":[[2016,3]]},"abstract":"<jats:p>This article presents a comparative study of a subfield of morphology learning referred to as minimally supervised morphological segmentation. In morphological segmentation, word forms are segmented into morphs, the surface forms of morphemes. In the minimally supervised data-driven learning setting, segmentation models are learned from a small number of manually annotated word forms and a large set of unannotated word forms. In addition to providing a literature survey on published methods, we present an in-depth empirical comparison on three diverse model families, including a detailed error analysis. Based on the literature survey, we conclude that the existing methodology contains substantial work on generative morph lexicon-based approaches and methods based on discriminative boundary detection. As for which approach has been more successful, both the previous work and the empirical evaluation presented here strongly imply that the current state of the art is yielded by the discriminative boundary detection methodology.<\/jats:p>","DOI":"10.1162\/coli_a_00243","type":"journal-article","created":{"date-parts":[[2016,2,23]],"date-time":"2016-02-23T19:58:22Z","timestamp":1456257502000},"page":"91-120","source":"Crossref","is-referenced-by-count":9,"title":["A Comparative Study of Minimally Supervised Morphological Segmentation"],"prefix":"10.1162","volume":"42","author":[{"given":"Teemu","family":"Ruokolainen","sequence":"first","affiliation":[{"name":"Aalto University"}]},{"given":"Oskar","family":"Kohonen","sequence":"additional","affiliation":[{"name":"Aalto University"}]},{"given":"Kairit","family":"Sirts","sequence":"additional","affiliation":[{"name":"Tallinn University of Technology"}]},{"given":"Stig-Arne","family":"Gr\u00f6nroos","sequence":"additional","affiliation":[{"name":"Aalto University"}]},{"given":"Mikko","family":"Kurimo","sequence":"additional","affiliation":[{"name":"Aalto University"}]},{"given":"Sami","family":"Virpioja","sequence":"additional","affiliation":[{"name":"Aalto University"}]}],"member":"281","reference":[{"key":"R1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007541817488"},{"key":"R2","unstructured":"Chrupala, Grzegorz, Georgiana Dinu, and Josef van Genabith. 2008. Learning morphology with Morfette. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pages 2362\u20132367, Marrakech."},{"key":"R3","doi-asserted-by":"crossref","unstructured":"Collins, Michael. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), volume 10, pages 1\u20138, Philadelphia, PA.","DOI":"10.3115\/1118693.1118694"},{"key":"R4","unstructured":"\u00c7\u00f6ltekin, \u00c7a\u011fr\u0131. 2010. Improving successor variety for morphological segmentation. LOT Occasional Series, 16:13\u201328."},{"key":"R6","unstructured":"Cozman, Fabio Gagliardi, Ira Cohen, and Marcelo Cesar Cirelo. 2003. Semi-supervised learning of mixture models. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pages 99\u2013106, Washington, DC."},{"key":"R7","doi-asserted-by":"publisher","DOI":"10.1145\/1322391.1322394"},{"key":"R8","doi-asserted-by":"publisher","DOI":"10.3115\/1118647.1118650"},{"key":"R10","doi-asserted-by":"publisher","DOI":"10.1145\/1187415.1187418"},{"key":"R11","doi-asserted-by":"crossref","unstructured":"de Gispert, Adri\u00e0, Sami Virpioja, Mikko Kurimo, and William Byrne. 2009. Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2009), pages 73\u201376, Boulder, CO.","DOI":"10.3115\/1620853.1620876"},{"key":"R12","doi-asserted-by":"publisher","DOI":"10.2478\/pralin-2013-0017"},{"key":"R13","unstructured":"Fox-Roberts, Patrick and Edward Rosten. 2014. Unbiased generative semi-supervised learning. Journal of Machine Learning Research, 15(1):367\u2013443."},{"key":"R14","unstructured":"Goldwater, Sharon. 2006. Nonparametric Bayesian Models of Lexical Acquisition. Ph.D. thesis, Brown University."},{"key":"R15","unstructured":"Green, Spence and John DeNero. 2012. A class-based agreement model for generating accurately inflected translations. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), pages 146\u2013155, Jeju Island."},{"key":"R16","unstructured":"Gr\u00f6nroos, Stig-Arne, Sami Virpioja, Peter Smit, and Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 1177\u20131185, Dublin."},{"key":"R17","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00050"},{"key":"R18","doi-asserted-by":"publisher","DOI":"10.2307\/411036"},{"key":"R19","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2005.07.002"},{"key":"R20","doi-asserted-by":"crossref","unstructured":"Jiao, Feng, Shaojun Wang, Chi-Hoon Lee, Russell Greiner, and Dale Schuurmans. 2006. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING\/ACL 2006), pages 209\u2013216, Sidney.","DOI":"10.3115\/1220175.1220202"},{"key":"R21","doi-asserted-by":"publisher","DOI":"10.3115\/1073483.1073498"},{"key":"R22","doi-asserted-by":"publisher","DOI":"10.3115\/1626324.1626328"},{"key":"R23","unstructured":"Johnson, Mark and Katherine Demuth. 2010. Unsupervised phonemic Chinese word segmentation using Adaptor Grammars. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 528\u2013536, Beijing."},{"key":"R24","doi-asserted-by":"publisher","DOI":"10.3115\/1620754.1620800"},{"key":"R25","doi-asserted-by":"crossref","unstructured":"Johnson, Mark, Thomas L. Griffiths, and Sharon Goldwater. 2006. Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models. In Advances in Neural Information Processing Systems, pages 641\u2013648, Vancouver.","DOI":"10.7551\/mitpress\/7503.003.0085"},{"key":"R26","unstructured":"Johnson, Mark, Thomas L. Griffiths, and Sharon Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2007), pages 139\u2013146, Rochester, NY."},{"key":"R27","doi-asserted-by":"crossref","unstructured":"Kaalep, Heiki-Jaan, Kadri Muischnek, Kristel Uiboaed, and Kaarel Veskis. 2010. The Estonian reference corpus: Its composition and morphology-aware user interface. In Proceedings of the 2010 Conference on Human Language Technologies \u2013 The Baltic Perspective: Proceedings of the Fourth International Conference Baltic (HLT 2010), pages 143\u2013146, Riga.","DOI":"10.3233\/978-1-60750-641-6-143"},{"key":"R28","unstructured":"K\u0131l\u0131\u00e7, \u00d6zkan and Cem Boz\u015fahin. 2012. Semi-supervised morpheme segmentation without morphological analysis. In Proceedings of the LREC 2012 Workshop on Language Resources and Technologies for Turkic Languages, pages 52\u201356, Istanbul."},{"key":"R29","unstructured":"Kohonen, Oskar, Sami Virpioja, and Krista Lagus. 2010. Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology (SIGMORPHON 2010), pages 78\u201386, Uppsala."},{"key":"R30","unstructured":"Kurimo, Mikko, Sami Virpioja, and Ville Turunen. 2010. Overview and results of Morpho Challenge 2010. In Proceedings of the Morpho Challenge 2010 Workshop, pages 7\u201324, Espoo."},{"key":"R31","doi-asserted-by":"crossref","unstructured":"Kurimo, Mikko, Sami Virpioja, Ville Turunen, Graeme W. Blackwood, and William Byrne. 2009. Overview and results of Morpho Challenge 2009. In Working Notes for the CLEF 2009 Workshop, pages 578\u2013597, Corfu.","DOI":"10.1007\/978-3-642-15754-7_71"},{"key":"R32","unstructured":"Lafferty, John, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pages 282\u2013289, Williamstown, MA."},{"key":"R33","unstructured":"Lee, Yoong Keok, Aria Haghighi, and Regina Barzilay. 2011. Modeling syntactic context improves morphological segmentation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), pages 1\u20139, Portland, OR."},{"key":"R34","unstructured":"Lignos, Constantine. 2010. Learning from unseen data. In Proceedings of the Morpho Challenge 2010 Workshop, pages 35\u201338, Helsinki."},{"key":"R35","unstructured":"Luong, Minh-Thang, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL 2013), pages 29\u201337, Sofia."},{"key":"R36","unstructured":"Mann, G. and A. McCallum. 2008. Generalized expectation criteria for semi-supervised learning of conditional random fields. In Proceedings of the 46th Annual Meeting of Association for Computational Linguistics: Human Language Technologies (ACL HLT 2008), pages 870\u2013878, Columbus, OH."},{"key":"R37","doi-asserted-by":"crossref","unstructured":"Monson, Christian, Kristy Hollingshead, and Brian Roark. 2010. Simulating morphological analyzers with stochastic taggers for confidence estimation. In Multilingual Information Access Evaluation I - Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-642-15754-7_78"},{"key":"R38","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1095"},{"key":"R39","doi-asserted-by":"crossref","unstructured":"Neuvel, Sylvain and Sean A. Fulop. 2002. Unsupervised learning of morphology without morphemes. In Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON 2002), pages 31\u201340, Philadelphia, PA.","DOI":"10.3115\/1118647.1118651"},{"key":"R40","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007692713085"},{"key":"R42","doi-asserted-by":"publisher","DOI":"10.3115\/1620754.1620785"},{"key":"R43","unstructured":"Qiu, Siyu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 141\u2013150, Dublin."},{"key":"R45","unstructured":"Ruokolainen, Teemu, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2013. Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL 2013), pages 29\u201337, Sofia."},{"key":"R46","doi-asserted-by":"crossref","unstructured":"Ruokolainen, Teemu, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2014. Painless semi-supervised morphological segmentation using conditional random fields. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), pages 84\u201389. Gothenburg.","DOI":"10.3115\/v1\/E14-4017"},{"key":"R47","doi-asserted-by":"publisher","DOI":"10.3115\/1073336.1073360"},{"key":"R48","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00225"},{"key":"R49","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/E14-2006"},{"key":"R50","unstructured":"Spiegler, Sebastian and Peter A. Flach. 2010. Enhanced word decomposition by calibrating the decision threshold of probabilistic models and using a model ensemble. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), pages 375\u2013383, Uppsala."},{"key":"R51","unstructured":"Stallard, David, Jacob Devlin, Michael Kayser, Yoong Keok Lee, and Regina Barzilay. 2012. Unsupervised morphology rivals supervised morphology for Arabic MT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), pages 322\u2013327, Jeju Island."},{"key":"R52","unstructured":"Sun, Weiwei and Jia Xu. 2011. Enhancing Chinese word segmentation using unlabeled data. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 970\u2013979, Edinburgh."},{"key":"R53","unstructured":"Turian, Joseph, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), pages 384\u2013394, Uppsala."},{"key":"R54","doi-asserted-by":"publisher","DOI":"10.1145\/2036916.2036917"},{"key":"R55","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15754-7_73"},{"key":"R56","unstructured":"Virpioja, Sami, Oskar Kohonen, and Krista Lagus. 2011. Evaluating the effect of word frequencies in a probabilistic generative model of morphology. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), pages 230\u2013237, Riga."},{"key":"R58","unstructured":"Virpioja, Sami, Ville Turunen, Sebastian Spiegler, Oskar Kohonen, and Mikko Kurimo. 2011. Empirical comparison of evaluation methods for unsupervised learning of morphology. Traitement Automatique des Langues, 52(2):45\u201390."},{"key":"R59","unstructured":"Wang, Yang, Gholamreza Haffari, Shaojun Wang, and Greg Mori. 2009. A rate distortion approach for semi-supervised conditional random fields. In Advances in Neural Information Processing Systems (NIPS), pages 2008\u20132016, Vancouver."},{"key":"R60","unstructured":"Wang, Yiou, Yoshimasa TsuruokaJun'ichi Kazama, Yoshimasa Tsuruoka, Wenliang Chen, Yujie Zhang, and Kentaro Torisawa. 2011. Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 309\u2013317, Chiang Mai."},{"key":"R61","doi-asserted-by":"crossref","unstructured":"Yarowsky, David and Richard Wicentowski. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the 38th Meeting of the Association for Computational Linguistics (ACL 2000), pages 207\u2013216, Hong Kong.","DOI":"10.3115\/1075218.1075245"},{"key":"R62","doi-asserted-by":"publisher","DOI":"10.2200\/S00196ED1V01Y200906AIM006"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/COLI_a_00243","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,1]],"date-time":"2025-06-01T13:49:51Z","timestamp":1748785791000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/42\/1\/91-120\/1521"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,3]]},"references-count":57,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,3]]}},"alternative-id":["10.1162\/COLI_a_00243"],"URL":"https:\/\/doi.org\/10.1162\/coli_a_00243","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,3]]}}}