{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T15:00:15Z","timestamp":1775142015010,"version":"3.50.1"},"reference-count":62,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T00:00:00Z","timestamp":1745798400000},"content-version":"vor","delay-in-days":117,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,4,17]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Large language models segment many words into multiple tokens, and companies that make those models claim that meaningful subword tokens are essential. To investigate whether subword tokens bear meaning, we segmented tens of thousands of words from each of 41 languages according to three generations of GPT tokenizers. We found that words sharing tokens are more semantically similar than expected by chance or expected from length alone, that tokens capture morphological information even when they don\u2019t look like morphemes, and that tokens capture more information than is explained by morphology. In languages that use a script other than the Latin alphabet, GPT-4 tokens are uninformative, but GPT-4o has improved this situation. These results suggest that comparing tokens to morphemes overlooks the wider variety of semantic information available in word form and that standard tokenization methods successfully capture much of that information.<\/jats:p>","DOI":"10.1162\/tacl_a_00747","type":"journal-article","created":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T19:04:57Z","timestamp":1745867097000},"page":"408-423","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["How Much Semantic Information is Available in Large Language Model Tokens?"],"prefix":"10.1162","volume":"13","author":[{"given":"David A.","family":"Haslett","sequence":"first","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, China. haslett@ust.hk"}]},{"given":"Zhenguang G.","family":"Cai","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China. zhenguangcai@cuhk.edu.hk"}]}],"member":"281","published-online":{"date-parts":[[2025,4,17]]},"reference":[{"key":"2025042815045260500_bib1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.258","article-title":"Mega: Multilingual evaluation of generative ai","author":"Ahuja","year":"2023","journal-title":"arXiv preprint arXiv:2303.12528"},{"key":"2025042815045260500_bib2","article-title":"Polyglot: Distributed word representations for multilingual nlp","author":"Al-Rfou","year":"2013","journal-title":"arXiv preprint arXiv:1307.1662"},{"issue":"1","key":"2025042815045260500_bib3","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1006\/jmla.1998.2607","article-title":"Frequency effects and the representational status of regular inflections","volume":"40","author":"Alegre","year":"1999","journal-title":"Journal of Memory and Language"},{"key":"2025042815045260500_bib4","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.sigmorphon-1.4","article-title":"Different tokenization schemes lead to comparable performance in spanish number agreement","author":"Arnett","year":"2024","journal-title":"arXiv preprint arXiv:2403.13754"},{"key":"2025042815045260500_bib5","volume-title":"Word Frequency Distributions","author":"Baayen","year":"2012"},{"issue":"1","key":"2025042815045260500_bib6","doi-asserted-by":"publisher","first-page":"4895891","DOI":"10.1155\/2019\/4895891","article-title":"The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de) composition but in linear discriminative learning","volume":"2019","author":"Baayen","year":"2019","journal-title":"Complexity"},{"key":"2025042815045260500_bib7","article-title":"Evaluating subword tokenization: Alien subword composition and oov generalization challenge","author":"Batsuren","year":"2024","journal-title":"arXiv preprint arXiv:2404.13292"},{"key":"2025042815045260500_bib8","volume-title":"Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit","author":"Bird","year":"2009"},{"key":"2025042815045260500_bib9","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2025042815045260500_bib10","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.414","article-title":"Byte pair encoding is suboptimal for language model pretraining","author":"Bostrom","year":"2020","journal-title":"arXiv preprint arXiv:2004.03720"},{"issue":"4","key":"2025042815045260500_bib11","doi-asserted-by":"publisher","first-page":"977","DOI":"10.3758\/BRM.41.4.977","article-title":"Moving beyond ku\u010dera and francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for american english","volume":"41","author":"Brysbaert","year":"2009","journal-title":"Behavior Research Methods"},{"key":"2025042815045260500_bib12","article-title":"Sparks of artificial general intelligence: Early experiments with gpt-4","author":"Bubeck","year":"2023","journal-title":"arXiv preprint arXiv: 2303.12712"},{"key":"2025042815045260500_bib13","doi-asserted-by":"publisher","DOI":"10.31234\/osf.io\/b45ys","article-title":"Meaning modulations and stability in large language models: An analysis of bert embeddings for psycholinguistic research","author":"Cassani","year":"2023","journal-title":"osf.io\/preprints\/psyarxiv\/b45ys"},{"issue":"3","key":"2025042815045260500_bib14","doi-asserted-by":"publisher","first-page":"375","DOI":"10.1017\/S1351324920000145","article-title":"Emerging trends: Subwords, seriously?","volume":"26","author":"Church","year":"2020","journal-title":"Natural Language Engineering"},{"key":"2025042815045260500_bib15","doi-asserted-by":"publisher","DOI":"10.3115\/1118647.1118650","article-title":"Unsupervised discovery of morphemes","author":"Creutz","year":"2002","journal-title":"arXiv preprint cs\/0205057"},{"key":"2025042815045260500_bib16","doi-asserted-by":"publisher","DOI":"10.1515\/ling.1985.23.5.723","article-title":"The suffixing preference: A processing explanation","author":"Cutler","year":"1985","journal-title":"Linguistics"},{"issue":"8","key":"2025042815045260500_bib17","doi-asserted-by":"publisher","first-page":"2149","DOI":"10.1111\/cogs.12453","article-title":"Wordform similarity increases with semantic similarity: An analysis of 100 languages","volume":"41","author":"Dautriche","year":"2017","journal-title":"Cognitive Science"},{"key":"2025042815045260500_bib18","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2018","journal-title":"arXiv preprint arXiv:1810.04805"},{"issue":"10","key":"2025042815045260500_bib19","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1016\/j.tics.2015.07.013","article-title":"Arbitrariness, iconicity, and systematicity in language","volume":"19","author":"Dingemanse","year":"2015","journal-title":"Trends in Cognitive Sciences"},{"key":"2025042815045260500_bib20","doi-asserted-by":"publisher","first-page":"17","DOI":"10.3765\/sp.9.17","article-title":"What do you know about an alligator when you know the company it keeps?","volume":"9","author":"Erk","year":"2016","journal-title":"Semantics and Pragmatics"},{"issue":"4","key":"2025042815045260500_bib21","doi-asserted-by":"publisher","first-page":"631","DOI":"10.1162\/coli_a_00013","article-title":"An asymptotic model for the english hapax\/vocabulary ratio","volume":"36","author":"Fan","year":"2010","journal-title":"Computational Linguistics"},{"key":"2025042815045260500_bib22","article-title":"A synopsis of linguistic theory 1930\u20131955","author":"Firth","year":"1957","journal-title":"Studies in Linguistic Analysis, Special Volume\/Blackwell"},{"issue":"2","key":"2025042815045260500_bib23","first-page":"23","article-title":"A new algorithm for data compression","volume":"12","author":"Gage","year":"1994","journal-title":"C Users Journal"},{"key":"2025042815045260500_bib24","article-title":"Learning word vectors for 157 languages","author":"Grave","year":"2018","journal-title":"arXiv preprint arXiv:1802.06893"},{"issue":"4","key":"2025042815045260500_bib25","doi-asserted-by":"publisher","first-page":"930","DOI":"10.3758\/s13428-014-0529-0","article-title":"Lsafun-An R package for computations based on latent semantic analysis","volume":"47","author":"G\u00fcnther","year":"2015","journal-title":"Behavior Research Methods"},{"key":"2025042815045260500_bib26","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.bionlp-1.32","article-title":"Biomedical language models are robust to sub-optimal tokenization","author":"Gutierrez","year":"2023","journal-title":"arXiv preprint arXiv:2306.17649"},{"issue":"2\u20133","key":"2025042815045260500_bib27","doi-asserted-by":"publisher","first-page":"146","DOI":"10.1080\/00437956.1954.11659520","article-title":"Distributional structure","volume":"10","author":"Harris","year":"1954","journal-title":"WORD"},{"issue":"8","key":"2025042815045260500_bib28","doi-asserted-by":"publisher","first-page":"2359","DOI":"10.1037\/xge0001409","article-title":"Similar-sounding words flesh out fuzzy meanings.","volume":"152","author":"Haslett","year":"2023","journal-title":"Journal of Experimental Psychology: General"},{"issue":"2","key":"2025042815045260500_bib29","doi-asserted-by":"publisher","first-page":"627","DOI":"10.3758\/s13423-023-02395-y","article-title":"Systematic mappings of sound to meaning: A theoretical review","volume":"31","author":"Haslett","year":"2024","journal-title":"Psychonomic Bulletin & Review"},{"key":"2025042815045260500_bib30","article-title":"Measuring massive multitask language understanding","author":"Hendrycks","year":"2020","journal-title":"arXiv preprint arXiv:2009.03300"},{"issue":"4","key":"2025042815045260500_bib31","doi-asserted-by":"publisher","first-page":"665","DOI":"10.1162\/COLI_a_00237","article-title":"Simlex-999: Evaluating semantic models with (genuine) similarity estimation","volume":"41","author":"Hill","year":"2015","journal-title":"Computational Linguistics"},{"key":"2025042815045260500_bib32","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.279","article-title":"Superbizarre is not superb: Derivational morphology improves bert\u2019s interpretation of complex words","author":"Hofmann","year":"2021","journal-title":"arXiv preprint arXiv:2101.00403"},{"key":"2025042815045260500_bib33","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-short.43","article-title":"An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers","volume-title":"Association for Computational Linguistics","author":"Hofmann","year":"2022"},{"key":"2025042815045260500_bib34","article-title":"Morphpiece: Moving away from statistical language representation","author":"Jabbar","year":"2023","journal-title":"arXiv preprint arXiv:2307.07262"},{"key":"2025042815045260500_bib35","article-title":"From tokens to words: On the inner lexicon of llms","author":"Kaplan","year":"2024","journal-title":"arXiv preprint arXiv:2410.05864"},{"key":"2025042815045260500_bib36","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.naacl-main.179","article-title":"What do tokens know about their characters and how do they know it?","author":"Kaushal","year":"2022","journal-title":"arXiv preprint arXiv:2206.02608"},{"key":"2025042815045260500_bib37","first-page":"78","article-title":"Semi-supervised learning of concatenative morphology","volume-title":"Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology","author":"Kohonen","year":"2010"},{"key":"2025042815045260500_bib38","doi-asserted-by":"publisher","DOI":"10.31234\/osf.io\/2bazx","article-title":"Mouse-mole-vole: The inconspicuous benefit of phonology during retrieval from semantic memory","volume-title":"Proceedings of the Annual Meeting of the Cognitive Science Society","author":"Kumar","year":"2022"},{"key":"2025042815045260500_bib39","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-acl.78","article-title":"Bpe vs. morphological segmentation: A case study on machine translation of four polysynthetic languages","author":"Mager","year":"2022","journal-title":"arXiv preprint arXiv:2203.08954"},{"issue":"8","key":"2025042815045260500_bib40","doi-asserted-by":"publisher","first-page":"1571","DOI":"10.1080\/17470218.2014.959709","article-title":"Semantic transparency in free stems: The effect of orthography-semantics consistency on word recognition","volume":"68","author":"Marelli","year":"2015","journal-title":"Quarterly Journal of Experimental Psychology"},{"key":"2025042815045260500_bib41","article-title":"Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp","author":"Mielke","year":"2021","journal-title":"arXiv preprint arXiv:2112.10508"},{"key":"2025042815045260500_bib42","article-title":"Distributed representations of words and phrases and their compositionality","volume":"26","author":"Mikolov","year":"2013","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025042815045260500_bib43","first-page":"984","article-title":"Word embedding-based antonym detection using thesauri and distributional information","volume-title":"Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Ono","year":"2015"},{"key":"2025042815045260500_bib44","article-title":"Gpt-4 technical report","author":"OpenAI","year":"2023","journal-title":"arXiv preprint arXiv:2303.08774"},{"key":"2025042815045260500_bib45","unstructured":"OpenAI. 2024. Hello gpt-4o. https:\/\/openai.com\/index\/hello-gpt-4o."},{"key":"2025042815045260500_bib46","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025042815045260500_bib47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.aacl-main.17","article-title":"An empirical study of tokenization strategies for various korean nlp tasks","author":"Park","year":"2020","journal-title":"arXiv preprint arXiv:2010.02534"},{"key":"2025042815045260500_bib48","first-page":"2825","article-title":"Scikit-learn: Machine learning in python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"2025042815045260500_bib49","article-title":"Language model tokenizers introduce unfairness between languages","volume":"36","author":"Petrov","year":"2024","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025042815045260500_bib50","article-title":"Meaning without reference in large language models","author":"Piantadosi","year":"2022","journal-title":"arXiv preprint arXiv:2208.02957"},{"key":"2025042815045260500_bib51","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1493","article-title":"How multilingual is multilingual bert","author":"Pires","year":"2019","journal-title":"arXiv preprint arXiv:1906.01502"},{"issue":"8","key":"2025042815045260500_bib52","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI blog"},{"issue":"8","key":"2025042815045260500_bib53","doi-asserted-by":"publisher","first-page":"2890","DOI":"10.1111\/cogs.12690","article-title":"Modeling the structure and dynamics of semantic processing","volume":"42","author":"Rotaru","year":"2018","journal-title":"Cognitive Science"},{"key":"2025042815045260500_bib54","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1099","article-title":"The effects of data size and frequency range on distributional semantic models","author":"Sahlgren","year":"2016","journal-title":"arXiv preprint arXiv:1609.08293"},{"key":"2025042815045260500_bib55","doi-asserted-by":"publisher","first-page":"8766","DOI":"10.1609\/aaai.v34i05.6403","article-title":"Rare words: A major problem for contextualized embeddings and how to fix it by attentive mimicking","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Schick","year":"2020"},{"key":"2025042815045260500_bib56","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1162","article-title":"Neural machine translation of rare words with subword units","author":"Sennrich","year":"2015","journal-title":"arXiv preprint arXiv:1508.07909"},{"key":"2025042815045260500_bib57","article-title":"rspeer\/wordfreq: v3. 0","author":"Speer","year":"2022","journal-title":"Version v3. 0.2. Sept"},{"issue":"7","key":"2025042815045260500_bib58","doi-asserted-by":"publisher","first-page":"1317","DOI":"10.1111\/j.1551-6709.2009.01065.x","article-title":"Relationships between language structure and language learning: The suffixing preference and grammatical categorization","volume":"33","author":"St. Clair","year":"2009","journal-title":"Cognitive Science"},{"issue":"4","key":"2025042815045260500_bib59","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3578707","article-title":"Impact of tokenization on language models: An analysis for turkish","volume":"22","author":"Toraman","year":"2023","journal-title":"ACM Transactions on Asian and Low-Resource Language Information Processing"},{"key":"2025042815045260500_bib60","article-title":"Morfessor 2.0: Python implementation and extensions for morfessor baseline","author":"Virpioja","year":"2013"},{"issue":"4","key":"2025042815045260500_bib61","doi-asserted-by":"publisher","first-page":"847","DOI":"10.1162\/coli_a_00391","article-title":"Multi-SimLex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity","volume":"46","author":"Vuli\u0107","year":"2020","journal-title":"Computational Linguistics"},{"key":"2025042815045260500_bib62","article-title":"Google\u2019s neural machine translation system: Bridging the gap between human and machine translation","author":"Yonghui","year":"2016","journal-title":"arXiv preprint arXiv:1609.08144"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00747\/2514611\/tacl_a_00747.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00747\/2514611\/tacl_a_00747.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T19:05:02Z","timestamp":1745867102000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00747\/128941\/How-Much-Semantic-Information-is-Available-in"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":62,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00747","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}