{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,15]],"date-time":"2026-07-15T07:01:54Z","timestamp":1784098914594,"version":"3.55.0"},"reference-count":76,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2022,2,11]],"date-time":"2022-02-11T00:00:00Z","timestamp":1644537600000},"content-version":"vor","delay-in-days":41,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,1,31]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model\u2019s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences\u2014without explicit tokenization or vocabulary\u2014and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.<\/jats:p>","DOI":"10.1162\/tacl_a_00448","type":"journal-article","created":{"date-parts":[[2022,2,11]],"date-time":"2022-02-11T14:06:21Z","timestamp":1644588381000},"page":"73-91","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":58,"title":["<scp>Canine<\/scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"],"prefix":"10.1162","volume":"10","author":[{"given":"Jonathan H.","family":"Clark","sequence":"first","affiliation":[{"name":"Google Research, USA. jhclark@google.com"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dan","family":"Garrette","sequence":"additional","affiliation":[{"name":"Google Research, USA. dhgarrette@google.com"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Iulia","family":"Turc","sequence":"additional","affiliation":[{"name":"Google Research, USA. iuliaturc@google.com"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"John","family":"Wieting","sequence":"additional","affiliation":[{"name":"Google Research, USA. jwieting@google.com"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2022,1,31]]},"reference":[{"key":"2022033118510389500_bib1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00416","article-title":"MasakhaNER: Named entity recognition for african languages","author":"Adelani","year":"2021","journal-title":"TACL"},{"key":"2022033118510389500_bib2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.19","article-title":"ETC: Encoding long and structured inputs in transformers","volume-title":"Proceedings of EMNLP","author":"Ainslie","year":"2020"},{"key":"2022033118510389500_bib3","article-title":"Contextual string embeddings for sequence labeling","volume-title":"Proceedings of COLING","author":"Akbik","year":"2018"},{"key":"2022033118510389500_bib4","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33013159","article-title":"Character-level language modeling with deeper self-attention","volume-title":"Proceedings of AAAI","author":"Al-Rfou","year":"2019"},{"key":"2022033118510389500_bib5","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1347","article-title":"Part-of-speech tagging for code-switched, transliterated texts without explicit language identification","volume-title":"Proceedings of EMNLP","author":"Ball","year":"2018"},{"key":"2022033118510389500_bib6","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1156","article-title":"Better character language modeling through morphology","volume-title":"Proceedings of ACL","author":"Blevins","year":"2019"},{"key":"2022033118510389500_bib7","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","author":"Bojanowski","year":"2017","journal-title":"TACL"},{"key":"2022033118510389500_bib8","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.414","article-title":"Byte pair encoding is suboptimal for language model pretraining","volume-title":"Findings of the Association for Computational Linguistics: EMNLP","author":"Bostrom","year":"2020"},{"key":"2022033118510389500_bib9","article-title":"Adaptor Grammars for learning non-concatenative morphology","volume-title":"Proceedings of EMNLP","author":"Botha","year":"2013"},{"key":"2022033118510389500_bib10","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.609","article-title":"CharacterBERT: Reconciling ELMo and BERT for word-level open-vocabulary representations from characters","volume-title":"Proceedings of COLING","author":"Boukkouri","year":"2020"},{"key":"2022033118510389500_bib11","article-title":"Language models are few-shot learners","volume-title":"Proceedings of NeurIPS","author":"Brown","year":"2020"},{"key":"2022033118510389500_bib12","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00036","article-title":"Neural lattice language models","author":"Buckman","year":"2018","journal-title":"TACL"},{"key":"2022033118510389500_bib13","article-title":"Bridging the gap for tokenizer-free language models","author":"Choe","year":"2019","journal-title":"arXiv preprint arXiv:1908.10322"},{"key":"2022033118510389500_bib14","article-title":"Rethinking attention with performers","volume-title":"Proceedings of ICLR","author":"Choromanski","year":"2021"},{"key":"2022033118510389500_bib15","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.367","article-title":"Improving multilingual models with language-clustered vocabularies","volume-title":"Proceedings of EMNLP","author":"Chung","year":"2020"},{"key":"2022033118510389500_bib16","article-title":"Hierarchical multiscale recurrent neural networks","volume-title":"Proceedings of ICLR","author":"Chung","year":"2017"},{"key":"2022033118510389500_bib17","doi-asserted-by":"crossref","DOI":"10.1162\/tacl_a_00317","article-title":"TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages","author":"Clark","year":"2020","journal-title":"TACL"},{"key":"2022033118510389500_bib18","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.747","article-title":"Unsupervised cross-lingual representation learning at scale","volume-title":"Proceedings of ACL","author":"Conneau","year":"2020"},{"key":"2022033118510389500_bib19","article-title":"Funnel-Transformer: Filtering out sequential redundancy for efficient language processing","volume-title":"Proceedings of NeurIPS","author":"Dai","year":"2020"},{"key":"2022033118510389500_bib20","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of NAACL","author":"Devlin","year":"2019"},{"key":"2022033118510389500_bib21","article-title":"Learning a part-of-speech tagger from two hours of annotation","volume-title":"Proceedings of NAACL","author":"Garrette","year":"2013"},{"key":"2022033118510389500_bib22","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00032","article-title":"Language modeling for morphologically rich languages: Character-aware modeling for word-level prediction","author":"Gerz","year":"2018","journal-title":"TACL"},{"key":"2022033118510389500_bib23","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1155","article-title":"Multilingual language processing from bytes","volume-title":"Proceedings of NAACL","author":"Gillick","year":"2016"},{"key":"2022033118510389500_bib24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1143","article-title":"Training hybrid language models by marginalizing over segmentations","volume-title":"Proceedings of ACL","author":"Grave","year":"2019"},{"key":"2022033118510389500_bib25","article-title":"Generating sequences with recurrent neural networks","author":"Graves","year":"2013","journal-title":"arXiv preprint arXiv:1308.0850"},{"key":"2022033118510389500_bib26","article-title":"Byte-level machine reading across morphologically varied languages","volume-title":"Proceedings of AAAI","author":"Hewlett","year":"2018"},{"key":"2022033118510389500_bib27","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505665","article-title":"Learning deep structured semantic models for web search using clickthrough data","volume-title":"Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM)","author":"Huang","year":"2013"},{"key":"2022033118510389500_bib28","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953252","article-title":"Character-level language modeling with hierarchical recurrent neural networks","volume-title":"Proceedings of ICASSP","author":"Hwang","year":"2017"},{"key":"2022033118510389500_bib29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1506","article-title":"PRADO: Projection attention networks for document classification on-device","volume-title":"Proceedings of EMNLP","author":"Kaliamoorthi","year":"2019"},{"key":"2022033118510389500_bib30","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1137","article-title":"Learning to create and reuse words in open-vocabulary neural language modeling","volume-title":"Proceedings of ACL","author":"Kawakami","year":"2017"},{"key":"2022033118510389500_bib31","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P19-1645","article-title":"Learning to discover, ground and use words with segmental neural language models","volume-title":"Proceedings of ACL","author":"Kawakami","year":"2019"},{"key":"2022033118510389500_bib32","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v30i1.10362","article-title":"Character-aware neural language models","volume-title":"Proceedings of AAAI","author":"Kim","year":"2016"},{"key":"2022033118510389500_bib33","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P18-1007","article-title":"Subword regularization: Improving neural network translation models with multiple subword candidates","volume-title":"Proceedings of ACL","author":"Kudo","year":"2018"},{"key":"2022033118510389500_bib34","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-2012","article-title":"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing","volume-title":"Proceedings of EMNLP: System Demonstrations","author":"Kudo","year":"2018"},{"key":"2022033118510389500_bib35","article-title":"Cross-lingual language model pretraining","volume-title":"Proceedings of NeurIPS","author":"Lample","year":"2019"},{"key":"2022033118510389500_bib36","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00067","article-title":"Fully character-level neural machine translation without explicit segmentation","author":"Lee","year":"2017","journal-title":"TACL"},{"key":"2022033118510389500_bib37","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1100","article-title":"Achieving open vocabulary neural machine translation with hybrid word-character models","volume-title":"Proceedings of ACL","author":"Luong","year":"2016"},{"key":"2022033118510389500_bib38","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1130","article-title":"Using morphological knowledge in open-vocabulary neural language models","volume-title":"Proceedings of NAACL","author":"Matthews","year":"2019"},{"key":"2022033118510389500_bib39","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016843","article-title":"Spell once, summon anywhere: A two-level open-vocabulary language model","volume-title":"Proceedings of AAAI","author":"Mielke","year":"2019"},{"key":"2022033118510389500_bib40","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.371","article-title":"Character-level representations improve drs-based semantic parsing even in the age of BERT","volume-title":"Proceedings of EMNLP","author":"Noord","year":"2020"},{"key":"2022033118510389500_bib41","article-title":"TweetMotif: Exploratory search and topic summarization for twitter introduction and description","volume-title":"Proceedings of the International AAAI Conference on Web and Social Media","author":"O\u2019Connor","year":"2010"},{"key":"2022033118510389500_bib42","article-title":"Random feature attention","volume-title":"Proceedings of ICLR","author":"Peng","year":"2021"},{"key":"2022033118510389500_bib43","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1202","article-title":"Deep contextualized word representations","volume-title":"Proceedings of NAACL","author":"Peters","year":"2018"},{"key":"2022033118510389500_bib44","article-title":"English intermediate-task training improves zero-shot cross-lingual transfer too","volume-title":"Proceedings of AACL","author":"Phang","year":"2020"},{"key":"2022033118510389500_bib45","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D17-1010","article-title":"Mimicking word embeddings using subword RNNs","volume-title":"Proceedings of EMNLP","author":"Pinter","year":"2017"},{"key":"2022033118510389500_bib46","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4811","article-title":"Character eyes: Seeing language through character-level taggers","volume-title":"Proceedings of BlackboxNLP","author":"Pinter","year":"2019"},{"key":"2022033118510389500_bib47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1493","article-title":"How multilingual is Multilingual BERT?","volume-title":"Proceedings of ACL","author":"Pires","year":"2019"},{"key":"2022033118510389500_bib48","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.170","article-title":"BPE-Dropout: Simple and effective subword regularization","volume-title":"Proceedings of ACL","author":"Provilkov","year":"2020"},{"key":"2022033118510389500_bib49","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1561","article-title":"Combating adversarial misspellings with robust word recognition","volume-title":"Proceedings of ACL","author":"Pruthi","year":"2019"},{"key":"2022033118510389500_bib50","article-title":"Learning to generate reviews and discovering sentiment","author":"Radford","year":"2017","journal-title":"arXiv preprint arXiv:1704.01444"},{"key":"2022033118510389500_bib51","unstructured":"Alec Radford , JeffWu, RewonChild, DavidLuan, DarioAmodei, and IlyaSutskever. 2019. Language models are unsupervised multitask learners. OpenAI Technical Report. https:\/\/www.semanticscholar.org\/paper\/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu\/9405cc0d6169988371b2755e573cc28650d14dfe"},{"key":"2022033118510389500_bib52","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","author":"Raffel","year":"2020","journal-title":"JMLR"},{"key":"2022033118510389500_bib53","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1162","article-title":"Neural machine translation of rare words with subword units","volume-title":"Proceedings of ACL","author":"Sennrich","year":"2016"},{"key":"2022033118510389500_bib54","article-title":"Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT","author":"Sun","year":"2020","journal-title":"arXiv preprint arXiv:2003.04985"},{"key":"2022033118510389500_bib55","article-title":"Generating text with recurrent neural networks","volume-title":"Proceedings of ICML","author":"Sutskever","year":"2011"},{"key":"2022033118510389500_bib56","article-title":"Hash embeddings for efficient word representations","volume-title":"Proceedings of NeurIPS","author":"Svenstrup","year":"2017"},{"key":"2022033118510389500_bib57","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972986.4","article-title":"Bloom maps","volume-title":"Proceedings of the Workshop on Analytic Algorithmics and Combinatorics (ANALCO)","author":"Talbot","year":"2008"},{"key":"2022033118510389500_bib58","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1452","article-title":"BERT rediscovers the classical NLP pipeline","volume-title":"Proceedings of ACL","author":"Tenney","year":"2019"},{"key":"2022033118510389500_bib59","doi-asserted-by":"publisher","DOI":"10.3115\/1118853.1118877","article-title":"Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition","volume-title":"Proceedings of CoNLL","author":"Tjong Kim Sang","year":"2002"},{"key":"2022033118510389500_bib60","doi-asserted-by":"publisher","DOI":"10.3115\/1119176.1119195","article-title":"Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition","volume-title":"Proceedings of NAACL","author":"Tjong Kim Sang","year":"2003"},{"key":"2022033118510389500_bib61","article-title":"Multiscale sequence modeling with a learned dictionary","author":"Merri\u00ebnboer","year":"2017","journal-title":"arXiv preprint arXiv:1707.00762"},{"key":"2022033118510389500_bib62","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D18-1278","article-title":"What do character-level models learn about morphology? The case of dependency parsing","volume-title":"Proceedings of EMNLP","author":"Vania","year":"2018"},{"key":"2022033118510389500_bib63","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1184","article-title":"From characters to words to in between: Do we capture morphology?","volume-title":"Proceedings of ACL","author":"Vania","year":"2017"},{"key":"2022033118510389500_bib64","article-title":"Attention is all you need","volume-title":"Proceedings of NeurIPS","author":"Vaswani","year":"2017"},{"key":"2022033118510389500_bib65","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.40","article-title":"Multi-view subword regularization","volume-title":"Proceedings of NAACL","author":"Wang","year":"2021"},{"key":"2022033118510389500_bib66","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.240","article-title":"Extending multilingual BERT to low-resource languages","volume-title":"Findings of EMNLP","author":"Wang","year":"2020"},{"key":"2022033118510389500_bib67","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1157","article-title":"Charagram: Embedding words and sentences via character n-grams","volume-title":"Proceedings of EMNLP","author":"Wieting","year":"2016"},{"key":"2022033118510389500_bib68","article-title":"Google\u2019s neural machine translation system: Bridging the gap between human and machine translation","author":"Yonghui","year":"2016","journal-title":"arXiv preprint arXiv:1609.08144"},{"key":"2022033118510389500_bib69","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D18-1034","article-title":"Neural cross-lingual named entity recognition with minimal resources","volume-title":"Proceedings of EMNLP","author":"Xie","year":"2018"},{"key":"2022033118510389500_bib70","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.naacl-main.41","article-title":"mT5: A massively multilingual pre-trained text-to-text transformer","volume-title":"Proceedings of NAACL","author":"Xue","year":"2021"},{"key":"2022033118510389500_bib71","article-title":"Large batch optimization for deep learning: Training BERT in 76 minutes","volume-title":"Proceedings of ICLR","author":"You","year":"2020"},{"key":"2022033118510389500_bib72","article-title":"On the strength of character language models for multilingual named entity recognition","volume-title":"Proceedings of EMNLP","author":"Xiaodong","year":"2018"},{"key":"2022033118510389500_bib73","article-title":"Big Bird: Transformers for longer sequences","volume-title":"Proceedings of NeurIPS","author":"Zaheer","year":"2020"},{"key":"2022033118510389500_bib74","article-title":"Which encoding is the best for text classification in Chinese, English, Japanese and Korean?","author":"Zhang","year":"2017","journal-title":"arXiv preprint arXiv:1708.02657v2"},{"key":"2022033118510389500_bib75","article-title":"Character-level convolutional networks for text classification","volume-title":"Proceedings of NeurIPS","author":"Zhang","year":"2015"},{"key":"2022033118510389500_bib76","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.37","article-title":"AMBERT: A pre-trained language model with multi-grained tokenization","volume-title":"Findings of ACL","author":"Zhang","year":"2021"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00448\/1985933\/tacl_a_00448.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00448\/1985933\/tacl_a_00448.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,26]],"date-time":"2023-01-26T17:14:32Z","timestamp":1674753272000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00448\/109284\/Canine-Pre-training-an-Efficient-Tokenization-Free"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":76,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00448","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}