{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T23:13:56Z","timestamp":1769555636711,"version":"3.49.0"},"reference-count":7,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2020,4,7]],"date-time":"2020-04-07T00:00:00Z","timestamp":1586217600000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Subwords have become very popular, but the BERT<jats:sup>a<\/jats:sup> and ERNIE<jats:sup>b<\/jats:sup>  tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, \u201celectroneutral\u201d can be parsed as electron-eu-tral or electro-neutral, and \u201cbidirectional\u201d can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).<\/jats:p>","DOI":"10.1017\/s1351324920000145","type":"journal-article","created":{"date-parts":[[2020,4,7]],"date-time":"2020-04-07T08:07:41Z","timestamp":1586246861000},"page":"375-382","source":"Crossref","is-referenced-by-count":7,"title":["Emerging trends: Subwords, seriously?"],"prefix":"10.1017","volume":"26","author":[{"given":"Kenneth Ward","family":"Church","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2020,4,7]]},"reference":[{"key":"S1351324920000145_ref7","unstructured":"Wu, Y. , Schuster, M. , Chen, Z. , Le, Q.V. , Norouzi, M. , Macherey, W. , Krikun, M. , Cao, Y. , Gao, Q. , Macherey, K. , Klingner, J. , Shah, A. , Johnson, M. , Liu, X. , Kaiser, \u0141. , Gouws, S. , Kato, Y. , Kudo, T. , Kazawa, H. , Stevens, K. , Kurian, G. , Patil, N. , Wang, W. , Young, C. , Smith, J. , Riesa, J. , Rudnick, A. , Vinyals, O. , Corrado, G. , Hughes, M. and Dean, J. (2016). Googles neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144."},{"key":"S1351324920000145_ref2","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171\u20134186."},{"key":"S1351324920000145_ref1","unstructured":"Coker, C.H. , Church, K.W. and Liberman, M.Y. (1991). Morphology and rhyming: Two powerful alternatives to letter-to-sound rules for speech synthesis. In The ESCA Workshop on Speech Synthesis, pp. 83\u201386."},{"key":"S1351324920000145_ref5","unstructured":"Sun, Y. , Wang, S. , Li, Y. , Feng, S. , Tian, H. , Wu, H. and Wang, H. (2019). Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint, arXiv:1907.12412."},{"key":"S1351324920000145_ref3","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2012.6289079"},{"key":"S1351324920000145_ref4","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1162"},{"key":"S1351324920000145_ref6","unstructured":"Wang, A. , Singh, A. , Michael, J. , Hill, F. , Levy, O. and Bowman, S.R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint, arXiv:1804.07461."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000145","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,4,7]],"date-time":"2020-04-07T08:08:29Z","timestamp":1586246909000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000145\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,7]]},"references-count":7,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,5]]}},"alternative-id":["S1351324920000145"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000145","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,4,7]]}}}