{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T15:56:35Z","timestamp":1778946995518,"version":"3.51.4"},"reference-count":89,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,5,9]],"date-time":"2024-05-09T00:00:00Z","timestamp":1715212800000},"content-version":"vor","delay-in-days":129,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,5,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.<\/jats:p>","DOI":"10.1162\/tacl_a_00663","type":"journal-article","created":{"date-parts":[[2024,5,9]],"date-time":"2024-05-09T16:03:53Z","timestamp":1715270633000},"page":"543-561","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":18,"title":["What Formal Languages Can Transformers Express? A Survey"],"prefix":"10.1162","volume":"12","author":[{"given":"Lena","family":"Strobl","sequence":"first","affiliation":[{"name":"Ume\u00e5 University, Sweden. lena.strobl@umu.se"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"William","family":"Merrill","sequence":"additional","affiliation":[{"name":"New York University, USA. willm@nyu.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gail","family":"Weiss","sequence":"additional","affiliation":[{"name":"EPFL, Switzerland. gail.weiss@epfl.ch"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David","family":"Chiang","sequence":"additional","affiliation":[{"name":"University of Notre Dame, USA. dchiang@nd.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dana","family":"Angluin","sequence":"additional","affiliation":[{"name":"Yale University, USA. dana.angluin@yale.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2024,5,7]]},"reference":[{"key":"2024050916034516400_bib1","article-title":"A survey of neural networks and formal languages","author":"Ackerman","year":"2020","journal-title":"arXiv preprint arXiv:2006.01338"},{"key":"2024050916034516400_bib2","article-title":"Physics of language models: Part 1, context-free grammar","author":"Allen-Zhu","year":"2023","journal-title":"arXiv preprint arXiv:2305.13673"},{"issue":"7","key":"2024050916034516400_bib3","doi-asserted-by":"publisher","DOI":"10.4086\/cjtcs.1999.007","article-title":"The permanent requires large uniform threshold circuits","volume":"1999","author":"Allender","year":"1999","journal-title":"Chicago Journal of Theoretical Computer Science"},{"key":"2024050916034516400_bib4","article-title":"Masked hard-attention transformers and Boolean RASP recognize exactly the star-free languages","author":"Angluin","year":"2023","journal-title":"arXiv preprint arXiv:2310.13897"},{"key":"2024050916034516400_bib5","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511804090","volume-title":"Computational Complexity: A Modern Approach","author":"Arora","year":"2009"},{"key":"2024050916034516400_bib6","article-title":"Layer normalization","volume-title":"NIPS 2016 Deep Learning Symposium","author":"Ba","year":"2016"},{"key":"2024050916034516400_bib7","article-title":"Neural machine translation by jointly learning to align and translate","volume-title":"Proceedings of the Third International Conference on Learning Representations (ICLR)","author":"Bahdanau","year":"2015"},{"key":"2024050916034516400_bib8","article-title":"Logical languages accepted by transformer encoders with hard attention","volume-title":"Proceedings of the Twelfth International Conference on Learning Representations (ICLR)","author":"Barcel\u00f3","year":"2024"},{"issue":"1","key":"2024050916034516400_bib9","doi-asserted-by":"publisher","first-page":"150","DOI":"10.1016\/0022-0000(89)90037-8","article-title":"Bounded-width polynomial-size branching programs recognize exactly those languages in NC1","volume":"38","author":"Barrington","year":"1989","journal-title":"Journal of Computer and System Sciences"},{"issue":"3","key":"2024050916034516400_bib10","doi-asserted-by":"publisher","first-page":"478","DOI":"10.1016\/0022-0000(92)90014-A","article-title":"Regular languages in NC1","volume":"44","author":"Barrington","year":"1992","journal-title":"Journal of Computer and System Sciences"},{"issue":"2","key":"2024050916034516400_bib11","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1016\/j.jcss.2004.07.004","article-title":"First-order expressibility of languages with neutral letters or: The Crane Beach conjecture","volume":"70","author":"Mix Barrington","year":"2005","journal-title":"Journal of Computer and System Sciences"},{"issue":"3","key":"2024050916034516400_bib12","doi-asserted-by":"publisher","first-page":"274","DOI":"10.1016\/0022-0000(90)90022-D","article-title":"On uniformity within NC1","volume":"41","author":"Mix Barrington","year":"1990","journal-title":"Journal of Computer and System Sciences"},{"key":"2024050916034516400_bib13","doi-asserted-by":"publisher","first-page":"176","DOI":"10.1109\/SCT.1994.315806","article-title":"Time, hardware, and uniformity","volume-title":"Proceedings of the IEEE 9th Annual Conference on Structure in Complexity Theory","author":"Barrington","year":"1994"},{"issue":"7","key":"2024050916034516400_bib14","doi-asserted-by":"publisher","first-page":"1155","DOI":"10.1016\/0893-6080(96)00130-X","article-title":"On the circuit complexity of sigmoid feedforward neural networks","volume":"9","author":"Beiu","year":"1996","journal-title":"Neural Networks"},{"key":"2024050916034516400_bib15","doi-asserted-by":"publisher","first-page":"7096","DOI":"10.18653\/v1\/2020.emnlp-main.576","article-title":"On the ability and limitations of Transformers to recognize formal languages","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Bhattamishra","year":"2020"},{"key":"2024050916034516400_bib16","doi-asserted-by":"publisher","first-page":"455","DOI":"10.18653\/v1\/2020.conll-1.37","article-title":"On the computational power of Transformers and its implications in sequence modeling","volume-title":"Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL)","author":"Bhattamishra","year":"2020"},{"key":"2024050916034516400_bib17","doi-asserted-by":"publisher","first-page":"5767","DOI":"10.18653\/v1\/2023.acl-long.317","article-title":"Simplicity bias in Transformers and their ability to learn sparse Boolean functions","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)","author":"Bhattamishra","year":"2023"},{"key":"2024050916034516400_bib18","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems 33 (NeurIPS)","author":"Brown","year":"2020"},{"key":"2024050916034516400_bib19","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1145\/28395.28409","article-title":"The Boolean formula value problem is in ALOGTIME","volume-title":"Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (STOC)","author":"Buss","year":"1987"},{"issue":"2","key":"2024050916034516400_bib20","doi-asserted-by":"publisher","first-page":"423","DOI":"10.1137\/0213028","article-title":"Constant depth reducibility","volume":"13","author":"Chandra","year":"1984","journal-title":"SIAM Journal of Computing"},{"key":"2024050916034516400_bib21","doi-asserted-by":"publisher","first-page":"7654","DOI":"10.18653\/v1\/2022.acl-long.527","article-title":"Overcoming a theoretical limitation of self-attention","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)","author":"Chiang","year":"2022"},{"key":"2024050916034516400_bib22","first-page":"5544","article-title":"Tighter bounds on the expressivity of transformer encoders","volume-title":"Proceedings of the 40th International Conference on Machine Learning (ICML)","author":"Chiang","year":"2023"},{"key":"2024050916034516400_bib23","doi-asserted-by":"publisher","first-page":"118","DOI":"10.1016\/S0049-237X(08)72023-8","article-title":"The algebraic theory of context-free languages","volume-title":"Computer Programming and Formal Systems","author":"Chomsky","year":"1963"},{"issue":"3","key":"2024050916034516400_bib24","doi-asserted-by":"publisher","first-page":"385","DOI":"10.1016\/0196-6774(87)90018-6","article-title":"Problems complete for deterministic logarithmic space","volume":"8","author":"Cook","year":"1987","journal-title":"Journal of Algorithms"},{"issue":"4","key":"2024050916034516400_bib25","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1007\/BF02551274","article-title":"Approximation by superpositions of a sigmoidal function","volume":"2","author":"Cybenko","year":"1989","journal-title":"Mathematics of Control, Signals, and Systems"},{"key":"2024050916034516400_bib26","article-title":"Neural networks and the Chomsky hierarchy","volume-title":"Proceedings of the Eleventh International Conference on Learning Representations (ICLR)","author":"Del\u00e9tang","year":"2023"},{"key":"2024050916034516400_bib27","doi-asserted-by":"publisher","first-page":"4171","DOI":"10.18653\/v1\/N19-1423","article-title":"BERT: Pre-training of deep bidirectional Transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT)","author":"Devlin","year":"2019"},{"key":"2024050916034516400_bib28","doi-asserted-by":"publisher","first-page":"4301","DOI":"10.18653\/v1\/2020.findings-emnlp.384","article-title":"How can self-attention networks recognize Dyck-n languages?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Ebrahimi","year":"2020"},{"key":"2024050916034516400_bib29","article-title":"Towards revealing the mystery behind Chain of Thought: A theoretical perspective","volume-title":"Advances in Neural Information Processing Systems 36 (NeurIPS)","author":"Feng","year":"2023"},{"key":"2024050916034516400_bib30","doi-asserted-by":"publisher","first-page":"265","DOI":"10.1007\/BF01694011","article-title":"Counter machines and counter languages","volume":"2","author":"Fischer","year":"1968","journal-title":"Mathematical Systems Theory"},{"key":"2024050916034516400_bib31","article-title":"Learning Transformer programs","volume-title":"Advances in Neural Information Processing Systems 36 (NeurIPS)","author":"Friedman","year":"2023"},{"key":"2024050916034516400_bib32","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1007\/BF01744431","article-title":"Parity, circuits, and the polynomial-time hierarchy","volume":"17","author":"Furst","year":"1984","journal-title":"Mathematical Systems Theory"},{"key":"2024050916034516400_bib33","doi-asserted-by":"crossref","unstructured":"Raymond\n              Greenlaw\n            , H.James Hoover, and Walter L.Ruzzo. 1995. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press. Preliminary version of Appendix A available as Technical Report TR91-11, University of Alberta, Department of Computing Science. 10.1093\/oso\/9780195085914.001.0001","DOI":"10.1093\/oso\/9780195085914.001.0001"},{"key":"2024050916034516400_bib34","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1162\/tacl_a_00306","article-title":"Theoretical limitations of self-attention in neural sequence models","volume":"8","author":"Hahn","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024050916034516400_bib35","doi-asserted-by":"publisher","first-page":"800","DOI":"10.1162\/tacl_a_00490","article-title":"Formal language recognition by hard attention Transformers: Perspectives from circuit complexity","volume":"10","author":"Hao","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024050916034516400_bib36","article-title":"Gaussian error linear units (GELUs)","author":"Hendrycks","year":"2016","journal-title":"arXiv preprint arXiv:1606.08415"},{"key":"2024050916034516400_bib37","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1007\/3-540-48224-5_9","article-title":"Division is in uniform TC0","volume-title":"Automata, Languages and Programming (ICALP)","author":"Hesse","year":"2001"},{"key":"2024050916034516400_bib38","doi-asserted-by":"publisher","first-page":"1978","DOI":"10.18653\/v1\/2020.emnlp-main.156","article-title":"RNNs can generate bounded hierarchical languages with optimal memory","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Hewitt","year":"2020"},{"issue":"5","key":"2024050916034516400_bib39","doi-asserted-by":"publisher","first-page":"359","DOI":"10.1016\/0893-6080(89)90020-8","article-title":"Multilayer feedforward networks are universal approximators","volume":"2","author":"Hornik","year":"1989","journal-title":"Neural Networks"},{"key":"2024050916034516400_bib40","article-title":"The annotated Transformer","author":"Huang","year":"2022"},{"issue":"4","key":"2024050916034516400_bib41","doi-asserted-by":"publisher","first-page":"760","DOI":"10.1137\/0216051","article-title":"Languages that capture complexity classes","volume":"16","author":"Immerman","year":"1997","journal-title":"SIAM Journal on Computing"},{"key":"2024050916034516400_bib42","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4612-0539-5","volume-title":"Descriptive Complexity","author":"Immerman","year":"1999"},{"issue":"1","key":"2024050916034516400_bib43","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1016\/0304-3975(76)90068-2","article-title":"Complete problems for deterministic polynomial time","volume":"3","author":"Jones","year":"1976","journal-title":"Theoretical Computer Science"},{"key":"2024050916034516400_bib44","unstructured":"Johan Anthony Willem\n              Kamp\n            \n          . 1968. Tense Logic and the Theory of Linear Order. Ph.D. thesis, University of California, Los Angeles."},{"key":"2024050916034516400_bib45","doi-asserted-by":"publisher","first-page":"3835","DOI":"10.18653\/v1\/2023.acl-long.213","article-title":"Entity tracking in language models","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Kim","year":"2023"},{"key":"2024050916034516400_bib46","doi-asserted-by":"publisher","first-page":"5147","DOI":"10.18653\/v1\/2021.naacl-main.405","article-title":"Limitations of autoregressive models and their alternatives","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT)","author":"Lin","year":"2021"},{"key":"2024050916034516400_bib47","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1016\/j.aiopen.2022.10.001","article-title":"A survey of transformers","volume":"3","author":"Lin","year":"2022","journal-title":"AI Open"},{"key":"2024050916034516400_bib48","first-page":"37876","article-title":"Tracr: Compiled transformers as a laboratory for interpretability","volume-title":"Advances in Neural Information Processing Systems 36 (NeurIPS)","author":"Lindner","year":"2023"},{"key":"2024050916034516400_bib49","article-title":"Transformers learn shortcuts to automata","volume-title":"Proceedings of the Eleventh International Conference on Learning Representations (ICLR)","author":"Liu","year":"2023"},{"key":"2024050916034516400_bib50","volume-title":"Counter-Free Automata","author":"McNaughton","year":"1971"},{"key":"2024050916034516400_bib51","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W19-3901","article-title":"Sequential neural networks as automata","volume-title":"Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges","author":"Merrill","year":"2019"},{"key":"2024050916034516400_bib52","article-title":"On the linguistic capacity of real-time counter automata","author":"Merrill","year":"2020","journal-title":"arXiv preprint arXiv:2004.06866"},{"key":"2024050916034516400_bib53","article-title":"Formal language theory meets modern NLP","author":"Merrill","year":"2021","journal-title":"arXiv preprint arXiv: 2102.10094"},{"key":"2024050916034516400_bib54","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-031-33264-7_1","article-title":"Formal languages and the NLP black box","volume-title":"Developments in Language Theory","author":"Merrill","year":"2023"},{"key":"2024050916034516400_bib55","doi-asserted-by":"publisher","first-page":"1766","DOI":"10.18653\/v1\/2021.emnlp-main.133","article-title":"Effects of parameter norm growth during transformer training: Inductive bias from gradient descent","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Merrill","year":"2021"},{"key":"2024050916034516400_bib56","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1162\/tacl_a_00562","article-title":"The parallelism tradeoff: Limitations of log-precision transformers","volume":"11","author":"Merrill","year":"2023","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024050916034516400_bib57","article-title":"A logic for expressing log-precision transformers","volume-title":"Advances in Neural Information Processing Systems 36 (NeurIPS)","author":"Merrill","year":"2023"},{"key":"2024050916034516400_bib58","article-title":"The expressive power of transformers with chain of thought","volume-title":"Proceedings of the Twelfth International Conference on Learning Representations (ICLR)","author":"Merrill","year":"2024"},{"key":"2024050916034516400_bib59","doi-asserted-by":"publisher","first-page":"843","DOI":"10.1162\/tacl_a_00493","article-title":"Saturated transformers are constant-depth threshold circuits","volume":"10","author":"Merrill","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024050916034516400_bib60","doi-asserted-by":"publisher","first-page":"443","DOI":"10.18653\/v1\/2020.acl-main.43","article-title":"A formal hierarchy of RNN architectures","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)","author":"Merrill","year":"2020"},{"key":"2024050916034516400_bib61","article-title":"Show your work: Scratchpads for intermediate computation with language models","volume-title":"Proceedings of the Workshop on Deep Learning for Code (DLAC)","author":"Nye","year":"2022"},{"key":"2024050916034516400_bib62","unstructured":"OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774."},{"issue":"2","key":"2024050916034516400_bib63","doi-asserted-by":"publisher","first-page":"471","DOI":"10.1162\/coli_a_00431","article-title":"On learning interpreted languages with recurrent models","volume":"48","author":"Paperno","year":"2022","journal-title":"Computational Linguistics"},{"key":"2024050916034516400_bib64","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/1836.001.0001","volume-title":"Circuit Complexity and Neural Networks","author":"Parberry","year":"1994"},{"key":"2024050916034516400_bib65","first-page":"75:1\u201375:35","article-title":"Attention is Turing-complete","volume":"22","author":"P\u00e9rez","year":"2021","journal-title":"Journal of Machine Learning Research"},{"key":"2024050916034516400_bib66","article-title":"Formal algorithms for transformers","author":"Phuong","year":"2022","journal-title":"arXiv preprint arXiv:2207.09238"},{"key":"2024050916034516400_bib67","article-title":"On the Turing completeness of modern neural network architectures","volume-title":"Proceedings of the Seventh International Conference on Learning Representations (ICLR)","author":"P\u00e9rez","year":"2019"},{"key":"2024050916034516400_bib68","article-title":"Improving language understanding by generative pre-training","author":"Radford","year":"2018"},{"issue":"4","key":"2024050916034516400_bib69","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1391289.1391291","article-title":"Undirected connectivity in log-space","volume":"55","author":"Reingold","year":"2008","journal-title":"Journal of the ACM"},{"key":"2024050916034516400_bib70","article-title":"Representational strengths and limitations of transformers","volume-title":"Advances in Neural Information Processing Systems 36 (NeurIPS)","author":"Sanford","year":"2023"},{"issue":"2","key":"2024050916034516400_bib71","doi-asserted-by":"publisher","first-page":"331","DOI":"10.1016\/0304-3975(94)90178-3","article-title":"Analog computation via neural networks","volume":"131","author":"Siegelmann","year":"1994","journal-title":"Theoretical Computer Science"},{"issue":"1","key":"2024050916034516400_bib72","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.1995.1013","article-title":"On the computational power of neural nets","volume":"50","author":"Siegelmann","year":"1995","journal-title":"Journal of Computer and System Sciences"},{"issue":"12","key":"2024050916034516400_bib73","doi-asserted-by":"publisher","first-page":"2727","DOI":"10.1162\/089976603322518731","article-title":"General-purpose computation with neural networks: A survey of complexity theoretic results","volume":"15","author":"\u0160\u00edma","year":"2003","journal-title":"Neural Computation"},{"key":"2024050916034516400_bib74","volume-title":"Introduction to the Theory of Computation","author":"Sipser","year":"2013","edition":"3rd"},{"key":"2024050916034516400_bib75","volume-title":"Discrete Neural Computation","author":"Siu","year":"1995"},{"key":"2024050916034516400_bib76","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4612-0289-9","volume-title":"Finite Automata, Formal Logic, and Circuit Complexity","author":"Straubing","year":"1994"},{"key":"2024050916034516400_bib77","article-title":"Average-hard attention transformers are constant-depth uniform threshold circuits","author":"Strobl","year":"2023","journal-title":"arXiv preprint arXiv:2308.03212"},{"issue":"1","key":"2024050916034516400_bib78","doi-asserted-by":"publisher","first-page":"62","DOI":"10.1016\/S0022-0000(75)80014-6","article-title":"On tape-bounded complexity classes and multihead finite automata","volume":"10","author":"Sudborough","year":"1975","journal-title":"Journal of Computer and System Sciences"},{"key":"2024050916034516400_bib79","doi-asserted-by":"publisher","first-page":"44","DOI":"10.18653\/v1\/W19-3905","article-title":"LSTM networks can perform dynamic counting","volume-title":"Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges","author":"Suzgun","year":"2019"},{"key":"2024050916034516400_bib80","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1007\/978-3-642-59126-6_7","article-title":"Languages, automata, and logic","volume-title":"Handbook of Formal Languages: Volume 3 Beyond Words","author":"Thomas","year":"1997"},{"key":"2024050916034516400_bib81","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems 30 (NeurIPS)","author":"Vaswani","year":"2017"},{"key":"2024050916034516400_bib82","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1176","article-title":"Learning deep Transformer models for machine translation","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)","author":"Wang","year":"2019"},{"key":"2024050916034516400_bib83","article-title":"Statistically meaningful approximation: A case study on approximating Turing machines with transformers","volume-title":"Advances in Neural Information Processing Systems 35 (NeurIPS)","author":"Wei","year":"2022"},{"key":"2024050916034516400_bib84","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume-title":"Advances in Neural Information Processing Systems 35 (NeurIPS)","author":"Wei","year":"2022"},{"key":"2024050916034516400_bib85","doi-asserted-by":"publisher","first-page":"740","DOI":"10.18653\/v1\/P18-2117","article-title":"On the practical computational power of finite precision RNNs for language recognition","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)","author":"Weiss","year":"2018"},{"key":"2024050916034516400_bib86","first-page":"11080","article-title":"Thinking like Transformers","volume-title":"Proceedings of the 38th International Conference on Machine Learning (ICML)","author":"Weiss","year":"2021"},{"key":"2024050916034516400_bib87","doi-asserted-by":"publisher","first-page":"3770","DOI":"10.18653\/v1\/2021.acl-long.292","article-title":"Self-attention networks can process bounded hierarchical languages","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP)","author":"Yao","year":"2021"},{"key":"2024050916034516400_bib88","article-title":"Are Transformers universal approximators of sequence-to-sequence functions?","volume-title":"8th International Conference on Learning Representations (ICLR)","author":"Yun","year":"2020"},{"key":"2024050916034516400_bib89","article-title":"What algorithms can Transformers learn? A study in length generalization","volume-title":"Proceedings of the Twelfth International Conference on Learning Representations (ICLR)","author":"Zhou","year":"2024"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00663\/2370911\/tacl_a_00663.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00663\/2370911\/tacl_a_00663.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,9]],"date-time":"2024-05-09T16:04:23Z","timestamp":1715270663000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00663\/120983\/What-Formal-Languages-Can-Transformers-Express-A"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":89,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00663","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}