{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T10:18:35Z","timestamp":1781259515610,"version":"3.54.1"},"reference-count":26,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2022,8,11]],"date-time":"2022-08-11T00:00:00Z","timestamp":1660176000000},"content-version":"vor","delay-in-days":222,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,8,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND\/OR circuits (Hao et al., 2022). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class TC0 as an upper bound on the formal languages they recognize.<\/jats:p>","DOI":"10.1162\/tacl_a_00493","type":"journal-article","created":{"date-parts":[[2022,8,11]],"date-time":"2022-08-11T19:43:52Z","timestamp":1660247032000},"page":"843-856","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":10,"title":["Saturated Transformers are Constant-Depth Threshold Circuits"],"prefix":"10.1162","volume":"10","author":[{"given":"William","family":"Merrill","sequence":"first","affiliation":[{"name":"Allen Institute for AI, USA. willm@nyu.edu"},{"name":"New York University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ashish","family":"Sabharwal","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA. ashishs@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Noah A.","family":"Smith","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA. noah@allenai.org"},{"name":"University of Washington, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2022,8,12]]},"reference":[{"key":"2022081119434484200_bib1","volume-title":"Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP","author":"Alishahi","year":"2020"},{"key":"2022081119434484200_bib2","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511804090","volume-title":"Computational Complexity: A Modern Approach","author":"Arora","year":"2009"},{"key":"2022081119434484200_bib3","article-title":"Layer normalization","author":"Ba","year":"2016","journal-title":"ArXiv"},{"key":"2022081119434484200_bib4","doi-asserted-by":"publisher","first-page":"7096","DOI":"10.18653\/v1\/2020.emnlp-main.576","article-title":"On the ability and limitations of transformers to recognize formal languages","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Bhattamishra","year":"2020"},{"key":"2022081119434484200_bib5","unstructured":"Liliana\n              Cojocaru\n            \n          . 2016. Advanced Studies on the Complexity of Formal Languages. Ph.D. thesis, University of Tampere."},{"key":"2022081119434484200_bib6","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2022081119434484200_bib7","doi-asserted-by":"publisher","DOI":"10.1016\/B978-0-444-88071-0.50006-0","volume-title":"Machine Models and Simulations, chapter 1","author":"van Emde Boas","year":"1991"},{"key":"2022081119434484200_bib8","doi-asserted-by":"publisher","first-page":"260","DOI":"10.1109\/SFCS.1981.35","article-title":"Parity, circuits, and the polynomial-time hierarchy","volume-title":"Proceedings of the 22nd Annual Symposium on Foundations of Computer Science","author":"Furst","year":"1981"},{"issue":"6","key":"2022081119434484200_bib9","doi-asserted-by":"publisher","first-page":"599","DOI":"10.1080\/00029890.1973.11993338","article-title":"A history of the prime number theorem","volume":"80","author":"Goldstein","year":"1973","journal-title":"The American Mathematical Monthly"},{"key":"2022081119434484200_bib10","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1162\/tacl_a_00306","article-title":"Theoretical limitations of self-attention in neural sequence models","volume":"8","author":"Hahn","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022081119434484200_bib11","doi-asserted-by":"crossref","DOI":"10.1162\/tacl_a_00490","article-title":"Formal language recognition by hard attention transformers: Perspectives from circuit complexity","author":"Hao","year":"2022"},{"key":"2022081119434484200_bib12","doi-asserted-by":"publisher","first-page":"1978","DOI":"10.18653\/v1\/2020.emnlp-main.156","article-title":"RNNs can generate bounded hierarchical languages with optimal memory","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Hewitt","year":"2020"},{"key":"2022081119434484200_bib13","doi-asserted-by":"publisher","DOI":"10.1016\/B978-0-444-88071-0.50007-2","volume-title":"A Catalog of Complexity Classes, chapter 2","author":"Johnson","year":"1991"},{"key":"2022081119434484200_bib14","article-title":"Lecture notes for topics in complexity theory","author":"Kayal","year":"2015"},{"key":"2022081119434484200_bib15","first-page":"42","article-title":"Polynomial size log depth circuits: Between NC1 and AC1.","volume":"91","author":"Mahajan","year":"2007","journal-title":"Bulletin of the EATCS"},{"key":"2022081119434484200_bib16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W19-3901","article-title":"Sequential neural networks as automata","volume-title":"Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges","author":"Merrill","year":"2019"},{"key":"2022081119434484200_bib17","article-title":"Formal language theory meets modern NLP","author":"Merrill","year":"2021","journal-title":"ArXiv"},{"key":"2022081119434484200_bib18","doi-asserted-by":"publisher","first-page":"1766","DOI":"10.18653\/v1\/2021.emnlp-main.133","article-title":"Effects of parameter norm growth during transformer training: Inductive bias from gradient descent","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Merrill","year":"2021"},{"key":"2022081119434484200_bib19","doi-asserted-by":"publisher","first-page":"1203","DOI":"10.18653\/v1\/D18-1152","article-title":"Rational recurrences","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Peng","year":"2018"},{"key":"2022081119434484200_bib20","article-title":"On the Turing completeness of modern neural network architectures","volume-title":"International Conference on Learning Representations","author":"P\u00e9rez","year":"2019"},{"key":"2022081119434484200_bib21","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2022081119434484200_bib22","doi-asserted-by":"publisher","first-page":"740","DOI":"10.18653\/v1\/P18-2117","article-title":"On the practical computational power of finite precision RNNs for language recognition","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Weiss","year":"2018"},{"key":"2022081119434484200_bib23","article-title":"Thinking like transformers","author":"Weiss","year":"2021","journal-title":"ArXiv"},{"key":"2022081119434484200_bib24","first-page":"619","article-title":"On ACC and threshold circuits","volume-title":"Proceedings of the 31st Annual Symposium on Foundations of Computer Science","author":"Yao","year":"1990"},{"key":"2022081119434484200_bib25","first-page":"3770","article-title":"Self-attention networks can process bounded hierarchical languages","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Yao","year":"2021"},{"key":"2022081119434484200_bib26","article-title":"Are transformers universal approximators of sequence-to-sequence functions?","volume-title":"International Conference on Learning Representations","author":"Yun","year":"2020"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00493\/2038506\/tacl_a_00493.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00493\/2038506\/tacl_a_00493.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,8,11]],"date-time":"2022-08-11T19:44:09Z","timestamp":1660247049000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00493\/112604\/Saturated-Transformers-are-Constant-Depth"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":26,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00493","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}