{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T07:56:19Z","timestamp":1781078179560,"version":"3.54.1"},"reference-count":27,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2023,6,15]],"date-time":"2023-06-15T00:00:00Z","timestamp":1686787200000},"content-version":"vor","delay-in-days":165,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,6,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L\u2260P (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture\u2019s high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.<\/jats:p>","DOI":"10.1162\/tacl_a_00562","type":"journal-article","created":{"date-parts":[[2023,6,15]],"date-time":"2023-06-15T16:53:56Z","timestamp":1686848036000},"page":"531-545","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":14,"title":["The Parallelism Tradeoff: Limitations of Log-Precision Transformers"],"prefix":"10.1162","volume":"11","author":[{"given":"William","family":"Merrill","sequence":"first","affiliation":[{"name":"Center for Data Science, New York University, New York, NY, USA. willm@nyu.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ashish","family":"Sabharwal","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, Seattle, WA, USA. ashishs@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2023,6,12]]},"reference":[{"key":"2023061516534933300_bib1","doi-asserted-by":"crossref","DOI":"10.4086\/cjtcs.1999.007","article-title":"The permanent requires large uniform threshold circuits","author":"Allender","year":"1999","journal-title":"Chicago Journal of Theoretical Computer Science"},{"key":"2023061516534933300_bib2","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511804090","volume-title":"Computational Complexity: A Modern Approach","author":"Arora","year":"2009"},{"key":"2023061516534933300_bib3","volume-title":"Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications","author":"Biere","year":"2009"},{"key":"2023061516534933300_bib4","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2023061516534933300_bib5","doi-asserted-by":"publisher","DOI":"10.1016\/B978-0-08-049944-4.50008-2","article-title":"Complexity results for planning","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence","author":"Bylander","year":"1991"},{"key":"2023061516534933300_bib6","doi-asserted-by":"publisher","first-page":"259","DOI":"10.1051\/ita:2001119","article-title":"Division in logspace-uniform nc1","volume":"35","author":"Chiu","year":"2001","journal-title":"RAIRO Theoretical Informatics and Applications"},{"key":"2023061516534933300_bib7","article-title":"Universal transformers","volume-title":"International Conference on Learning Representations","author":"Dehghani","year":"2019"},{"key":"2023061516534933300_bib8","article-title":"GPT3.int8(): 8-bit matrix multiplication for transformers at scale","volume-title":"Advances in Neural Information Processing Systems","author":"Dettmers","year":"2022"},{"key":"2023061516534933300_bib9","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin","year":"2019"},{"key":"2023061516534933300_bib10","article-title":"A mathematical framework for transformer circuits","author":"Elhage","year":"2021","journal-title":"Transformer Circuits Thread"},{"key":"2023061516534933300_bib11","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.emnlp-main.27","article-title":"What makes instruction learning hard? An investigation and a new challenge in a synthetic environment","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Finlayson","year":"2022"},{"key":"2023061516534933300_bib12","unstructured":"Raymond Greenlaw , James M.Hoover, and Walter L.Ruzzo. 1991. A compendium of problems complete for P. Technical Report TR91-11, University of Alberta."},{"key":"2023061516534933300_bib13","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1162\/tacl_a_00306","article-title":"Theoretical limitations of self-attention in neural sequence models","volume":"8","author":"Hahn","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023061516534933300_bib14","doi-asserted-by":"publisher","first-page":"800","DOI":"10.1162\/tacl_a_00490","article-title":"Formal language recognition by hard attention transformers: Perspectives from circuit complexity","volume":"10","author":"Hao","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023061516534933300_bib15","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1007\/3-540-48224-5_9","article-title":"Division is in uniform TC0","volume-title":"International Colloquium on Automata, Languages, and Programming","author":"Hesse","year":"2001"},{"key":"2023061516534933300_bib16","volume-title":"Descriptive Complexity","author":"Immerman","year":"2012"},{"issue":"1","key":"2023061516534933300_bib17","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1016\/0304-3975(76)90068-2","article-title":"Complete problems for deterministic polynomial time","volume":"3","author":"Jones","year":"1976","journal-title":"Theoretical Computer Science"},{"issue":"1","key":"2023061516534933300_bib18","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1145\/990518.990519","article-title":"The circuit value problem is log space complete for P","volume":"7","author":"Ladner","year":"1975","journal-title":"ACM SIGACT News"},{"key":"2023061516534933300_bib19","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1016\/0304-3975(82)90058-5","article-title":"Symmetric space-bounded computation","volume":"19","author":"Lewis","year":"1982","journal-title":"Theoretical Computer Science"},{"key":"2023061516534933300_bib20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.133","article-title":"Effects of parameter norm growth during transformer training: Inductive bias from gradient descent","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Merrill","year":"2021"},{"key":"2023061516534933300_bib21","article-title":"On the linguistic capacity of real-time counter automata","volume":"abs\/2004.06866","author":"Merrill","year":"2020","journal-title":"ArXiv"},{"key":"2023061516534933300_bib22","doi-asserted-by":"publisher","first-page":"843","DOI":"10.1162\/tacl_a_00493","article-title":"Saturated transformers are constant-depth threshold circuits","volume":"10","author":"Merrill","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023061516534933300_bib23","article-title":"On the Turing completeness of modern neural network architectures","volume-title":"International Conference on Learning Representations","author":"P\u00e9rez","year":"2019"},{"issue":"140","key":"2023061516534933300_bib24","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2023061516534933300_bib25","doi-asserted-by":"publisher","first-page":"17:1\u201317:24","DOI":"10.1145\/1391289.1391291","article-title":"Undirected connectivity in log-space","volume":"55","author":"Reingold","year":"2008","journal-title":"Journal of the ACM"},{"key":"2023061516534933300_bib26","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1016\/0304-3975(79)90044-6","article-title":"The complexity of computing the permanent","volume":"8","author":"Valiant","year":"1979","journal-title":"Theoretical Computer Science"},{"key":"2023061516534933300_bib27","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00562\/2131191\/tacl_a_00562.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00562\/2131191\/tacl_a_00562.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,15]],"date-time":"2023-12-15T04:51:40Z","timestamp":1702615900000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00562\/116413\/The-Parallelism-Tradeoff-Limitations-of-Log"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"references-count":27,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00562","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023]]},"published":{"date-parts":[[2023]]}}}