{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,9]],"date-time":"2026-06-09T18:04:59Z","timestamp":1781028299486,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":53,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,9]],"date-time":"2026-06-09T00:00:00Z","timestamp":1780963200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-2106444"],"award-info":[{"award-number":["CCF-2106444"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000893","name":"Simons Foundation","doi-asserted-by":"publisher","award":["Simons Investigator 2024"],"award-info":[{"award-number":["Simons Investigator 2024"]}],"id":[{"id":"10.13039\/100000893","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004332","name":"JPMorgan Chase and Company","doi-asserted-by":"publisher","award":["JPMC AI PhD Fellowship"],"award-info":[{"award-number":["JPMC AI PhD Fellowship"]}],"id":[{"id":"10.13039\/100004332","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,9]]},"DOI":"10.1145\/3798129.3800914","type":"proceedings-article","created":{"date-parts":[[2026,6,9]],"date-time":"2026-06-09T17:53:56Z","timestamp":1781027636000},"page":"2106-2117","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Provable Long-Range Benefits of Next-Token Prediction"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-1180-3198","authenticated-orcid":false,"given":"Xinyuan","family":"Cao","sequence":"first","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3779-433X","authenticated-orcid":false,"given":"Santosh S.","family":"Vempala","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,9]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.52202\/068431-0659"},{"key":"e_1_3_2_1_3_1","volume-title":"Hareesh Bahuleyan, and Jackie Chi Kit Cheung.","author":"Arora Kushal","year":"2022","unstructured":"Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Chi Kit Cheung. 2022. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171."},{"key":"e_1_3_2_1_4_1","unstructured":"Gregor Bachmann and Vaishnavh Nagarajan. 2024. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1214\/aos\/1024691352"},{"key":"e_1_3_2_1_6_1","volume-title":"Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28","author":"Bengio Samy","year":"2015","unstructured":"Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28 (2015)."},{"key":"e_1_3_2_1_7_1","volume-title":"Adam Tauman Kalai, and Preetum Nakkiran","author":"Jaros\u0142","year":"2023","unstructured":"Jaros\u0142 aw B\u0142 asiok, Parikshit Gopalan, Lunjia Hu, Adam Tauman Kalai, and Preetum Nakkiran. 2023. Loss minimization yields multicalibration for large neural networks. arXiv preprint arXiv:2304.09424."},{"key":"e_1_3_2_1_8_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_2_1_9_1","volume-title":"Yuanzhi Li, Scott Lundberg, et al.","author":"Bubeck S\u00e9bastien","year":"2023","unstructured":"S\u00e9bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712."},{"key":"e_1_3_2_1_10_1","unstructured":"Xinyuan Cao and Santosh S Vempala. 2025. Provable Long-Range Benefits of Next-Token Prediction. arXiv preprint arXiv:2512.07818."},{"key":"e_1_3_2_1_11_1","volume-title":"Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, et al.","author":"Del\u00e9tang Gr\u00e9goire","year":"2022","unstructured":"Gr\u00e9goire Del\u00e9tang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, et al. 2022. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098."},{"key":"e_1_3_2_1_12_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv e-prints arXiv\u20132407."},{"key":"e_1_3_2_1_13_1","volume-title":"Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al.","author":"Dziri Nouha","year":"2024","unstructured":"Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. 2024. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36 (2024)."},{"key":"e_1_3_2_1_14_1","unstructured":"Tao Fang Shu Yang Kaixin Lan Derek F Wong Jinpeng Hu Lidia S Chao and Yue Zhang. 2023. Is chatgpt a highly fluent grammatical error correction system? a comprehensive evaluation. arXiv preprint arXiv:2304.01746."},{"key":"e_1_3_2_1_15_1","volume-title":"Analysis of classifiers","author":"Fawzi Alhussein","year":"2018","unstructured":"Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2018. Analysis of classifiers\u2019 robustness to adversarial perturbations. Machine learning, 107, 3 (2018), 481\u2013508."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1006\/inco.1995.1136"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.1997.1504"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1214\/aos\/1016218223"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611977912.98"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.52202\/068431-0532"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Oded Goldreich et al. 2005. Foundations of cryptography\u2013a primer. Foundations and Trends\u00ae in Theoretical Computer Science 1 1 (2005) 1\u2013116.","DOI":"10.1561\/0400000001"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Tatsunori B Hashimoto Hugh Zhang and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792.","DOI":"10.18653\/v1\/N19-1169"},{"key":"e_1_3_2_1_24_1","volume-title":"International Conference on Machine Learning. 1939\u20131948","author":"H\u00e9bert-Johnson Ursula","year":"2018","unstructured":"Ursula H\u00e9bert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning. 1939\u20131948."},{"key":"e_1_3_2_1_25_1","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300."},{"key":"e_1_3_2_1_26_1","unstructured":"Dan Hendrycks Collin Burns Saurav Kadavath Akul Arora Steven Basart Eric Tang Dawn Song and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874."},{"key":"e_1_3_2_1_27_1","volume-title":"Generative adversarial imitation learning. Advances in neural information processing systems, 29","author":"Ho Jonathan","year":"2016","unstructured":"Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Advances in neural information processing systems, 29 (2016)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.1995.492584"},{"key":"e_1_3_2_1_29_1","unstructured":"Samy Jelassi David Brandfonbrener Sham M Kakade and Eran Malach. 2024. Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00145-003-0051-5"},{"key":"e_1_3_2_1_31_1","first-page":"2005","article-title":"Distinguishing attack on MAG","volume":"1","author":"Kunzli S","year":"2005","unstructured":"S Kunzli and Willi Meier. 2005. Distinguishing attack on MAG. ECRYPT Stream Cipher Project Report, 1 (2005), 2005.","journal-title":"ECRYPT Stream Cipher Project Report"},{"key":"e_1_3_2_1_32_1","volume-title":"Boosting and maximum likelihood for exponential models. Advances in neural information processing systems, 14","author":"Lebanon Guy","year":"2001","unstructured":"Guy Lebanon and John Lafferty. 2001. Boosting and maximum likelihood for exponential models. Advances in neural information processing systems, 14 (2001)."},{"key":"e_1_3_2_1_33_1","volume-title":"On the Generalization Ability of Next-Token-Prediction Pretraining. In Forty-second International Conference on Machine Learning. 267","author":"Li Zhihao","year":"2025","unstructured":"Zhihao Li, Xue Jiang, Liyuan Liu, Xuelin Zhang, Hong Chen, and Feng Zheng. 2025. On the Generalization Ability of Next-Token-Prediction Pretraining. In Forty-second International Conference on Machine Learning. 267, 34943\u201334975."},{"key":"e_1_3_2_1_34_1","unstructured":"Eran Malach. 2023. Auto-regressive next-token predictors are universal learners. arXiv preprint arXiv:2309.06979."},{"key":"e_1_3_2_1_35_1","volume-title":"Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229.","author":"Mirzadeh Iman","year":"2024","unstructured":"Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229."},{"key":"e_1_3_2_1_36_1","volume-title":"Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, and Jonathan Larson.","author":"Momennejad Ida","year":"2024","unstructured":"Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Frujeri, Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, and Jonathan Larson. 2024. Evaluating cognitive maps and planning in large language models with CogEval. Advances in Neural Information Processing Systems, 36 (2024)."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0022-0000(05)80043-1"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Long Ouyang Jeffrey Wu Xu Jiang Diogo Almeida Carroll Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022) 27730\u201327744.","DOI":"10.52202\/068431-2011"},{"key":"e_1_3_2_1_39_1","unstructured":"Chengwen Qi Ren Ma Bowen Li He Du Binyuan Hui Jinwang Wu Yuanjun Laili and Conghui He. 2025. Large language models meet symbolic provers for logical reasoning evaluation. arXiv preprint arXiv:2502.06563."},{"key":"e_1_3_2_1_40_1","unstructured":"Jing Qian Hong Wang Zekun Li Shiyang Li and Xifeng Yan. 2022. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051."},{"key":"e_1_3_2_1_41_1","unstructured":"Marc\u2019Aurelio Ranzato Sumit Chopra Michael Auli and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2008.38"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00116037"},{"key":"e_1_3_2_1_44_1","volume-title":"A mathematical theory of communication. The Bell system technical journal, 27, 3","author":"Shannon Claude Elwood","year":"1948","unstructured":"Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27, 3 (1948), 379\u2013423."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1951.tb01366.x"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3188745.3188954"},{"key":"e_1_3_2_1_47_1","unstructured":"Buck Shlegeris Fabien Roger Lawrence Chan and Euan McLean. 2022. Language models are better than humans at next-token prediction. arXiv preprint arXiv:2212.11281."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCC.2009.41"},{"key":"e_1_3_2_1_49_1","unstructured":"Xinlong Wang Xiaosong Zhang Zhengxiong Luo Quan Sun Yufeng Cui Jinsheng Wang Fan Zhang Yueze Wang Zhen Li Qiying Yu et al. 2024. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.58337"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.1982.45"},{"key":"e_1_3_2_1_52_1","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223."},{"key":"e_1_3_2_1_53_1","first-page":"44502","article-title":"Felm: Benchmarking factuality evaluation of large language models","volume":"36","author":"Zhao Yiran","year":"2023","unstructured":"Yiran Zhao, Jinghan Zhang, I Chern, Siyang Gao, Pengfei Liu, Junxian He, et al. 2023. Felm: Benchmarking factuality evaluation of large language models. Advances in Neural Information Processing Systems, 36 (2023), 44502\u201344523.","journal-title":"Advances in Neural Information Processing Systems"}],"event":{"name":"STOC '26: 58th Annual ACM Symposium on Theory of Computing","location":"Salt Lake City UT USA","acronym":"STOC '26","sponsor":["SIGACT ACM Special Interest Group on Algorithms and Computation Theory"]},"container-title":["Proceedings of the 58th Annual ACM Symposium on Theory of Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3798129.3800914","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3798129.3800914","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,6,9]],"date-time":"2026-06-09T17:54:20Z","timestamp":1781027660000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3798129.3800914"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,9]]},"references-count":53,"alternative-id":["10.1145\/3798129.3800914","10.1145\/3798129"],"URL":"https:\/\/doi.org\/10.1145\/3798129.3800914","relation":{},"subject":[],"published":{"date-parts":[[2026,6,9]]},"assertion":[{"value":"2026-06-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}