{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,22]],"date-time":"2026-07-22T22:16:56Z","timestamp":1784758616088,"version":"3.55.0"},"reference-count":355,"publisher":"Springer Science and Business Media LLC","issue":"12","license":[{"start":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T00:00:00Z","timestamp":1778284800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T00:00:00Z","timestamp":1778284800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Front. Comput. Sci."],"published-print":{"date-parts":[[2026,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for understanding their development, behavior, and societal impact. This survey systematically reviews recent advancements in LLM techniques across four key dimensions: (1)\n                    <jats:italic>pre-training<\/jats:italic>\n                    methodologies, which establish core model capabilities through large-scale self-supervised training, architectural innovations, and data curation strategies; (2)\n                    <jats:italic>post-training<\/jats:italic>\n                    techniques, including supervised fine-tuning and reinforcement learning, which adapt foundational models to downstream tasks and enhance their alignment and safety; (3)\n                    <jats:italic>utilization<\/jats:italic>\n                    strategies, such as in-context learning, prompt engineering, and agentic reasoning, that optimize real-world deployment and enable effective interaction with external environments; and (4)\n                    <jats:italic>evaluation<\/jats:italic>\n                    methods, encompassing benchmarks for key ability dimensions such as core language capabilities, reasoning, and safety, which support comprehensive and reliable assessment of model performance. Additionally, we identify critical research issues, including those concerning theoretical foundations, efficient scaling, alignment, and agentic capability, and highlight the open challenges they present. By synthesizing state-of-the-art insights and emerging trends, this survey aims to provide a systematic and comprehensive framework for understanding the trajectory, current limitations, and future directions of LLM progress.\n                  <\/jats:p>","DOI":"10.1007\/s11704-026-60308-3","type":"journal-article","created":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T07:19:54Z","timestamp":1778311194000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":111,"title":["A Survey of Large Language Models"],"prefix":"10.1007","volume":"20","author":[{"given":"Wayne Xin","family":"Zhao","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kun","family":"Zhou","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Junyi","family":"Li","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tianyi","family":"Tang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zican","family":"Dong","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yupeng","family":"Hou","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Beichen","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yingqian","family":"Min","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Junjie","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Peiyu","family":"Liu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiaolei","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yifan","family":"Du","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chen","family":"Yang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yushuo","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhipeng","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jinhao","family":"Jiang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ruiyang","family":"Ren","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yifan","family":"Li","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xinyu","family":"Tang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zikang","family":"Liu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yiwen","family":"Hu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jian-Yun","family":"Nie","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ji-Rong","family":"Wen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,5,9]]},"reference":[{"issue":"5598","key":"60308_CR1","doi-asserted-by":"publisher","first-page":"1569","DOI":"10.1126\/science.298.5598.1569","volume":"298","author":"M D Hauser","year":"2002","unstructured":"Hauser M D, Chomsky N, Fitch W T. The faculty of language: what is it, who has it, and how did it evolve? Science, 2002, 298(5598): 1569\u20131579","journal-title":"Science"},{"issue":"236","key":"60308_CR2","doi-asserted-by":"publisher","first-page":"433","DOI":"10.1093\/mind\/LIX.236.433","volume":"59","author":"A M Turing","year":"1950","unstructured":"Turing A M. Computing machinery and intelligence. Mind, 1950, 59(236): 433\u2013460","journal-title":"Mind"},{"key":"60308_CR3","volume-title":"Statistical Methods for Speech Recognition","author":"F Jelinek","year":"1998","unstructured":"Jelinek F. Statistical Methods for Speech Recognition. Cambridge: MIT Press, 1998"},{"key":"60308_CR4","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-02130-5","volume-title":"Statistical Language Models for Information Retrieval","author":"C Zhai","year":"2009","unstructured":"Zhai C. Statistical Language Models for Information Retrieval. Springer Nature, See link.springer.com\/book\/10.1007\/978-3-031-02130-5 website, 2009"},{"key":"60308_CR5","first-page":"1137","volume":"3","author":"Y Bengio","year":"2003","unstructured":"Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. The Journal of Machine Learning Research, 2003, 3: 1137\u20131155","journal-title":"The Journal of Machine Learning Research"},{"key":"60308_CR6","first-page":"1045","volume-title":"Proceedings of the 11th Annual Conference of the International Speech Communication Association","author":"T Mikolov","year":"2010","unstructured":"Mikolov T, Karafi\u00e1t M, Burget L, Cernock\u00fd J, Khudanpur S. Recurrent neural network based language model. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association. 2010, 1045\u20131048"},{"key":"60308_CR7","first-page":"2493","volume":"12","author":"R Collobert","year":"2011","unstructured":"Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P P. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011, 12: 2493\u20132537","journal-title":"The Journal of Machine Learning Research"},{"key":"60308_CR8","first-page":"3111","volume-title":"Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2013, 3111\u20133119"},{"key":"60308_CR9","unstructured":"Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013"},{"key":"60308_CR10","first-page":"2227","volume-title":"Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"M E Peters","year":"2018","unstructured":"Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018, 2227\u20132237"},{"key":"60308_CR11","first-page":"4171","volume-title":"Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"J Devlin","year":"2019","unstructured":"Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171\u20134186"},{"key":"60308_CR12","first-page":"6000","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems","author":"A Vaswani","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser \u0141, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000\u20136010"},{"key":"60308_CR13","first-page":"159","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"T B Brown","year":"2020","unstructured":"Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159"},{"key":"60308_CR14","unstructured":"Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. 2020, arXiv preprint arXiv: 2001.08361"},{"issue":"1","key":"60308_CR15","first-page":"240","volume":"24","author":"A Chowdhery","year":"2023","unstructured":"Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, et al. PaLM: scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240","journal-title":"The Journal of Machine Learning Research"},{"key":"60308_CR16","volume-title":"Improving language understanding by generative pre-training","author":"A Radford","year":"2018","unstructured":"Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. See cdn.openai.com\/research-covers\/language-unsupervised\/language_understanding_paper.pdf website, 2018"},{"key":"60308_CR17","first-page":"1800","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"J Wei","year":"2022","unstructured":"Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Le Q, Chi E D, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1800"},{"key":"60308_CR18","first-page":"2176","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"J Hoffmann","year":"2022","unstructured":"Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, De Las Casas D, Hendricks L A, Welbl J, Clark A, Hennigan T, Noland E, Millican K, Van Den Driessche G, Damoc B, Guy A, Osindero S, Simonyan K, Elsen E, Vinyals O, Rae J W, Sifre L. Training compute-optimal large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2176"},{"key":"60308_CR19","unstructured":"Zhao W X, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z, Yang C, Chen Y, Chen Z, Jiang J, Ren R, Li Y, Tang X, Liu Z, Liu P, Nie J Y, Wen J R. A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223"},{"key":"60308_CR20","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-96-6259-3","volume-title":"Large language models","author":"W X Zhao","year":"2026","unstructured":"Zhao W X, Zhou K, Li J, Tang T, Wen J R. Large language models. Springer Nature, See link.springer.com\/book\/10.1007\/978-981-96-6259-3 website, 2026"},{"issue":"1","key":"60308_CR21","first-page":"140","volume":"21","author":"C Raffel","year":"2020","unstructured":"Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140","journal-title":"The Journal of Machine Learning Research"},{"key":"60308_CR22","doi-asserted-by":"publisher","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"M Lewis","year":"2020","unstructured":"Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871\u20137880"},{"key":"60308_CR23","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"A Zeng","year":"2022","unstructured":"Zeng A, Liu X, Du Z, Wang Z, Lai H, et al. GLM-130B: an open bilingual pre-trained model. In: Proceedings of the 11th International Conference on Learning Representations. 2022"},{"key":"60308_CR24","doi-asserted-by":"publisher","first-page":"1471","DOI":"10.18653\/v1\/2023.emnlp-main.91","volume-title":"Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing","author":"Y Tay","year":"2023","unstructured":"Tay Y, Wei J, Chung H W, Tran V Q, So D R, Shakeri S, Garcia X, Zheng H S, Rao J, Chowdhery A, Zhou D, Metzler D, Petrov S, Houlsby N, Le Q, Dehghani M. Transcending scaling laws with 0.1% extra compute. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 1471\u20131486"},{"key":"60308_CR25","first-page":"2011","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"L Ouyang","year":"2022","unstructured":"Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, et al. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011"},{"key":"60308_CR26","unstructured":"Yang A, Yang B, Zhang B, Hui B, Zheng B, et al. Qwen2.5 technical report. 2024, arXiv preprint arXiv: 2412.15115"},{"key":"60308_CR27","unstructured":"Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozi\u00e8re B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: Open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971"},{"key":"60308_CR28","volume-title":"Introducing openai o1","author":"OpenAI","year":"2024","unstructured":"OpenAI. Introducing openai o1, See openai.com\/o1 website, 2024"},{"key":"60308_CR29","unstructured":"Guo D, Yang D, Zhang H, Song J, Wang P, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. 2025, arXiv preprint arXiv: 2501.12948"},{"key":"60308_CR30","unstructured":"Shao Z, Wang P, Zhu Q, Xu R, Song J, Bi X, Zhang H, Zhang M, Li Y K, Wu Y, Guo D. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. 2024, arXiv preprint arXiv: 2402.03300"},{"key":"60308_CR31","unstructured":"Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022"},{"key":"60308_CR32","first-page":"2425","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"R Schaeffer","year":"2023","unstructured":"Schaeffer R, Miranda B, Koyejo S. Are emergent abilities of large language models a mirage? In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2425"},{"key":"60308_CR33","first-page":"1735","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"A Rogers","year":"2024","unstructured":"Rogers A, Luccioni A S. Position: key claims in LLM research have a long tail of footnotes. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 1735"},{"key":"60308_CR34","first-page":"970","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"G Penedo","year":"2024","unstructured":"Penedo G, Kydl\u00ed\u010dek H, Allal L B, Lozhkov A, Mitchell M, Raffel C A, Von Werra L, Wolf T. The FineWeb datasets: decanting the web for the finest text data at scale. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 970"},{"key":"60308_CR35","unstructured":"Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, Presser S, Leahy C. The pile: An 800GB dataset of diverse text for language modeling. 2021, arXiv preprint arXiv: 2101.00027"},{"key":"60308_CR36","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1109\/JCDL57899.2023.00020","volume-title":"Proceedings of 2023 ACM\/IEEE Joint Conference on Digital Libraries (JCDL)","author":"T Saier","year":"2023","unstructured":"Saier T, Krause J, F\u00e4rber M. unarXive 2022: all arXiv publications pre-processed for NLP, including structured full-text and citation network. In: Proceedings of 2023 ACM\/IEEE Joint Conference on Digital Libraries (JCDL). 2023, 66\u201370"},{"key":"60308_CR37","unstructured":"Chen M, Tworek J, Jun H, Yuan Q, De Oliveira Pinto H P, et al. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374"},{"key":"60308_CR38","unstructured":"Austin J, Odena A, Nye M, Bosma M, Michalewski H, Dohan D, Jiang E, Cai C, Terry M, Le Q, Sutton C. Program synthesis with large language models. 2021, arXiv preprint arXiv: 2108.07732"},{"key":"60308_CR39","unstructured":"Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, et al. The llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783"},{"key":"60308_CR40","unstructured":"Liu A, Feng B, Xue B, Wang B, Wu B, et al. DeepSeek-v3 technical report. 2024, arXiv preprint arXiv: 2412.19437"},{"key":"60308_CR41","first-page":"5374","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Y Hu","year":"2025","unstructured":"Hu Y, Song H, Chen J, Deng J, Wang J, Zhou K, Zhu Y, Jiang J, Dong Z, Lu Y, Miao X, Zhao W X, Wen J R. YuLan-Mini: Pushing the limits of open data-efficient language model. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 5374\u20135400"},{"key":"60308_CR42","doi-asserted-by":"publisher","first-page":"1715","DOI":"10.18653\/v1\/P16-1162","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"R Sennrich","year":"2016","unstructured":"Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016, 1715\u20131725"},{"key":"60308_CR43","unstructured":"Singh A K, Strouse D. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. 2024, arXiv preprint arXiv: 2402.14903"},{"key":"60308_CR44","unstructured":"Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. Gpt-4 technical report. 2023, arXiv preprint arXiv: 2303.08774"},{"key":"60308_CR45","unstructured":"Team G, Riviere M, Pathak S, Sessa P G, Hardin C, et al. Gemma 2: improving open language models at a practical size. 2024, arXiv preprint arXiv: 2408.00118"},{"key":"60308_CR46","first-page":"3059","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"S M Xie","year":"2023","unstructured":"Xie S M, Pham H, Dong X, Du N, Liu H, Lu Y, Liang P, Le Q V, Ma T, Yu A W. DoReMi: optimizing data mixtures speeds up language model pretraining. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3059"},{"key":"60308_CR47","first-page":"1482","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"S M Xie","year":"2023","unstructured":"Xie S M, Santurkar S, Ma T, Liang P. Data selection for language models via importance resampling. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1482"},{"key":"60308_CR48","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"J Ye","year":"2025","unstructured":"Ye J, Liu P, Sun T, Zhan J, Zhou Y, Qiu X. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR49","unstructured":"Wang J, Tian C, Chen K, Liu Z, Mao J, Zhao W X, Zhang Z, Zhou J. MergeMix: optimizing mid-training data mixtures via learnable model merging. 2026, arXiv preprint arXiv: 2601.17858"},{"key":"60308_CR50","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"J T Wang","year":"2025","unstructured":"Wang J T, Mittal P, Song D, Jia R. Data Shapley in one training run. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR51","unstructured":"Wang S, Ouyang X, Xu T, Hu Y Z, Liu J, Chen G, Zhang T, Zheng J, Yang K, Ren X, Liu D, Zhang L. OPUS: towards efficient and principled data selection in large language model pre-training in every iteration. 2025, arXiv preprint arXiv: 2602.05400"},{"key":"60308_CR52","unstructured":"Hu S, Tu Y, Han X, He C, Cui G, et al. MiniCPM: Unveiling the potential of small language models with scalable training strategies. 2024, arXiv preprint arXiv: 2404.06395"},{"key":"60308_CR53","first-page":"14125","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Y Fu","year":"2024","unstructured":"Fu Y, Panda R, Niu X, Yue X, Hajishirzi H, Kim Y, Peng H. Data engineering for scaling language models to 128k context. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 14125\u201314134"},{"key":"60308_CR54","doi-asserted-by":"publisher","first-page":"7376","DOI":"10.18653\/v1\/2025.acl-long.366","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"T Gao","year":"2025","unstructured":"Gao T, Wettig A, Yen H, Chen D. How to train long-context language models (effectively). In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 7376\u20137399"},{"key":"60308_CR55","doi-asserted-by":"publisher","first-page":"5709","DOI":"10.18653\/v1\/2024.findings-emnlp.327","volume-title":"Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024","author":"X Liu","year":"2024","unstructured":"Liu X, Lv K, Guo Q, Yan H, He C, Qiu X, Lin D. LongWanjuan: Towards systematic measurement for long text quality. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024. 2024, 5709\u20135725"},{"key":"60308_CR56","volume-title":"Proceedings of Conference Paper at ICLR 2026","author":"H Deng","year":"2026","unstructured":"Deng H, Lin Y, Lin Z, Liu X, Sun Y, Ma Y A, Gong Y. Beyond length: Quantifying long-range information for long-context LLM pretraining data. In: Proceedings of Conference Paper at ICLR 2026. 2026"},{"key":"60308_CR57","volume-title":"Qwen3: think deeper, act faster","author":"Team Q","year":"2025","unstructured":"Team Q. Qwen3: think deeper, act faster. See qwen.ai\/blog?id=qwen3 website, 2025"},{"key":"60308_CR58","first-page":"2191","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"N Muennighoff","year":"2023","unstructured":"Muennighoff N, Rush A M, Barak B, Scao T L, Piktus A, Tazi N, Pyysalo S, Wolf T, Raffel C. Scaling data-constrained language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2191"},{"key":"60308_CR59","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"E Dohmatob","year":"2025","unstructured":"Dohmatob E, Feng Y, Subramonian A, Kempe J. Strong model collapse. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR60","volume-title":"Google AI","author":"D Silver","year":"2025","unstructured":"Silver D, Sutton R S. Welcome to the era of experience. Google AI, See \/\/storage.googleapis.com\/deepmind-media\/Era-of-Experience%20\/The%20Era%20of%20Experience%20Paper.pdf website, 2025"},{"key":"60308_CR61","unstructured":"Rae J W, Borgeaud S, Cai T, Millican K, Hoffmann J, et al. Scaling language models: methods, analysis & insights from training gopher. 2021, arXiv preprint arXiv: 2112.11446"},{"key":"60308_CR62","unstructured":"Hernandez D, Brown T B, Conerly T, DasSarma N, Drain D, El-Showk S, Elhage N, Hatfield-Dodds Z, Henighan T, Johnston S, Mann B, Olah C, Olsson C, Amodei D, Joseph N, Kaplan J, McCandlish S. Scaling laws and interpretability of learning from repeated data. 2022, arXiv preprint arXiv: 2205.10487"},{"issue":"3","key":"60308_CR63","doi-asserted-by":"publisher","first-page":"1097","DOI":"10.1162\/coli_a_00524","volume":"50","author":"I O Gallegos","year":"2024","unstructured":"Gallegos I O, Rossi R A, Barrow J, Tanjim M M, Kim S, Dernoncourt F, Yu T, Zhang R Y, Ahmed N K. Bias and fairness in large language models: a survey. Proceedings of Computational Linguistics, 2024, 50(3): 1097\u20131179","journal-title":"Proceedings of Computational Linguistics"},{"issue":"4","key":"60308_CR64","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1145\/3628602","volume":"18","author":"M Bozdag","year":"2024","unstructured":"Bozdag M, Sevim N, Ko\u00e7 A. Measuring and mitigating gender bias in legal contextualized language models. ACM Transactions on Knowledge Discovery from Data, 2024, 18(4): 79","journal-title":"ACM Transactions on Knowledge Discovery from Data"},{"key":"60308_CR65","doi-asserted-by":"publisher","first-page":"4895","DOI":"10.18653\/v1\/2023.emnlp-main.298","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"J Ainslie","year":"2023","unstructured":"Ainslie J, Lee-Thorp J, De Jong M, Zemlyanskiy Y, Lebron F, Sanghai S. GQA: training generalized multi-query transformer models from multi-head checkpoints. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 4895\u20134901"},{"key":"60308_CR66","unstructured":"Shazeer N. Fast transformer decoding: one write-head is all you need. 2019, arXiv preprint arXiv: 1911.02150"},{"key":"60308_CR67","unstructured":"Liu A, Feng B, Wang B, Wang B, Liu B, et al. Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. 2024, arXiv preprint arXiv: 2405.04434"},{"key":"60308_CR68","unstructured":"Beltagy I, Peters M E, Cohan A. Longformer: the long-document transformer. 2020, arXiv preprint arXiv: 2004.05150"},{"key":"60308_CR69","unstructured":"Lu E, Jiang Z, Liu J, Du Y, Jiang T, et al. MoBA: mixture of block attention for long-context LLMs. 2025, arXiv preprint arXiv: 2502.13189"},{"key":"60308_CR70","doi-asserted-by":"publisher","first-page":"23078","DOI":"10.18653\/v1\/2025.acl-long.1126","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"J Yuan","year":"2025","unstructured":"Yuan J, Gao H, Dai D, Luo J, Zhao L, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 23078\u201323097"},{"key":"60308_CR71","first-page":"5156","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"A Katharopoulos","year":"2020","unstructured":"Katharopoulos A, Vyas A, Pappas N, Fleuret F. Transformers are RNNs: fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 5156\u20135165"},{"key":"60308_CR72","unstructured":"Gu A, Dao T. Mamba: linear-time sequence modeling with selective state spaces. 2023, arXiv preprint arXiv: 2312.00752"},{"key":"60308_CR73","first-page":"9355","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"I Schlag","year":"2021","unstructured":"Schlag I, Irie K, Schmidhuber J. Linear transformers are secretly fast weight programmers. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 9355\u20139366"},{"key":"60308_CR74","unstructured":"Su J, Lu Y, Pan S, Murtadha A, Wen B, Liu Y. RoFormer: enhanced transformer with rotary position embedding. 2021, arXiv preprint arXiv: 2104.09864"},{"key":"60308_CR75","volume-title":"Proceedings of the 10th International Conference on Learning Representations","author":"O Press","year":"2022","unstructured":"Press O, Smith N A, Lewis M. Train short, test long: attention with linear biases enables input length extrapolation. In: Proceedings of the 10th International Conference on Learning Representations. 2022"},{"key":"60308_CR76","unstructured":"Golovneva O, Wang T, Weston J, Sukhbaatar S. Contextual position encoding: learning to count what\u2019s important. 2024, arXiv preprint arXiv: 2405.18719"},{"key":"60308_CR77","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"G Xiao","year":"2024","unstructured":"Xiao G, Tian Y, Chen B, Han S, Lewis M. Efficient streaming language models with attention sinks. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR78","unstructured":"Liu A, Mei, A, Lin B, Xue B, Wang B, et al. DeepSeek-v3.2: Pushing the frontier of open large language models. 2025, arXiv preprint arXiv: 2512.02556"},{"key":"60308_CR79","unstructured":"Gao Y, Wei J, Zhang Q, Cheng Y, Chen S, Tang Z, Jiang Z, Song Y, Zhang H, Zhao L, Yang B, Wang G, Cao S, Luo F. HySparse: a hybrid sparse attention architecture with oracle token selection and KV cache sharing. 2026, arXiv preprint arXiv: 2602.03560"},{"key":"60308_CR80","unstructured":"Zhao W, Zhou Z, Su Z, Xiao C, Li Y, Li Y, Zhang Y, Zhao W, Li Z, Huang Y, Sun A, Han X, Liu Z. InfLLM-V2: dense-sparse switchable attention for seamless short-to-long adaptation. 2025, arXiv preprint arXiv: 2509.24663"},{"key":"60308_CR81","first-page":"2758","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"W Brandon","year":"2024","unstructured":"Brandon W, Mishra M, Nrusimha A, Panda R, Ragan-Kelley J. Reducing transformer key-value cache size with cross-layer attention. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 2758"},{"key":"60308_CR82","first-page":"5531","volume-title":"Findings of the Association for Computational Linguistics: NAACL 2025","author":"Z M K Zuhri","year":"2025","unstructured":"Zuhri Z M K, Adilazuarda M F, Purwarianti A, Aji A F. MLKV: multi-layer key-value heads for memory efficient transformer decoding. In: Findings of the Association for Computational Linguistics: NAACL 2025. 2025, 5531\u20135540"},{"key":"60308_CR83","unstructured":"Zhang Y, Liu Y, Yuan H, Qin Z, Yuan Y, Gu Q, Yao A C C. Tensor product attention is all you need. 2025, arXiv preprint arXiv: 2501.06425"},{"key":"60308_CR84","doi-asserted-by":"publisher","first-page":"25114","DOI":"10.18653\/v1\/2025.findings-acl.1288","volume-title":"Findings of the Association for Computational Linguistics: ACL 2025","author":"J Hu","year":"2025","unstructured":"Hu J, Li H, Zhang Y, Wang Z, Zhou S, Zhang X, Shum H Y. Multi-matrix factorization attention. In: Findings of the Association for Computational Linguistics: ACL 2025. 2025, 25114\u201325126"},{"key":"60308_CR85","unstructured":"Sun Y, Dong L, Huang S, Ma S, Xia Y, Xue J, Wang J, Wei F. Retentive network: a successor to transformer for large language models. 2023, arXiv preprint arXiv: 2307.08621"},{"key":"60308_CR86","first-page":"41517","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Z Qin","year":"2024","unstructured":"Qin Z, Sun W, Li D, Shen X, Sun W, Zhong Y. Various lengths, constant speed: Efficient language modeling with lightning attention. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 41517\u201341535"},{"key":"60308_CR87","unstructured":"Qin Z, Sun W, Li D, Shen X, Sun W, Zhong Y. Lightning attention-2: a free lunch for handling unlimited sequence lengths in large language models. 2024, arXiv preprint arXiv: 2401.04658"},{"key":"60308_CR88","doi-asserted-by":"publisher","first-page":"14048","DOI":"10.18653\/v1\/2023.findings-emnlp.936","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2023","author":"B Peng","year":"2023","unstructured":"Peng B, Alcaide E, Anthony Q, Albalak A, Arcadinho S, Biderman S, Cao H, Cheng X, Chung M, Derczynski L, Du X, Grella M, GV K K, He X, Hou H, Kazienko P, Kocon J, Kong J, Koptyra B, Lau H, Lin J, Mantri K S I, Mom F, Saito A, Song G, Tang X, Wind J S, Wozniak S, Zhang Z, Zhou Q, Zhu J, Zhu R J. RWKV: reinventing RNNs for the transformer era. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, 14048\u201314077"},{"key":"60308_CR89","first-page":"10041","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"T Dao","year":"2024","unstructured":"Dao T, Gu A. Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 10041\u201310071"},{"key":"60308_CR90","first-page":"3668","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"S Yang","year":"2024","unstructured":"Yang S, Wang B, Zhang Y, Shen Y, Kim Y. Parallelizing linear transformers with the delta rule over sequence length. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 3668"},{"key":"60308_CR91","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"S Yang","year":"2025","unstructured":"Yang S, Kautz J, Hatamizadeh A. Gated delta networks: Improving mamba2 with delta rule. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR92","unstructured":"Team K, Zhang Y, Lin Z, Yao X, Hu J, et al. Kimi linear: an expressive, efficient attention architecture. 2025, arXiv preprint arXiv: 2510.26692"},{"key":"60308_CR93","unstructured":"Peng B, Zhang R, Goldstein D, Alcaide E, Du X, Hou H, Lin J, Liu J, Lu J, Merrill W, Song G, Tan K, Utpala S, Wilce N, Wind J S, Wu T, Wuttke D, Zhou-Zheng C. RWKV-7 \u201cgoose\u201d with expressive dynamic state evolution. 2025, arXiv preprint arXiv: 2503.14456"},{"key":"60308_CR94","unstructured":"Li A, Gong B, Yang B, Shan B, Liu C, et al. Minimax-01: scaling foundation models with lightning attention. 2025, arXiv preprint arXiv: 2501.08313"},{"key":"60308_CR95","volume-title":"Qwen3-Next: towards ultimate training & inference efficiency","author":"Team Q","year":"2025","unstructured":"Team Q. Qwen3-Next: towards ultimate training & inference efficiency. See qwen.ai\/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd website, 2025"},{"key":"60308_CR96","first-page":"57503","volume-title":"Proceedings of the 42nd International Conference on Machine Learning","author":"Y Sun","year":"2025","unstructured":"Sun Y, Li X, Dalal K, Xu J, Vikram A, et al. Learning to (learn at test time): RNNs with expressive hidden states. In: Proceedings of the 42nd International Conference on Machine Learning. 2025, 57503\u201357522"},{"key":"60308_CR97","unstructured":"Behrouz A, Zhong P, Mirrokni V. Titans: Learning to memorize at test time. 2024, arXiv preprint arXiv: 2501.00663"},{"key":"60308_CR98","unstructured":"Zhang T, Bi S, Hong Y, Zhang K, Luan F, Yang S, Sunkavalli K, Freeman W T, Tan H. Test-time training done right. 2025, arXiv preprint arXiv: 2505.23884"},{"key":"60308_CR99","unstructured":"Qiu Z, Wang Z, Zheng B, Huang Z, Wen K, Yang S, Men R, Yu L, Huang F, Huang S, Liu D, Zhou J, Lin J. Gated attention for large language models: non-linearity, sparsity, and attention-sink-free. 2025, arXiv preprint arXiv: 2505.06708"},{"key":"60308_CR100","unstructured":"Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt-oss-120b & gpt-oss-20b model card. 2025, arXiv preprint arXiv: 2508.10925"},{"key":"60308_CR101","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"B Peng","year":"2024","unstructured":"Peng B, Quesnelle J, Fan H, Shippole E. YaRN: efficient context window extension of large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR102","unstructured":"Ye T, Dong L, Xia Y, Sun Y, Zhu Y, Huang G, Wei F. Differential transformer. 2024, arXiv preprint arXiv: 2410.05258"},{"key":"60308_CR103","unstructured":"Chen S, Wong S, Chen L, Tian Y. Extending context window of large language models via positional interpolation. 2023, arXiv preprint arXiv: 2306.15595"},{"key":"60308_CR104","first-page":"22099","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"H Jin","year":"2024","unstructured":"Jin H, Han X, Yang J, Jiang Z, Liu Z, Chang C Y, Chen H, Hu X. LLM maybe longLM: SelfExtend LLM context window without tuning. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 22099\u201322114"},{"key":"60308_CR105","first-page":"4643","volume-title":"Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"W Xiong","year":"2024","unstructured":"Xiong W, Liu J, Molybog I, Zhang H, Bhargava P, et al. Effective long-context scaling of foundation models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, 4643\u20134663"},{"key":"60308_CR106","first-page":"11091","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Y Ding","year":"2024","unstructured":"Ding Y, Zhang L L, Zhang C, Xu Y, Shang N, Xu J, Yang F, Yang M. LongRoPE: Extending LLM context window beyond 2 million tokens. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 11091\u201311104"},{"key":"60308_CR107","first-page":"610","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"T C Chi","year":"2022","unstructured":"Chi T C, Fan T H, Ramadge P J, Rudnicky A I. KERPLE: kernelized relative positional embedding for length extrapolation. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 610"},{"key":"60308_CR108","unstructured":"Scao T L, Fan A, Akiki C, Pavlick E, Ili\u0107 S, et al. BLOOM: a 176B-parameter open-access multilingual language model. 2022, arXiv preprint arXiv: 2211.05100"},{"key":"60308_CR109","doi-asserted-by":"publisher","first-page":"13522","DOI":"10.18653\/v1\/2023.acl-long.756","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"T C Chi","year":"2023","unstructured":"Chi T C, Fan T H, Rudnicky A, Ramadge P J. Dissecting transformer length extrapolation via the lens of receptive field analysis. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 13522\u201313537"},{"key":"60308_CR110","first-page":"80155","volume-title":"Proceedings of the 42nd International Conference on Machine Learning","author":"J Zhu","year":"2025","unstructured":"Zhu J, Wang P, Cai R, Lee J D, Li P, Wang Z. Rethinking addressing in language models via contextualized equivariant positional encoding. In: Proceedings of the 42nd International Conference on Machine Learning. 2025, 80155\u201380186"},{"key":"60308_CR111","unstructured":"Jiang A Q, Sablayrolles A, Roux A, Mensch A, et al. Mixtral of experts. 2024, arXiv preprint arXiv: 2401.04088"},{"key":"60308_CR112","doi-asserted-by":"publisher","first-page":"12883","DOI":"10.18653\/v1\/2024.acl-long.696","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Q Huang","year":"2024","unstructured":"Huang Q, An Z, Zhuang N, Tao M, Zhang C, Jin Y, Xu K, Xu K, Chen L, Huang S, Feng Y. Harder task needs more experts: dynamic routing in MoE models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 12883\u201312895"},{"key":"60308_CR113","unstructured":"Wang L, Gao H, Zhao C, Sun X, Dai D. Auxiliary-loss-free load balancing strategy for mixture-of-experts. 2024, arXiv preprint arXiv: 2408.15664"},{"key":"60308_CR114","first-page":"18332","volume-title":"Proceedings of International Conference on Machine Learning","author":"S Rajbhandari","year":"2022","unstructured":"Rajbhandari S, Li C, Yao Z, Zhang M, Aminabadi R Y, Awan A A, Rasley J, He Y. DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale. In: Proceedings of International Conference on Machine Learning. 2022, 18332\u201318346"},{"key":"60308_CR115","doi-asserted-by":"crossref","unstructured":"Cai W, Jiang J, Wang F, Tang J, Kim S, Huang J. A survey on mixture of experts. 2024, arXiv preprint arXiv: 2407.06204v2","DOI":"10.36227\/techrxiv.172055626.64129172\/v1"},{"key":"60308_CR116","doi-asserted-by":"publisher","first-page":"5005","DOI":"10.18653\/v1\/2025.acl-long.249","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Z Qiu","year":"2025","unstructured":"Qiu Z, Huang Z, Zheng B, Wen K, Wang Z, Men R, Titov I, Liu D, Zhou J, Lin J. Demons in the detail: on implementing load balancing loss for training specialized mixture-of-expert models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 5005\u20135018"},{"key":"60308_CR117","volume-title":"Muon: an optimizer for hidden layers in neural networks","author":"K Jordan","year":"2024","unstructured":"Jordan K, Jin Y, Boza V, You J, Cesista F, Newhouse L, Bernstein J. Muon: an optimizer for hidden layers in neural networks. See \/kellerjordan.github.io\/posts\/muon\/ website, 2024"},{"key":"60308_CR118","unstructured":"Xie T, Luo H, Tang H, Hu Y, Liu J K, Ren Q, Wang Y, Zhao W X, Yan R, Su B, Luo C, Guo B. Controlled LLM training on spectral sphere. 2026, arXiv preprint arXiv: 2601.08393"},{"key":"60308_CR119","doi-asserted-by":"publisher","first-page":"770","DOI":"10.1109\/CVPR.2016.90","volume-title":"Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"K He","year":"2016","unstructured":"He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 770\u2013778"},{"key":"60308_CR120","volume-title":"Proceedings of the 7th International Conference on Learning Representations","author":"A Baevski","year":"2019","unstructured":"Baevski A, Auli M. Adaptive input representations for neural language modeling. In: Proceedings of the 7th International Conference on Learning Representations. 2019"},{"key":"60308_CR121","unstructured":"Takase S, Kiyono S, Kobayashi S, Suzuki J. Spike no more: stabilizing the pre-training of large language models. 2023, arXiv preprint arXiv: 2312.16903"},{"key":"60308_CR122","first-page":"1516","volume-title":"Proceedings of the 35th International Conference on Neural Information Processing Systems","author":"M Ding","year":"2021","unstructured":"Ding M, Yang Z, Hong W, Zheng W, Zhou C, Yin D, Lin J, Zou X, Shao Z, Yang H, Tang J. CogView: mastering text-to-image generation via transformers. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 1516"},{"issue":"10","key":"60308_CR123","doi-asserted-by":"publisher","first-page":"6761","DOI":"10.1109\/TPAMI.2024.3386927","volume":"46","author":"H Wang","year":"2024","unstructured":"Wang H, Ma S, Dong L, Huang S, Zhang D, Wei F. DeepNet: scaling transformers to 1, 000 layers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(10): 6761\u20136774","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"60308_CR124","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"D Zhu","year":"2025","unstructured":"Zhu D, Huang H, Huang Z, Zeng Y, Mao Y, Wu B, Min Q, Zhou X. Hyper-connections. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR125","unstructured":"Xie Z, Wei Y, Cao H, Zhao C, Deng C, Li J, Dai D, Gao H, Chang J, Yu K, Zhao L, Zhou S, Xu Z, Zhang Z, Zeng W, Hu S, Wang Y, Yuan J, Wang L, Liang W. mHC: manifold-constrained hyperconnections. 2025, arXiv preprint arXiv: 2512.24880"},{"key":"60308_CR126","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"Z Huang","year":"2025","unstructured":"Huang Z, Min Q, Huang H, Zeng Y, Zhu D, Guo R, Zhou X. Ultra-sparse memory network. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR127","volume-title":"Proceedings of the 42nd International Conference on Machine Learning","author":"V Berges","year":"2025","unstructured":"Berges V, Oguz B, Haziza D, Yih W T, Zettlemoyer L, Ghosh G. Memory layers at scale. In: Proceedings of the 42nd International Conference on Machine Learning. 2025"},{"key":"60308_CR128","first-page":"767","volume-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems","author":"G Lample","year":"2019","unstructured":"Lample G, Sablayrolles A, Ranzato M, Denoyer L, J\u00e9gou H. Large memory layers with product keys. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 767"},{"key":"60308_CR129","volume-title":"Gemma 3n","author":"Team G","year":"2025","unstructured":"Team G. Gemma 3n. See deepmind.google\/models\/gemma\/gemma-3n\/ website, 2025"},{"key":"60308_CR130","unstructured":"Cheng X, Zeng W, Dai D, Chen Q, Wang B, Xie Z, Huang K, Yu X, Hao Z, Li Y, Zhang H, Zhang H, Zhao D, Liang W. Conditional memory via scalable lookup: a new axis of sparsity for large language models. 2026, arXiv preprint arXiv: 2601.07372"},{"key":"60308_CR131","unstructured":"Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser \u0141. Universal transformers. 2018, arXiv preprint arXiv: 1807.03819"},{"key":"60308_CR132","unstructured":"Zhu R J, Wang Z, Hua K, Zhang T, Li Z, Que H, Wei B, Wen Z, Yin F, Xing H, Li L, Shi J, Ma K, Li S, Kergan T, Smith A, Qu X, Hui M, Wu B, Min Q, Huang H, Zhou X, Ye W, Liu J, Yang J, Shi Y, Lin C, Zhao E, Cai T, Zhang G, Huang W, Bengio Y, Eshraghian J. Scaling latent reasoning via looped language models. 2025, arXiv preprint arXiv: 2510.25741"},{"key":"60308_CR133","unstructured":"Wang G, Li J, Sun Y, Chen X, Liu C, Wu Y, Lu M, Song S, Yadkori Y A. Hierarchical reasoning model. 2025, arXiv preprint arXiv: 2506.21734"},{"key":"60308_CR134","unstructured":"Nie S, Zhu F, You Z, Zhang X, Ou J, Hu J, Zhou J, Lin Y, Wen J R, Li C. Large language diffusion models. 2025, arXiv preprint arXiv: 2502.09992"},{"key":"60308_CR135","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"M Arriola","year":"2025","unstructured":"Arriola M, Gokaslan A, Chiu J T, Yang Z, Qi Z, Han J, Sahoo S S, Kuleshov V. Block diffusion: Interpolating between autoregressive and diffusion language models. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR136","unstructured":"Kamath A, Ferret J, Pathak S, Vieillard N, Merhej R, et al. Gemma 3 technical report. 2025, arXiv preprint arXiv: 2503.19786"},{"key":"60308_CR137","volume-title":"Step 3.5 flash: fast enough to think. reliable enough to act","author":"StepFun","year":"2026","unstructured":"StepFun. Step 3.5 flash: fast enough to think. reliable enough to act. See static.stepfun.com\/blog\/step-3.5-flash\/ website, 2026"},{"key":"60308_CR138","unstructured":"Zeng A, Lv X, Zheng Q, Hou Z, Chen B, et al. GLM-4.5: agentic, reasoning, and coding (ARC) foundation models. 2025, arXiv preprint arXiv: 2508.06471"},{"key":"60308_CR139","unstructured":"Bai Y, Bao Y, Chen G, Chen J, Chen N, et al. Kimi K2: open agentic intelligence. 2025, arXiv preprint arXiv: 2507.20534"},{"key":"60308_CR140","first-page":"8469","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"D Driess","year":"2023","unstructured":"Driess D, Xia F, Sajjadi M S, Lynch C, Chowdhery A, et al. PaLM-E: an embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 8469\u20138488"},{"key":"60308_CR141","unstructured":"Li H, Zheng W, Hu J, Wang Q, Zhang H, Wang Z, Xu Y, Zhou S, Zhang X, Jiang D. Predictable scale: Part I\u2013optimal hyperparameter scaling law in large language model pretraining. 2025, arXiv preprint arXiv: 2503.04715"},{"key":"60308_CR142","unstructured":"Yang G, Hu E J, Babuschkin I, Sidor S, Liu X, Farhi D, Ryder N, Pachocki J, Chen W, Gao J. Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer. 2022, arXiv preprint arXiv: 2203.03466"},{"key":"60308_CR143","unstructured":"Yang G, Simon J B, Bernstein J. A spectral condition for feature learning. 2023, arXiv preprint arXiv: 2310.17813"},{"key":"60308_CR144","first-page":"4603","volume-title":"Proceedings of the 35th International Conference on Machine Learning","author":"N Shazeer","year":"2018","unstructured":"Shazeer N, Stern M. Adafactor: adaptive learning rates with sublinear memory cost. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4603\u20134611"},{"key":"60308_CR145","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"Y Zhang","year":"2025","unstructured":"Zhang Y, Chen C, Li Z, Ding T, Wu C, Kingma D P, Ye Y, Luo Z, Sun R. Adam-mini: Use fewer learning rates to gain more. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR146","first-page":"1837","volume-title":"Proceedings of the 35th International Conference on Machine Learning","author":"V Gupta","year":"2018","unstructured":"Gupta V, Koren T, Singer Y. Shampoo: preconditioned stochastic tensor optimization. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1837\u20131845"},{"key":"60308_CR147","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"N Vyas","year":"2025","unstructured":"Vyas N, Morwani D, Zhao R, Shapira I, Brandfonbrener D, Janson L, Kakade S M. SOAP: improving and stabilizing shampoo using Adam for language modeling. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR148","unstructured":"Liu J, Su J, Yao X, Jiang Z, Lai G, et al. Muon is scalable for LLM training. 2025, arXiv preprint arXiv: 2502.16982"},{"key":"60308_CR149","unstructured":"Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: training multi-billion parameter language models using model parallelism. 2019, arXiv preprint arXiv: 1909.08053"},{"key":"60308_CR150","volume-title":"Proceedings of the 9th International Conference on Learning Representations","author":"D Lepikhin","year":"2021","unstructured":"Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, Krikun M, Shazeer N, Chen Z. GShard: scaling giant models with conditional computation and automatic sharding. In: Proceedings of the 9th International Conference on Learning Representations. 2021"},{"key":"60308_CR151","unstructured":"Liu H, Zaharia M, Abbeel P. Ring attention with blockwise transformers for near-infinite context. 2023, arXiv preprint arXiv: 2310.01889"},{"key":"60308_CR152","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1145\/3662158.3662806","volume-title":"Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing","author":"S A Jacobs","year":"2024","unstructured":"Jacobs S A, Tanaka M, Zhang C, Zhang M, Aminadabi R Y, Song S L, Rajbhandari S, He Y. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence transformer models. In: Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing. 2024, 121\u2013130"},{"issue":"1","key":"60308_CR153","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1109\/MCAS.2024.3476008","volume":"25","author":"C Guo","year":"2025","unstructured":"Guo C, Cheng F, Du Z, Kiessling J, Ku J, Li S, Li Z, Ma M, Molom-Ochir T, Morris B, Shan H, Sun J, Wang Y, Wei C, Wu X, Wu Y, Yang H F, Zhang J, Zheng Q, Zhou G, Li H, Chen Y. A survey: collaborative hardware and software design in the era of large language models. IEEE Circuits and Systems Magazine, 2025, 25(1): 35\u201357","journal-title":"IEEE Circuits and Systems Magazine"},{"key":"60308_CR154","first-page":"20","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"S Rajbhandari","year":"2020","unstructured":"Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: memory optimizations toward training trillion parameter models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 20"},{"key":"60308_CR155","first-page":"551","volume-title":"Proceedings of 2021 USENIX Annual Technical Conference","author":"J Ren","year":"2021","unstructured":"Ren J, Rajbhandari S, Aminabadi R Y, Ruwase O, Yang S, Zhang M, Li D, He Y. ZeRO-Offload: democratizing billion-scale model training. In: Proceedings of 2021 USENIX Annual Technical Conference. 2021, 551\u2013564"},{"key":"60308_CR156","first-page":"59","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"S Rajbhandari","year":"2021","unstructured":"Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y. ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 59"},{"key":"60308_CR157","volume-title":"NVIDIA\/nccl","author":"NVIDIA","year":"2026","unstructured":"NVIDIA. NVIDIA\/nccl, See github.com\/NVIDIA\/nccl website, 2026"},{"issue":"1","key":"60308_CR158","first-page":"120","volume":"23","author":"W Fedus","year":"2022","unstructured":"Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120","journal-title":"The Journal of Machine Learning Research"},{"key":"60308_CR159","volume-title":"Proceedings of the 8th Conference on Machine Learning and Systems","author":"S Zhang","year":"2025","unstructured":"Zhang S, Zheng N, Lin H, Jiang Z, Bao W, Jiang C, Hou Q, Cui W, Zheng S, Chang L W, Chen Q, Liu X. COMET: fine-grained computation-communication overlapping for mixture-of-experts. In: Proceedings of the 8th Conference on Machine Learning and Systems. 2025"},{"key":"60308_CR160","volume-title":"Improving network performance of HPC systems using NVIDIA magnum IO NVSHMEM and GPUDirect Async","author":"NVIDIA","year":"2022","unstructured":"NVIDIA. Improving network performance of HPC systems using NVIDIA magnum IO NVSHMEM and GPUDirect Async. See developer.nvidia.com\/blog\/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async\/ website, 2022"},{"key":"60308_CR161","volume-title":"DeepEP: an efficient expert-parallel communication library","author":"C Zhao","year":"2025","unstructured":"Zhao C, Zhou S, Zhang L, Deng C, Xu Z, Liu Y, Yu K, Li J, Zhao L. DeepEP: an efficient expert-parallel communication library. See github.com\/deepseek-ai\/DeepEP website, 2025"},{"key":"60308_CR162","first-page":"16","volume-title":"Proceedings of ACM SIGGRAPH 2008 Classes","author":"J Nickolls","year":"2008","unstructured":"Nickolls J, Buck I, Garland M, Skadron K. Scalable parallel programming with CUDA. In: Proceedings of ACM SIGGRAPH 2008 Classes. 2008, 16"},{"key":"60308_CR163","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1145\/3315508.3329973","volume-title":"Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages","author":"P Tillet","year":"2019","unstructured":"Tillet P, Kung H T, Cox D. Triton: an intermediate language and compiler for tiled neural network computations. In: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 2019, 10\u201319"},{"key":"60308_CR164","unstructured":"Wang L, Cheng Y, Shi Y, Tang Z, Mo Z, Xie W, Ma L, Xia Y, Xue J, Yang F, Yang Z. TileLang: a composable tiled programming model for AI systems. 2025, arXiv preprint arXiv: 2504.17577"},{"key":"60308_CR165","volume-title":"CUTLASS","author":"V Thakkar","year":"2023","unstructured":"Thakkar V, Ramani P, Cecka C, Shivam A, Lu H, Yan E, Kosaian J, Hoemmen M, Wu H, Kerr A, Nicely M, Merrill D, Blasig D, Atluri A, Qiao F, Majcher P, Springer P, Hohnerbach M, Wang J, Gupta M. CUTLASS, See github.com\/NVIDIA\/cutlass website, 2023"},{"key":"60308_CR166","first-page":"1189","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"T Dao","year":"2022","unstructured":"Dao T, Fu D Y, Ermon S, Rudra A, R\u00e9 C. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1189"},{"key":"60308_CR167","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"T Dao","year":"2024","unstructured":"Dao T. FlashAttention-2: faster attention with better parallelism and work partitioning. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR168","first-page":"2193","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"J Shah","year":"2024","unstructured":"Shah J, Bikshandi G, Zhang Y, Thakkar V, Ramani P, Dao T. FlashAttention-3: fast and accurate attention with asynchrony and low-precision. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 2193"},{"key":"60308_CR169","volume-title":"Deepseek-ai\/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling","author":"Deepseek-Ai","year":"2025","unstructured":"Deepseek-Ai. Deepseek-ai\/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling. See github.com\/deepseek-ai\/DeepGEMM website, 2025"},{"key":"60308_CR170","volume-title":"NVIDIA\/TransformerEngine: A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in","author":"NVIDIA","year":"2023","unstructured":"NVIDIA. NVIDIA\/TransformerEngine: A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference. See github.com\/NVIDIA\/TransformerEngine website, 2023"},{"key":"60308_CR171","volume-title":"Fla: a triton-based library for hardware efficient implementations of linear attention mechanism","author":"S Yang","year":"2024","unstructured":"Yang S, Zhang Y. Fla: a triton-based library for hardware efficient implementations of linear attention mechanism. See github.com\/fla-org\/flash-linear-attention website, 2024"},{"key":"60308_CR172","volume-title":"Proceedings of the ICML 2025 Workshop on Championing Opensource Development in Machine Learning (CODEML\u2019 25)","author":"P L Hsu","year":"2025","unstructured":"Hsu P L, Dai Y, Kothapalli V, Song Q, Tang S, Zhu S, Shimizu S, Sahni S, Ning H, Chen Y, Wang Z. Liger-kernel: efficient triton kernels for LLM training. In: Proceedings of the ICML 2025 Workshop on Championing Opensource Development in Machine Learning (CODEML\u2019 25). 2025"},{"key":"60308_CR173","volume-title":"FlashMLA: efficient multi-head latent attention kernels","author":"J Li","year":"2025","unstructured":"Li J, Liu S. FlashMLA: efficient multi-head latent attention kernels. See github.com\/deepseek-ai\/FlashMLA website, 2025"},{"key":"60308_CR174","unstructured":"Micikevicius P, Narang S, Alben J, Diamos G F, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed precision training. 2017, arXiv preprint arXiv: 1710.03740"},{"key":"60308_CR175","unstructured":"Rouhani B D, Zhao R, More A, Hall M, Khodamoradi A, et al. Microscaling data formats for deep learning. 2023, arXiv preprint arXiv: 2310.10537"},{"key":"60308_CR176","first-page":"278","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"A Lewkowycz","year":"2022","unstructured":"Lewkowycz A, Andreassen A, Dohan D, Dyer E, Michalewski H, Ramasesh V, Slone A, Anil C, Schlag I, Gutman-Solo T, Wu Y, Neyshabur B, Gur-Ari G, Misra V. Solving quantitative reasoning problems with language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 278"},{"key":"60308_CR177","first-page":"11030","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"H Ding","year":"2024","unstructured":"Ding H, Wang Z, Paolini G, Kumar V, Deoras A, Roth D, Soatto S. Fewer truncations improve language modeling. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 11030\u201311048"},{"key":"60308_CR178","unstructured":"Yang A, Zhang B, Hui B, Gao B, Yu B, Li C, Liu D, Tu J, Zhou J, Lin J, Lu K, Xue M, Lin R, Liu T, Ren X, Zhang Z. Qwen2.5-math technical report: toward mathematical expert model via self-improvement. 2024, arXiv preprint arXiv: 2409.12122"},{"key":"60308_CR179","unstructured":"Tian C, Wang J, Zhao Q, Chen K, Liu J, Liu Z, Mao J, Zhao W X, Zhang Z, Zhou J. WSM: decay-free learning rate schedule via checkpoint merging for LLM pre-training. 2025, arXiv preprint arXiv: 2507.17634"},{"key":"60308_CR180","unstructured":"Xiao B, Xia B, Yang B, Gao B, Shen B, et al. Mimo-v2-flash technical report. 2026, arXiv preprint arXiv: 2601.02780"},{"key":"60308_CR181","volume-title":"Proceedings of the 10th International Conference on Learning Representations","author":"J Wei","year":"2022","unstructured":"Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. In: Proceedings of the 10th International Conference on Learning Representations. 2022"},{"key":"60308_CR182","doi-asserted-by":"publisher","first-page":"13484","DOI":"10.18653\/v1\/2023.acl-long.754","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Y Wang","year":"2023","unstructured":"Wang Y, Kordi Y, Mishra S, Liu A, Smith N A, Khashabi D, Hajishirzi H. Self-instruct: Aligning language models with self-generated instructions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 13484\u201313508"},{"key":"60308_CR183","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"C Xu","year":"2024","unstructured":"Xu C, Sun Q, Zheng K, Geng X, Zhao P, Feng J, Tao C, Jiang D. WizardLM: empowering large pre-trained language models to follow complex instructions. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR184","doi-asserted-by":"publisher","first-page":"3029","DOI":"10.18653\/v1\/2023.emnlp-main.183","volume-title":"Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing","author":"N Ding","year":"2023","unstructured":"Ding N, Chen Y, Xu B, Qin Y, Hu S, Liu Z, Sun M, Zhou B. Enhancing chat language models by scaling high-quality instructional conversations. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 3029\u20133051"},{"key":"60308_CR185","first-page":"115","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Z Sun","year":"2023","unstructured":"Sun Z, Shen Y, Zhou Q, Zhang H, Chen Z, Cox D, Yang Y, Gan C. Principle-driven self-alignment of language models from scratch with minimal human supervision. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 115"},{"key":"60308_CR186","unstructured":"Huang Z, Zou H, Li X, Liu Y, Zheng Y, Chern E, Xia S, Qin Y, Yuan W, Liu P. O1 replication journey\u2013part 2: Surpassing O1-preview through simple distillation, big progress or bitter lesson? 2024, arXiv preprint arXiv: 2411.16489"},{"key":"60308_CR187","unstructured":"Song H, Jiang J, Min Y, Chen J, Chen Z, Zhao W X, Fang L, Wen J R. R1-searcher: Incentivizing the search capability in LLMs via reinforcement learning. 2025, arXiv preprint arXiv: 2503.05592"},{"key":"60308_CR188","unstructured":"Bai T, Bai Y, Bao Y, Cai S H, Cao Y, et al. Kimi K2.5: Visual agentic intelligence. 2026, arXiv preprint arXiv: 2602.02276"},{"key":"60308_CR189","unstructured":"Cao Y, Kang Y, Wang C, Sun L. Instruction mining: Instruction data selection for tuning large language models. 2023, arXiv preprint arXiv: 2307.06290"},{"key":"60308_CR190","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"L Chen","year":"2024","unstructured":"Chen L, Li S, Yan J, Wang H, Gunaratna K, Yadav V, Tang Z, Srinivasan V, Zhou T, Huang H, Jin H. AlpaGasus: Training a better alpaca with fewer data. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR191","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"K Lu","year":"2024","unstructured":"Lu K, Yuan H, Yuan Z, Lin R, Lin J, Tan C, Zhou C, Zhou J. #InsTag: Instruction tagging for analyzing supervised fine-tuning of large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR192","first-page":"54104","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"M Xia","year":"2024","unstructured":"Xia M, Malladi S, Gururangan S, Arora S, Chen D. LESS: Selecting influential data for targeted instruction tuning. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 54104\u201354132"},{"key":"60308_CR193","unstructured":"Lambert N, Morrison J, Pyatkin V, Huang S, Ivison H, et al. Tulu 3: Pushing frontiers in open language model post-training. 2024, arXiv preprint arXiv: 2411.15124"},{"key":"60308_CR194","unstructured":"Liu A, Zhou B, Xu C, Zhou C, Zhang C, et al. Hunyuan-TurboS: advancing large language models through mamba-transformer synergy and adaptive chain-of-thought. 2025, arXiv preprint arXiv: 2505.15431"},{"key":"60308_CR195","first-page":"2400","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"C Zhou","year":"2023","unstructured":"Zhou C, Liu P, Xu P, Iyer S, Sun J, et al. LIMA: Less is more for alignment. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2400"},{"issue":"8081","key":"60308_CR196","doi-asserted-by":"publisher","first-page":"633","DOI":"10.1038\/s41586-025-09422-z","volume":"645","author":"D Guo","year":"2025","unstructured":"Guo D, Yang D, Zhang H, Song J, Wang P, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 2025, 645(8081): 633\u2013638","journal-title":"Nature"},{"key":"60308_CR197","unstructured":"Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, et al. Towards expert-level medical question answering with large language models. 2023, arXiv preprint arXiv: 2305.09617"},{"key":"60308_CR198","volume-title":"Proceedings of the 10th International Conference on Learning Representations","author":"E J Hu","year":"2022","unstructured":"Hu E J, Shen Y, Wallis P, Zhu Z A, Li Y, Wang S, Wang L, Chen W. LoRA: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022"},{"key":"60308_CR199","unstructured":"Shao Z, Luo Y, Lu C, Ren Z, Hu J, Ye T, Gou Z, Ma S, Zhang X. DeepSeekMath-V2: Towards self-verifiable mathematical reasoning. 2025, arXiv preprint arXiv: 2511.22570"},{"key":"60308_CR200","unstructured":"Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347"},{"key":"60308_CR201","unstructured":"Yuan Z, Yuan H, Li C, Dong G, Lu K, Tan C, Zhou C, Zhou J. Scaling relationship on learning mathematical reasoning with large language models. 2023, arXiv preprint arXiv: 2308.01825"},{"key":"60308_CR202","first-page":"2338","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"R Rafailov","year":"2023","unstructured":"Rafailov R, Sharma A, Mitchell E, Ermon S, Manning C D, Finn C. Direct preference optimization: your language model is secretly a reward model. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2338"},{"key":"60308_CR203","unstructured":"Yu Q, Zhang Z, Zhu R, Yuan Y, Zuo X, et al. DAPO: An open-source LLM reinforcement learning system at scale. 2025, arXiv preprint arXiv: 2503.14476"},{"key":"60308_CR204","unstructured":"Liu Z, Chen C, Li W, Qi P, Pang T, Du C, Lee W S, Lin M. Understanding R1-zero-like training: A critical perspective. 2025, arXiv preprint arXiv: 2503.20783"},{"key":"60308_CR205","unstructured":"Zheng C, Liu S, Li M, Chen X H, Yu B, Gao C, Dang K, Liu Y, Men R, Yang A, Zhou J, Lin J. Group sequence policy optimization. 2025, arXiv preprint arXiv: 2507.18071"},{"issue":"7587","key":"60308_CR206","doi-asserted-by":"publisher","first-page":"484","DOI":"10.1038\/nature16961","volume":"529","author":"D Silver","year":"2016","unstructured":"Silver D, Huang A, Maddison C J, Guez A, Sifre L, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484\u2013489","journal-title":"Nature"},{"key":"60308_CR207","unstructured":"Glaese A, McAleese N, Tr\u0119bacz M, Aslanides J, Firoiu V, et al. Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375"},{"key":"60308_CR208","volume-title":"The landscape of agentic reinforcement learning for LLMs: a survey","author":"G Zhang","year":"2025","unstructured":"Zhang G, Geng H, Yu X, Yin Z, Zhang Z, Tan Z, Zhou H, Li Z, Xue X, Li Y, others. The landscape of agentic reinforcement learning for LLMs: a survey. CoRR, 2025"},{"key":"60308_CR209","unstructured":"Gunjal A, Wang A, Lau E, Nath V, He Y, Liu B, Hendryx S. Rubrics as rewards: Reinforcement learning beyond verifiable domains. 2025, arXiv preprint arXiv: 2507.17746"},{"key":"60308_CR210","unstructured":"Hou Z, Niu Y, Du Z, Zhang X, Liu X, Zeng A, Zheng Q, Huang M, Wang H, Tang J, Dong Y. ChatGLM-RLHF: Practices of aligning large language models with human feedback. 2024, arXiv preprint arXiv: 2404.00934"},{"key":"60308_CR211","unstructured":"Chen X, Li G, Wang Z, Jin B, Qian C, Wang Y, Wang H, Zhang Y, Zhang D, Zhang T, Tong H, Ji H. RM-R1: Reward modeling as reasoning. 2025, arXiv preprint arXiv: 2505.02387"},{"key":"60308_CR212","unstructured":"Liu Z, Wang P, Xu R, Ma S, Ruan C, Li P, Liu Y, Wu Y. Inference-time scaling for generalist reward modeling. 2025, arXiv preprint arXiv: 2504.02495"},{"key":"60308_CR213","volume-title":"Proceedings of Conference Paper at ICLR 2024","author":"H Lightman","year":"2024","unstructured":"Lightman H, Kosaraju V, Burda Y, Edwards H, Baker B, Lee T, Leike J, Schulman J, Sutskever I, Cobbe K. Let\u2019s verify step by step. In: Proceedings of Conference Paper at ICLR 2024. 2024"},{"key":"60308_CR214","first-page":"1057","volume-title":"Proceedings of the 13th International Conference on Neural Information Processing Systems","author":"R S Sutton","year":"1999","unstructured":"Sutton R S, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 13th International Conference on Neural Information Processing Systems. 1999, 1057\u20131063"},{"key":"60308_CR215","first-page":"55204","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"H Xu","year":"2024","unstructured":"Xu H, Sharaf A, Chen Y, Tan W, Shen L, Van Durme B, Murray K, Kim Y J. Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 55204\u201355224"},{"key":"60308_CR216","first-page":"4447","volume-title":"Proceedings of International Conference on Artificial Intelligence and Statistics","author":"M G Azar","year":"2024","unstructured":"Azar M G, Guo Z D, Piot B, Munos R, Rowland M, Valko M, Calandriello D. A general theoretical paradigm to understand learning from human preferences. In: Proceedings of International Conference on Artificial Intelligence and Statistics. 2024, 4447\u20134455"},{"key":"60308_CR217","unstructured":"Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. KTO: Model alignment as prospect theoretic optimization. 2024, arXiv preprint arXiv: 2402.01306"},{"key":"60308_CR218","unstructured":"Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, et al. Constitutional AI: harmlessness from AI feedback. 2022, arXiv preprint arXiv: 2212.08073"},{"key":"60308_CR219","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"J Dai","year":"2024","unstructured":"Dai J, Pan X, Sun R, Ji J, Xu X, Liu M, Wang Y, Yang Y. Safe RLHF: Safe reinforcement learning from human feedback. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR220","unstructured":"Yue Y, Chen Z, Lu R, Zhao A, Wang Z, Yue Y, Song S, Huang G. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? 2025, arXiv preprint arXiv: 2504.13837"},{"key":"60308_CR221","volume-title":"Reward hacking in reinforcement learning","author":"L Weng","year":"2024","unstructured":"Weng L. Reward hacking in reinforcement learning. See lilianweng.github.io\/posts\/2024-11-28-reward-hacking\/ website, 2024"},{"key":"60308_CR222","first-page":"7935","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"L Chen","year":"2024","unstructured":"Chen L, Zhu C, Soselia D, Chen J, Zhou T, Goldstein T, Huang H, Shoeybi M, Catanzaro B. ODIN: Disentangled reward mitigates hacking in RLHF. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 7935\u20137952"},{"key":"60308_CR223","first-page":"4270","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"Y Miao","year":"2024","unstructured":"Miao Y, Zhang S, Ding L, Bao R, Zhang L, Tao D. InfoRM: Mitigating reward hacking in RLHF via information-theoretic reward modeling. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 4270"},{"key":"60308_CR224","first-page":"51","volume-title":"Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","author":"M F Bin Tarek","year":"2025","unstructured":"Bin Tarek M F, Beheshti R. Reward hacking mitigation using verifiable composite rewards. In: Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2025, 51"},{"key":"60308_CR225","unstructured":"Chen Z, Min Y, Zhang B, Chen J, Jiang J, Cheng D, Zhao W X, Liu Z, Miao X, Lu Y, Fang L, Wang Z, Wen J R. An empirical study on eliciting and improving R1-like reasoning models. 2025, arXiv preprint arXiv: 2503.04548"},{"key":"60308_CR226","unstructured":"Deng J, Chen J, Chen Z, Cheng D, Bai F, Zhang B, Min Y, Gao Y, Zhao W X, Wen J R. From trial-and-error to improvement: A systematic analysis of LLM exploration mechanisms in RLVR. 2025, arXiv preprint arXiv: 2508.07534"},{"key":"60308_CR227","unstructured":"Kimi T, Du A, Gao B, Xing B, Jiang C, et al. Kimi k1.5: Scaling reinforcement learning with LLMs. 2025, arXiv preprint arXiv: 2501.12599"},{"key":"60308_CR228","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"R Agarwal","year":"2024","unstructured":"Agarwal R, Vieillard N, Zhou Y, Stanczyk P, Garea S R, Geist M, Bachem O. On-policy distillation of language models: Learning from self-generated mistakes. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR229","unstructured":"Touvron H, Martin L, Stone K, Albert P, Almahairi A, et al. Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288"},{"issue":"9","key":"60308_CR230","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1145\/3560815","volume":"55","author":"P Liu","year":"2023","unstructured":"Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023, 55(9): 195","journal-title":"ACM Computing Surveys"},{"key":"60308_CR231","doi-asserted-by":"publisher","first-page":"2609","DOI":"10.18653\/v1\/2023.acl-long.147","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"L Wang","year":"2023","unstructured":"Wang L, Xu W, Lan Y, Hu Z, Lan Y, Lee R K, Lim E. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 2609\u20132634"},{"key":"60308_CR232","doi-asserted-by":"publisher","first-page":"1401","DOI":"10.18653\/v1\/2023.acl-long.78","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"I Levy","year":"2023","unstructured":"Levy I, Bogin B, Berant J. Diverse demonstrations improve incontext compositional generalization. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 1401\u20131422"},{"key":"60308_CR233","unstructured":"Kim H J, Cho H, Kim J, Kim T, Yoo K M, Lee S. Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator. 2022, arXiv preprint arXiv: 2206.08082"},{"key":"60308_CR234","first-page":"12697","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Z Zhao","year":"2021","unstructured":"Zhao Z, Wallace E, Feng S, Klein D, Singh S. Calibrate before use: Improving few-shot performance of language models. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 12697\u201312706"},{"key":"60308_CR235","doi-asserted-by":"publisher","first-page":"8086","DOI":"10.18653\/v1\/2022.acl-long.556","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Y Lu","year":"2022","unstructured":"Lu Y, Bartolo M, Moore A, Riedel S, Stenetorp P. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 8086\u20138098"},{"key":"60308_CR236","doi-asserted-by":"publisher","first-page":"1423","DOI":"10.18653\/v1\/2023.acl-long.79","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Z Wu","year":"2023","unstructured":"Wu Z, Wang Y, Ye J, Kong L. Self-adaptive in-context learning: an information compression perspective for in-context example selection and ordering. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 1423\u20131436"},{"key":"60308_CR237","first-page":"523","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"K C Wibisono","year":"2024","unstructured":"Wibisono K C, Wang Y. From unstructured data to in-context learning: Exploring what tasks can be learned and when. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 523"},{"key":"60308_CR238","volume-title":"Proceedings of the 10th International Conference on Learning Representations","author":"S M Xie","year":"2022","unstructured":"Xie S M, Raghunathan A, Liang P, Ma T. An explanation of incontext learning as implicit Bayesian inference. In: Proceedings of the 10th International Conference on Learning Representations. 2022"},{"key":"60308_CR239","doi-asserted-by":"publisher","first-page":"8298","DOI":"10.18653\/v1\/2023.findings-acl.527","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: ACL 2023","author":"J Pan","year":"2023","unstructured":"Pan J, Gao T, Chen H, Chen D. What in-context learning \u201clearns\u201d in-context: Disentangling task recognition and task learning. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2023. 2023, 8298\u20138319"},{"key":"60308_CR240","unstructured":"Dherin B, Munn M, Mazzawi H, Wunder M, Gonzalvo J. Learning without training: The implicit dynamics of in-context learning. 2025, arXiv preprint arXiv: 2507.16003"},{"key":"60308_CR241","first-page":"1613","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"T Kojima","year":"2022","unstructured":"Kojima T, Gu S S, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1613"},{"key":"60308_CR242","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"C Yang","year":"2024","unstructured":"Yang C, Wang X, Lu Y, Liu H, Le Q V, Zhou D, Chen X. Large language models as optimizers. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR243","first-page":"3107","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"B Prystawski","year":"2023","unstructured":"Prystawski B, Li M Y, Goodman N D. Why think step by step? reasoning emerges from the locality of experience. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3107"},{"key":"60308_CR244","unstructured":"Zhao C, Tan Z, Ma P, Li D, Jiang B, Wang Y, Yang Y, Liu H. Is chain-of-thought reasoning of LLMs a mirage? a data distribution lens. 2025, arXiv preprint arXiv: 2508.01191"},{"key":"60308_CR245","unstructured":"Dutta S, Singh J, Chakrabarti S, Chakraborty T. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. Transactions on Machine Learning Research, 2024"},{"key":"60308_CR246","unstructured":"Yang H, Zhao Q, Li L. Chain-of-thought in large language models: Decoding, projection, and activation. 2024, arXiv preprint arXiv: 2412.03944"},{"key":"60308_CR247","first-page":"4222","volume-title":"Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing","author":"T Shin","year":"2020","unstructured":"Shin T, Razeghi Y, Logan IV R L, Wallace E, Singh S. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4222\u20134235"},{"key":"60308_CR248","doi-asserted-by":"publisher","first-page":"3369","DOI":"10.18653\/v1\/2022.emnlp-main.222","volume-title":"Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing","author":"M Deng","year":"2022","unstructured":"Deng M, Wang J, Hsieh C P, Wang Y, Guo H, Shu T, Song M, Xing E P, Hu Z. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3369\u20133391"},{"key":"60308_CR249","doi-asserted-by":"publisher","first-page":"8162","DOI":"10.18653\/v1\/2022.emnlp-main.559","volume-title":"Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing","author":"H Xu","year":"2022","unstructured":"Xu H, Chen Y, Du Y, Shao N, Wang Y, Li H, Yang Z. GPS: genetic prompt search for efficient few-shot learning. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8162\u20138171"},{"key":"60308_CR250","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Y Zhou","year":"2023","unstructured":"Zhou Y, Muresanu A I, Han Z, Paster K, Pitis S, Chan H, Ba J. Large language models are human-level prompt engineers. In: Proceedings of the 11th International Conference on Learning Representations. 2023"},{"key":"60308_CR251","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"X Wang","year":"2023","unstructured":"Wang X, Wei J, Schuurmans D, Le Q V, Chi E H, Narang S, Chowdhery A, Zhou D. Self-consistency improves chain of thought reasoning in language models. In: Proceedings of the 11th International Conference on Learning Representations. 2023"},{"key":"60308_CR252","doi-asserted-by":"publisher","first-page":"5942","DOI":"10.18653\/v1\/2023.emnlp-main.364","volume-title":"Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing","author":"O Yoran","year":"2023","unstructured":"Yoran O, Wolfson T, Bogin B, Katz U, Deutch D, Berant J. Answering questions by meta-reasoning over multiple chains of thought. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 5942\u20135966"},{"key":"60308_CR253","doi-asserted-by":"publisher","first-page":"5315","DOI":"10.18653\/v1\/2023.acl-long.291","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Y Li","year":"2023","unstructured":"Li Y, Lin Z, Zhang S, Fu Q, Chen B, Lou J G, Chen W. Making language models better reasoners with step-aware verifier. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 5315\u20135333"},{"key":"60308_CR254","first-page":"517","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"S Yao","year":"2023","unstructured":"Yao S, Yu D, Zhao J, Shafran I, Griffiths T L, Cao Y, Narasimhan K. Tree of thoughts: Deliberate problem solving with large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 517"},{"key":"60308_CR255","first-page":"17682","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"M Besta","year":"2024","unstructured":"Besta M, Blach N, Kubicek A, Gerstenberger R, Podstawski M, Gianinazzi L, Gajda J, Lehmann T, Niewiadomski H, Nyczyk P, Hoefler T. Graph of thoughts: Solving elaborate problems with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 17682\u201317690"},{"key":"60308_CR256","first-page":"62138","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"A Zhou","year":"2024","unstructured":"Zhou A, Yan K, Shlapentokh-Rothman M, Wang H, Wang Y X. Language agent tree search unifies reasoning, acting, and planning in language models. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 62138\u201362160"},{"key":"60308_CR257","volume-title":"Proceedings of the 42nd International Conference on Machine Learning","author":"X Guan","year":"2025","unstructured":"Guan X, Zhang L L, Liu Y, Shang N, Sun Y, Zhu Y, Yang F, Yang M. rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking. In: Proceedings of the 42nd International Conference on Machine Learning. 2025"},{"key":"60308_CR258","unstructured":"Jiang J, Chen Z, Min Y, Chen J, Cheng X, Wang J, Tang Y, Sun H, Deng J, Zhao W X, Liu Z, Yan D, Xie J, Wang Z, Wen J R. Enhancing LLM reasoning with reward-guided tree search. 2024, arXiv preprint arXiv: 2411.11694"},{"key":"60308_CR259","volume-title":"Proceedings of the 42nd International Conference on Machine Learning","author":"X Chen","year":"2025","unstructured":"Chen X, Xu J, Liang T, He Z, Pang J, Yu D, Song L, Liu Q, Zhou M, Zhang Z, Wang R, Tu Z, Mi H, Yu D. Do NOT think that much for 2+3=? On the overthinking of long reasoning models. In: Proceedings of the 42nd International Conference on Machine Learning. 2025"},{"key":"60308_CR260","unstructured":"Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Wang H. Retrieval-augmented generation for large language models: A survey. 2023, arXiv preprint arXiv: 2312.10997"},{"key":"60308_CR261","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1162\/tacl_a_00638","volume":"12","author":"N F Liu","year":"2024","unstructured":"Liu N F, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, Liang P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 2024, 12: 157\u2013173","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"60308_CR262","doi-asserted-by":"publisher","first-page":"7969","DOI":"10.18653\/v1\/2023.emnlp-main.495","volume-title":"Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing","author":"Z Jiang","year":"2023","unstructured":"Jiang Z, Xu F F, Gao L, Sun Z, Liu Q, Dwivedi-Yu J, Yang Y, Callan J, Neubig G. Active retrieval augmented generation. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 7969\u20137992"},{"key":"60308_CR263","doi-asserted-by":"publisher","first-page":"15159","DOI":"10.18653\/v1\/2024.emnlp-main.845","volume-title":"Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing","author":"T Chen","year":"2024","unstructured":"Chen T, Wang H, Chen S, Yu W, Ma K, Zhao X, Zhang H, Yu D. Dense X retrieval: What retrieval granularity should we use? In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 15159\u201315177"},{"key":"60308_CR264","doi-asserted-by":"publisher","first-page":"5303","DOI":"10.18653\/v1\/2023.emnlp-main.322","volume-title":"Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing","author":"X Ma","year":"2023","unstructured":"Ma X, Gong Y, He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 5303\u20135315"},{"key":"60308_CR265","doi-asserted-by":"publisher","first-page":"10014","DOI":"10.18653\/v1\/2023.acl-long.557","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"H Trivedi","year":"2023","unstructured":"Trivedi H, Balasubramanian N, Khot T, Sabharwal A. Interleaving retrieval with chain-of-thought reasoning for knowledgeintensive multi-step questions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, 10014\u201310037"},{"key":"60308_CR266","unstructured":"Yao S, Shinn N, Razavi P, Narasimhan K. \u03c4-bench: A benchmark for tool-agent-user interaction in real-world domains. 2024, arXiv preprint arXiv: 2406.12045"},{"key":"60308_CR267","first-page":"1650","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"T Xie","year":"2024","unstructured":"Xie T, Zhang D, Chen J, Li X, Zhao S, et al. OSWORLD: Benchmarking multimodal agents for open-ended tasks in real computer environments. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1650"},{"issue":"6","key":"60308_CR268","doi-asserted-by":"publisher","first-page":"186345","DOI":"10.1007\/s11704-024-40231-1","volume":"18","author":"L Wang","year":"2024","unstructured":"Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W X, Wei Z, Wen J. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18(6): 186345","journal-title":"Frontiers of Computer Science"},{"key":"60308_CR269","unstructured":"Huang X, Liu W, Chen X, Wang X, Wang H, Lian D, Wang Y, Tang R, Chen E. Understanding the planning of LLM agents: A survey. 2024, arXiv preprint arXiv: 2402.02716"},{"key":"60308_CR270","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"S Yao","year":"2023","unstructured":"Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K R, Cao Y. ReAct: Synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations. 2023"},{"key":"60308_CR271","first-page":"377","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"N Shinn","year":"2023","unstructured":"Shinn N, Cassano F, Gopinath A, Labash B, Narasimhan K, Yao S. Reflexion: Language agents with verbal reinforcement learning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 377"},{"key":"60308_CR272","unstructured":"Sun W, Lu M, Ling Z, Liu K, Yao X, Yang Y, Chen J. Scaling long-horizon LLM agent via context-folding. 2025, arXiv preprint arXiv: 2510.11967"},{"key":"60308_CR273","unstructured":"Chen G, Qiao Z, Chen X, Yu D, Xu H, Zhao W X, Song R, Yin W, Yin H, Zhang L, Li K, Liao M, Jiang Y, Xie P, Huang F, Zhou J. IterResearch: Rethinking long-horizon agents with interaction scaling. 2026, arXiv preprint arXiv: 2511.07327v2"},{"key":"60308_CR274","unstructured":"Xu W, Liang Z, Mei K, Gao H, Tan J, Zhang Y. A-MEM: Agentic memory for LLM agents. 2025, arXiv preprint arXiv: 2502.12110"},{"key":"60308_CR275","unstructured":"Zhou Z, Qu A, Wu Z, Kim S, Prakash A, Rus D, Zhao J, Low B K H, Liang P P. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. 2025, arXiv preprint arXiv: 2506.15841"},{"key":"60308_CR276","first-page":"2993","volume-title":"Proceedings of ECAI 2025 - 28th European Conference on Artificial Intelligence","author":"P Chhikara","year":"2025","unstructured":"Chhikara P, Khant D, Aryan S, Singh T, Yadav D. Mem0: Building production-ready AI agents with scalable long-term memory. In: Proceedings of ECAI 2025 - 28th European Conference on Artificial Intelligence. 2025, 2993\u20133000"},{"key":"60308_CR277","first-page":"2019","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"A Madaan","year":"2023","unstructured":"Madaan A, Tandon N, Gupta P, Hallinan S, Gao L, Wiegreffe S, Alon U, Dziri N, Prabhumoye S, Yang Y, Welleck S, Majumder B P, Gupta S, Yazdanbakhsh A, Clark P. SELF-REFINE: Iterative refinement with self-feedback. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2019"},{"key":"60308_CR278","first-page":"1126","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"E Zelikman","year":"2022","unstructured":"Zelikman E, Wu Y, Mu J, Goodman N D. Star: Self-taught reasoner bootstrapping reasoning with reasoning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1126"},{"key":"60308_CR279","unstructured":"Zhang K, Chen X, Liu B, Xue T, Liao Z, et al. Agent learning via early experience. 2025, arXiv preprint arXiv: 2510.08558"},{"key":"60308_CR280","unstructured":"Lu F, Zhong Z, Liu S, Fu C W, Jia J. ARPO: End-to-end policy optimization for GUI agents with experience replay. 2025, arXiv preprint arXiv: 2505.16282"},{"key":"60308_CR281","unstructured":"Wu R, Wang X, Mei J, Cai P, Fu D, Yang C, Wen L, Yang X, Shen Y, Wang Y, Shi B. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle. 2025, arXiv preprint arXiv: 2510.16079"},{"key":"60308_CR282","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"S Hong","year":"2024","unstructured":"Hong S, Zhuge M, Chen J, Zheng X, Cheng Y, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR283","doi-asserted-by":"publisher","first-page":"599","DOI":"10.18653\/v1\/2024.findings-acl.33","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: ACL 2024","author":"X Tang","year":"2024","unstructured":"Tang X, Zou A, Zhang Z, Li Z, Zhao Y, Zhang X, Cohan A, Gerstein M. MedAgents: Large language models as collaborators for zero-shot medical reasoning. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 599\u2013621"},{"key":"60308_CR284","unstructured":"Lu R, Hou Z, Wang Z, Zhang H, Liu X, Li Y, Feng S, Tang J, Dong Y. DeepDive: Advancing deep search agents with knowledge graphs and multi-turn RL. 2025, arXiv preprint arXiv: 2509.10446"},{"key":"60308_CR285","unstructured":"Zhu C, Xu B, Du M, Wang S, Wang X, Mao Z, Zhang Y. FS-Researcher: Test-time scaling for long-horizon research tasks with file-system-based agents. 2026, arXiv preprint arXiv: 2602.01566"},{"key":"60308_CR286","unstructured":"Song H, Huang L, Sun S, Jiang J, Le R, Cheng D, Chen G, Hu Y, Chen Z, Jia Y, Zhao W X, Song Y, Zhang T, Wen J R. SWE-Master: Unleashing the potential of software engineering agents via post-training. 2026, arXiv preprint arXiv: 2602.03411"},{"key":"60308_CR287","first-page":"441","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"T Dettmers","year":"2023","unstructured":"Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: efficient finetuning of quantized LLMs. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 441"},{"key":"60308_CR288","first-page":"244","volume-title":"Proceedings of the 24th China National Conference on Chinese Computational Linguistics","author":"Y Chen","year":"2025","unstructured":"Chen Y, Tang T, Xiang E, Li L, Zhao W X, Wang J, Chai Y, Wen J R. Towards coarse-to-fine evaluation of inference efficiency for large language models. In: Proceedings of the 24th China National Conference on Chinese Computational Linguistics. 2025, 244\u2013264"},{"key":"60308_CR289","doi-asserted-by":"publisher","first-page":"611","DOI":"10.1145\/3600006.3613165","volume-title":"Proceedings of the 29th Symposium on Operating Systems Principles","author":"W Kwon","year":"2023","unstructured":"Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, et al. Efficient memory management for large language model serving with PagedAttention. In: Proceedings of the 29th Symposium on Operating Systems Principles. 2023, 611\u2013626"},{"key":"60308_CR290","unstructured":"Holmes C, Tanaka M, Wyatt M, Awan A A, Rasley J, Rajbhandari S, Aminabadi R Y, Qin H, Bakhtiari A, Kurilenko L, He Y. DeepSpeed-FastGen: High-throughput text generation for LLMs via MII and DeepSpeed-inference. 2024, arXiv preprint arXiv: 2401.08671"},{"key":"60308_CR291","first-page":"795","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Y Leviathan","year":"2023","unstructured":"Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 795"},{"key":"60308_CR292","first-page":"203","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"T Cai","year":"2024","unstructured":"Cai T, Li Y, Geng Z, Peng H, Lee J D, Chen D, Dao T. MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 203"},{"key":"60308_CR293","unstructured":"Miao X, Oliaro G, Zhang Z, Cheng X, Wang Z, Zhang Z, Wong R Y Y, Zhu A, Yang L, Shi X, Shi C, Chen Z, Arfeen D, Abhyankar R, Jia Z. Specinfer: Accelerating generative LLM serving with speculative inference and token tree verification. 2023, arXiv preprint arXiv: 2305.09781"},{"key":"60308_CR294","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"M Yue","year":"2024","unstructured":"Yue M, Zhao J, Zhang M, Du L, Yao Z. Large language model cascades with mixture of thought representations for cost-efficient reasoning. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR295","unstructured":"Raposo D, Ritter S, Richards B A, Lillicrap T P, Humphreys P C, Santoro A. Mixture-of-depths: Dynamically allocating compute in transformer-based language models. 2024, arXiv preprint arXiv: 2404.02258"},{"key":"60308_CR296","first-page":"1506","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Z Zhang","year":"2023","unstructured":"Zhang Z, Sheng Y, Zhou T, Chen T, Zheng L, et al. H2O: heavy-hitter oracle for efficient generative inference of large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1506"},{"key":"60308_CR297","first-page":"32332","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Z Liu","year":"2024","unstructured":"Liu Z, Yuan J, Jin H, Zhong S, Xu Z, Braverman V, Chen B, Hu X. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 32332\u201332344"},{"key":"60308_CR298","doi-asserted-by":"publisher","first-page":"11175","DOI":"10.18653\/v1\/2024.acl-long.602","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"H Wu","year":"2024","unstructured":"Wu H, Tu K. Layer-condensed KV cache for efficient inference of large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 11175\u201311188"},{"key":"60308_CR299","doi-asserted-by":"publisher","first-page":"33313","DOI":"10.18653\/v1\/2025.acl-long.1597","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"T Ji","year":"2025","unstructured":"Ji T, Guo B, Wu Y, Guo Q, Shen L, Chen Z, Qiu X, Zhang Q, Gui T. Towards economical inference: Enabling deepseek\u2019s multi-head latent attention in any transformer-based LLMs. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, 33313\u201333328"},{"key":"60308_CR300","unstructured":"Zhang Y, Hu Y, Zhao R, Lui J C S, Chen H. Unifying KV cache compression for large language models with leanKV. 2024, arXiv preprint arXiv: 2412.03131"},{"key":"60308_CR301","first-page":"2000","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"L Zheng","year":"2024","unstructured":"Zheng L, Yin L, Xie Z, Sun C, Huang J, et al. SGLang: Efficient execution of structured language model programs. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 2000"},{"key":"60308_CR302","doi-asserted-by":"publisher","first-page":"107856","DOI":"10.1016\/j.neunet.2025.107856","volume":"192","author":"R Gong","year":"2025","unstructured":"Gong R, Ding Y, Wang Z, Lv C, Zheng X, et al. A survey of low-bit large language models: Basics, systems, and algorithms. Neural Networks, 2025, 192: 107856","journal-title":"Neural Networks"},{"key":"60308_CR303","doi-asserted-by":"publisher","first-page":"5174","DOI":"10.63317\/5my4wtsi85on","volume-title":"Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)","author":"P Liu","year":"2024","unstructured":"Liu P, Liu Z, Gao Z F, Gao D, Zhao W X, Li Y, Ding B, Wen J R. Do emergent abilities exist in quantized large language models: an empirical study. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 5174\u20135190"},{"key":"60308_CR304","unstructured":"Frantar E, Ashkboos S, Hoefler T, Alistarh D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. 2022, arXiv preprint arXiv: 2210.17323"},{"issue":"4","key":"60308_CR305","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1145\/3714983.3714987","volume":"28","author":"J Lin","year":"2024","unstructured":"Lin J, Tang J, Tang H, Yang S, Dang X, Han S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. GetMobile: Mobile Computing and Communications, 2024, 28(4): 12\u201317","journal-title":"GetMobile: Mobile Computing and Communications"},{"key":"60308_CR306","first-page":"107","volume-title":"Proceedings of the 61st ACM\/IEEE Design Automation Conference","author":"Z Guan","year":"2024","unstructured":"Guan Z, Huang H, Su Y, Huang H, Wong N, Yu H. APTQ: attention-aware post-training mixed-precision quantization for large language models. In: Proceedings of the 61st ACM\/IEEE Design Automation Conference. 2024, 107"},{"key":"60308_CR307","unstructured":"Dettmers T, Lewis M, Belkada Y, Zettlemoyer L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. 2022, arXiv preprint arXiv: 2208.07339"},{"key":"60308_CR308","first-page":"3180","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"S Ashkboos","year":"2024","unstructured":"Ashkboos S, Mohtashami A, Croci M L, Li B, Cameron P, Jaggi M, Alistarh D, Hoefler T, Hensman J. QuaRot: Outlier-free 4-bit inference in rotated LLMs. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 3180"},{"key":"60308_CR309","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"Z Liu","year":"2025","unstructured":"Liu Z, Zhao C, Fedorov I, Soran B, Choudhary D, Krishnamoorthi R, Chandra V, Tian Y, Blankevoort T. SpinQuant: LLM quantization with learned rotations. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR310","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Y Gu","year":"2024","unstructured":"Gu Y, Dong L, Wei F, Huang M. MiniLLM: knowledge distillation of large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR311","doi-asserted-by":"publisher","first-page":"8003","DOI":"10.18653\/v1\/2023.findings-acl.507","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: ACL 2023","author":"C Y Hsieh","year":"2023","unstructured":"Hsieh C Y, Li C L, Yeh C K, Nakhost H, Fujii Y, Ratner A, Krishna R, Lee C Y, Pfister T. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2023. 2023, 8003\u20138017"},{"key":"60308_CR312","first-page":"10323","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"E Frantar","year":"2023","unstructured":"Frantar E, Alistarh D. SparseGPT: Massive language models can be accurately pruned in one-shot. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 10323\u201310337"},{"key":"60308_CR313","unstructured":"Dong Z, Peng H, Liu P, Zhao W X, Wu D, Xiao F, Wang Z. Domain-specific pruning of large mixture-of-experts models with few-shot demonstrations. 2025, arXiv preprint arXiv: 2504.06792"},{"key":"60308_CR314","unstructured":"Zhang T, Hariri M, Zhong S, Chaudhary V, Sui Y, Hu X, Shrivastava A. 70% size, 100% accuracy: Lossless LLM compression for efficient GPU inference via dynamic-length float. 2025, arXiv preprint arXiv: 2504.11651"},{"key":"60308_CR315","doi-asserted-by":"publisher","first-page":"4791","DOI":"10.18653\/v1\/P19-1472","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"R Zellers","year":"2019","unstructured":"Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. HellaSwag: Can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4791\u20134800"},{"key":"60308_CR316","unstructured":"Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, Tafjord O. Think you have solved question answering? Try arc, the AI2 reasoning challenge. 2018, arXiv preprint arXiv: 1803.05457"},{"key":"60308_CR317","volume-title":"NeurIPS Datasets and Benchmarks","author":"D Hendrycks","year":"2021","unstructured":"Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the MATH dataset. In: NeurIPS Datasets and Benchmarks. 2021"},{"key":"60308_CR318","unstructured":"Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, Plappert M, Tworek J, Hilton J, Nakano R, Hesse C, Schulman J. Training verifiers to solve math word problems. 2021, arXiv preprint arXiv: 2110.14168"},{"key":"60308_CR319","unstructured":"Zhou J, Lu T, Mishra S, Brahma S, Basu S, Luan Y, Zhou D, Hou L. Instruction-following evaluation for large language models. 2023, arXiv preprint arXiv: 2311.07911"},{"key":"60308_CR320","doi-asserted-by":"publisher","first-page":"3214","DOI":"10.18653\/v1\/2022.acl-long.229","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"S Lin","year":"2022","unstructured":"Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 3214\u20133252"},{"key":"60308_CR321","doi-asserted-by":"publisher","first-page":"1953","DOI":"10.18653\/v1\/2020.emnlp-main.154","volume-title":"Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"N Nangia","year":"2020","unstructured":"Nangia N, Vania C, Bhalerao R, Bowman S R. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 1953\u20131967"},{"key":"60308_CR322","doi-asserted-by":"publisher","first-page":"3356","DOI":"10.18653\/v1\/2020.findings-emnlp.301","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020","author":"S Gehman","year":"2020","unstructured":"Gehman S, Gururangan S, Sap M, Choi Y, Smith N A. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 3356\u20133369"},{"key":"60308_CR323","doi-asserted-by":"publisher","first-page":"220","DOI":"10.1162\/tacl_a_00737","volume":"13","author":"R Vashurin","year":"2025","unstructured":"Vashurin R, Fadeeva E, Vazhentsev A, Rvanova L, Vasilev D, et al. Benchmarking uncertainty quantification methods for large language models with LM-polygraph. Transactions of the Association for Computational Linguistics, 2025, 13: 220\u2013248","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"60308_CR324","doi-asserted-by":"publisher","first-page":"4629","DOI":"10.18653\/v1\/2024.findings-acl.275","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: ACL 2024","author":"A Salinas","year":"2024","unstructured":"Salinas A, Morstatter F. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2024. 2024, 4629\u20134651"},{"key":"60308_CR325","volume-title":"Proceedings of the 9th International Conference on Learning Representations","author":"D Hendrycks","year":"2021","unstructured":"Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. In: Proceedings of the 9th International Conference on Learning Representations. 2021"},{"key":"60308_CR326","unstructured":"Srivastava A, Rastogi A, Rao A, Shoeb A A M, Abid A, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023"},{"key":"60308_CR327","doi-asserted-by":"crossref","unstructured":"Liang P, Bommasani R, Lee T, Tsipras D, Soylu D, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023","DOI":"10.1111\/nyas.15007"},{"key":"60308_CR328","first-page":"8359","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"W Chiang","year":"2024","unstructured":"Chiang W, Zheng L, Sheng Y, Angelopoulos A N, Li T, et al. Chatbot Arena: An open platform for evaluating LLMs by human preference. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 8359\u20138388"},{"key":"60308_CR329","unstructured":"Sun H, Min Y, Chen Z, Zhao W X, Fang L, Liu Z, Wang Z, Wen J R. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. 2025, arXiv preprint arXiv: 2503.21380"},{"key":"60308_CR330","unstructured":"Rein D, Hou B L, Stickland A C, Petty J, Pang R Y, Dirani J, Michael J, Bowman S R. GPQA: A graduate-level google-proof Q&A benchmark. 2023, arXiv preprint arXiv: 2311.12022"},{"key":"60308_CR331","doi-asserted-by":"crossref","unstructured":"Phan L, Gatti A, Han Z, Li N, Hu J, et al. Humanity\u2019s last exam. 2025, arXiv preprint arXiv: 2501.14249","DOI":"10.70777\/si.v2i1.13973"},{"key":"60308_CR332","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"C E Jimenez","year":"2024","unstructured":"Jimenez C E, Yang J, Wettig A, Yao S, Pei K, Press O, Narasimhan K R. SWE-bench: Can language models resolve real-world GitHub issues? In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR333","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"G Mialon","year":"2024","unstructured":"Mialon G, Fourrier C, Wolf T, LeCun Y, Scialom T. GAIA: a benchmark for general AI assistants. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR334","unstructured":"Luo Z, Shen Z, Yang W, Zhao Z, Jwalapuram P, Saha A, Sahoo D, Savarese S, Xiong C, Li J. MCP-universe: Benchmarking large language models with real-world model context protocol servers. 2025, arXiv preprint arXiv: 2508.14704"},{"key":"60308_CR335","first-page":"2749","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Y Huang","year":"2023","unstructured":"Huang Y, Bai Y, Zhu Z, Zhang J, Zhang J, Su T, Liu J, Lv C, Lei J, Fu Y, Sun M, He J. C-Eval: A multi-level multi-discipline Chinese evaluation suite for foundation models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2749"},{"key":"60308_CR336","doi-asserted-by":"publisher","first-page":"13003","DOI":"10.18653\/v1\/2023.findings-acl.824","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: ACL 2023","author":"M Suzgun","year":"2023","unstructured":"Suzgun M, Scales N, Sch\u00e4rli N, Gehrmann S, Tay Y, Chung H W, Chowdhery A, Le Q, Chi E H, Zhou D, Wei J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2023. 2023, 13003\u201313051"},{"key":"60308_CR337","doi-asserted-by":"publisher","first-page":"9440","DOI":"10.18653\/v1\/2024.acl-long.511","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"P Wang","year":"2024","unstructured":"Wang P, Li L, Chen L, Cai Z, Zhu D, Lin B, Cao Y, Kong L, Liu Q, Liu T, Sui Z. Large language models are not fair evaluators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 9440\u20139450"},{"key":"60308_CR338","first-page":"2020","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"L Zheng","year":"2023","unstructured":"Zheng L, Chiang W L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, Li Z, Li D, Xing E P, Zhang H, Gonzalez J E, Stoica I. Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2020"},{"key":"60308_CR339","volume-title":"AlpacaEval: An automatic evaluator of instruction-following models","author":"X Li","year":"2023","unstructured":"Li X, Zhang T, Dubois Y, Taori R, Gulrajani I, Guestrin C, Liang P, Hashimoto T B. AlpacaEval: An automatic evaluator of instruction-following models. See \/github.com\/tatsu-lab\/alpaca_eval website, 2023"},{"key":"60308_CR340","volume-title":"Proceedings of the 42nd International Conference on Machine Learning","author":"M Zhuge","year":"2025","unstructured":"Zhuge M, Zhao C, Ashley D R, Wang W, Khizbullin D, et al. Agent-as-a-judge: Evaluate agents with agents. In: Proceedings of the 42nd International Conference on Machine Learning. 2025"},{"key":"60308_CR341","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"S Golchin","year":"2024","unstructured":"Golchin S, Surdeanu M. Time travel in LLMs: Tracing data contamination in large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR342","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"C White","year":"2025","unstructured":"White C, Dooley S, Roberts M, Pal A, Feuer B, et al. LiveBench: A challenging, contamination-limited LLM benchmark. In: Proceedings of the 13th International Conference on Learning Representations. 2025"},{"key":"60308_CR343","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"C Zheng","year":"2024","unstructured":"Zheng C, Zhou H, Meng F, Zhou J, Huang M. Large language models are not robust multiple choice selectors. In: Proceedings of the 12th International Conference on Learning Representations. 2024"},{"key":"60308_CR344","first-page":"4603","volume-title":"Proceedings of ECAI 2025 - 28th European Conference on Artificial Intelligence","author":"R Lunardi","year":"2025","unstructured":"Lunardi R, Della Mea V, Mizzaro S, Roitero K. On robustness and reliability of benchmark-based evaluation of LLMs. In: Proceedings of ECAI 2025 - 28th European Conference on Artificial Intelligence. 2025, 4603\u20134610"},{"key":"60308_CR345","unstructured":"Rahman M, Khatoonabadi S, Shihab E. Beyond synthetic benchmarks: Evaluating LLM performance on real-world class-level code generation. 2025, arXiv preprint arXiv: 2510.26130"},{"key":"60308_CR346","unstructured":"Chollet F, Knoop M, Kamradt G, Landers B. ARC prize 2024: technical report. 2024, arXiv preprint arXiv: 2412.04604"},{"key":"60308_CR347","volume-title":"OpenAI blog","author":"A Radford","year":"2019","unstructured":"Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog, See cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf website, 2019"},{"key":"60308_CR348","volume-title":"Blog","author":"R Sutton","year":"2019","unstructured":"Sutton R. The bitter lesson. In: Blog. See www.cs.utexas.edu\/~eunsol\/courses\/data\/bitter_lesson.pdf website, 2019"},{"key":"60308_CR349","doi-asserted-by":"publisher","first-page":"6449","DOI":"10.18653\/v1\/2023.emnlp-main.397","volume-title":"Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing","author":"J Li","year":"2023","unstructured":"Li J, Cheng X, Zhao X, Nie J Y, Wen J R. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 6449\u20136464"},{"key":"60308_CR350","first-page":"4971","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"C Burns","year":"2024","unstructured":"Burns C, Izmailov P, Kirchner J H, Baker B, Gao L, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 4971\u20135012"},{"key":"60308_CR351","unstructured":"Zhou K, Zhu Y, Chen Z, Chen W, Zhao W X, Chen X, Lin Y, Wen J R, Han J. Don\u2019t make your LLM an evaluation benchmark cheater. 2023, arXiv preprint arXiv: 2311.01964"},{"key":"60308_CR352","first-page":"2","volume-title":"Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology","author":"J S Park","year":"2023","unstructured":"Park J S, O\u2019Brien J C, Cai C J, Morris M R, Liang P, Bernstein M S. Generative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023, 2"},{"key":"60308_CR353","unstructured":"Cheng D, Huang S, Gu Y, Song H, Chen G, Dong L, Zhao W X, Wen J R, Wei F. LLM-in-sandbox elicits general agentic intelligence. 2026, arXiv preprint arXiv: 2601.16206"},{"key":"60308_CR354","first-page":"795","volume-title":"Proceedings of Machine Learning and Systems 2022","author":"C J Wu","year":"2022","unstructured":"Wu C J, Raghavendra R, Gupta U, Acun B, Ardalani N, et al. Sustainable AI: Environmental implications, challenges and opportunities. In: Proceedings of Machine Learning and Systems 2022. 2022, 795\u2013813"},{"issue":"3","key":"60308_CR355","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1080\/0960085X.2022.2026621","volume":"31","author":"P Mikalef","year":"2022","unstructured":"Mikalef P, Conboy K, Lundstr\u00f6m J E, Popovi\u010d A. Thinking responsibly about responsible AI and \u2018the dark side\u2019 of AI. European Journal of Information Systems, 2022, 31(3): 257\u2013268","journal-title":"European Journal of Information Systems"}],"container-title":["Frontiers of Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11704-026-60308-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11704-026-60308-3","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11704-026-60308-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T16:32:22Z","timestamp":1778689942000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11704-026-60308-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,9]]},"references-count":355,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2026,12]]}},"alternative-id":["60308"],"URL":"https:\/\/doi.org\/10.1007\/s11704-026-60308-3","relation":{},"ISSN":["2095-2228","2095-2236"],"issn-type":[{"value":"2095-2228","type":"print"},{"value":"2095-2236","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,9]]},"assertion":[{"value":"14 February 2026","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 March 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 May 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests or financial conflicts to disclose.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"2012627"}}