{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:35:59Z","timestamp":1772724959724,"version":"3.50.1"},"reference-count":23,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T00:00:00Z","timestamp":1740960000000},"content-version":"vor","delay-in-days":61,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,2,28]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a \u201cdraft and then verify\u201d mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which do not adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we propose OPT-Tree, an algorithm to construct adaptive and scalable draft trees, which can be applied to any autoregressive draft model. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at https:\/\/github.com\/Jikai0Wang\/OPT-Tree.<\/jats:p>","DOI":"10.1162\/tacl_a_00735","type":"journal-article","created":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T15:46:42Z","timestamp":1741016802000},"page":"188-199","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":5,"title":["OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure"],"prefix":"10.1162","volume":"13","author":[{"given":"Jikai","family":"Wang","sequence":"first","affiliation":[{"name":"Soochow University, China. risus254@gmail.com"}]},{"given":"Yi","family":"Su","sequence":"additional","affiliation":[{"name":"Soochow University, China. yisunlp@outlook.com"}]},{"given":"Juntao","family":"Li","sequence":"additional","affiliation":[{"name":"Soochow University, China. ljt@suda.edu.cn"}]},{"given":"Qingrong","family":"Xia","sequence":"additional","affiliation":[{"name":"Huawei Cloud, China. xiaqingrong@huawei.com"}]},{"given":"Zi","family":"Ye","sequence":"additional","affiliation":[{"name":"Huawei Cloud, China. yezi3@huawei.com"}]},{"given":"Xinyu","family":"Duan","sequence":"additional","affiliation":[{"name":"Huawei Cloud, China. duanxinyu@huawei.com"}]},{"given":"Zhefeng","family":"Wang","sequence":"additional","affiliation":[{"name":"Huawei Cloud, China. wangzhefeng@huawei.com"}]},{"given":"Min","family":"Zhang","sequence":"additional","affiliation":[{"name":"Soochow University, China. zhangminmt@hotmail.com"}]}],"member":"281","published-online":{"date-parts":[[2025,2,28]]},"reference":[{"key":"2025051914252706600_bib1","doi-asserted-by":"publisher","first-page":"95","DOI":"10.18653\/v1\/2022.bigscience-1.9","article-title":"GPT-NeoX-20B: An open-source autoregressive language model","volume-title":"Proceedings of BigScience Episode #5 \u2013 Workshop on Challenges & Perspectives in Creating Large Language Models","author":"Black","year":"2022"},{"key":"2025051914252706600_bib2","article-title":"Medusa: Simple LLM inference acceleration framework with multiple decoding heads","author":"Cai","year":"2024","journal-title":"arXiv preprint arXiv:2401.10774v3"},{"key":"2025051914252706600_bib3","article-title":"Accelerating large language model decoding with speculative sampling","author":"Chen","year":"2023","journal-title":"arXiv preprint arXiv:2302.01318v1"},{"key":"2025051914252706600_bib4","article-title":"Sequoia: Scalable, robust, and hardware-aware speculative decoding","author":"Chen","year":"2024","journal-title":"arXiv preprint arXiv:2402.12374v2"},{"key":"2025051914252706600_bib5","article-title":"Cascade speculative drafting for even faster LLM inference","author":"Chen","year":"2023","journal-title":"arXiv preprint arXiv:2312.11462v4"},{"key":"2025051914252706600_bib6","article-title":"Training verifiers to solve math word problems","author":"Cobbe","year":"2021","journal-title":"arXiv preprint arXiv:2110.14168v2"},{"key":"2025051914252706600_bib7","article-title":"Break the sequential dependency of LLM inference using lookahead decoding","author":"Yichao","year":"2024","journal-title":"arXiv preprint arXiv:2402.02057v1"},{"key":"2025051914252706600_bib8","article-title":"Rest: Retrieval-based speculative decoding","author":"He","year":"2023","journal-title":"arXiv preprint arXiv: 2311.08252v2"},{"key":"2025051914252706600_bib9","article-title":"Recursive speculative decoding: Accelerating LLM inference via sampling without replacement","volume-title":"ICLR 2024 Workshop on Large Language Model (LLM) Agents","author":"Jeon","year":"2024"},{"key":"2025051914252706600_bib10","article-title":"Mixtral of experts","author":"Jiang","year":"2024"},{"key":"2025051914252706600_bib11","first-page":"19274","article-title":"Fast inference from transformers via speculative decoding","volume-title":"International Conference on Machine Learning","author":"Leviathan","year":"2023"},{"key":"2025051914252706600_bib12","article-title":"EAGLE: Speculative sampling requires rethinking feature uncertainty","author":"Li","year":"2024","journal-title":"arXiv preprint arXiv:2401.15077v2"},{"key":"2025051914252706600_bib13","article-title":"Efficiently scaling transformer inference","volume":"5","author":"Pope","year":"2023","journal-title":"Proceedings of Machine Learning and Systems"},{"issue":"140","key":"2025051914252706600_bib14","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2025051914252706600_bib15","article-title":"Accelerating LLM inference with staged speculative decoding","author":"Spector","year":"2023","journal-title":"arXiv preprint arXiv:2308.04623v1"},{"key":"2025051914252706600_bib16","article-title":"Blockwise parallel decoding for deep autoregressive models","volume":"31","author":"Stern","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914252706600_bib17","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv preprint arXiv:2307.09288v2"},{"key":"2025051914252706600_bib18","doi-asserted-by":"publisher","first-page":"3909","DOI":"10.18653\/v1\/2023.findings-emnlp.257","article-title":"Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Xia","year":"2023"},{"key":"2025051914252706600_bib19","article-title":"Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding","author":"Xia","year":"2024","journal-title":"arXiv preprint arXiv:2401.07851v3"},{"key":"2025051914252706600_bib20","article-title":"Predictive pipelined decoding: A compute-latency trade-off for exact LLM decoding","author":"Yang","year":"2023","journal-title":"arXiv preprint arXiv:2307.05908v2"},{"key":"2025051914252706600_bib21","article-title":"Draft & verify: Lossless large language model acceleration via self-speculative decoding","author":"Zhang","year":"2023","journal-title":"arXiv preprint arXiv:2309.08168v2"},{"key":"2025051914252706600_bib22","article-title":"Opt: Open pre-trained transformer language models","author":"Zhang","year":"2022","journal-title":"arXiv preprint arXiv:2205.01068v4"},{"key":"2025051914252706600_bib23","article-title":"Judging LLM-as-a-judge with MT-Bench and Chatbot Arena","volume":"36","author":"Zheng","year":"2024","journal-title":"Advances in Neural Information Processing Systems"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00735\/2506509\/tacl_a_00735.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00735\/2506509\/tacl_a_00735.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,19]],"date-time":"2025-05-19T18:25:39Z","timestamp":1747679139000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00735\/128189\/OPT-Tree-Speculative-Decoding-with-Adaptive-Draft"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":23,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00735","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}