{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T23:35:36Z","timestamp":1761176136001,"version":"build-2065373602"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686318","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T00:00:00Z","timestamp":1761004800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,10,21]]},"abstract":"<jats:p>In traditional LLM inference, the prefill phase necessitates exclusive access to the GPU, forcing mutually exclusive execution between the prefill and decode phases. This makes it challenging to balance Time to First Token (TTFT) and Time Between Tokens (TBT). Existing chunked-prefills techniques decompose prefill tasks into non-exclusive GPU chunks for batch execution, enabling parallel execution of prefill and decode, improving TTFT-TBT trade-offs. However, current chunked-prefills techniques rely on artificial experience to set static chunk sizes, making it difficult to optimize TTFT-TBT trade-offs under complex workloads. Consequently, it is necessary to dynamically adjust the prefill chunk size without interrupting the LLM inference. However, it is difficult because LLM inference scenarios are real-time and concurrent, and the state information is high-dimensional. To address this, we propose DRLServe - an adaptive chunked prefill inference technology based on deep reinforcement learning, which significantly enhances TTFT-TBT trade-offs by dynamically adjusting the chunk size based on real-time load and system resource utilisation. Specifically, we propose a real-time adaptive prefill chunking inference framework (RAPC), which decouples inference and chunk size adjustment, allowing it to complete chunk size adjustment without interrupting the existing inference. RAPC implements a resource and task-aware dual-driven chunk size decision mechanism, which can obtain resource status and inference task characteristics in real-time and dynamically adjust the chunk size. Then, we present a reinforcement learning algorithm for real-time prefill chunk partitioning (TAPPO). For the first time, we modelled LLM inference as a Markov Decision Process (MDP). We used reinforcement learning to learn the chunk partitioning strategy, aiming to achieve adaptive decision-making of prefill chunk size under constraints of inference latency and training stability, to optimise the TTFT-TBT trade-offs. Experiments show that compared with the latest inference technology, DRLServe can shorten TTFT by 50.9%, TBT mean by 64.6%, TBT variance by 99.0%.<\/jats:p>","DOI":"10.3233\/faia250867","type":"book-chapter","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:44:33Z","timestamp":1761126273000},"source":"Crossref","is-referenced-by-count":0,"title":["DRLServe: Adaptive Prefill Chunking with Deep Reinforcement Learning for LLM Inference"],"prefix":"10.3233","author":[{"given":"Chongxiang","family":"Sun","sequence":"first","affiliation":[{"name":"National Key Laboratory of Parallel and Distributed Computing (National University of Defense Technology), Changsha 410073, China"},{"name":"College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China"},{"name":"State Key Laboratory of Complex & Critical Software Environment"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Han","family":"Bao","sequence":"additional","affiliation":[{"name":"National Key Laboratory of Parallel and Distributed Computing (National University of Defense Technology), Changsha 410073, China"},{"name":"College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China"},{"name":"State Key Laboratory of Complex & Critical Software Environment"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yijie","family":"Wang","sequence":"additional","affiliation":[{"name":"National Key Laboratory of Parallel and Distributed Computing (National University of Defense Technology), Changsha 410073, China"},{"name":"College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China"},{"name":"State Key Laboratory of Complex & Critical Software Environment"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","ECAI 2025"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA250867","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:44:34Z","timestamp":1761126274000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA250867"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,21]]},"ISBN":["9781643686318"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia250867","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,21]]}}}