{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T08:59:50Z","timestamp":1768294790835,"version":"3.49.0"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGKDD Explor. Newsl."],"published-print":{"date-parts":[[2025,12,30]]},"abstract":"<jats:p>Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLMrouting is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi- LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs. After obtaining the predicted cost and performance, we utilize a constrained optimizer for cost-optimal assignments that employs Lagrangian dual decomposition with adaptive multipliers. It iteratively converges toward the globally optimal query-model allocation, dynamically balancing latency minimization against quality thresholds while adhering to heterogeneous capacity constraints. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https: \/\/github.com\/dongyuanjushi\/OmniRouter.<\/jats:p>","DOI":"10.1145\/3787470.3787480","type":"journal-article","created":{"date-parts":[[2026,1,1]],"date-time":"2026-01-01T00:46:21Z","timestamp":1767228381000},"page":"107-116","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["OmniRouter: Budget and Performance Controllable Multi-LLM Routing"],"prefix":"10.1145","volume":"27","author":[{"given":"Kai","family":"Mei","sequence":"first","affiliation":[{"name":"Department of Computer Science, Rutgers University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wujiang","family":"Xu","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Rutgers University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Minghao","family":"Guo","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Rutgers University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shuhang","family":"Lin","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Rutgers University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yongfeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Rutgers University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,12,31]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Achiam J. Adler S. Agarwal S. Ahmad L. Akkaya I. Aleman F. L. Almeida D. Altenschmidt J. Altman S. Anadkat S. et al. Gpt-4 technical report. arXiv preprint:2303.08774 (OpenAI Technical Report) (2023)."},{"key":"e_1_2_1_2_1","volume-title":"Hybrid llm: Cost-efficient and qualityaware query routing. arXiv preprint arXiv:2404.14618","author":"Anonymous","year":"2024","unstructured":"Anonymous. Hybrid llm: Cost-efficient and qualityaware query routing. arXiv preprint arXiv:2404.14618 (2024)."},{"key":"e_1_2_1_3_1","volume-title":"Constrained optimization and Lagrange multiplier methods","author":"Bertsekas D. P.","year":"2014","unstructured":"Bertsekas, D. P. Constrained optimization and Lagrange multiplier methods. Academic press, 2014."},{"key":"e_1_2_1_4_1","first-page":"2206","volume-title":"International Conference on Machine Learning","author":"Borgeaud S.","year":"2022","unstructured":"Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning (2022), PMLR, pp. 2206--2240."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.52202\/079017-2120"},{"key":"e_1_2_1_6_1","volume-title":"Enabling efficient batch serving for lmaas via generation length prediction. arXiv preprint arXiv:2406.04785","author":"Cheng K.","year":"2024","unstructured":"Cheng, K., Hu, W., Wang, Z., Du, P., Li, J., and Zhang, S. Enabling efficient batch serving for lmaas via generation length prediction. arXiv preprint arXiv:2406.04785 (2024)."},{"key":"e_1_2_1_7_1","volume-title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https:\/\/vicuna. lmsys. org (accessed","author":"Chiang W.-L.","year":"2023","unstructured":"Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https:\/\/vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6."},{"key":"e_1_2_1_8_1","volume-title":"Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168","author":"Cobbe K.","year":"2021","unstructured":"Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)."},{"key":"e_1_2_1_9_1","volume-title":"Costeffective online multi-llm selection with versatile reward models. arXiv preprint arXiv:2405.16587","author":"Dai X.","year":"2024","unstructured":"Dai, X., Li, J., Liu, X., Yu, A., and Lui, J. Costeffective online multi-llm selection with versatile reward models. arXiv preprint arXiv:2405.16587 (2024)."},{"key":"e_1_2_1_10_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin J.","year":"2018","unstructured":"Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_1_11_1","unstructured":"Dubey A. Jauhri A. Pandey A. Kadian A. Al-Dahle A. Letman A. Mathur A. Schelten A. Yang A. Fan A. et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (Meta AI Technical Report) (2024)."},{"key":"e_1_2_1_12_1","volume-title":"Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834","author":"Feng T.","year":"2024","unstructured":"Feng, T., Shen, Y., and You, J. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834 (2024)."},{"key":"e_1_2_1_13_1","volume-title":"Advances in Neural Information Processing Systems","volume":"36","author":"Feng X.","year":"2023","unstructured":"Feng, X., Wu, M., Feng, Y., Yin, X., Wang, S., Zeng, Y., Zeng, X., Wang, Z., Qin, R., Hu, G., et al. Towards understanding and mitigating the training data quality of large language models. In Advances in Neural Information Processing Systems (2023), vol. 36."},{"key":"e_1_2_1_14_1","volume-title":"Efficient llm scheduling by learning to rank. arXiv preprint arXiv:2408.15792","author":"Fu Y.","year":"2024","unstructured":"Fu, Y., Zhu, S., Su, R., Qiao, A., Stoica, I., and Zhang, H. Efficient llm scheduling by learning to rank. arXiv preprint arXiv:2408.15792 (2024)."},{"key":"e_1_2_1_15_1","volume-title":"Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948","author":"Guo D.","year":"2025","unstructured":"Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)."},{"key":"e_1_2_1_16_1","volume-title":"Deepsieve: Information sieving via llm-as-a-knowledge-router. arXiv preprint arXiv:2507.22050","author":"Guo M.","year":"2025","unstructured":"Guo, M., Zeng, Q., Zhao, X., Liu, Y., Yu, W., Du, M., Chen, H., and Cheng, W. Deepsieve: Information sieving via llm-as-a-knowledge-router. arXiv preprint arXiv:2507.22050 (2025)."},{"key":"e_1_2_1_17_1","volume-title":"Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300","author":"Hendrycks D.","year":"2020","unstructured":"Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)."},{"key":"e_1_2_1_18_1","volume-title":"Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874","author":"Hendrycks D.","year":"2021","unstructured":"Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1177\/003754979406200405"},{"key":"e_1_2_1_20_1","volume-title":"Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181","author":"Hu C.","year":"2024","unstructured":"Hu, C., Huang, H., Xu, L., Chen, X., Xu, J., Chen, S., Feng, H., Wang, C., Wang, S., Bao, Y., et al. Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181 (2024)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.585"},{"key":"e_1_2_1_22_1","volume-title":"Disentangling logic: The role of context in large language model reasoning capabilities. arXiv preprint arXiv:2406.02787","author":"Hua W.","year":"2024","unstructured":"Hua, W., Zhu, K., Li, L., Fan, L., Lin, S., Jin, M., Xue, H., Li, Z., Wang, J., and Zhang, Y. Disentangling logic: The role of context in large language model reasoning capabilities. arXiv preprint arXiv:2406.02787 (2024)."},{"key":"e_1_2_1_23_1","volume-title":"5-coder technical report. arXiv preprint arXiv:2409.12186","author":"Hui B.","year":"2024","unstructured":"Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)."},{"key":"e_1_2_1_24_1","volume-title":"Two heads are better than one: Test-time scaling of multi-agent collaborative reasoning. arXiv preprint arXiv:2504.09772","author":"Jin C.","year":"2025","unstructured":"Jin, C., Peng, H., Zhang, Q., Tang, Y., Metaxas, D. N., and Che, T. Two heads are better than one: Test-time scaling of multi-agent collaborative reasoning. arXiv preprint arXiv:2504.09772 (2025)."},{"key":"e_1_2_1_25_1","first-page":"18015","article-title":"-Y. S3: Increasing gpu utilization during generative inference for higher throughput","volume":"36","author":"Jin Y.","year":"2023","unstructured":"Jin, Y., Wu, C.-F., Brooks, D., and Wei, G.-Y. S3: Increasing gpu utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems 36 (2023), 18015--18027.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_1_27_1","first-page":"9459","article-title":"Retrievalaugmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis P.","year":"2020","unstructured":"Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u00a8uttler, H., Lewis, M., Yih, W.-t., Rockt\u00a8aschel, T., et al. Retrievalaugmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_28_1","volume-title":"Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579","author":"Li H.","year":"2024","unstructured":"Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579 (2024)."},{"key":"e_1_2_1_29_1","volume-title":"Aios: Llm agent operating system. arXiv e-prints","author":"Mei K.","year":"2024","unstructured":"Mei, K., Li, Z., Xu, S., Ye, R., Ge, Y., and Zhang, Y. Aios: Llm agent operating system. arXiv e-prints, pp. arXiv--2403 (2024)."},{"key":"e_1_2_1_30_1","volume-title":"Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309","author":"Nijkamp E.","year":"2023","unstructured":"Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., and Zhou, Y. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023)."},{"key":"e_1_2_1_31_1","volume-title":"Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665","author":"Ong I.","year":"2024","unstructured":"Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665 (2024)."},{"key":"e_1_2_1_32_1","volume-title":"Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560","author":"Packer C.","year":"2023","unstructured":"Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., and Gonzalez, J. E. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560 (2023)."},{"key":"e_1_2_1_33_1","volume-title":"Adaptive llm routing under budget constraints. arXiv preprint arXiv:2508.21141","author":"Panda P.","year":"2025","unstructured":"Panda, P., Magazine, R., Devaguptapu, C., Takemori, S., and Sharma, V. Adaptive llm routing under budget constraints. arXiv preprint arXiv:2508.21141 (2025)."},{"key":"e_1_2_1_34_1","volume-title":"Selectllm: Can llms select important instructions to annotate? arXiv preprint arXiv:2401.16553","author":"Parkar R. S.","year":"2024","unstructured":"Parkar, R. S., Kim, J., Park, J. I., and Kang, D. Selectllm: Can llms select important instructions to annotate? arXiv preprint arXiv:2401.16553 (2024)."},{"key":"e_1_2_1_35_1","volume-title":"International Conference on Learning Representations","author":"Qin X.","year":"2023","unstructured":"Qin, X., Yang, Y., Li, X., Dong, S., Huang, S., Ji, H., and Li, L. Towards robust llm-based decisionmaking: A calibration and planning approach. In International Conference on Learning Representations (2023)."},{"key":"e_1_2_1_36_1","first-page":"75","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Qiu H.","year":"2024","unstructured":"Qiu, H., Mao, W., Patke, A., Cui, S., Jha, S., Wang, C., Franke, H., Kalbarczyk, Z., Bas\u00b8ar, T., and Iyer, R. K. Power-aware deep learning model serving with {?-Serve}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24) (2024), pp. 75--93."},{"key":"e_1_2_1_37_1","volume-title":"The 5th International Workshop on Cloud Intelligence \/ AIOps at ASPLOS 2024","volume":"5","author":"Qiu H.","year":"2024","unstructured":"Qiu, H., Mao, W., Patke, A., Cui, S., Jha, S., Wang, C., Franke, H., Kalbarczyk, Z. T., Bas\u00b8ar, T., and Iyer, R. K. Efficient interactive llm serving with proxy model-based sequence length prediction. In The 5th International Workshop on Cloud Intelligence \/ AIOps at ASPLOS 2024 (San Diego, CA, USA, 2024), vol. 5, Association for Computing Machinery, pp. 1--7."},{"key":"e_1_2_1_38_1","volume-title":"Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022","author":"Rein D.","year":"2023","unstructured":"Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022 (2023)."},{"key":"e_1_2_1_39_1","volume-title":"Metametrics and best practices for system-level inference performance benchmarking","author":"Salaria S.","year":"2025","unstructured":"Salaria, S., Liu, Z., and Gonzalez, N. M. Metametrics and best practices for system-level inference performance benchmarking, 2025."},{"key":"e_1_2_1_40_1","volume-title":"Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108","author":"Sanh V.","year":"2019","unstructured":"Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)."},{"key":"e_1_2_1_41_1","volume-title":"The Thirteenth International Conference on Learning Representations","author":"Shi Z.","year":"2025","unstructured":"Shi, Z., Mei, K., Jin, M., Su, Y., Zuo, C., Hua, W., Xu, W., Ren, Y., Liu, Z., Du, M., Deng, D., and Zhang, Y. From commands to prompts: LLMbased semantic file system for aios. In The Thirteenth International Conference on Learning Representations (2025)."},{"key":"e_1_2_1_42_1","volume-title":"Carrot: A cost aware rate optimal router. arXiv preprint arXiv:2502.03261","author":"Somerstep S.","year":"2025","unstructured":"Somerstep, S., Polo, F. M., de Oliveira, A. F. M., Mangal, P., Silva, M., Bhardwaj, O., Yurochkin, M., and Maity, S. Carrot: A cost aware rate optimal router. arXiv preprint arXiv:2502.03261 (2025)."},{"key":"e_1_2_1_43_1","volume-title":"Dynamollm: Designing llm inference clusters for performance and energy efficiency. arXiv preprint arXiv:2408.00741","author":"Stojkovic J.","year":"2024","unstructured":"Stojkovic, J., Zhang, C., Goiri, \u00b4I., Torrellas, J., and Choukse, E. Dynamollm: Designing llm inference clusters for performance and energy efficiency. arXiv preprint arXiv:2408.00741 (2024)."},{"key":"e_1_2_1_44_1","volume-title":"Llumnix: Dynamic scheduling for large language model serving. arXiv preprint arXiv:2406.03243","author":"Sun B.","year":"2024","unstructured":"Sun, B., Huang, Z., Zhao, H., Xiao, W., Zhang, X., Li, Y., and Lin, W. Llumnix: Dynamic scheduling for large language model serving. arXiv preprint arXiv:2406.03243 (2024)."},{"key":"e_1_2_1_45_1","volume-title":"Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805","author":"Team G.","year":"2023","unstructured":"Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)."},{"key":"e_1_2_1_46_1","volume-title":"Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118","author":"Team G.","year":"2024","unstructured":"Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram\u00b4e, A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024)."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599764"},{"key":"e_1_2_1_48_1","volume-title":"Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120","author":"Wei Y.","year":"2023","unstructured":"Wei, Y., Wang, Z., Liu, J., Ding, Y., and Zhang, L. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 (2023)."},{"key":"e_1_2_1_49_1","volume-title":"A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110","author":"Xu W.","year":"2025","unstructured":"Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., and Zhang, Y. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110 (2025)."},{"key":"e_1_2_1_50_1","volume-title":"iagent: Llm agent as a shield between user and recommender systems. arXiv preprint arXiv:2502.14662","author":"Xu W.","year":"2025","unstructured":"Xu, W., Shi, Y., Liang, Z., Ning, X., Mei, K., Wang, K., Zhu, X., Xu, M., and Zhang, Y. iagent: Llm agent as a shield between user and recommender systems. arXiv preprint arXiv:2502.14662 (2025)."},{"key":"e_1_2_1_51_1","volume-title":"5 technical report. arXiv preprint arXiv:2412.15115","author":"Yang A.","year":"2024","unstructured":"Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)."},{"key":"e_1_2_1_52_1","first-page":"10285","volume-title":"Advances in Neural Information Processing Systems","volume":"35","author":"Yu P.","year":"2022","unstructured":"Yu, P., Trombetta, P., Hassani, A., Bulitko, V., and White, M. Gilbo: One metric to measure them all. In Advances in Neural Information Processing Systems (2022), vol. 35, pp. 10285--10297."},{"key":"e_1_2_1_53_1","volume-title":"When ai meets finance (stockagent): Large language model-based stock trading in simulated realworld environments. arXiv preprint arXiv:2407.18957","author":"Zhang C.","year":"2024","unstructured":"Zhang, C., Liu, X., Zhang, Z., Jin, M., Li, L., Wang, Z., Hua, W., Shu, D., Zhu, S., Jin, X., et al. When ai meets finance (stockagent): Large language model-based stock trading in simulated realworld environments. arXiv preprint arXiv:2407.18957 (2024)."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380084"},{"key":"e_1_2_1_55_1","volume-title":"Advances in Neural Information Processing Systems","volume":"36","author":"Zhang Y.","year":"2023","unstructured":"Zhang, Y., Shi, X., Feng, L., Wang, M., Yu, Y., Yu, X., Shen, H., Chen, Z., Mucci, P., Kudlur, M., et al. S-lora: Serving thousands of concurrent lora adapters. In Advances in Neural Information Processing Systems (2023), vol. 36."},{"key":"e_1_2_1_56_1","volume-title":"-J. Capability instruction tuning: A new paradigm for dynamic llm routing. arXiv preprint arXiv:2502.17282","author":"Zhang Y.-K.","year":"2025","unstructured":"Zhang, Y.-K., Zhan, D.-C., and Ye, H.-J. Capability instruction tuning: A new paradigm for dynamic llm routing. arXiv preprint arXiv:2502.17282 (2025)."},{"key":"e_1_2_1_57_1","unstructured":"Zheng L. Yin L. Xie Z. Huang J. Sun C. Yu C. Cao S. Kozyrakis C. Stoica I. Gonzalez J. E. et al. Efficiently programming large language models using sglang."},{"key":"e_1_2_1_58_1","volume-title":"Response length perception and sequence scheduling: An llm-empowered llm inference pipeline","author":"Zheng Z.","unstructured":"Zheng, Z., Ren, X., Xue, F., Luo, Y., Jiang, X., and You, Y. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline. vol. 36."},{"key":"e_1_2_1_59_1","volume-title":"Deepseek-coder-v2: Breaking the barrier of closedsource models in code intelligence. arXiv preprint arXiv:2406.11931","author":"Zhu Q.","year":"2024","unstructured":"Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y., Li, Y., Gao, H., Ma, S., et al. Deepseek-coder-v2: Breaking the barrier of closedsource models in code intelligence. arXiv preprint arXiv:2406.11931 (2024)."},{"key":"e_1_2_1_60_1","volume-title":"Embedllm: Learning compact representations of large language models. arXiv preprint arXiv:2410.02223","author":"Zhuang R.","year":"2024","unstructured":"Zhuang, R., Wu, T., Wen, Z., Li, A., Jiao, J., and Ramchandran, K. Embedllm: Learning compact representations of large language models. arXiv preprint arXiv:2410.02223 (2024)."}],"container-title":["ACM SIGKDD Explorations Newsletter"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3787470.3787480","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T00:43:16Z","timestamp":1768264996000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3787470.3787480"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,30]]},"references-count":60,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,12,30]]}},"alternative-id":["10.1145\/3787470.3787480"],"URL":"https:\/\/doi.org\/10.1145\/3787470.3787480","relation":{},"ISSN":["1931-0145","1931-0153"],"issn-type":[{"value":"1931-0145","type":"print"},{"value":"1931-0153","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,30]]},"assertion":[{"value":"2025-12-31","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}