{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T07:16:54Z","timestamp":1779175014941,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":103,"publisher":"ACM","funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["CNS 2146496"],"award-info":[{"award-number":["CNS 2146496"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,13]]},"DOI":"10.1145\/3731569.3764855","type":"proceedings-article","created":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T12:43:24Z","timestamp":1759322604000},"page":"606-622","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0265-2144","authenticated-orcid":false,"given":"Siddhant","family":"Ray","sequence":"first","affiliation":[{"name":"University of Chicago, Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6973-3259","authenticated-orcid":false,"given":"Rui","family":"Pan","sequence":"additional","affiliation":[{"name":"Princeton University, Princeton, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1076-6549","authenticated-orcid":false,"given":"Zhuohan","family":"Gu","sequence":"additional","affiliation":[{"name":"University of Chicago, Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3964-4079","authenticated-orcid":false,"given":"Kuntai","family":"Du","sequence":"additional","affiliation":[{"name":"University of Chicago, Chicago, USA"},{"name":"TensorMesh, Inc., Foster City, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3346-5165","authenticated-orcid":false,"given":"Shaoting","family":"Feng","sequence":"additional","affiliation":[{"name":"University of Chicago, Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7479-1664","authenticated-orcid":false,"given":"Ganesh","family":"Ananthanarayanan","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7002-5033","authenticated-orcid":false,"given":"Ravi","family":"Netravali","sequence":"additional","affiliation":[{"name":"Princeton University, Princeton, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6877-1683","authenticated-orcid":false,"given":"Junchen","family":"Jiang","sequence":"additional","affiliation":[{"name":"University of Chicago, Chicago, USA"},{"name":"TensorMesh, Inc., Foster City, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,10,12]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"https:\/\/docs.llamaindex.ai\/en\/stable\/examples\/param_optimizer\/param_optimizer\/","author":"Hyperparameter Optimization","year":"2024","unstructured":"Hyperparameter Optimization for RAG. https:\/\/docs.llamaindex.ai\/en\/stable\/examples\/param_optimizer\/param_optimizer\/, 2024."},{"key":"e_1_3_2_1_2_1","volume-title":"Forty-first International Conference on Machine Learning.","author":"Abhyankar Reyna","unstructured":"Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. Infercept: Efficient intercept support for augmented large language model inference. In Forty-first International Conference on Machine Learning."},{"key":"e_1_3_2_1_3_1","first-page":"134","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117\u2013134, Santa Clara, CA, July 2024. USENIX Association."},{"key":"e_1_3_2_1_4_1","volume-title":"Cohere: Cutting-edge gen ai","author":"Cohere","year":"2023","unstructured":"Cohere AI. Cohere: Cutting-edge gen ai, 2023."},{"key":"e_1_3_2_1_5_1","volume-title":"29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"2","author":"Ansel Jason","year":"2024","unstructured":"Jason et al. Ansel. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '24). ACM, April 2024."},{"key":"e_1_3_2_1_6_1","first-page":"46","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts)","author":"Asai Akari","year":"2023","unstructured":"Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41\u201346, 2023."},{"key":"e_1_3_2_1_7_1","unstructured":"Angels Balaguer Vinamra Benara Renato Luiz de Freitas Cunha Roberto de M. Estev\u00e3o Filho Todd Hendry Daniel Holstein Jennifer Marsman Nick Mecklenburg Sara Malvar Leonardo O. Nunes Rafael Padilha Morris Sharp Bruno Silva Swati Sharma Vijay Aski and Ranveer Chandra. Rag vs fine-tuning: Pipelines tradeoffs and a case study on agriculture 2024."},{"key":"e_1_3_2_1_8_1","unstructured":"Harrison Chase. LangChain October 2022."},{"key":"e_1_3_2_1_9_1","volume-title":"Do large language models need a content delivery network?","author":"Cheng Yihua","year":"2024","unstructured":"Yihua Cheng, Kuntai Du, Jiayi Yao, and Junchen Jiang. Do large language models need a content delivery network?, 2024."},{"key":"e_1_3_2_1_10_1","first-page":"729","volume-title":"Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24","author":"Cuconasu Florin","year":"2024","unstructured":"Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24, page 719\u2013729, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_11_1","first-page":"623","volume-title":"Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles","author":"Dai Yinwei","year":"2024","unstructured":"Yinwei Dai, Rui Pan, Anand Iyer, Kai Li, and Ravi Netravali. Apparate: Rethinking early exits to tame latency-throughput tensions in ml serving. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 607\u2013623, 2024."},{"key":"e_1_3_2_1_12_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019."},{"key":"e_1_3_2_1_13_1","volume-title":"Hybrid llm: Cost-efficient and quality-aware query routing","author":"Ding Dujian","year":"2024","unstructured":"Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing, 2024."},{"key":"e_1_3_2_1_14_1","volume-title":"Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference","author":"Dong Harry","year":"2024","unstructured":"Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference, 2024."},{"key":"e_1_3_2_1_15_1","first-page":"6417","volume-title":"Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24","author":"dos Santos Junior Jos\u00e9 Cassio","year":"2024","unstructured":"Jos\u00e9 Cassio dos Santos Junior, Rachel Hu, Richard Song, and Yunfei Bai. Domain-driven llm development: Insights into rag and fine-tuning practices. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24, page 6416\u20136417, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_16_1","volume-title":"The faiss library","author":"Douze Matthijs","year":"2024","unstructured":"Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar\u00e9, Maria Lomeli, Lucas Hosseini, and Herv\u00e9 J\u00e9gou. The faiss library. 2024."},{"key":"e_1_3_2_1_17_1","volume-title":"From local to global: A graph rag approach to query-focused summarization","author":"Edge Darren","year":"2024","unstructured":"Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2024."},{"key":"e_1_3_2_1_18_1","volume-title":"Text and code embeddings by contrastive pre-training","author":"Arvind Neelakantan","year":"2022","unstructured":"Arvind Neelakantan et al. Text and code embeddings by contrastive pre-training, 2022."},{"key":"e_1_3_2_1_19_1","volume-title":"Mteb: Massive text embedding benchmark","author":"Face Hugging","year":"2024","unstructured":"Hugging Face. Mteb: Massive text embedding benchmark, 2024."},{"key":"e_1_3_2_1_20_1","first-page":"6501","volume-title":"Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24","author":"Fan Wenqi","year":"2024","unstructured":"Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24, page 6491\u20136501, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_21_1","volume-title":"Autorag-hp: Automatic online hyper-parameter tuning for retrieval-augmented generation. arXiv preprint arXiv:2406.19251","author":"Fu Jia","year":"2024","unstructured":"Jia Fu, Xiaoting Qin, Fangkai Yang, Lu Wang, Jue Zhang, Qingwei Lin, Yubo Chen, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. Autorag-hp: Automatic online hyper-parameter tuning for retrieval-augmented generation. arXiv preprint arXiv:2406.19251, 2024."},{"key":"e_1_3_2_1_22_1","first-page":"126","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Gao Bin","year":"2024","unstructured":"Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. Cost-efficient large language model serving for multi-turn conversations with cachedattention. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 111\u2013126, 2024."},{"key":"e_1_3_2_1_23_1","first-page":"2376","volume-title":"Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22","author":"Gao Luyu","unstructured":"Luyu Gao and Jamie Callan. Long document re-ranking with modular re-ranker. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, page 2371\u20132376. ACM, July 2022."},{"key":"e_1_3_2_1_24_1","volume-title":"AWS","author":"Genevay Aude","year":"2024","unstructured":"Aude Genevay. From rag to fabric: Lessons learned from building real-world rags at genaiic \u2013 part 1. Technical report, AWS, 2024."},{"key":"e_1_3_2_1_25_1","first-page":"6595","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Geng Jiahui","year":"2024","unstructured":"Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577\u20136595, Mexico City, Mexico, June 2024. Association for Computational Linguistics."},{"key":"e_1_3_2_1_26_1","first-page":"325","article-title":"Modular attention reuse for low-latency inference","volume":"6","author":"Gim In","year":"2024","unstructured":"In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, 6:325\u2013338, 2024.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_27_1","volume-title":"Reagan: Node-as-agent-reasoning graph agentic network","author":"Guo Minghao","year":"2025","unstructured":"Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, and Yongfeng Zhang. Reagan: Node-as-agent-reasoning graph agentic network, 2025."},{"key":"e_1_3_2_1_28_1","volume-title":"Lightrag: Simple and fast retrieval-augmented generation","author":"Guo Zirui","year":"2024","unstructured":"Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation, 2024."},{"key":"e_1_3_2_1_29_1","volume-title":"Ruler: What's the real context size of your long-context language models?","author":"Hsieh Cheng-Ping","year":"2024","unstructured":"Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What's the real context size of your long-context language models?, 2024."},{"key":"e_1_3_2_1_30_1","volume-title":"Memserve: Context caching for disaggregated llm serving with elastic memory pool","author":"Hu Cunchen","year":"2024","unstructured":"Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. Memserve: Context caching for disaggregated llm serving with elastic memory pool, 2024."},{"key":"e_1_3_2_1_31_1","volume-title":"Epic: Efficient position-independent context caching for serving large language models","author":"Hu Junhao","year":"2024","unstructured":"Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. Epic: Efficient position-independent context caching for serving large language models, 2024."},{"key":"e_1_3_2_1_32_1","volume-title":"Routerbench: A benchmark for multi-llm routing system","author":"Hu Qitian Jason","year":"2024","unstructured":"Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024."},{"key":"e_1_3_2_1_33_1","volume-title":"Mrag-bench: Vision-centric evaluation for retrieval-augmented multimodal models","author":"Hu Wenbo","year":"2025","unstructured":"Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng. Mrag-bench: Vision-centric evaluation for retrieval-augmented multimodal models, 2025."},{"key":"e_1_3_2_1_34_1","first-page":"1012","volume-title":"Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, KDD '25","author":"Huang Yiqian","year":"2025","unstructured":"Yiqian Huang, Shiqi Zhang, and Xiaokui Xiao. Ket-rag: A cost- efficient multi-granular indexing framework for graph-rag. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, KDD '25, page 1003\u20131012, New York, NY, USA, 2025. Association for Computing Machinery."},{"key":"e_1_3_2_1_35_1","first-page":"7050","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Jeong Soyeong","year":"2024","unstructured":"Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036\u20137050, Mexico City, Mexico, June 2024. Association for Computational Linguistics."},{"key":"e_1_3_2_1_36_1","volume-title":"Rago: Systematic performance optimization for retrieval-augmented generation serving","author":"Jiang Wenqi","year":"2025","unstructured":"Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, and Vidushi Dadu. Rago: Systematic performance optimization for retrieval-augmented generation serving, 2025."},{"key":"e_1_3_2_1_37_1","volume-title":"Piperag: Fast retrieval-augmented generation via algorithm-system co-design. arXiv preprint arXiv:2403.05676","author":"Jiang Wenqi","year":"2024","unstructured":"Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, and Tim Kraska. Piperag: Fast retrieval-augmented generation via algorithm-system co-design. arXiv preprint arXiv:2403.05676, 2024."},{"key":"e_1_3_2_1_38_1","volume-title":"Neo: Saving gpu memory crisis with cpu offloading for online llm inference","author":"Jiang Xuanlin","year":"2024","unstructured":"Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. Neo: Saving gpu memory crisis with cpu offloading for online llm inference, 2024."},{"key":"e_1_3_2_1_39_1","volume-title":"Active retrieval augmented generation","author":"Jiang Zhengbao","year":"2023","unstructured":"Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation, 2023."},{"key":"e_1_3_2_1_40_1","volume-title":"Ragcache: Efficient knowledge caching for retrieval-augmented generation","author":"Jin Chao","year":"2024","unstructured":"Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation, 2024."},{"key":"e_1_3_2_1_41_1","volume-title":"Compute or load kv cache? why not both?","author":"Jin Shuowei","year":"2024","unstructured":"Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z. Morley Mao. Compute or load kv cache? why not both?, 2024."},{"key":"e_1_3_2_1_42_1","volume-title":"Autorag: Automated framework for optimization of retrieval augmented generation pipeline","author":"Kim Dongkyu","year":"2024","unstructured":"Dongkyu Kim, Byoungwook Kim, Donggeon Han, and Matou\u0161 Eibich. Autorag: Automated framework for optimization of retrieval augmented generation pipeline, 2024."},{"key":"e_1_3_2_1_43_1","volume-title":"The effect of scheduling and preemption on the efficiency of llm inference serving","author":"Kim Kyoungmin","year":"2024","unstructured":"Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, and Anastasia Ailamaki. The effect of scheduling and preemption on the efficiency of llm inference serving, 2024."},{"key":"e_1_3_2_1_44_1","volume-title":"Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023."},{"key":"e_1_3_2_1_45_1","volume-title":"Long context rag performance of large language models","author":"Leng Quinn","year":"2024","unstructured":"Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, and Michael Carbin. Long context rag performance of large language models, 2024."},{"key":"e_1_3_2_1_46_1","unstructured":"Yangning Li Weizhi Zhang Yuyao Yang Wei-Chieh Huang Yaozu Wu Junyu Luo Yuanchen Bei Henry Peng Zou Xiao Luo Yusheng Zhao Chunkit Chan Yankai Chen Zhongfen Deng Yinghui Li Hai-Tao Zheng Dongyuan Li Renhe Jiang Ming Zhang Yangqiu Song and Philip S. Yu. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms 2025."},{"key":"e_1_3_2_1_47_1","volume-title":"Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach","author":"Li Zhuowan","year":"2024","unstructured":"Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach, 2024."},{"key":"e_1_3_2_1_48_1","volume-title":"Parrot: Efficient serving of llm-based applications with semantic variable","author":"Lin Chaofan","year":"2024","unstructured":"Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of llm-based applications with semantic variable, 2024."},{"key":"e_1_3_2_1_49_1","volume-title":"Telerag: Efficient retrieval-augmented generation inference with lookahead retrieval","author":"Lin Chien-Yu","year":"2025","unstructured":"Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang, Arvind Krishnamurthy, Rohan Kadekodi, Luis Ceze, and Baris Kasikci. Telerag: Efficient retrieval-augmented generation inference with lookahead retrieval, 2025."},{"key":"e_1_3_2_1_50_1","volume-title":"Minicache: Kv cache compression in depth dimension for large language models","author":"Liu Akide","year":"2024","unstructured":"Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models, 2024."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1162\/tacl_a_00638","article-title":"Lost in the middle: How language models use long contexts","volume":"12","author":"Liu Nelson F","year":"2024","unstructured":"Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157\u2013173, 2024.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_2_1_52_1","volume-title":"Hm-rag: Hierarchical multi-agent multimodal retrieval augmented generation","author":"Liu Pei","year":"2025","unstructured":"Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, and Jun Ma. Hm-rag: Hierarchical multi-agent multimodal retrieval augmented generation, 2025."},{"key":"e_1_3_2_1_53_1","volume-title":"Droidspeak: Enhancing cross-llm communication","author":"Liu Yuhan","year":"2024","unstructured":"Yuhan Liu, Esha Choukse, Shan Lu, Junchen Jiang, and Madan Musuvathi. Droidspeak: Enhancing cross-llm communication, 2024."},{"key":"e_1_3_2_1_54_1","first-page":"56","volume-title":"Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM '24","author":"Liu Yuhan","year":"2024","unstructured":"Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM '24, page 38\u201356, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_55_1","volume-title":"Docugami knowledge graph retrieval augmented generation (kg-rag) datasets","year":"2024","unstructured":"LlamaHub. Docugami knowledge graph retrieval augmented generation (kg-rag) datasets, 2024."},{"key":"e_1_3_2_1_56_1","volume-title":"Query rewriting for retrieval-augmented large language models","author":"Ma Xinbei","year":"2023","unstructured":"Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models, 2023."},{"key":"e_1_3_2_1_57_1","volume-title":"Zero-shot listwise document reranking with a large language model","author":"Ma Xueguang","year":"2023","unstructured":"Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model, 2023."},{"key":"e_1_3_2_1_58_1","first-page":"735","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2024","author":"Mao Kelong","year":"2024","unstructured":"Kelong Mao, Zheng Liu, Hongjin Qian, Fengran Mo, Chenlong Deng, and Zhicheng Dou. RAG-studio: Towards in-domain adaptation of retrieval augmented generation through self-alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 725\u2013735, Miami, Florida, USA, November 2024. Association for Computational Linguistics."},{"key":"e_1_3_2_1_59_1","volume-title":"Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, and Fahad Dogar. Llmproxy: Reducing cost to access large language models","author":"Martin Noah","year":"2024","unstructured":"Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, and Fahad Dogar. Llmproxy: Reducing cost to access large language models, 2024."},{"key":"e_1_3_2_1_60_1","volume-title":"NVIDIA","author":"Merritt Rick","year":"2024","unstructured":"Rick Merritt. What is retrieval-augmented generation, aka rag? Technical report, NVIDIA, 2024."},{"key":"e_1_3_2_1_61_1","volume-title":"Arshad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively","author":"Mohammadshahi Alireza","year":"2024","unstructured":"Alireza Mohammadshahi, Arshad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively, 2024."},{"key":"e_1_3_2_1_62_1","volume-title":"Meta knowledge for retrieval augmented large language models","author":"Mombaerts Laurent","year":"2024","unstructured":"Laurent Mombaerts, Terry Ding, Adi Banerjee, Florian Felice, Jonathan Taws, and Tarik Borogovac. Meta knowledge for retrieval augmented large language models, 2024."},{"key":"e_1_3_2_1_63_1","volume-title":"Chanh Le, Hong An Phan, Shruti Raghavan, and Christopher Nguyen. Enhancing q&a with domain-specific fine-tuning and iterative reasoning: A comparative study","author":"Nguyen Zooey","year":"2024","unstructured":"Zooey Nguyen, Anthony Annunziata, Vinh Luong, Sang Dinh, Quynh Le, Anh Hai Ha, Chanh Le, Hong An Phan, Shruti Raghavan, and Christopher Nguyen. Enhancing q&a with domain-specific fine-tuning and iterative reasoning: A comparative study, 2024."},{"key":"e_1_3_2_1_64_1","volume-title":"Routellm: Learning to route llms with preference data","author":"Ong Isaac","year":"2024","unstructured":"Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024."},{"key":"e_1_3_2_1_65_1","volume-title":"Routellm: Learning to route llms with preference data","author":"Ong Isaac","year":"2025","unstructured":"Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2025."},{"key":"e_1_3_2_1_66_1","volume-title":"Openai api","author":"AI.","year":"2023","unstructured":"OpenAI. Openai api, 2023."},{"key":"e_1_3_2_1_67_1","volume-title":"Submitted to Tsinghua University Course: Advanced Machine Learning","author":"Ouyang Yang","year":"2024","unstructured":"Yang Ouyang, Tong Yu, and Wenchu Wang. Context-aware chatbot extension leveraging HTML data and retrieval-augmented generation (RAG). In Submitted to Tsinghua University Course: Advanced Machine Learning, 2024. under review."},{"key":"e_1_3_2_1_68_1","volume-title":"Marconi: Prefix caching for the era of hybrid llms. arXiv preprint arXiv:2411.19379","author":"Pan Rui","year":"2024","unstructured":"Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid llms. arXiv preprint arXiv:2411.19379, 2024."},{"key":"e_1_3_2_1_69_1","volume-title":"Memorag: Moving towards next-gen rag via memory-inspired knowledge discovery","author":"Qian Hongjin","year":"2024","unstructured":"Hongjin Qian, Peitian Zhang, Zheng Liu, Kelong Mao, and Zhicheng Dou. Memorag: Moving towards next-gen rag via memory-inspired knowledge discovery, 2024."},{"key":"e_1_3_2_1_70_1","volume-title":"Mooncake: A kvcache-centric disaggregated architecture for llm serving","author":"Qin Ruoyu","year":"2024","unstructured":"Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024."},{"key":"e_1_3_2_1_71_1","first-page":"2392","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383\u20132392, Austin, Texas, November 2016. Association for Computational Linguistics."},{"key":"e_1_3_2_1_72_1","volume-title":"Sentence-bert: Sentence embeddings using siamese bert-networks","author":"Reimers Nils","year":"2019","unstructured":"Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019."},{"key":"e_1_3_2_1_73_1","unstructured":"Grand View Research. Retrieval augmented generation market size share and trend analysis report by function (document retrieval recommendation engines) by application (content generation) by deployment (cloud on-premises) by end use by region and segment forecasts 2025 - 2030 2024."},{"key":"e_1_3_2_1_74_1","volume-title":"Beyond text: Optimizing rag with multimodal inputs for industrial applications","author":"Riedler Monica","year":"2024","unstructured":"Monica Riedler and Stefan Langer. Beyond text: Optimizing rag with multimodal inputs for industrial applications, 2024."},{"key":"e_1_3_2_1_75_1","volume-title":"Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation","author":"Ru Dongyu","year":"2024","unstructured":"Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation, 2024."},{"key":"e_1_3_2_1_76_1","volume-title":"Fast inference for augmented large language models","author":"Shahout Rana","year":"2024","unstructured":"Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, and Michael Mitzenmacher. Fast inference for augmented large language models, 2024."},{"key":"e_1_3_2_1_77_1","first-page":"91299","volume-title":"Advances in Neural Information Processing Systems","author":"Shao Rulin","unstructured":"Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, and Pang Wei Koh. Scaling retrieval-based language models with a trillion-token datastore. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 91260\u201391299. Curran Associates, Inc., 2024."},{"key":"e_1_3_2_1_78_1","volume-title":"A methodology for evaluating rag systems: A case study on configuration dependency validation","author":"Simon Sebastian","year":"2024","unstructured":"Sebastian Simon, Alina Mailach, Johannes Dorn, and Norbert Siegmund. A methodology for evaluating rag systems: A case study on configuration dependency validation, 2024."},{"key":"e_1_3_2_1_79_1","volume-title":"Agentic retrieval-augmented generation: A survey on agentic rag","author":"Singh Aditi","year":"2025","unstructured":"Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag, 2025."},{"key":"e_1_3_2_1_80_1","first-page":"606","volume-title":"Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles","author":"Song Yixin","year":"2024","unstructured":"Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 590\u2013606, 2024."},{"key":"e_1_3_2_1_81_1","volume-title":"Preble: Efficient distributed prompt scheduling for llm serving","author":"Srivatsa Vikranth","year":"2024","unstructured":"Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. Preble: Efficient distributed prompt scheduling for llm serving. 2024."},{"key":"e_1_3_2_1_82_1","volume-title":"Teola: Towards end-to-end optimization of llm-based applications","author":"Tan Xin","year":"2024","unstructured":"Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. Teola: Towards end-to-end optimization of llm-based applications, 2024."},{"key":"e_1_3_2_1_83_1","volume-title":"Mba-rag: a bandit approach for adaptive retrieval-augmented generation through question complexity","author":"Tang Xiaqiang","year":"2024","unstructured":"Xiaqiang Tang, Qiang Gao, Jian Li, Nan Du, Qi Li, and Sihong Xie. Mba-rag: a bandit approach for adaptive retrieval-augmented generation through question complexity, 2024."},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1162\/tacl_a_00475","article-title":"Multihop questions via single-hop question composition","volume":"10","author":"Trivedi Harsh","year":"2022","unstructured":"Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539\u2013554, 2022.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_2_1_85_1","first-page":"17736","volume-title":"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing","author":"Wang Xiaohua","year":"2024","unstructured":"Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. Searching for best practices in retrieval-augmented generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17716\u201317736, Miami, Florida, USA, November 2024. Association for Computational Linguistics."},{"key":"e_1_3_2_1_86_1","volume-title":"Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks","author":"Wang Zheng","year":"2024","unstructured":"Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks, 2024."},{"key":"e_1_3_2_1_87_1","volume-title":"Association for Computational Linguistics","author":"Wolf Thomas","year":"2020","unstructured":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-Art Natural Language Processing. pages 38\u201345. Association for Computational Linguistics, October 2020."},{"key":"e_1_3_2_1_88_1","first-page":"654","volume-title":"Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP '24","author":"Wu Bingyang","year":"2024","unstructured":"Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP '24, page 640\u2013654, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_89_1","volume-title":"Weknow-rag: An adaptive approach for retrieval-augmented generation integrating web search and knowledge graphs","author":"Xie Weijian","year":"2024","unstructured":"Weijian Xie, Xuefeng Liang, Yuhui Liu, Kaihua Ni, Hong Cheng, and Zetian Hu. Weknow-rag: An adaptive approach for retrieval-augmented generation integrating web search and knowledge graphs, 2024."},{"key":"e_1_3_2_1_90_1","volume-title":"The Twelfth International Conference on Learning Representations","author":"Xiong Miao","year":"2024","unstructured":"Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations, 2024."},{"key":"e_1_3_2_1_91_1","volume-title":"Cacheblend: Fast large language model serving for rag with cached knowledge fusion","author":"Yao Jiayi","year":"2024","unstructured":"Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion, 2024."},{"key":"e_1_3_2_1_92_1","volume-title":"Stateful large language model serving with pensieve. arXiv preprint arXiv:2312.05516","author":"Yu Lingfan","year":"2023","unstructured":"Lingfan Yu and Jinyang Li. Stateful large language model serving with pensieve. arXiv preprint arXiv:2312.05516, 2023."},{"key":"e_1_3_2_1_93_1","volume-title":"Pqcache: Product quantization-based kvcache for long context llm inference","author":"Zhang Hailin","year":"2024","unstructured":"Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference, 2024."},{"key":"e_1_3_2_1_94_1","first-page":"345","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhang Qizheng","year":"2024","unstructured":"Qizheng Zhang, Ali Imran, Enkeleda Bardhi, Tushar Swamy, Nathan Zhang, Muhammad Shahbaz, and Kunle Olukotun. Caravan: Practical online learning of In-Network ML models with labeling agents. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 325\u2013345, Santa Clara, CA, July 2024. USENIX Association."},{"key":"e_1_3_2_1_95_1","volume-title":"Raft: Adapting language model to domain specific rag","author":"Zhang Tianjun","year":"2024","unstructured":"Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E. Gonzalez. Raft: Adapting language model to domain specific rag, 2024."},{"key":"e_1_3_2_1_96_1","volume-title":"Forty-first International Conference on Machine Learning.","author":"Zhang Zhihao","unstructured":"Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, and Zhihao Jia. Accelerating iterative retrieval-augmented language model serving with speculation. In Forty-first International Conference on Machine Learning."},{"key":"e_1_3_2_1_97_1","first-page":"294","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)","author":"Zhao Yiyun","year":"2024","unstructured":"Yiyun Zhao, Prateek Singh, Hanoz Bhathena, Bernardo Ramos, Aviral Joshi, Swaroop Gadiyaram, and Saket Sharma. Optimizing LLM based retrieval augmented generation pipelines in the financial domain. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 279\u2013294, Mexico City, Mexico, June 2024. Association for Computational Linguistics."},{"key":"e_1_3_2_1_98_1","volume-title":"The Thirty-eighth Annual Conference on Neural Information Processing Systems","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024."},{"key":"e_1_3_2_1_99_1","volume-title":"Dragomir Radev. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In North American Association for Computational Linguistics (NAACL)","author":"Zhong Ming","year":"2021","unstructured":"Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In North American Association for Computational Linguistics (NAACL), 2021."},{"key":"e_1_3_2_1_100_1","first-page":"210","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193\u2013210, Santa Clara, CA, July 2024. USENIX Association."},{"key":"e_1_3_2_1_101_1","volume-title":"In-depth analysis of graph-based rag in a unified framework","author":"Zhou Yingli","year":"2025","unstructured":"Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, and Yixiang Fang. In-depth analysis of graph-based rag in a unified framework, 2025."},{"key":"e_1_3_2_1_102_1","volume-title":"et al. Nanoflow: Towards optimal large language model serving throughput. arXiv preprint arXiv:2408.12757","author":"Zhu Kan","year":"2024","unstructured":"Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, et al. Nanoflow: Towards optimal large language model serving throughput. arXiv preprint arXiv:2408.12757, 2024."},{"key":"e_1_3_2_1_103_1","volume-title":"Rageval: Scenario specific rag evaluation dataset generation framework","author":"Zhu Kunlun","year":"2024","unstructured":"Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, and Maosong Sun. Rageval: Scenario specific rag evaluation dataset generation framework, 2024."}],"event":{"name":"SOSP '25: ACM SIGOPS 31st Symposium on Operating Systems Principles","location":"Lotte Hotel World Seoul Republic of Korea","acronym":"SOSP '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","USENIX"]},"container-title":["Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles"],"original-title":[],"deposited":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T12:50:23Z","timestamp":1759323023000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731569.3764855"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,12]]},"references-count":103,"alternative-id":["10.1145\/3731569.3764855","10.1145\/3731569"],"URL":"https:\/\/doi.org\/10.1145\/3731569.3764855","relation":{},"subject":[],"published":{"date-parts":[[2025,10,12]]},"assertion":[{"value":"2025-10-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}