{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T18:56:43Z","timestamp":1775156203117,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":57,"publisher":"ACM","funder":[{"name":"Swiss National Science Foundation","award":["10001796"],"award-info":[{"award-number":["10001796"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,12,15]]},"DOI":"10.1145\/3721462.3770776","type":"proceedings-article","created":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T19:56:49Z","timestamp":1765223809000},"page":"340-353","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Leveraging Approximate Caching for Faster Retrieval-Augmented Generation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3484-1452","authenticated-orcid":false,"given":"Shai","family":"Bergman","sequence":"first","affiliation":[{"name":"Huawei, Zurich, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8187-724X","authenticated-orcid":false,"given":"Anne-Marie","family":"Kermarrec","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-2229-235X","authenticated-orcid":false,"given":"Diana","family":"Petrescu","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7826-1599","authenticated-orcid":false,"given":"Rafael","family":"Pires","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6844-4695","authenticated-orcid":false,"given":"Mathis","family":"Randl","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4157-4847","authenticated-orcid":false,"given":"Martijn","family":"de Vos","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6184-7279","authenticated-orcid":false,"given":"Ji","family":"Zhang","sequence":"additional","affiliation":[{"name":"Huawei, Zurich, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2025,12,14]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00667"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/88636.87848"},{"key":"e_1_3_2_1_3_1","volume-title":"Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval","author":"Baeza-Yates Ricardo","year":"2007","unstructured":"Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (Amsterdam, The Netherlands) (SIGIR '07). 10.1145\/1277741.1277775"},{"key":"e_1_3_2_1_4_1","volume-title":"International conference on machine learning (ICML '22)","author":"Borgeaud Sebastian","year":"2022","unstructured":"Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning (ICML '22). PMLR. https:\/\/proceedings.mlr.press\/v162\/borgeaud22a\/borgeaud22a.pdf"},{"key":"e_1_3_2_1_5_1","unstructured":"Brian J Chan Chao-Ting Chen Jui-Hung Cheng and Hen-Hsen 25Huang. 2024. Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks. (2024). arXiv:2412.15605"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing (Montreal","author":"Charikar Moses S.","year":"2002","unstructured":"Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing (Montreal, Quebec, Canada) (STOC '02). 10.1145\/509907.509965"},{"key":"e_1_3_2_1_7_1","volume-title":"Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems 34","author":"Chen Qi","year":"2021","unstructured":"Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems 34 (2021). https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/SPANN_finalversion1.pdf"},{"key":"e_1_3_2_1_8_1","volume-title":"Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems","author":"Chierichetti Flavio","year":"2009","unstructured":"Flavio Chierichetti, Ravi Kumar, and Sergei Vassilvitskii. 2009. Similarity caching. In Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Providence, Rhode Island, USA) (PODS '09). 10.1145\/1559795.1559815"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-97-9255-9_26"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"crossref","unstructured":"Matthijs Douze Alexandr Guzhva Chengqi Deng Jeff Johnson Gergely Szilvasy Pierre-Emmanuel Mazar\u00e9 Maria Lomeli Lucas Hosseini and Herv\u00e9 J\u00e9gou. 2024. The faiss library. (2024). arXiv:2401.08281","DOI":"10.1109\/TBDATA.2025.3618474"},{"key":"e_1_3_2_1_11_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. (2024). arXiv:2407.21783"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","unstructured":"Fabrizio Falchi Claudio Lucchese Salvatore Orlando Raffaele Perego and Fausto Rabitti. 2008. A metric cache for similarity search (LSDS-IR '08). 10.1145\/1458469.1458473","DOI":"10.1145\/1458469.1458473"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2010.12.006"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3578519"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2311.04934"},{"key":"e_1_3_2_1_16_1","unstructured":"Google. 2025. AI Overviews in Search. https:\/\/search.google\/ways-to-search\/ai-overviews Accessed: 2025-09-18."},{"key":"e_1_3_2_1_17_1","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. (2020). arXiv:2009.03300"},{"key":"e_1_3_2_1_18_1","volume-title":"EPIC: Efficient Position-Independent Caching for Serving Large Language Models.","author":"Hu Junhao","year":"2024","unstructured":"Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2024. EPIC: Efficient Position-Independent Caching for Serving Large Language Models. (2024). arXiv:2410.15332"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3703155"},{"key":"e_1_3_2_1_20_1","volume-title":"Ravishankar Krishnawamy, and Rohan Kadekodi.","author":"Subramanya Suhas Jayaram","year":"2019","unstructured":"Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems 32 (2019). https:\/\/papers.nips.cc\/paper_files\/paper\/2019\/file\/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf"},{"key":"e_1_3_2_1_21_1","first-page":"123","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Ji Ziwei","year":"2023","unstructured":"Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating LLM hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023. arXiv:2310.06271 10.18653\/v1\/2023.findings-emnlp.123"},{"key":"e_1_3_2_1_22_1","volume-title":"Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)","author":"Jiang Wenqi","year":"2025","unstructured":"Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, and Vidushi Dadu. 2025. RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25). Association for Computing Machinery, New York, NY, USA, 974\u2013989. 10.1145\/3695053.3731093"},{"key":"e_1_3_2_1_23_1","volume-title":"Piperag: Fast retrieval-augmented generation via algorithm-system codesign.","author":"Jiang Wenqi","year":"2024","unstructured":"Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, and Tim Kraska. 2024. Piperag: Fast retrieval-augmented generation via algorithm-system codesign. (2024). arXiv:2403.05676"},{"key":"e_1_3_2_1_24_1","unstructured":"Chao Jin Zili Zhang Xuanlin Jiang Fangyue Liu Xin Liu Xuanzhe Liu and Xin Jin. 2024. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. (2024). arXiv:2404.12457"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btad651"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2010.57"},{"key":"e_1_3_2_1_27_1","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom B Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. (2020). arXiv:2001.08361"},{"key":"e_1_3_2_1_28_1","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10","author":"Karpukhin Vladimir","year":"2020","unstructured":"Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10.18653\/v1\/2020.emnlp-main.550"},{"key":"e_1_3_2_1_29_1","volume-title":"The 2024 ACM Conference on Fairness, Accountability, and Transparency. arXiv:2405","author":"Lee Yoonjoo","year":"2024","unstructured":"Yoonjoo Lee, Kihoon Son, Tae Soo Kim, Jisu Kim, John Joon Young Chung, Eytan Adar, and Juho Kim. 2024. One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations. In The 2024 ACM Conference on Fairness, Accountability, and Transparency. arXiv:2405.05581 10.1145\/3630106.3662681"},{"key":"e_1_3_2_1_30_1","unstructured":"Patrick Lewis Ethan Perez Aleksandra Piktus Fabio Petroni Vladimir Karpukhin Naman Goyal Heinrich K\u00fcttler Mike Lewis Wen-tau Yih Tim Rockt\u00e4schel et al. 2020. Retrieval-augmented generation for knowledgeintensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020). arXiv:2005.11401 https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/6b493230205f780e1bc26945df7481e5-Paper.pdf"},{"key":"e_1_3_2_1_31_1","unstructured":"Songshuo Lu Hua Wang Yutian Rong Zhi Chen and Yaohua Tang. 2024. TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. (2024). arXiv:2410.07590"},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3318\u20133322","author":"Macdonald Craig","year":"2021","unstructured":"Craig Macdonald and Nicola Tonellotto. 2021. On approximate nearest neighbour selection for multi-stage dense retrieval. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3318\u20133322. 10.1145\/3459637.3482156"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2889473"},{"key":"e_1_3_2_1_34_1","article-title":"Using the Turning Research Into Practice (TRIP) database: how do clinicians really search","volume":"95","author":"Meats Emma","year":"2007","unstructured":"Emma Meats, Jon Brassey, Carl Heneghan, and Paul Glasziou. 2007. Using the Turning Research Into Practice (TRIP) database: how do clinicians really search? Journal of the Medical Library Association 95, 2 (2007). pubmed:17443248 https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC1852632\/","journal-title":"Journal of the Medical Library Association"},{"key":"e_1_3_2_1_35_1","unstructured":"Microsoft. 2024. Copilot Search. https:\/\/www.microsoft.com\/en-us\/bing\/copilot-search Accessed: 2025-09-18."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3489620"},{"key":"e_1_3_2_1_37_1","unstructured":"OpenAI. 2024. ChatGPT Search. https:\/\/openai.com\/chatgpt\/search Accessed: 2025-09-18."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-024-00864-x"},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of the 18th International Conference on World Wide Web","author":"Pandey Sandeep","year":"2009","unstructured":"Sandeep Pandey, Andrei Broder, Flavio Chierichetti, Vanja Josifovski, Ravi Kumar, and Sergei Vassilvitskii. 2009. Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW '09). 10.1145\/1526709.1526769"},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1","author":"Quinn Derrick","year":"2025","unstructured":"Derrick Quinn, Mohammad Nouri, Neel Patel, John Salihu, Alireza Salemi, Sukhan Lee, Hamed Zamani, and Mohammad Alian. 2025. Accelerating Retrieval-Augmented Generation. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (Rotterdam, Netherlands) (ASPLOS '25). 10.1145\/3669940.3707264"},{"key":"e_1_3_2_1_41_1","volume-title":"Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1","author":"Quinn Derrick","year":"2025","unstructured":"Derrick Quinn, Mohammad Nouri, Neel Patel, John Salihu, Alireza Salemi, Sukhan Lee, Hamed Zamani, and Mohammad Alian. 2025. Accelerating Retrieval-Augmented Generation. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (Rotterdam, Netherlands) (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 15\u201332. 10.1145\/3669940.3707264"},{"key":"e_1_3_2_1_42_1","volume-title":"METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation.","author":"Ray Siddhant","year":"2024","unstructured":"Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang. 2024. METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation. (2024). arXiv:2412.10543"},{"key":"e_1_3_2_1_43_1","unstructured":"Sajal Regmi and Chetan Phakami Pun. 2024. GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching. (2024). arXiv:2411.05276"},{"key":"e_1_3_2_1_44_1","volume-title":"Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 1180\u20131181","author":"Roychowdhury Sohini","year":"2024","unstructured":"Sohini Roychowdhury. 2024. Journey of hallucination-minimized generative ai solutions for financial decision makers. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 1180\u20131181. 10.1145\/3616855.3635737"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2022.3187044"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1017\/nlp.2024.53"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSAC.2018.2844983"},{"key":"e_1_3_2_1_48_1","unstructured":"Michael Shen Muhammad Umar Kiwan Maeng G Edward Suh and Udit Gupta. 2024. Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference. (2024). arXiv:2412.11854"},{"key":"e_1_3_2_1_49_1","unstructured":"Jonathon Shlens. 2014. A Tutorial on Principal Component Analysis. arXiv:1404.1100"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1002\/1097-4571(2000)9999:9999"},{"key":"e_1_3_2_1_51_1","volume-title":"Preble: Efficient distributed prompt scheduling for llm serving.","author":"Srivatsa Vikranth","year":"2024","unstructured":"Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2024. Preble: Efficient distributed prompt scheduling for llm serving. (2024). arXiv:2407.00023"},{"key":"e_1_3_2_1_52_1","volume-title":"Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 10","author":"Teevan Jaime","year":"2007","unstructured":"Jaime Teevan, Eytan Adar, Rosie Jones, and Michael AS Potts. 2007. Information re-retrieval: Repeat queries in Yahoo's logs. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 10.1145\/1277741.1277770"},{"key":"e_1_3_2_1_53_1","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008). https:\/\/www.jmlr.org\/papers\/volume9\/vandermaaten08a\/vandermaaten08a.pdf","journal-title":"Journal of machine learning research"},{"key":"e_1_3_2_1_54_1","volume-title":"Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al.","author":"Wang Zilong","year":"2025","unstructured":"Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al. 2025. Speculative RAG: Enhancing retrieval augmented generation through drafting. (2025). arXiv:2407.08223 https:\/\/openreview.net\/pdf?id=xgQfWbV6Ey"},{"key":"e_1_3_2_1_55_1","volume-title":"Benchmarking Retrieval-Augmented Generation for Medicine. In Findings of the Association for Computational Linguistics ACL 2024","author":"Xiong Guangzhi","year":"2024","unstructured":"Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. 2024. Benchmarking Retrieval-Augmented Generation for Medicine. In Findings of the Association for Computational Linguistics ACL 2024. 10.18653\/v1\/2024.findings-acl.372"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-024-07930-y"},{"key":"e_1_3_2_1_57_1","unstructured":"Yun Zhu Jia-Chen Gu Caitlin Sikora Ho Ko Yinxiao Liu Chu-Cheng Lin Lei Shu Liangchen Luo Lei Meng Bang Liu et al. 2024. Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection. (2024). arXiv:2405.16178"}],"event":{"name":"MIDDLEWARE '25: 26th International Middleware Conference","location":"Vanderbilt University Nashville TN USA","acronym":"MIDDLEWARE '25","sponsor":["IFIP","Usenix"]},"container-title":["Proceedings of the 26th International Middleware Conference"],"original-title":[],"deposited":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T20:00:44Z","timestamp":1765224044000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721462.3770776"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,14]]},"references-count":57,"alternative-id":["10.1145\/3721462.3770776","10.1145\/3721462"],"URL":"https:\/\/doi.org\/10.1145\/3721462.3770776","relation":{},"subject":[],"published":{"date-parts":[[2025,12,14]]},"assertion":[{"value":"2025-12-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}