{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T05:17:03Z","timestamp":1775539023447,"version":"3.50.1"},"reference-count":104,"publisher":"Association for Computing Machinery (ACM)","issue":"6","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62461146205, U2241212"],"award-info":[{"award-number":["62461146205, U2241212"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Distinguished Youth Foundation of Liaoning Province","award":["2024021148-JH3\/501"],"award-info":[{"award-number":["2024021148-JH3\/501"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,12,4]]},"abstract":"<jats:p>Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising paradigm for enhancing LLM reliability by enabling multi-hop reasoning over graph-structured knowledge. However, existing LLMs struggle to efficiently process graph-structured inputs, as traditional attention mechanisms are sequence-based and introduce significant redundancy when serializing graphs into prompt sequences, leading to excessive computation and memory overhead. To address this, we introduce dependency attention, a novel graph-aware attention mechanism that restricts attention computation to token pairs with structural dependencies in the retrieved subgraph. Unlike standard self-attention that computes fully connected interactions, dependency attention prunes irrelevant token pairs and reuses computations along shared relational paths, substantially reducing inference overhead. Building on this idea, we develop DepCache, a KV cache management framework tailored for dependency attention. DepCache enables efficient KV cache reuse through (i) a graph-based KV cache reuse strategy that aligns KV caches across varying prompt contexts, enabling efficient cross-request reuse in GraphRAG, and (ii) a locality-aware replacement policy that leverages spatial and temporal access patterns to improve KV cache hit rate. Evaluations across diverse models and datasets show that DepCache improves LLM inference throughput by 1.5\u00d7-5.0\u00d7 and reduces time-to-first-token latency by up to 3.2\u00d7, without compromising generation accuracy.<\/jats:p>","DOI":"10.1145\/3769778","type":"journal-article","created":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T04:32:13Z","timestamp":1764995533000},"page":"1-29","source":"Crossref","is-referenced-by-count":0,"title":["DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-6502-7696","authenticated-orcid":false,"given":"Hao","family":"Yuan","sequence":"first","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-0746-8222","authenticated-orcid":false,"given":"Xin","family":"Ai","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4847-6070","authenticated-orcid":false,"given":"Qiange","family":"Wang","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0047-2576","authenticated-orcid":false,"given":"Peizheng","family":"Li","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6651-5787","authenticated-orcid":false,"given":"Jiayang","family":"Yu","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-5518-9978","authenticated-orcid":false,"given":"Chaoyi","family":"Chen","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3756-9273","authenticated-orcid":false,"given":"Xinbo","family":"Yang","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9871-0304","authenticated-orcid":false,"given":"Yanfeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2317-9561","authenticated-orcid":false,"given":"Zhenbo","family":"Fu","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6659-1785","authenticated-orcid":false,"given":"Yingyou","family":"Wen","sequence":"additional","affiliation":[{"name":"Neusoft AI Magic Technology Research, Shenyang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3171-8889","authenticated-orcid":false,"given":"Ge","family":"Yu","sequence":"additional","affiliation":[{"name":"Northeastern University, Shenyang, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,5]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al., 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)."},{"key":"e_1_2_1_2_1","volume-title":"Tong Yu, and Shiv Saini.","author":"Agarwal Shubham","year":"2025","unstructured":"Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation. arXiv preprint arXiv:2502.15734 (2025)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3705829.3705837"},{"key":"e_1_2_1_4_1","unstructured":"Jinze Bai Shuai Bai Yunfei Chu Zeyu Cui Kai Dang Xiaodong Deng Yang Fan Wenbin Ge Yu Han Fei Huang et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)."},{"key":"e_1_2_1_5_1","volume-title":"Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150 (2020)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29720"},{"key":"e_1_2_1_7_1","volume-title":"Scaling In-Context Demonstrations with Structured Attention. arXiv preprint arXiv:2307.02690","author":"Cai Tianle","year":"2023","unstructured":"Tianle Cai, Kaixuan Huang, Jason D Lee, and Mengdi Wang. 2023. Scaling In-Context Demonstrations with Structured Attention. arXiv preprint arXiv:2307.02690 (2023)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29728"},{"key":"e_1_2_1_9_1","unstructured":"Ludmila Cherkasova. 1998. Improving WWW proxies performance with greedy-dual-size-frequency caching policy. Hewlett-Packard Laboratories Palo Alto CA USA."},{"key":"e_1_2_1_10_1","volume-title":"Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509","author":"Child Rewon","year":"2019","unstructured":"Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)."},{"key":"e_1_2_1_11_1","volume-title":"Complex logical reasoning over knowledge graphs using large language models. arXiv preprint arXiv:2305.01157","author":"Choudhary Nurendra","year":"2023","unstructured":"Nurendra Choudhary and Chandan K Reddy. 2023. Complex logical reasoning over knowledge graphs using large language models. arXiv preprint arXiv:2305.01157 (2023)."},{"key":"e_1_2_1_12_1","unstructured":"DeepSpeed. 2020. https:\/\/github.com\/deepspeedai\/DeepSpeed."},{"key":"e_1_2_1_13_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3698801"},{"key":"e_1_2_1_15_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_2_1_16_1","volume-title":"A Generalization of Transformer Networks to Graphs. arXiv preprint arXiv:2012.09699","author":"Dwivedi Vijay Prakash","year":"2020","unstructured":"Vijay Prakash Dwivedi and Xavier Bresson. 2020. A Generalization of Transformer Networks to Graphs. arXiv preprint arXiv:2012.09699 (2020)."},{"key":"e_1_2_1_17_1","volume-title":"From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130","author":"Edge Darren","year":"2024","unstructured":"Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)."},{"key":"e_1_2_1_18_1","unstructured":"FalkorDB. 2023. https:\/\/github.com\/FalkorDB\/FalkorDB."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671470"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/3691992.3691999"},{"key":"e_1_2_1_21_1","volume-title":"Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997","author":"Gao Yunfan","year":"2023","unstructured":"Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)."},{"key":"e_1_2_1_22_1","first-page":"325","article-title":"Prompt cache: Modular attention reuse for low-latency inference","volume":"6","author":"Gim In","year":"2024","unstructured":"In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, Vol. 6 (2024), 325-338.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_23_1","volume-title":"Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793","author":"Aohan Zeng Team GLM","year":"2024","unstructured":"Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al., 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)."},{"key":"e_1_2_1_24_1","unstructured":"Daya Guo Qihao Zhu Dejian Yang Zhenda Xie Kai Dong Wentao Zhang Guanting Chen Xiao Bi Yu Wu YK Li et al. 2024b. DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)."},{"key":"e_1_2_1_25_1","first-page":"1315","volume-title":"Star-Transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '19)","author":"Guo Qipeng","year":"2019","unstructured":"Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-Transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '19). 1315-1325."},{"key":"e_1_2_1_26_1","volume-title":"LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv preprint arXiv:2410.05779","author":"Guo Zirui","year":"2024","unstructured":"Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2024a. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv preprint arXiv:2410.05779 (2024)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.222"},{"key":"e_1_2_1_28_1","unstructured":"Haoyu Han Yu Wang Harry Shomer Kai Guo Jiayuan Ding Yongjia Lei Mahantesh Halappanavar Ryan A Rossi Subhabrata Mukherjee Xianfeng Tang et al. 2024b. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309 (2024)."},{"key":"e_1_2_1_29_1","first-page":"132876","article-title":"G-retriever: Retrieval-augmented generation for textual graph understanding and question answering","volume":"37","author":"He Xiaoxin","year":"2024","unstructured":"Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. Advances in Neural Information Processing Systems, Vol. 37 (2024), 132876-132907.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_30_1","volume-title":"Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.","author":"Jiang Albert Q","year":"2023","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al., 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)."},{"key":"e_1_2_1_31_1","volume-title":"Piperag: Fast retrieval-augmented generation via algorithm-system co-design. arXiv preprint arXiv:2403.05676","author":"Jiang Wenqi","year":"2024","unstructured":"Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, and Tim Kraska. 2024. Piperag: Fast retrieval-augmented generation via algorithm-system co-design. arXiv preprint arXiv:2403.05676 (2024)."},{"key":"e_1_2_1_32_1","volume-title":"RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. arXiv preprint arXiv:2404.12457","author":"Jin Chao","year":"2024","unstructured":"Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. 2024. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. arXiv preprint arXiv:2404.12457 (2024)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132764"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1147"},{"key":"e_1_2_1_35_1","volume-title":"Shape: Shifted absolute position embedding for transformers. arXiv preprint arXiv:2109.05644","author":"Kiyono Shun","year":"2021","unstructured":"Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. 2021. Shape: Shifted absolute position embedding for transformers. arXiv preprint arXiv:2109.05644 (2021)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/3495724.3496517"},{"key":"e_1_2_1_38_1","first-page":"387","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20)","author":"Li Jialin","year":"2020","unstructured":"Jialin Li, Jacob Nelson, Ellis Michael, Xin Jin, and Dan RK Ports. 2020. Pegasus: Tolerating skewed workloads in distributed storage with In-Network coherence directories. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20). 387-406."},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","unstructured":"Peizheng Li Chaoyi Chen Hao Yuan Zhenbo Fu Hang Shen Xinbo Yang Qiange Wang Xin Ai Yanfeng Zhang Yingyou Wen and Ge Yu. 2025. NeutronRAG: Towards Understanding the Effectiveness of RAG from a Data Retrieval Perspective (SIGMOD '25). 163-166.","DOI":"10.1145\/3722212.3725119"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489512"},{"key":"e_1_2_1_41_1","volume-title":"KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation. arXiv preprint arXiv:2409.13731","author":"Liang Lei","year":"2024","unstructured":"Lei Liang, Mengshu Sun, Zhengke Gui, Zhongshu Zhu, Zhouyu Jiang, Ling Zhong, Yuan Qu, Peilong Zhao, Zhongpu Bo, Jin Yang, et al., 2024. KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation. arXiv preprint arXiv:2409.13731 (2024)."},{"key":"e_1_2_1_42_1","volume-title":"Rouge: A package for automatic evaluation of summaries","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Association for Computational Linguistics (ACL '04). 74-81."},{"key":"e_1_2_1_43_1","volume-title":"Dual Reasoning: A GNN-LLM Collaborative Framework for Knowledge Graph Question Answering. In The Second Conference on Parsimony and Learning (Proceedings Track) (CPAL '25)","author":"Liu Guangyi","year":"2025","unstructured":"Guangyi Liu, Yongqi Zhang, Yong Li, and Quanming Yao. 2025. Dual Reasoning: A GNN-LLM Collaborative Framework for Knowledge Graph Question Answering. In The Second Conference on Parsimony and Learning (Proceedings Track) (CPAL '25)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_2_1_45_1","first-page":"52342","article-title":"Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time","volume":"36","author":"Liu Zichang","year":"2023","unstructured":"Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, Vol. 36 (2023), 52342-52364.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_46_1","first-page":"7067","volume-title":"International Conference on Machine Learning (PMLR '21)","author":"Liutkus Antoine","year":"2021","unstructured":"Antoine Liutkus, Ond\u0159ej Cifka, Shih-Lun Wu, Umut Simsekli, Yi-Hsuan Yang, and Gael Richard. 2021. Relative positional encoding for transformers with linear complexity. In International Conference on Machine Learning (PMLR '21). 7067-7079."},{"key":"e_1_2_1_47_1","unstructured":"Meta Llama. 2024. https:\/\/huggingface.co\/meta-llama\/Meta-Llama-3-8B-Instruct."},{"key":"e_1_2_1_48_1","unstructured":"Llama3-70B. 2024. https:\/\/huggingface.co\/meta-llama\/Meta-Llama-3-70B."},{"key":"e_1_2_1_49_1","unstructured":"Enzhe Lu Zhejun Jiang Jingyuan Liu Yulun Du Tao Jiang Chao Hong Shaowei Liu Weiran He Enming Yuan Yuzhi Wang et al. 2025. MoBA: Mixture of Block Attention for Long-Context LLMs. arXiv preprint arXiv:2502.13189 (2025)."},{"key":"e_1_2_1_50_1","volume-title":"TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. arXiv preprint arXiv:2410.07590","author":"Lu Songshuo","year":"2024","unstructured":"Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. 2024. TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. arXiv preprint arXiv:2410.07590 (2024)."},{"key":"e_1_2_1_51_1","volume-title":"Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In The Twelfth International Conference on Learning Representations (ICLR '24)","author":"Yuan-Fang Li LINHAO LUO","year":"2024","unstructured":"LINHAO LUO, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In The Twelfth International Conference on Learning Representations (ICLR '24)."},{"key":"e_1_2_1_52_1","volume-title":"The Thirteenth International Conference on Learning Representations (ICLR '25)","author":"Ma Shengjie","year":"2025","unstructured":"Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. 2025. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. In The Thirteenth International Conference on Learning Representations (ICLR '25)."},{"key":"e_1_2_1_53_1","volume-title":"Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation. In International Conference on Machine Learning (ICML '24)","author":"Merth Thomas","year":"2024","unstructured":"Thomas Merth, Qichen Fu, Mohammad Rastegari, and Mahyar Najibi. 2024. Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation. In International Conference on Machine Learning (ICML '24)."},{"key":"e_1_2_1_54_1","volume-title":"Large language models: A survey. arXiv","author":"Minaee Shervin","year":"2024","unstructured":"Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv 2024. arXiv preprint arXiv:2402.06196 (2024)."},{"key":"e_1_2_1_55_1","unstructured":"OpenAI. 2024. https:\/\/openai.com\/blog\/chatgpt."},{"key":"e_1_2_1_56_1","volume-title":"Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921","author":"Peng Boci","year":"2024","unstructured":"Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2024. Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921 (2024)."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_1_58_1","volume-title":"Anh Tuan Luu, Guy Wolf, and Dominique Beaini.","author":"Ramp\u00e1sek Ladislav","year":"2022","unstructured":"Ladislav Ramp\u00e1sek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. 2022. Recipe for a General, Powerful, Scalable Graph Transformer. In Advances in Neural Information Processing Systems 35 (NeurIPS '22)."},{"key":"e_1_2_1_59_1","unstructured":"Machel Reid Nikolay Savinov Denis Teplyashin Dmitry Lepikhin Timothy Lillicrap Jean-baptiste Alayrac Radu Soricut Angeliki Lazaridou Orhan Firat Julian Schrittwieser et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)."},{"key":"e_1_2_1_60_1","volume-title":"Kg-rag: Bridging the gap between knowledge and creativity. arXiv preprint arXiv:2405.12035","author":"Sanmartin Diego","year":"2024","unstructured":"Diego Sanmartin. 2024. Kg-rag: Bridging the gap between knowledge and creativity. arXiv preprint arXiv:2405.12035 (2024)."},{"key":"e_1_2_1_61_1","volume-title":"HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv preprint arXiv:2408.04948","author":"Sarmah Bhaskarjit","year":"2024","unstructured":"Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, and Dhagash Mehta. 2024. HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv preprint arXiv:2408.04948 (2024)."},{"key":"e_1_2_1_62_1","volume-title":"Graph transformers: A survey. arXiv preprint arXiv:2407.09777","author":"Shehzad Ahsan","year":"2024","unstructured":"Ahsan Shehzad, Feng Xia, Shagufta Abid, Ciyuan Peng, Shuo Yu, Dongyu Zhang, and Karin Verspoor. 2024. Graph transformers: A survey. arXiv preprint arXiv:2407.09777 (2024)."},{"key":"e_1_2_1_63_1","volume-title":"Exphormer: Sparse Transformers for Graphs. In International Conference on Machine Learning, ICML 2023","volume":"31632","author":"Shirzad Hamed","year":"2023","unstructured":"Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J. Sutherland, and Ali Kemal Sinop. 2023. Exphormer: Sparse Transformers for Graphs. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (PMLR '23, Vol. 202). 31613-31632."},{"key":"e_1_2_1_64_1","volume-title":"Preble: Efficient Distributed Prompt Scheduling for LLM Serving. In The Thirteenth International Conference on Learning Representations (ICLR '25)","author":"Srivatsa Vikranth","year":"2025","unstructured":"Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2025. Preble: Efficient Distributed Prompt Scheduling for LLM Serving. In The Thirteenth International Conference on Learning Representations (ICLR '25)."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_2_1_66_1","volume-title":"Block-Attention for Efficient RAG. arXiv preprint arXiv:2409.15355","author":"Sun East","year":"2024","unstructured":"East Sun, Yan Wang, and Lan Tian. 2024a. Block-Attention for Efficient RAG. arXiv preprint arXiv:2409.15355 (2024)."},{"key":"e_1_2_1_67_1","volume-title":"Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In The Twelfth International Conference on Learning Representations (ICLR '24)","author":"Sun Jiashuo","year":"2024","unstructured":"Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024b. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In The Twelfth International Conference on Learning Representations (ICLR '24)."},{"key":"e_1_2_1_68_1","unstructured":"Yixuan Tang and Yi Yang. 2024. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv:2401.15391 [cs.CL]"},{"key":"e_1_2_1_69_1","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew M Dai Anja Hauth et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)."},{"key":"e_1_2_1_70_1","unstructured":"Qwen Team. 2024. https:\/\/huggingface.co\/Qwen\/Qwen2.5-32B-Instruct."},{"key":"e_1_2_1_71_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al., 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_1_72_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)."},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.557"},{"key":"e_1_2_1_74_1","first-page":"5998","volume-title":"Proceedings of Advances in Neural Information Processing Systems (NeurIPS '17)","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS '17). 5998-6008."},{"key":"e_1_2_1_75_1","volume-title":"Encoding word order in complex embeddings. arXiv preprint arXiv:1912.12333","author":"Wang Benyou","year":"2019","unstructured":"Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. 2019. Encoding word order in complex embeddings. arXiv preprint arXiv:1912.12333 (2019)."},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626733"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3526134"},{"key":"e_1_2_1_78_1","volume-title":"The Thirteenth International Conference on Learning Representations (ICLR '25)","author":"Wang Song","year":"2025","unstructured":"Song Wang, Junhong Lin, Xiaojie Guo, Julian Shun, Jundong Li, and Yada Zhu. 2025. Reasoning of Large Language Models over Knowledge Graphs with Super-Relations. In The Thirteenth International Conference on Learning Representations (ICLR '25)."},{"key":"e_1_2_1_79_1","volume-title":"ICCBR 2024, Merida, Mexico, July 1-4, 2024, Proceedings (ICCBR '24","volume":"460","author":"Wiratunga Nirmalie","year":"2024","unstructured":"Nirmalie Wiratunga, Ramitha Abeyratne, Lasal Jayawardena, Kyle Martin, Stewart Massie, Ikechukwu Nkisi-Orji, Ruvan Weerasinghe, Anne Liret, and Bruno Fleisch. 2024. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering. In Case-Based Reasoning Research and Development - 32nd International Conference, ICCBR 2024, Merida, Mexico, July 1-4, 2024, Proceedings (ICCBR '24, Vol. 14775). 445-460."},{"key":"e_1_2_1_80_1","volume-title":"Efficient Streaming Language Models with Attention Sinks. In The Twelfth International Conference on Learning Representations (ICLR '24)","author":"Xiao Guangxuan","year":"2024","unstructured":"Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. In The Twelfth International Conference on Learning Representations (ICLR '24)."},{"key":"e_1_2_1_81_1","volume-title":"C-pack: packaged resources to advance general Chinese embedding","author":"Xiao Shitao","year":"2023","unstructured":"Shitao Xiao, Zheng Liu, Peitian Zhang, and N Muennighof. 2023. C-pack: packaged resources to advance general Chinese embedding. 2023. arXiv preprint arXiv:2309.07597 (2023)."},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-short.68"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3661370"},{"key":"e_1_2_1_84_1","volume-title":"CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. arXiv preprint arXiv:2405.16444","author":"Yao Jiayi","year":"2024","unstructured":"Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2024. CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. arXiv preprint arXiv:2405.16444 (2024)."},{"key":"e_1_2_1_85_1","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS '23)","author":"Yao Shunyu","year":"2023","unstructured":"Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS '23)."},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.623"},{"key":"e_1_2_1_87_1","volume-title":"Flashinfer: Efficient and customizable attention engine for llm inference serving. arXiv preprint arXiv:2501.01005","author":"Ye Zihao","year":"2025","unstructured":"Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al., 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving. arXiv preprint arXiv:2501.01005 (2025)."},{"key":"e_1_2_1_88_1","volume-title":"Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652","author":"Young Alex","year":"2024","unstructured":"Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al., 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024)."},{"key":"e_1_2_1_89_1","volume-title":"Liang Wang, Tingyang Xu, Wenbing Huang, Deli Zhao, Hong Cheng, and Yu Rong.","author":"Yuan Chaohao","year":"2025","unstructured":"Chaohao Yuan, Kangfei Zhao, Ercan Engin Kuruoglu, Liang Wang, Tingyang Xu, Wenbing Huang, Deli Zhao, Hong Cheng, and Yu Rong. 2025b. A survey of graph transformers: Architectures, theories and applications. arXiv preprint arXiv:2502.16533 (2025)."},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.14778\/3648160.3648167"},{"key":"e_1_2_1_91_1","doi-asserted-by":"crossref","unstructured":"Jingyang Yuan Huazuo Gao Damai Dai Junyu Luo Liang Zhao Zhengyan Zhang Zhenda Xie YX Wei Lean Wang Zhiping Xiao et al. 2025a. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv preprint arXiv:2502.11089 (2025).","DOI":"10.18653\/v1\/2025.acl-long.1126"},{"key":"e_1_2_1_92_1","volume-title":"Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al.","author":"Zaheer Manzil","year":"2020","unstructured":"Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al., 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, Vol. 33 (2020), 17283-17297."},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/3604237.3626866"},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/3725338"},{"key":"e_1_2_1_95_1","first-page":"1","volume-title":"TorchGT: A Holistic System for Large-Scale Graph Transformer Training. In SC24: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24)","author":"Zhang Meng","year":"2024","unstructured":"Meng Zhang, Jie Sun, Qinghao Hu, Peng Sun, Zeke Wang, Yonggang Wen, and Tianwei Zhang. 2024. TorchGT: A Holistic System for Large-Scale Graph Transformer Training. In SC24: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '24). 1-17."},{"key":"e_1_2_1_96_1","volume-title":"A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv preprint arXiv:2501.13958","author":"Zhang Qinggang","year":"2025","unstructured":"Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, and Xiao Huang. 2025a. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv preprint arXiv:2501.13958 (2025)."},{"key":"e_1_2_1_97_1","volume-title":"ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129","author":"Zhang Zhengyan","year":"2019","unstructured":"Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019)."},{"key":"e_1_2_1_98_1","first-page":"34661","article-title":"H2o: Heavy-hitter oracle for efficient generative inference of large language models","volume":"36","author":"Zhang Zhenyu","year":"2023","unstructured":"Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\u00e9, Clark Barrett, et al., 2023a. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, Vol. 36 (2023), 34661-34710.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_99_1","volume-title":"Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473","author":"Zhao Penghao","year":"2024","unstructured":"Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. 2024. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473 (2024)."},{"key":"e_1_2_1_100_1","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)."},{"key":"e_1_2_1_101_1","first-page":"62557","article-title":"Sglang: Efficient execution of structured language model programs","volume":"37","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al., 2024b. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems, Vol. 37 (2024), 62557-62583.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_102_1","volume-title":"BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching. arXiv preprint arXiv:2412.03594","author":"Zheng Zhen","year":"2024","unstructured":"Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. 2024a. BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching. arXiv preprint arXiv:2412.03594 (2024)."},{"key":"e_1_2_1_103_1","unstructured":"Zixuan Zhou Xuefei Ning Ke Hong Tianyu Fu Jiaming Xu Shiyao Li Yuming Lou Luning Wang Zhihang Yuan Xiuhong Li et al. 2024. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294 (2024)."},{"key":"e_1_2_1_104_1","volume-title":"Rageval: Scenario specific rag evaluation dataset generation framework. arXiv preprint arXiv:2408.01262","author":"Zhu Kunlun","year":"2024","unstructured":"Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, et al., 2024. Rageval: Scenario specific rag evaluation dataset generation framework. arXiv preprint arXiv:2408.01262 (2024)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769778","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T04:37:31Z","timestamp":1775536651000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769778"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,4]]},"references-count":104,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12,4]]}},"alternative-id":["10.1145\/3769778"],"URL":"https:\/\/doi.org\/10.1145\/3769778","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,4]]}}}