{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T00:39:46Z","timestamp":1769560786775,"version":"3.49.0"},"reference-count":54,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T00:00:00Z","timestamp":1769472000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Center for Equitable Artificial Intelligence and Machine Learning Systems"},{"name":"Safety and Mobility Advancements Regional Transportation and Economics Research (SMARTER) Center"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Large Language Model (LLM) agents depend heavily on multiple external tools such as APIs, databases and computational services to perform complex tasks. However, these tool executions create latency and introduce costs, particularly when agents handle similar queries or workflows. Most current caching methods focus on LLM prompt\u2013response pairs or execution plans and overlook redundancies at the tool level. To address this, we designed a multi-level caching architecture that captures redundancy at both the workflow and tool level. The proposed system integrates four key components: (1) hierarchical caching that operates at both the workflow and tool level to capture coarse and fine-grained redundancies; (2) dependency-aware invalidation using graph-based techniques to maintain consistency when write operations affect cached reads across execution contexts; (3) category-specific time-to-live (TTL) policies tailored to different data types, e.g., weather APIs, user location, database queries and filesystem and computational tasks; and (4) session isolation to ensure multi-tenant cache safety through automatic session scoping. We evaluated the system using synthetic data with 2.25 million queries across ten configurations in fifteen runs. In addition, we conducted four targeted evaluations\u2014write intensity robustness from 4 to 30% writes, personalized memory effects under isolated vs. shared cache modes, workflow-level caching comparison and workload sensitivity across five access distributions\u2014on an additional 2.565 million queries, bringing the total experimental scope to 4.815 million executed queries. The architecture achieved 76.5% caching efficiency, reducing query processing time by 13.3\u00d7 and lowering estimated costs by 73.3% compared to a no-cache baseline. Multi-tenant testing with fifteen concurrent tenants confirmed robust session isolation and 74.1% efficiency under concurrent workloads. Our evaluation used controlled synthetic workloads following Zipfian distributions, which are commonly used in caching research. While absolute hit rates vary by deployment domain, the architectural principles of hierarchical caching, dependency tracking and session isolation remain broadly applicable.<\/jats:p>","DOI":"10.3390\/make8020030","type":"journal-article","created":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T11:35:42Z","timestamp":1769513742000},"page":"30","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Hierarchical Caching for Agentic Workflows: A Multi-Level Architecture to Reduce Tool Execution Overhead"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-0897-3116","authenticated-orcid":false,"given":"Farhana","family":"Begum","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering Department, Morgan State University, Baltimore, MD 21251, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Craig","family":"Scott","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, Morgan State University, Baltimore, MD 21251, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kofi","family":"Nyarko","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, Morgan State University, Baltimore, MD 21251, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8052-6931","authenticated-orcid":false,"given":"Mansoureh","family":"Jeihani","sequence":"additional","affiliation":[{"name":"Transportation Engineering Department, Morgan State University, Baltimore, MD 21251, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3318-2851","authenticated-orcid":false,"given":"Fahmi","family":"Khalifa","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, Morgan State University, Baltimore, MD 21251, USA"},{"name":"Electronics and Communications Engineering Department, Mansoura University, Mansoura 35516, Egypt"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,27]]},"reference":[{"key":"ref_1","unstructured":"Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2023, January 1\u20135). ReAct: Synergizing reasoning and acting in language models. Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda."},{"key":"ref_2","unstructured":"Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Zhang, S. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv."},{"key":"ref_3","unstructured":"LangChain (2026, January 10). LangChain: Building Applications with LLMs Through Composability. Available online: https:\/\/www.blog.langchain.com\/author\/langchain\/."},{"key":"ref_4","unstructured":"Langfuse (2025, December 18). AI Agent Observability with Langfuse. Langfuse Blog. Available online: https:\/\/langfuse.com\/blog\/2024-07-ai-agent-observability-with-langfuse."},{"key":"ref_5","unstructured":"De Backer, K. (2025, December 18). Common Solutions to Latency Issues in LLM Applications. Medium. Available online: https:\/\/medium.com\/@mancity.kevindb\/common-solutions-to-latency-issues-in-llm-applications-d58b8cf4be17."},{"key":"ref_6","unstructured":"Datadog (2025, December 18). Monitor Your OpenAI LLM Spend with Cost Insights from Datadog. Datadog Blog. Available online: https:\/\/www.datadoghq.com\/blog\/monitor-openai-cost-datadog-cloud-cost-management-llm-observability\/."},{"key":"ref_7","unstructured":"Li, H., Li, Y., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., and Chen, L. (2025). A survey on large language model acceleration based on KV cache management. arXiv."},{"key":"ref_8","unstructured":"Frantar, E., and Alistarh, D. (2023, January 23\u201329). SparseGPT: Massive language models can be accurately pruned in one-shot. Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA."},{"key":"ref_9","unstructured":"Leviathan, Y., Kalman, M., and Matias, Y. (2023, January 23\u201329). Fast inference from transformers via speculative decoding. Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R.K.W., and Lim, E.P. (2023, January 9\u201314). Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada.","DOI":"10.18653\/v1\/2023.acl-long.147"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Bang, F., and Feng, D. (2023, January 6). GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. Proceedings of the Workshop for Natural Language Processing Open Source Software (NLP-OSS), Singapore.","DOI":"10.18653\/v1\/2023.nlposs-1.24"},{"key":"ref_12","unstructured":"Regmi, S., and Pun, G. (2024). GPT semantic cache: Reducing LLM costs and latency via semantic embedding caching. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gill, W., Elidrisi, M., Kalapatapu, P., Ahmed, A., Anwar, A., and Gulzar, M.A. (2025). MeanCache: User-centric semantic cache for large language model-based web services. arXiv.","DOI":"10.1109\/IPDPS64566.2025.00117"},{"key":"ref_14","unstructured":"Anthropic (2025, December 18). Prompt Caching with Claude. Anthropic Documentation. Available online: https:\/\/docs.anthropic.com\/en\/docs\/build-with-claude\/prompt-caching."},{"key":"ref_15","unstructured":"Google (2025, December 18). Gemini Context Caching. Google AI for Developers. Available online: https:\/\/ai.google.dev\/gemini-api\/docs\/caching?lang=python."},{"key":"ref_16","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020, January 6\u201312). Language models are few-shot learners. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., and Stoica, I. (2023, January 23\u201326). Efficient memory management for large language model serving with Paged Attention. Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), Koblenz, Germany.","DOI":"10.1145\/3600006.3613165"},{"key":"ref_18","unstructured":"Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R\u00e9, C., and Barrett, C. (2023, January 10\u201316). H2O: Heavy-hitter oracle for efficient generative inference of large language models. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_19","unstructured":"Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. (2024, January 7\u201311). Efficient streaming language models with attention sinks. Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria."},{"key":"ref_20","unstructured":"Gim, I., Chen, G., Lee, S.-S., Sarda, N., Khandelwal, A., and Zhong, L. (2024, January 13\u201316). Prompt Cache: Modular Attention Reuse for Low-Latency Inference. Proceedings of the Conference on Machine Learning and Systems (MLSys), Santa Clara, CA, USA."},{"key":"ref_21","unstructured":"Xu, D., Yin, W., Jin, X., Zhang, Y., Wei, S., Xu, M., and Liu, X. (2023). LLMCad: Fast and Scalable On-device Large Language Model Inference. arXiv."},{"key":"ref_22","unstructured":"Xie, Y., Li, Z., Zhang, H., Chen, X., Li, Q., and Chen, L. (2024). LLMCache: Efficient Semantic Caching for Large Language Model Inference. arXiv."},{"key":"ref_23","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V., and Zhou, D. (December, January 28). Chain-of-thought prompting elicits reasoning in large language models. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_24","unstructured":"Huang, Z., Li, Y., and Zhang, H. (2024, January 21\u201327). Executable code actions elicit better LLM agents. Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria."},{"key":"ref_25","unstructured":"Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2023, January 10\u201316). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_26","unstructured":"Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., and Chen, X. (2023, January 1\u20135). Large Language Models as Optimizers. Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda."},{"key":"ref_27","unstructured":"Altinel, R., Boncz, P., and Zukowski, M. (2007, January 23\u201327). Cooperative scans: Dynamic bandwidth sharing in a DBMS. Proceedings of the International Conference on Very Large Data Bases (VLDB), Vienna, Austria."},{"key":"ref_28","unstructured":"Cao, P., and Irani, S. (1997, January 9). Cost-aware WWW proxy caching algorithms. Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS), Monterey, CA, USA."},{"key":"ref_29","unstructured":"Zhang, Y., Zhu, Y., Yu, C., Zhou, K., Zheng, S., Chen, C., Tian, Y., Yang, F., Shao, J., and Liao, X. (2024, January 10\u201312). MuCache: Framework-Agnostic Caching for Microservices. Proceedings of the USENIX Annual Technical Conference (ATC), Santa Clara, CA, USA."},{"key":"ref_30","unstructured":"Otaki, R., Chang, J.H., Benello, C., Elmore, A.J., and Graefe, G. (2025, January 19\u201322). Resource-Adaptive Query Execution with Paged Memory Management. Proceedings of the Conference on Innovative Data Systems Research (CIDR), Amsterdam, The Netherlands."},{"key":"ref_31","first-page":"645","article-title":"Semantic Query Optimization for Query Plans of Heterogeneous Multidatabase Systems","volume":"11","author":"Hsu","year":"1999","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_32","unstructured":"Li, Z., Chang, Y., Yu, G., and Le, X. (2025). HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"O\u2019Neil, E.J., O\u2019Neil, P.E., and Weikum, G. (1993, January 25\u201328). The LRU-K page replacement algorithm for database disk buffering. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, DC, USA.","DOI":"10.1145\/170035.170081"},{"key":"ref_34","unstructured":"Megiddo, N., and Modha, D.S. (2003, January 31). ARC: A self-tuning, low overhead replacement cache. Proceedings of the USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA, USA."},{"key":"ref_35","unstructured":"Hennessy, J.L., and Patterson, D.A. (2017). Computer Architecture: A Quantitative Approach, Morgan Kaufmann. [6th ed.]."},{"key":"ref_36","unstructured":"Rabinovich, M., and Spatscheck, O. (2002). Web Caching and Replication, Addison-Wesley."},{"key":"ref_37","unstructured":"Sethumurugan, S., Vuppala, J.S.V.R., Krishnakumar, S., and Murugan, T.B. (2021, January 14\u201319). RLR: A Reinforcement Learning Based Cache Replacement Policy. Proceedings of the International Symposium on Computer Architecture (ISCA), Virtual."},{"key":"ref_38","unstructured":"Dehghan, M., Jiang, B., Vuppala, J.S.V.R., and Murugan, T.B. (2020, January 8\u201312). On the Complexity of Traffic Traces and Implications. Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Boston, MA, USA."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Stojkovic, J., Alverti, C., Andrade, A., Iliakopoulou, N.M., Franke, H., Xu, T., and Torrellas, J. (2025, January 1\u20135). Concord: Rethinking Distributed Coherence for Software Caches in Serverless Environments. Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), Las Vegas, NV, USA.","DOI":"10.1109\/HPCA61900.2025.00043"},{"key":"ref_40","unstructured":"Zhang, H., Zuo, D., Yan, Y., Liang, Z., and Wang, H. (2025). SAM: A Stability-Aware Cache Manager for Multi-Tenant Embedded Databases. arXiv."},{"key":"ref_41","unstructured":"Berger, D.S., Sitaraman, R.K., and Harchol-Balter, M. (2017, January 27\u201329). AdaptSize: Orchestrating the hot object memory cache in a content delivery network. Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, USA."},{"key":"ref_42","unstructured":"Yang, J., Yue, Y., and Rashmi, K.V. (2020, January 4\u20136). A large scale analysis of hundreds of in-memory cache clusters at Twitter. Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Online."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Kasture, H., and Sanchez, D. (2014, January 1\u20135). Ubik: Efficient cache sharing with strict QoS for latency-critical workloads. Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Salt Lake City, UT, USA.","DOI":"10.1145\/2541940.2541944"},{"key":"ref_44","unstructured":"Huang, Q., Laddad, P., Veeraraghavan, K., Faleiro, J.M., Abadi, D.J., and Ren, X. (2022, January 11\u201313). Cache Made Consistent: Meta\u2019s Cache Invalidation Solution. Proceedings of the USENIX Annual Technical Conference (ATC), Carlsbad, CA, USA."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Dallot, J., Fesharaki, A.J., Pacut, M., and Schmid, S. (2024). Dependency-Aware Online Caching. arXiv.","DOI":"10.1109\/INFOCOM52122.2024.10621422"},{"key":"ref_46","unstructured":"OpenWeatherMap (2025, December 18). Pricing. Available online: https:\/\/openweathermap.org\/price."},{"key":"ref_47","unstructured":"Amazon Web Services (2025, December 18). Amazon RDS Proxy Pricing. Available online: https:\/\/aws.amazon.com\/rds\/proxy\/pricing\/."},{"key":"ref_48","unstructured":"Amazon Web Services (2025, December 18). AWS Lambda Pricing. Available online: https:\/\/aws.amazon.com\/lambda\/pricing\/."},{"key":"ref_49","unstructured":"RapidAPI (2025, December 18). API Marketplace Pricing. Available online: https:\/\/rapidapi.com\/backend_box\/api\/usage-and-billing\/pricing."},{"key":"ref_50","unstructured":"Zhang, Q., Wornow, M., and Olukotun, K. (2025). Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching. arXiv."},{"key":"ref_51","unstructured":"Li, G., Wu, R., and Tan, H. (2025). A Plan Reuse Mechanism for LLM-Driven Agent. arXiv."},{"key":"ref_52","unstructured":"Chu, K., Lin, Z., Xiang, D., Shen, Z., Su, J., Chu, C., Yang, Y., Zhang, W., Wu, W., and Zhang, W. (2025). SafeKV: Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Serving. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Volos, S., Fournet, C., Hofmann, J., and K\u00f6pf, B. (2024, January 14\u201318). Principled Microarchitectural Isolation on Cloud CPUs. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Salt Lake City, UT, USA.","DOI":"10.1145\/3658644.3690183"},{"key":"ref_54","unstructured":"Song, Z., Chen, K., Sarda, N., Alt\u0131nb\u00fcken, D., Brevdo, E., Coleman, J., Ju, X., Jurczyk, P., Schooler, R., and Gummadi, R. (2023, January 17\u201319). HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube CDN. Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, USA."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/8\/2\/30\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T11:48:22Z","timestamp":1769514502000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/8\/2\/30"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,27]]},"references-count":54,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["make8020030"],"URL":"https:\/\/doi.org\/10.3390\/make8020030","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,27]]}}}