{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:39:54Z","timestamp":1766219994164,"version":"3.48.0"},"publisher-location":"New York, NY, USA","reference-count":26,"publisher":"ACM","funder":[{"name":"Hong Kong Research Grants Council","award":["GRF 16214123"],"award-info":[{"award-number":["GRF 16214123"]}]},{"name":"AI Chip Center for Emerging Smart Systems (ACCESS)","award":["N.A."],"award-info":[{"award-number":["N.A."]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,9,8]]},"DOI":"10.1145\/3754598.3754671","type":"proceedings-article","created":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:34:32Z","timestamp":1766219672000},"page":"460-469","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-7037-7418","authenticated-orcid":false,"given":"Zhongchun","family":"Zhou","sequence":"first","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9547-9653","authenticated-orcid":false,"given":"Chengtao","family":"Lai","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7622-6714","authenticated-orcid":false,"given":"Wei","family":"Zhang","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, Hong Kong"}]}],"member":"320","published-online":{"date-parts":[[2025,12,20]]},"reference":[{"key":"e_1_3_3_1_2_2","doi-asserted-by":"crossref","unstructured":"Joshua Ainslie James Lee-Thorp Michiel de Jong Yury Zemlyanskiy Federico Lebr\u00f3n and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arxiv:https:\/\/arXiv.org\/abs\/2305.13245\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2305.13245","DOI":"10.18653\/v1\/2023.emnlp-main.298"},{"key":"e_1_3_3_1_3_2","doi-asserted-by":"publisher","unstructured":"Aritra Bagchi Dharamjeet Ohm Rishabh Manan Suri and Preeti\u00a0Ranjan Panda. 2024. POEM: Performance Optimization and Endurance Management for Non-volatile Caches. ACM Trans. Des. Autom. Electron. Syst. 29 5 Article 79 (Sept. 2024) 36\u00a0pages. 10.1145\/3653452","DOI":"10.1145\/3653452"},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"publisher","unstructured":"Aritra Bagchi Dinesh Joshi and Preeti\u00a0Ranjan Panda. 2024. COBRRA: COntention-aware cache Bypass with Request-Response Arbitration. ACM Trans. Embed. Comput. Syst. 23 1 Article 12 (jan 2024) 30\u00a0pages. 10.1145\/3632748","DOI":"10.1145\/3632748"},{"key":"e_1_3_3_1_5_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman et\u00a0al. 2024. The Llama 3 Herd of Models. arxiv:https:\/\/arXiv.org\/abs\/2407.21783\u00a0[cs.AI] https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330390"},{"key":"e_1_3_3_1_7_2","volume-title":"Intel Accelerator Engines","year":"2023","unstructured":"Intel. 2023. Intel Accelerator Engines. https:\/\/www.intel.com\/content\/www\/us\/en\/products\/docs\/accelerator-engines\/overview.html"},{"key":"e_1_3_3_1_8_2","unstructured":"Intel. 2024. Intel\u00ae 64 and IA-32 Architectures Optimization Reference Manual Volume 1. https:\/\/www.intel.com\/content\/www\/us\/en\/content-details\/671488\/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html"},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579990.3580017"},{"key":"e_1_3_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835938"},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575747"},{"key":"e_1_3_3_1_12_2","series-title":"(PACT \u201913)","first-page":"157","volume-title":"Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques","author":"Kay\u0131ran Onur","year":"2013","unstructured":"Onur Kay\u0131ran, Adwait Jog, Mahmut\u00a0Taylan Kandemir, and Chita\u00a0Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (Edinburgh, Scotland, UK) (PACT \u201913). IEEE Press, 157\u2013166."},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00047"},{"key":"e_1_3_3_1_14_2","unstructured":"Gwang\u00a0Bok Kim Jong\u00a0Myon Kim and Cheol\u00a0Hong Kim. 2019. Mshr-aware dynamic warp scheduler for high performance GPUS. KIPS Transactions on Computer and Communication Systems 8 5 (2019) 111\u2013118."},{"key":"e_1_3_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650200.3656592"},{"key":"e_1_3_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835937"},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00071"},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Haocong Luo Yahya\u00a0Can Tu\u011frul F\u00a0Nisa Bostanc\u0131 Ataberk Olgun A\u00a0Giray Ya\u011fl\u0131k\u00e7\u0131 and Onur Mutlu. 2023. Ramulator 2.0: A modern modular and extensible dram simulator. IEEE Computer Architecture Letters 23 1 (2023) 112\u2013116.","DOI":"10.1109\/LCA.2023.3333759"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/2717764.2717783"},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","unstructured":"Subhankar Pal Swagath Venkataramani Viji Srinivasan and Kailash Gopalakrishnan. 2022. OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators. ACM Trans. Embed. Comput. Syst. 21 6 Article 86 (oct 2022) 29\u00a0pages. 10.1145\/3530909","DOI":"10.1145\/3530909"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640422"},{"key":"e_1_3_3_1_23_2","volume-title":"Snapdragon X Elite","year":"2023","unstructured":"Qualcomm. 2023. Snapdragon X Elite. https:\/\/www.qualcomm.com\/snapdragon\/laptops"},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"crossref","unstructured":"Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani and Tri Dao. 2025. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems 37 (2025) 68658\u201368685.","DOI":"10.52202\/079017-2193"},{"key":"e_1_3_3_1_25_2","unstructured":"Gemma Team Morgane Riviere Shreya Pathak Pier\u00a0Giuseppe Sessa Cassidy Hardin Surya Bhupatiraju et\u00a0al. 2024. Gemma 2: Improving Open Language Models at a Practical Size. arxiv:https:\/\/arXiv.org\/abs\/2408.00118\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2408.00118"},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"publisher","unstructured":"Sakshi Tiwari Shreshth Tuli Isaar Ahmad Ayushi Agarwal Preeti\u00a0Ranjan Panda and Sreenivas Subramoney. 2019. REAL: REquest Arbitration in Last Level Caches. ACM Trans. Embed. Comput. Syst. 18 6 Article 115 (nov 2019) 24\u00a0pages. 10.1145\/3362100","DOI":"10.1145\/3362100"},{"key":"e_1_3_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Sam Xi Yuan Yao Kshitij Bhardwaj Paul Whatmough Gu-Yeon Wei and David Brooks. 2020. SMAUG: End-to-end full-stack simulation infrastructure for deep learning workloads. ACM Transactions on Architecture and Code Optimization (TACO) 17 4 (2020) 1\u201326.","DOI":"10.1145\/3424669"}],"event":{"name":"ICPP '25: 54th International Conference on Parallel Processing","location":"San Diego CA USA","acronym":"ICPP '25"},"container-title":["Proceedings of the 54th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3754598.3754671","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:37:40Z","timestamp":1766219860000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3754598.3754671"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,8]]},"references-count":26,"alternative-id":["10.1145\/3754598.3754671","10.1145\/3754598"],"URL":"https:\/\/doi.org\/10.1145\/3754598.3754671","relation":{},"subject":[],"published":{"date-parts":[[2025,9,8]]},"assertion":[{"value":"2025-12-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}