{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:40:11Z","timestamp":1755870011172,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":54,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T00:00:00Z","timestamp":1755820800000},"content-version":"vor","delay-in-days":75,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["2442271"],"award-info":[{"award-number":["2442271"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,8]]},"DOI":"10.1145\/3721145.3725779","type":"proceedings-article","created":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:57:17Z","timestamp":1755867437000},"page":"13-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["BitWeaver: Read-Time Truncation in Memory"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0754-9518","authenticated-orcid":false,"given":"Garrett","family":"Gagnon","sequence":"first","affiliation":[{"name":"Samsung Semiconductor US, San Jose, California, USA and Rensselaer Polytechnic Institute, Troy, New York, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7431-3064","authenticated-orcid":false,"given":"Srikanth","family":"Malla","sequence":"additional","affiliation":[{"name":"Samsung Semiconductor US, San Jose, California, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9536-1894","authenticated-orcid":false,"given":"Yangwook","family":"Kang","sequence":"additional","affiliation":[{"name":"Samsung Semiconductor US, San Jose, California, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0792-8146","authenticated-orcid":false,"given":"Liu","family":"Liu","sequence":"additional","affiliation":[{"name":"Rensselaer Polytechnic Institute, Troy, New York, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,8,22]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Muhammad Adnan Akhil Arunkumar Gaurav Jain Prashant Nair Ilya Soloveychik and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems 6 (2024) 114\u2013127."},{"key":"e_1_3_3_1_3_2","unstructured":"Amey Agrawal Nitin Kedia Jayashree Mohan Ashish Panwar Nipun Kwatra Bhargav Gulavani Ramachandran Ramjee and Alexey Tumanov. 2024. Vidur: A Large-Scale Simulation Framework For LLM Inference. Proceedings of Machine Learning and Systems 6 (2024) 351\u2013366."},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1245"},{"key":"e_1_3_3_1_5_2","unstructured":"Anthropic. 2023. Claude 2. https:\/\/www.anthropic.com\/news\/claude-2. [Accessed 15-01-2025]."},{"key":"e_1_3_3_1_6_2","unstructured":"Anthropic. 2024. Introducing the next generation of Claude. https:\/\/www.anthropic.com\/news\/claude-3-family. [Accessed 15-01-2025]."},{"key":"e_1_3_3_1_7_2","volume-title":"High Bandwidth Memory DRAM (HBM1, HBM2)","author":"Association JEDEC Solid State\u00a0Technology","year":"2021","unstructured":"JEDEC Solid State\u00a0Technology Association. 2021. High Bandwidth Memory DRAM (HBM1, HBM2). Technical Report. JEDEC."},{"key":"e_1_3_3_1_8_2","volume-title":"High Bandwidth Memory DRAM (HBM3)","author":"Association JEDEC Solid State\u00a0Technology","year":"2023","unstructured":"JEDEC Solid State\u00a0Technology Association. 2023. High Bandwidth Memory DRAM (HBM3). Technical Report. JEDEC."},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6239"},{"key":"e_1_3_3_1_10_2","unstructured":"Yelysei Bondarenko Markus Nagel and Tijmen Blankevoort. 2023. Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems 36 (2023) 75067\u201375096."},{"key":"e_1_3_3_1_11_2","unstructured":"Wei-Lin Chiang Lianmin Zheng Ying Sheng Anastasios\u00a0Nikolas Angelopoulos Tianle Li Dacheng Li Hao Zhang Banghua Zhu Michael Jordan Joseph\u00a0E. Gonzalez and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arxiv:https:\/\/arXiv.org\/abs\/2403.04132\u00a0[cs.AI]"},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"crossref","unstructured":"Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 43 3 (2023) 9\u201317.","DOI":"10.1109\/MM.2023.3256796"},{"key":"e_1_3_3_1_13_2","unstructured":"Jincheng Dai Zhuowei Huang Haiyun Jiang Chen Chen Deng Cai Wei Bi and Shuming Shi. 2024. Sequence can Secretly Tell You What to Discard. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.15949 (2024)."},{"key":"e_1_3_3_1_14_2","unstructured":"Steve Dai Rangha Venkatesan Mark Ren Brian Zimmer William Dally and Brucek Khailany. 2021. Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference. Proceedings of Machine Learning and Systems 3 (2021) 873\u2013884."},{"key":"e_1_3_3_1_15_2","unstructured":"Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.08691 (2023)."},{"key":"e_1_3_3_1_16_2","unstructured":"Tri Dao Dan Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022) 16344\u201316359."},{"key":"e_1_3_3_1_17_2","unstructured":"Harry Dong Xinyu Yang Zhenyu Zhang Zhangyang Wang Yuejie Chi and Beidi Chen. 2024. Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.09398 (2024)."},{"key":"e_1_3_3_1_18_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et\u00a0al. 2024. The llama 3 herd of models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.21783 (2024)."},{"key":"e_1_3_3_1_19_2","unstructured":"Cl\u00e9mentine Fourrier Nathan Habib Alina Lozovskaya Konrad Szafer and Thomas Wolf. 2024. Open LLM Leaderboard v2. https:\/\/huggingface.co\/spaces\/open-llm-leaderboard\/open_llm_leaderboard."},{"key":"e_1_3_3_1_20_2","series-title":"Proceedings of Machine Learning Research","first-page":"3929","volume-title":"Proceedings of the 37th International Conference on Machine Learning","volume":"119","author":"Guu Kelvin","year":"2020","unstructured":"Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0119), Hal\u00a0Daum\u00e9 III and Aarti Singh (Eds.). PMLR, 3929\u20133938. https:\/\/proceedings.mlr.press\/v119\/guu20a.html"},{"key":"e_1_3_3_1_21_2","unstructured":"Karl\u00a0Moritz Hermann Tomas Kocisky Edward Grefenstette Lasse Espeholt Will Kay Mustafa Suleyman and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"crossref","unstructured":"Kai Huang Bowen Li Dongliang Xiong Haitian Jiang Xiaowen Jiang Xiaolang Yan Luc Claesen Dehong Liu Junjian Chen and Zhili Liu. 2023. Structured dynamic precision for deep neural networks quantization. ACM Transactions on Design Automation of Electronic Systems 28 1 (2023) 1\u201324.","DOI":"10.1145\/3549535"},{"key":"e_1_3_3_1_23_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier et\u00a0al. 2023. Mistral 7B. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.06825 (2023)."},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/IMW.2017.7939084"},{"key":"e_1_3_3_1_25_2","unstructured":"Hao Kang Qingru Zhang Souvik Kundu Geonhwa Jeong Zaoxing Liu Tushar Krishna and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.05527 (2024)."},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Sakaguchi Keisuke Le\u00a0Bras Ronan Bhagavatula Chandra and Choi Yejin. 2019. WinoGrande: An Adversarial Winograd Schema Challenge at Scale.","DOI":"10.1609\/aaai.v34i05.6399"},{"key":"e_1_3_3_1_27_2","first-page":"155","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. { InfiniGen} : Efficient generative inference of large language models with dynamic { KV} cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155\u2013172."},{"key":"e_1_3_3_1_28_2","unstructured":"Patrick Lewis Ethan Perez Aleksandra Piktus Fabio Petroni Vladimir Karpukhin Naman Goyal Heinrich K\u00fcttler Mike Lewis Wen-tau Yih Tim Rockt\u00e4schel Sebastian Riedel and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020) 9459\u20139474."},{"key":"e_1_3_3_1_29_2","unstructured":"Ji Lin Jiaming Tang Haotian Tang Shang Yang Wei-Ming Chen Wei-Chen Wang Guangxuan Xiao Xingyu Dang Chuang Gan and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems 6 (2024) 87\u2013100."},{"key":"e_1_3_3_1_30_2","unstructured":"Zichang Liu Aditya Desai Fangshuo Liao Weitao Wang Victor Xie Zhaozhuo Xu Anastasios Kyrillidis and Anshumali Shrivastava. 2024. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_3_3_1_31_2","unstructured":"Zirui Liu Jiayi Yuan Hongye Jin Shaochen Zhong Zhaozhuo Xu Vladimir Braverman Beidi Chen and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.02750 (2024)."},{"key":"e_1_3_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2015.7477461"},{"key":"e_1_3_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1260"},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"crossref","unstructured":"Shashi Narayan Shay\u00a0B. Cohen and Mirella Lapata. 2018. Don\u2019t Give Me the Details Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv abs\/1808.08745 (2018).","DOI":"10.18653\/v1\/D18-1206"},{"key":"e_1_3_3_1_35_2","unstructured":"Matteo Pagliardini Daniele Paliotta Martin Jaggi and Fran\u00e7ois Fleuret. 2023. Fast attention over long sequences with dynamic sparse flash attention. Advances in Neural Information Processing Systems 36 (2023) 59808\u201359831."},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","unstructured":"Denis Paperno Germ\u00e1n Kruszewski Angeliki Lazaridou Quan\u00a0Ngoc Pham Raffaella Bernardi Sandro Pezzelle Marco Baroni Gemma Boleda and Raquel Fern\u00e1ndez. 2016. The LAMBADA dataset. 10.5281\/zenodo.2630551","DOI":"10.5281\/zenodo.2630551"},{"key":"e_1_3_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649329.3655953"},{"key":"e_1_3_3_1_38_2","doi-asserted-by":"crossref","unstructured":"Myeong-Jae Park Jinhyung Lee Kyungjun Cho Jihwan Park Junil Moon Sung-Hak Lee Tae-Kyun Kim Sanghoon Oh Seokwoo Choi Yongsuk Choi Ho\u00a0Sung Cho Taesik Yun Young\u00a0Jun Koo Jae-Seung Lee Byung-Kuk Yoon Young-Jun Park Sangmuk Oh Chang\u00a0Kwon Lee Seong-Hee Lee Hyun-Woo Kim Yucheon Ju Seung-Kyun Lim Kyo\u00a0Yun Lee Sang-Hoon Lee Woo\u00a0Sung We Seungchan Kim Seung\u00a0Min Yang Keonho Lee In-Keun Kim Younghyun Jeon Jae-Hyung Park Jong\u00a0Chan Yun Seonyeol Kim Dong-Yeol Lee Su-Hyun Oh Jung-Hyun Shin Yeonho Lee Jieun Jang and Joohwan Cho. 2022. A 192-Gb 12-high 896-GB\/s HBM3 DRAM with a TSV auto-calibration scheme and machine-learning-based layout optimization. IEEE Journal of Solid-State Circuits 58 1 (2022) 256\u2013269.","DOI":"10.1109\/JSSC.2022.3193354"},{"key":"e_1_3_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507738"},{"key":"e_1_3_3_1_40_2","unstructured":"Luka Ribar Ivan Chelombiev Luke Hudlass-Galley Charlie Blake Carlo Luschi and Douglas Orr. 2023. Sparq attention: Bandwidth-efficient llm inference. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.04985 (2023)."},{"key":"e_1_3_3_1_41_2","volume-title":"2011 AAAI spring symposium series","author":"Roemmele Melissa","year":"2011","unstructured":"Melissa Roemmele, Cosmin\u00a0Adrian Bejan, and Andrew\u00a0S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI spring symposium series."},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589351"},{"key":"e_1_3_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370864"},{"key":"e_1_3_3_1_44_2","unstructured":"Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.08608 (2024)."},{"key":"e_1_3_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSE.2007.44"},{"key":"e_1_3_3_1_46_2","unstructured":"A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017)."},{"key":"e_1_3_3_1_47_2","doi-asserted-by":"crossref","unstructured":"Alex Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy and Samuel\u00a0R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In the Proceedings of ICLR..","DOI":"10.18653\/v1\/W18-5446"},{"key":"e_1_3_3_1_48_2","unstructured":"Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https:\/\/github.com\/kingoflolz\/mesh-transformer-jax."},{"key":"e_1_3_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00018"},{"key":"e_1_3_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Samuel Williams Andrew Waterman and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52 4 (2009) 65\u201376.","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_3_3_1_51_2","doi-asserted-by":"crossref","unstructured":"Rui Xie Asad\u00a0Ul Haq Linsen Ma Krystal Sun Sanchari Sen Swagath Venkataramani Liu Liu and Tong Zhang. 2024. SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization. IEEE Computer Architecture Letters (2024).","DOI":"10.1109\/LCA.2024.3452699"},{"key":"e_1_3_3_1_52_2","unstructured":"June\u00a0Yong Yang Byeongwook Kim Jeongin Bae Beomseok Kwon Gunho Park Eunho Yang Se\u00a0Jung Kwon and Dongsoo Lee. 2024. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.18096 (2024)."},{"key":"e_1_3_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1472"},{"key":"e_1_3_3_1_54_2","unstructured":"Zhenyu Zhang Shiwei Liu Runjin Chen Bhavya Kailkhura Beidi Chen and Atlas Wang. 2024. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. Proceedings of Machine Learning and Systems 6 (2024) 381\u2013394."},{"key":"e_1_3_3_1_55_2","unstructured":"Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher R\u00e9 Zhangyang Wang and Clark Barrett. 2024. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36 (2024)."}],"event":{"name":"ICS '25: 2025 International Conference on Supercomputing","location":"Salt Lake City USA","acronym":"ICS '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 39th ACM International Conference on Supercomputing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721145.3725779","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721145.3725779","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:02:11Z","timestamp":1755867731000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721145.3725779"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,8]]},"references-count":54,"alternative-id":["10.1145\/3721145.3725779","10.1145\/3721145"],"URL":"https:\/\/doi.org\/10.1145\/3721145.3725779","relation":{},"subject":[],"published":{"date-parts":[[2025,6,8]]},"assertion":[{"value":"2025-08-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}