{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T16:21:00Z","timestamp":1772727660586,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":85,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,21]]},"DOI":"10.1145\/3695053.3731057","type":"proceedings-article","created":{"date-parts":[[2025,6,20]],"date-time":"2025-06-20T16:46:17Z","timestamp":1750437977000},"page":"514-528","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-6435-263X","authenticated-orcid":false,"given":"Zhiwen","family":"Mo","sequence":"first","affiliation":[{"name":"Imperial College London, London, United Kingdom and Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-2313-5348","authenticated-orcid":false,"given":"Lei","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China and Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-0830-044X","authenticated-orcid":false,"given":"Jianyu","family":"Wei","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China and Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0023-2367","authenticated-orcid":false,"given":"Zhichen","family":"Zeng","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, USA and Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2001-3763","authenticated-orcid":false,"given":"Shijie","family":"Cao","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9524-5476","authenticated-orcid":false,"given":"Lingxiao","family":"Ma","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8417-5796","authenticated-orcid":false,"given":"Naifeng","family":"Jing","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9107-013X","authenticated-orcid":false,"given":"Ting","family":"Cao","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4495-1997","authenticated-orcid":false,"given":"Jilong","family":"Xue","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0378-060X","authenticated-orcid":false,"given":"Fan","family":"Yang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6455-3898","authenticated-orcid":false,"given":"Mao","family":"Yang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,20]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia\u00a0Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat et\u00a0al. 2023. Gpt-4 technical report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2303.08774 (2023)."},{"key":"e_1_3_3_1_3_2","unstructured":"Yonatan Bisk Rowan Zellers Ronan\u00a0Le Bras Jianfeng Gao and Yejin Choi. 2019. PIQA: Reasoning about Physical Commonsense in Natural Language. arxiv:https:\/\/arXiv.org\/abs\/1911.11641\u00a0[cs.CL]"},{"key":"e_1_3_3_1_4_2","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared\u00a0D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et\u00a0al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_3_1_5_2","unstructured":"Jerry Chee Yaohui Cai Volodymyr Kuleshov and Christopher\u00a0M De\u00a0Sa. 2024. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_3_3_1_6_2","unstructured":"Tianqi Chen Thierry Moreau Ziheng Jiang Haichen Shen Eddie\u00a0Q Yan Leyuan Wang Yuwei Hu Luis Ceze Carlos Guestrin and Arvind Krishnamurthy. 2018. TVM: end-to-end optimization stack for deep learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1802.04799 11 20 (2018)."},{"key":"e_1_3_3_1_7_2","doi-asserted-by":"crossref","unstructured":"Jack Choquette Wishwesh Gandhi Olivier Giroux Nick Stam and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro 41 2 (2021) 29\u201335.","DOI":"10.1109\/MM.2021.3061394"},{"key":"e_1_3_3_1_8_2","unstructured":"Christopher Clark Kenton Lee Ming-Wei Chang Tom Kwiatkowski Michael Collins and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes\/No Questions. arxiv:https:\/\/arXiv.org\/abs\/1905.10044\u00a0[cs.CL]"},{"key":"e_1_3_3_1_9_2","volume-title":"NVIDIA Blackwell Architecture Technical Brief","author":"Corporation NVIDIA","year":"2025","unstructured":"NVIDIA Corporation. 2025. NVIDIA Blackwell Architecture Technical Brief. Technical Report. NVIDIA Corporation. https:\/\/resources.nvidia.com\/en-us-blackwell-architecture?ncid=no-ncid"},{"key":"e_1_3_3_1_10_2","unstructured":"NVIDIA Corporation. 2025. Parallel Thread Execution ISA Version 8.8. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/. Accessed: 2025-05-02."},{"key":"e_1_3_3_1_11_2","unstructured":"Tim Dettmers Mike Lewis Younes Belkada and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35 (2022) 30318\u201330332."},{"key":"e_1_3_3_1_12_2","unstructured":"Tim Dettmers Artidoro Pagnoni Ari Holtzman and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_3_3_1_13_2","first-page":"7750","volume-title":"International Conference on Machine Learning","author":"Dettmers Tim","year":"2023","unstructured":"Tim Dettmers and Luke Zettlemoyer. 2023. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning. PMLR, 7750\u20137774."},{"key":"e_1_3_3_1_14_2","unstructured":"Yiran Ding Li\u00a0Lyna Zhang Chengruidong Zhang Yuanyuan Xu Ning Shang Jiahang Xu Fan Yang and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.13753 (2024)."},{"key":"e_1_3_3_1_15_2","unstructured":"Dayou Du Yijia Zhang Shijie Cao Jiaqi Guo Ting Cao Xiaowen Chu and Ningyi Xu. 2024. BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.10631 (2024)."},{"key":"e_1_3_3_1_16_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2210.17323 (2022)."},{"key":"e_1_3_3_1_17_2","unstructured":"ggml org. 2025. llama.cpp: Port of LLaMA models to C\/C++. https:\/\/github.com\/ggml-org\/llama.cpp."},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3614312"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589038"},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00095"},{"key":"e_1_3_3_1_21_2","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arxiv:https:\/\/arXiv.org\/abs\/2009.03300\u00a0[cs.CY]"},{"key":"e_1_3_3_1_22_2","unstructured":"Jordan Hoffmann Sebastian Borgeaud Arthur Mensch Elena Buchatskaya Trevor Cai Eliza Rutherford Diego de\u00a0Las Casas Lisa\u00a0Anne Hendricks Johannes Welbl Aidan Clark et\u00a0al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2203.15556 (2022)."},{"key":"e_1_3_3_1_23_2","unstructured":"Coleman Hooper Sehoon Kim Hiva Mohammadzadeh Michael\u00a0W Mahoney Yakun\u00a0Sophia Shao Kurt Keutzer and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.18079 (2024)."},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3623775"},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00063"},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00064"},{"key":"e_1_3_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00099"},{"key":"e_1_3_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783722"},{"key":"e_1_3_3_1_29_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom\u00a0B Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2001.08361 (2020)."},{"key":"e_1_3_3_1_30_2","unstructured":"Ayush Kaushal Tejas Vaidhya Arnab\u00a0Kumar Mondal Tejas Pandey Aaryan Bhagat and Irina Rish. 2024. Spectra: Surprising effectiveness of pretraining ternary language models at scale. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.12327 (2024)."},{"key":"e_1_3_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00047"},{"key":"e_1_3_3_1_32_2","unstructured":"Sehoon Kim Coleman Hooper Amir Gholami Zhen Dong Xiuyu Li Sheng Shen Michael\u00a0W Mahoney and Kurt Keutzer. 2023. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.07629 (2023)."},{"key":"e_1_3_3_1_33_2","unstructured":"Tanishq Kumar Zachary Ankner Benjamin\u00a0F Spector Blake Bordelon Niklas Muennighoff Mansheej Paul Cengiz Pehlevan Christopher R\u00e9 and Aditi Raghunathan. 2024. Scaling laws for precision. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.04330 (2024)."},{"key":"e_1_3_3_1_34_2","unstructured":"Andrey Kuzmin Mart Van\u00a0Baalen Yuwei Ren Markus Nagel Jorn Peters and Tijmen Blankevoort. 2022. Fp8 quantization: The power of the exponent. Advances in Neural Information Processing Systems 35 (2022) 14651\u201314662."},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"crossref","unstructured":"Hyoukjun Kwon Prasanth Chatarasi Vivek Sarkar Tushar Krishna Michael Pellauer and Angshuman Parashar. 2020. Maestro: A data-centric approach to understand reuse performance and hardware cost of dnn mappings. IEEE micro 40 3 (2020) 20\u201329.","DOI":"10.1109\/MM.2020.2985963"},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640356"},{"key":"e_1_3_3_1_37_2","unstructured":"Teven Le\u00a0Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ili\u0107 Daniel Hesslow Roman Castagn\u00e9 Alexandra\u00a0Sasha Luccioni Fran\u00e7ois Yvon Matthias Gall\u00e9 et\u00a0al. 2023. Bloom: A 176b-parameter open-access multilingual language model. (2023)."},{"key":"e_1_3_3_1_38_2","unstructured":"Changhun Lee Jungyu Jin Taesu Kim Hyungjun Kim and Eunhyeok Park. 2023. Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.02272 (2023)."},{"key":"e_1_3_3_1_39_2","doi-asserted-by":"publisher","unstructured":"Jinmook Lee Changhyeon Kim Sanghoon Kang Dongjoo Shin Sangyeob Kim and Hoi-Jun Yoo. 2019. UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision. IEEE Journal of Solid-State Circuits 54 1 (2019) 173\u2013185. 10.1109\/JSSC.2018.2865489","DOI":"10.1109\/JSSC.2018.2865489"},{"key":"e_1_3_3_1_40_2","unstructured":"Ji Lin Jiaming Tang Haotian Tang Shang Yang Xingyu Dang and Song Han. 2023. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.00978 (2023)."},{"key":"e_1_3_3_1_41_2","unstructured":"Jing Liu Ruihao Gong Xiuying Wei Zhiwei Dong Jianfei Cai and Bohan Zhuang. 2024. QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models."},{"key":"e_1_3_3_1_42_2","unstructured":"Zirui Liu Jiayi Yuan Hongye Jin Shaochen Zhong Zhaozhuo Xu Vladimir Braverman Beidi Chen and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.02750 (2024)."},{"key":"e_1_3_3_1_43_2","unstructured":"Zechun Liu Changsheng Zhao Hanxian Huang Sijia Chen Jing Zhang Jiawei Zhao Scott Roy Lisa Jin Yunyang Xiong Yangyang Shi et\u00a0al. 2025. ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.02631 (2025)."},{"key":"e_1_3_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3614249"},{"key":"e_1_3_3_1_45_2","unstructured":"Shuming Ma Hongyu Wang Lingxiao Ma Lei Wang Wenhui Wang Shaohan Huang Li Dong Ruiping Wang Jilong Xue and Furu Wei. 2024. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.17764 (2024)."},{"key":"e_1_3_3_1_46_2","unstructured":"Saeed Maleki. 2023. Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5 x via msGeMM. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.06178 (2023)."},{"key":"e_1_3_3_1_47_2","unstructured":"Stephen Merity Caiming Xiong James Bradbury and Richard Socher. 2016. Pointer Sentinel Mixture Models. arxiv:https:\/\/arXiv.org\/abs\/1609.07843\u00a0[cs.CL]"},{"key":"e_1_3_3_1_48_2","unstructured":"Paulius Micikevicius Dusan Stosic Neil Burgess Marius Cornea Pradeep Dubey Richard Grisenthwaite Sangwon Ha Alexander Heinecke Patrick Judd John Kamalu et\u00a0al. 2022. Fp8 formats for deep learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2209.05433 (2022)."},{"key":"e_1_3_3_1_49_2","doi-asserted-by":"crossref","unstructured":"Todor Mihaylov Peter Clark Tushar Khot and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. arxiv:https:\/\/arXiv.org\/abs\/1809.02789\u00a0[cs.CL]","DOI":"10.18653\/v1\/D18-1260"},{"key":"e_1_3_3_1_50_2","unstructured":"Pranav Nair Puranjay Datta Jeff Dean Prateek Jain and Aditya Kusupati. 2025. Matryoshka Quantization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.06786 (2025)."},{"key":"e_1_3_3_1_51_2","unstructured":"NVIDIA. 2025. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https:\/\/github.com\/NVIDIA\/cutlass."},{"key":"e_1_3_3_1_52_2","unstructured":"NVIDIA. 2025. TensorRT-LLM: High-Performance Inference for Large Language Models. https:\/\/github.com\/NVIDIA\/TensorRT-LLM. Accessed: 2025-05-02."},{"key":"e_1_3_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_3_1_54_2","unstructured":"Gunho Park Baeseong Park Minsub Kim Sungjae Lee Jeonghoon Kim Beomseok Kwon Se\u00a0Jung Kwon Byeongwook Kim Youngjoo Lee and Dongsoo Lee. 2023. LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2206.09557 (2023)."},{"key":"e_1_3_3_1_55_2","unstructured":"Pratyush Patel Esha Choukse Chaojie Zhang Aashaka Shah \u00cd\u00f1igo Goiri Saeed Maleki and Ricardo Bianchini. 2023. Splitwise: Efficient generative llm inference using phase splitting. Power 400 700W (2023) 1\u201375."},{"key":"e_1_3_3_1_56_2","doi-asserted-by":"crossref","unstructured":"David Patterson Joseph Gonzalez Urs H\u00f6lzle Quoc Le Chen Liang Lluis-Miquel Munguia Daniel Rothchild David\u00a0R So Maud Texier and Jeff Dean. 2022. The carbon footprint of machine learning training will plateau then shrink. Computer 55 7 (2022) 18\u201328.","DOI":"10.1109\/MC.2022.3148714"},{"key":"e_1_3_3_1_57_2","unstructured":"Bowen Peng Jeffrey Quesnelle Honglu Fan and Enrico Shippole. 2023. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2309.00071 (2023)."},{"key":"e_1_3_3_1_58_2","unstructured":"Bita\u00a0Darvish Rouhani Ritchie Zhao Ankit More Mathew Hall Alireza Khodamoradi Summer Deng Dhruv Choudhary Marius Cornea Eric Dellinger Kristof Denolf et\u00a0al. 2023. Microscaling data formats for deep learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.10537 (2023)."},{"key":"e_1_3_3_1_59_2","doi-asserted-by":"crossref","unstructured":"Sungju Ryu Hyungjun Kim Wooseok Yi Eunhwan Kim Yulhwa Kim Taesu Kim and Jae-Joon Kim. 2022. BitBlade: Energy-efficient variable bit-precision hardware accelerator for quantized neural networks. IEEE Journal of Solid-State Circuits 57 6 (2022) 1924\u20131935.","DOI":"10.1109\/JSSC.2022.3141050"},{"key":"e_1_3_3_1_60_2","unstructured":"Keisuke Sakaguchi Ronan\u00a0Le Bras Chandra Bhagavatula and Yejin Choi. 2019. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arxiv:https:\/\/arXiv.org\/abs\/1907.10641\u00a0[cs.CL]"},{"key":"e_1_3_3_1_61_2","unstructured":"Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.08608 (2024)."},{"key":"e_1_3_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00062"},{"key":"e_1_3_3_1_63_2","first-page":"701","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Shi Yining","year":"2023","unstructured":"Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling Deep Learning Memory Access via Tile-graph. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 701\u2013718."},{"key":"e_1_3_3_1_64_2","volume-title":"Design Compiler User Guide","author":"Inc. Synopsys","year":"2018","unstructured":"Synopsys Inc.2018. Design Compiler User Guide."},{"key":"e_1_3_3_1_65_2","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew\u00a0M Dai Anja Hauth et\u00a0al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.11805 (2023)."},{"key":"e_1_3_3_1_66_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.09288 (2023)."},{"key":"e_1_3_3_1_67_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00077"},{"key":"e_1_3_3_1_69_2","unstructured":"Hongyu Wang Shuming Ma Li Dong Shaohan Huang Huaijie Wang Lingxiao Ma Fan Yang Ruiping Wang Yi Wu and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.11453 (2023)."},{"key":"e_1_3_3_1_70_2","first-page":"307","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Wang Lei","year":"2024","unstructured":"Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, et\u00a0al. 2024. Ladder: Enabling Efficient { Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 307\u2013323."},{"key":"e_1_3_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00088"},{"key":"e_1_3_3_1_72_2","unstructured":"Jianyu Wei Shijie Cao Ting Cao Lingxiao Ma Lei Wang Yanyong Zhang and Mao Yang. 2024. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.00088 (2024)."},{"key":"e_1_3_3_1_73_2","unstructured":"Haocheng Xi Changhao Li Jianfei Chen and Jun Zhu. 2023. Training transformers with 4-bit integers. Advances in Neural Information Processing Systems 36 (2023) 49146\u201349168."},{"key":"e_1_3_3_1_74_2","first-page":"38087","volume-title":"International Conference on Machine Learning","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087\u201338099."},{"key":"e_1_3_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00079"},{"key":"e_1_3_3_1_76_2","unstructured":"Zhewei Yao Reza Yazdani\u00a0Aminabadi Minjia Zhang Xiaoxia Wu Conglong Li and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems 35 (2022) 27168\u201327183."},{"key":"e_1_3_3_1_77_2","unstructured":"Alex Young Bei Chen Chao Li Chengen Huang Ge Zhang Guanwei Zhang Heng Li Jiangcheng Zhu Jianqun Chen Jing Chang et\u00a0al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.04652 (2024)."},{"key":"e_1_3_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/micro50266.2020.00071"},{"key":"e_1_3_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3527438"},{"key":"e_1_3_3_1_80_2","doi-asserted-by":"crossref","unstructured":"Rowan Zellers Ari Holtzman Yonatan Bisk Ali Farhadi and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? arxiv:https:\/\/arXiv.org\/abs\/1905.07830\u00a0[cs.CL]","DOI":"10.18653\/v1\/P19-1472"},{"key":"e_1_3_3_1_81_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi\u00a0Victoria Lin et\u00a0al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2205.01068 (2022)."},{"key":"e_1_3_3_1_82_2","unstructured":"Yijia Zhang Lingran Zhao Shijie Cao Wenqiang Wang Ting Cao Fan Yang Mao Yang Shanghang Zhang and Ningyi Xu. 2023. Integer or floating point? new outlooks for low-bit quantization on large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2305.12356 (2023)."},{"key":"e_1_3_3_1_83_2","first-page":"863","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody\u00a0Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph\u00a0E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 863\u2013879. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/zheng"},{"key":"e_1_3_3_1_84_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3623792"},{"key":"e_1_3_3_1_85_2","first-page":"233","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zhu Hongyu","year":"2022","unstructured":"Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, et\u00a0al. 2022. { ROLLER} : Fast and efficient tensor compilation for deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 233\u2013248."},{"key":"e_1_3_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358269"}],"event":{"name":"ISCA '25: Proceedings of the 52nd Annual International Symposium on Computer Architecture","location":"Tokyo Japan","acronym":"SIGARCH '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 52nd Annual International Symposium on Computer Architecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695053.3731057","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T11:05:19Z","timestamp":1750503919000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695053.3731057"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,20]]},"references-count":85,"alternative-id":["10.1145\/3695053.3731057","10.1145\/3695053"],"URL":"https:\/\/doi.org\/10.1145\/3695053.3731057","relation":{},"subject":[],"published":{"date-parts":[[2025,6,20]]},"assertion":[{"value":"2025-06-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}