{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T12:18:49Z","timestamp":1773317929556,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,16]]},"DOI":"10.1145\/3712285.3759852","type":"proceedings-article","created":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T16:05:39Z","timestamp":1762963539000},"page":"1619-1630","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-4392-4712","authenticated-orcid":false,"given":"Huanqi","family":"Hu","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0357-913X","authenticated-orcid":false,"given":"Bowen","family":"Xiao","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4060-9438","authenticated-orcid":false,"given":"Shixuan","family":"Sun","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5728-6182","authenticated-orcid":false,"given":"Jianian","family":"Yin","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1272-632X","authenticated-orcid":false,"given":"Zhexi","family":"Zhang","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0504-9559","authenticated-orcid":false,"given":"Xiang","family":"Luo","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9356-6034","authenticated-orcid":false,"given":"Chengquan","family":"Jiang","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Seattle, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6915-4032","authenticated-orcid":false,"given":"Weiqi","family":"Xu","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2201-4765","authenticated-orcid":false,"given":"Xiaoying","family":"Jia","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8346-3323","authenticated-orcid":false,"given":"Xin","family":"Liu","sequence":"additional","affiliation":[{"name":"ByteDance Seed, Seattle, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0034-2302","authenticated-orcid":false,"given":"Minyi","family":"Guo","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,15]]},"reference":[{"key":"e_1_3_3_3_2_2","unstructured":"Saleh Ashkboos Amirkeivan Mohtashami Maximilian\u00a0L Croci Bo Li Pashmina Cameron Martin Jaggi Dan Alistarh Torsten Hoefler and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.00456 (2024)."},{"key":"e_1_3_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6239"},{"key":"e_1_3_3_3_4_2","unstructured":"Yelysei Bondarenko Riccardo Del\u00a0Chiaro and Markus Nagel. 2024. Low-Rank Quantization-Aware Training for LLMs. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.06385 (2024)."},{"key":"e_1_3_3_3_5_2","unstructured":"Mengzhao Chen Wenqi Shao Peng Xu Jiahao Wang Peng Gao Kaipeng Zhang and Ping Luo. 2024. Efficientqat: Efficient quantization-aware training for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.11062 (2024)."},{"key":"e_1_3_3_3_6_2","unstructured":"Peter Clark Isaac Cowhey Oren Etzioni Tushar Khot Ashish Sabharwal Carissa Schoenick and Oyvind Tafjord. 2018. Think you have solved question answering? try arc the ai2 reasoning challenge. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1803.05457 (2018)."},{"key":"e_1_3_3_3_7_2","unstructured":"Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.08691 (2023)."},{"key":"e_1_3_3_3_8_2","doi-asserted-by":"crossref","unstructured":"Tim Dettmers Mike Lewis Younes Belkada and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35 (2022) 30318\u201330332.","DOI":"10.52202\/068431-2198"},{"key":"e_1_3_3_3_9_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2210.17323 (2022)."},{"key":"e_1_3_3_3_10_2","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et\u00a0al. 2024. The llama 3 herd of models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2407.21783 (2024)."},{"key":"e_1_3_3_3_11_2","unstructured":"Albert\u00a0Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio\u00a0Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven\u00a0Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William\u00a0El Sayed. 2023. Mistral 7B. arxiv:https:\/\/arXiv.org\/abs\/2310.06825\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2310.06825"},{"key":"e_1_3_3_3_12_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Emma\u00a0Bou Hanna Florian Bressand et\u00a0al. 2024. Mixtral of experts. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.04088 (2024)."},{"key":"e_1_3_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_3_3_14_2","doi-asserted-by":"crossref","unstructured":"Haokun Lin Haobo Xu Yichen Wu Jingzhi Cui Yingtao Zhang Linzhan Mou Linqi Song Zhenan Sun and Ying Wei. 2025. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems 37 (2025) 87766\u201387800.","DOI":"10.52202\/079017-2786"},{"key":"e_1_3_3_3_15_2","unstructured":"Ji Lin Jiaming Tang Haotian Tang Shang Yang Wei-Ming Chen Wei-Chen Wang Guangxuan Xiao Xingyu Dang Chuang Gan and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems 6 (2024) 87\u2013100."},{"key":"e_1_3_3_3_16_2","unstructured":"Yujun Lin Haotian Tang Shang Yang Zhekai Zhang Guangxuan Xiao Chuang Gan and Song Han. 2024. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.04532 (2024)."},{"key":"e_1_3_3_3_17_2","doi-asserted-by":"crossref","unstructured":"Lian Liu Haimeng Ren Long Cheng Zhaohui Xu Yudong Pan Mengdi Wang Xiaowei Li Yinhe Han and Ying Wang. 2024. COMET: Towards Partical W4A4KV4 LLMs Serving. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2410.12168 (2024).","DOI":"10.1145\/3676641.3716252"},{"key":"e_1_3_3_3_18_2","unstructured":"Zechun Liu Barlas Oguz Changsheng Zhao Ernie Chang Pierre Stock Yashar Mehdad Yangyang Shi Raghuraman Krishnamoorthi and Vikas Chandra. 2023. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2305.17888 (2023)."},{"key":"e_1_3_3_3_19_2","unstructured":"Zechun Liu Changsheng Zhao Igor Fedorov Bilge Soran Dhruv Choudhary Raghuraman Krishnamoorthi Vikas Chandra Yuandong Tian and Tijmen Blankevoort. 2024. SpinQuant\u2013LLM quantization with learned rotations. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.16406 (2024)."},{"key":"e_1_3_3_3_20_2","unstructured":"Stephen Merity Caiming Xiong James Bradbury and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1609.07843 (2016)."},{"key":"e_1_3_3_3_21_2","volume-title":"TensorRT-LLM: A TensorRT Toolbox for Optimized Large Language Model Inference","year":"2023","unstructured":"NVIDIA. 2023. TensorRT-LLM: A TensorRT Toolbox for Optimized Large Language Model Inference. https:\/\/github.com\/NVIDIA\/TensorRT-LLM"},{"key":"e_1_3_3_3_22_2","doi-asserted-by":"crossref","unstructured":"Keisuke Sakaguchi Ronan\u00a0Le Bras Chandra Bhagavatula and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64 9 (2021) 99\u2013106.","DOI":"10.1145\/3474381"},{"key":"e_1_3_3_3_23_2","doi-asserted-by":"crossref","unstructured":"Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems 37 (2024) 68658\u201368685.","DOI":"10.52202\/079017-2193"},{"key":"e_1_3_3_3_24_2","unstructured":"Wenqi Shao Mengzhao Chen Zhaoyang Zhang Peng Xu Lirui Zhao Zhiqian Li Kaipeng Zhang Peng Gao Yu Qiao and Ping Luo. 2023. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2308.13137 (2023)."},{"key":"e_1_3_3_3_25_2","unstructured":"Xuan Shen Zhenglun Kong Changdi Yang Zhaoyang Han Lei Lu Peiyan Dong Cheng Lyu Chih-hsiang Li Xuehang Guo Zhihao Shu et\u00a0al. 2024. EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.10787 (2024)."},{"key":"e_1_3_3_3_26_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2302.13971 (2023)."},{"key":"e_1_3_3_3_27_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.09288 (2023)."},{"key":"e_1_3_3_3_28_2","unstructured":"Hongyu Wang Shuming Ma Li Dong Shaohan Huang Huaijie Wang Lingxiao Ma Fan Yang Ruiping Wang Yi Wu and Furu Wei. 2023. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.11453 (2023)."},{"key":"e_1_3_3_3_29_2","unstructured":"Xiuying Wei Yunchen Zhang Yuhang Li Xiangguo Zhang Ruihao Gong Jinyang Guo and Xianglong Liu. 2023. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2304.09145 (2023)."},{"key":"e_1_3_3_3_30_2","first-page":"38087","volume-title":"International Conference on Machine Learning","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087\u201338099."},{"key":"e_1_3_3_3_31_2","unstructured":"Yuzhuang Xu Xu Han Zonghan Yang Shuo Wang Qingfu Zhu Zhiyuan Liu Weidong Liu and Wanxiang Che. 2024. Onebit: Towards extremely low-bit large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.11295 (2024)."},{"key":"e_1_3_3_3_32_2","unstructured":"Alex Young Bei Chen Chao Li Chengen Huang Ge Zhang Guanwei Zhang Guoyin Wang Heng Li Jiangcheng Zhu Jianqun Chen et\u00a0al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.04652 (2024)."},{"key":"e_1_3_3_3_33_2","first-page":"521","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo\u00a0Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for { Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521\u2013538."},{"key":"e_1_3_3_3_34_2","doi-asserted-by":"crossref","unstructured":"Rowan Zellers Ari Holtzman Yonatan Bisk Ali Farhadi and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1905.07830 (2019).","DOI":"10.18653\/v1\/P19-1472"},{"key":"e_1_3_3_3_35_2","unstructured":"Ying Zhang Peng Zhang Mincong Huang Jingyang Xiang Yujie Wang Chao Wang Yineng Zhang Lei Yu Chuan Liu and Wei Lin. 2024. QQQ: Quality Quattuor-Bit Quantization for Large Language Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.09904 (2024)."},{"key":"e_1_3_3_3_36_2","unstructured":"Yilong Zhao Chien-Yu Lin Kan Zhu Zihao Ye Lequn Chen Size Zheng Luis Ceze Arvind Krishnamurthy Tianqi Chen and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024) 196\u2013209."},{"key":"e_1_3_3_3_37_2","unstructured":"Yinmin Zhong Shengyu Liu Junda Chen Jianbo Hu Yibo Zhu Xuanzhe Liu Xin Jin and Hao Zhang. 2024. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.09670 (2024)."}],"event":{"name":"SC '25: The International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St. Louis MO USA","acronym":"SC '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3712285.3759852","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T18:31:47Z","timestamp":1773253907000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712285.3759852"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,15]]},"references-count":36,"alternative-id":["10.1145\/3712285.3759852","10.1145\/3712285"],"URL":"https:\/\/doi.org\/10.1145\/3712285.3759852","relation":{},"subject":[],"published":{"date-parts":[[2025,11,15]]},"assertion":[{"value":"2025-11-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}