{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T03:57:23Z","timestamp":1769659043040,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,1,28]]},"DOI":"10.1145\/3774934.3786423","type":"proceedings-article","created":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T15:25:57Z","timestamp":1769613957000},"page":"288-300","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["High-Throughput Non-uniformly Quantized 3-bit LLM Inference"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3392-8388","authenticated-orcid":false,"given":"YuAng","family":"Chen","sequence":"first","affiliation":[{"name":"Chinese University of Hong Kong, Hong Kong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9770-6522","authenticated-orcid":false,"given":"Wenqi","family":"Zeng","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9738-827X","authenticated-orcid":false,"given":"Jeffrey Xu","family":"Yu","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology (Guangzhou), Hong Kong, China"}]}],"member":"320","published-online":{"date-parts":[[2026,1,28]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"AutoGPTQ Contributors. 2023. AutoGPTQ: An easy-to-use LLM quantization package with user-friendly APIs based on GPTQ algorithm. https:\/\/github.com\/AutoGPTQ\/AutoGPTQ Accessed: 2025-01-15"},{"key":"e_1_3_2_1_2_1","unstructured":"Arnav Chavan Raghav Magazine Shubham Kushwaha M\u00e9rouane Debbah and Deepak Gupta. 2024. Faster and lighter llms: A survey on current challenges and way forward. arXiv preprint arXiv:2402.01799."},{"key":"e_1_3_2_1_3_1","first-page":"4396","article-title":"Quip: 2-bit quantization of large language models with guarantees","volume":"36","author":"Chee Jerry","year":"2023","unstructured":"Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36 (2023), 4396\u20134429.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3572848.3577500"},{"key":"e_1_3_2_1_5_1","volume-title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https:\/\/vicuna.lmsys.org Accessed","author":"Chiang Wei-Lin","year":"2023","unstructured":"Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https:\/\/vicuna.lmsys.org Accessed: 14 April 2023"},{"key":"e_1_3_2_1_6_1","unstructured":"Tim Dettmers. 2023. BitsandBytes. https:\/\/github.com\/bitsandbytes-foundation\/bitsandbytes Accessed: 2025-05-26"},{"key":"e_1_3_2_1_7_1","volume-title":"Advances in Neural Information Processing Systems 35 (NeurIPS","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022), S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/hash\/8c4a7160935517e91cfe296b0bb1be8a-Abstract-Conference.html"},{"key":"e_1_3_2_1_8_1","volume-title":"12th International Conference on Learning Representations.","author":"Dettmers Tim","year":"2024","unstructured":"Tim Dettmers, Ruslan A Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan-Adrian Alistarh. 2024. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. In 12th International Conference on Learning Representations."},{"key":"e_1_3_2_1_9_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"10344","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10325\u201310344."},{"key":"e_1_3_2_1_10_1","volume-title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In International Conference on Learning Representations (ICLR). https:\/\/openreview.net\/forum?id=tcbBPnfwxS","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In International Conference on Learning Representations (ICLR). https:\/\/openreview.net\/forum?id=tcbBPnfwxS"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3710848.3710871"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433723"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.12608602"},{"key":"e_1_3_2_1_14_1","volume-title":"AI and memory wall","author":"Gholami Amir","unstructured":"Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer. 2024. AI and memory wall. IEEE Micro."},{"key":"e_1_3_2_1_15_1","unstructured":"A. Griffin. 2024. ChatGPT creators OpenAI are generating 100 billion words per day CEO says. https:\/\/www.independent.co.uk\/tech\/chatgpt-openai-words-sam-altman-b2494900.html Accessed: 2025-08-30"},{"key":"e_1_3_2_1_16_1","volume-title":"Forty-first International Conference on Machine Learning.","author":"Guo Jinyang","year":"2024","unstructured":"Jinyang Guo, Jianyu Wu, Zining Wang, Jiaheng Liu, Ge Yang, Yifu Ding, Ruihao Gong, Haotong Qin, and Xianglong Liu. 2024. Compressing large language models by joint sparsification and quantization. In Forty-first International Conference on Machine Learning."},{"key":"e_1_3_2_1_17_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Han Song","year":"2016","unstructured":"Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00286"},{"key":"e_1_3_2_1_19_1","volume-title":"Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629.","author":"Kim Sehoon","year":"2023","unstructured":"Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. 2023. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629."},{"key":"e_1_3_2_1_20_1","unstructured":"Ronny Krashinsky Olivier Giroux Stephen Jones Nick Stam and Sridhar Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. https:\/\/developer.nvidia.com\/blog\/nvidia-ampere-architecture-in-depth Accessed: 2024-01-15"},{"key":"e_1_3_2_1_21_1","volume-title":"Fine-Tuning, and Inference Techniques, 7","author":"Kurtic Eldar","year":"2025","unstructured":"Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goinv, Shubhra Pandit, Abhinav Agarwalla, Tuan Nguyen, Alexandre Marques, Mark Kurtz, and Dan Alistarh. 2025. Sparse fine-tuning for inference acceleration of large language models. Enhancing LLM Performance: Efficacy, Fine-Tuning, and Inference Techniques, 7 (2025), 83."},{"key":"e_1_3_2_1_22_1","unstructured":"Yann LeCun John S Denker and Sara A Solla. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2."},{"key":"e_1_3_2_1_23_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Li Hao","year":"2017","unstructured":"Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_1_24_1","volume-title":"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978.","author":"Lin Ji","year":"2023","unstructured":"Ji Lin, Ruicheng Tang, Haotian Tang, Shang Yang, Jiaming Zhang, and Guangxuan Cui. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978."},{"key":"e_1_3_2_1_25_1","volume-title":"Olivier Giroux and Nick Stam","author":"Luke Durant Mark Harris","year":"2017","unstructured":"Mark Harris Luke Durant, Olivier Giroux and Nick Stam. 2017. Inside Volta: The World\u2019s Most Advanced Data Center GPU. https:\/\/www.nvidia.com\/en-us\/data-center\/volta-gpu-architecture\/ Accessed: 2024-05-15"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01152"},{"key":"e_1_3_2_1_27_1","volume-title":"Yelysei Wu, STOYAN GKERESTEDJIAN, and Tijmen Blankevoort.","author":"Nagel Markus","year":"2021","unstructured":"Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Wu, STOYAN GKERESTEDJIAN, and Tijmen Blankevoort. 2021. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295."},{"key":"e_1_3_2_1_28_1","volume-title":"International conference on machine learning. 7197\u20137206","author":"Nagel Markus","year":"2020","unstructured":"Markus Nagel, Mart Van Baalen, Tijmen Blankevoort, and Max Welling. 2020. Up or down? adaptive rounding for post-training quantization. In International conference on machine learning. 7197\u20137206."},{"key":"e_1_3_2_1_29_1","unstructured":"NVIDIA. 2023. L40S GPU for AI and Graphics Performance. https:\/\/www.nvidia.com\/en-us\/data-center\/l40s\/ \/ Accessed: 2025-05-15"},{"key":"e_1_3_2_1_30_1","unstructured":"NVIDIA. 2023. Nsight Systems. https:\/\/developer.nvidia.com\/nsight-systems Accessed: 2025-05-15"},{"key":"e_1_3_2_1_31_1","unstructured":"NVIDIA Corporation. 2025. Efficient GEMM in CUDA. https:\/\/docs.nvidia.com\/cutlass\/media\/docs\/cpp\/efficient_gemm.html Accessed: 2025-08-29"},{"key":"e_1_3_2_1_32_1","unstructured":"NVIDIA Developer Blog. 2023. Mastering LLM Techniques: Inference Optimization. https:\/\/developer.nvidia.com\/blog\/mastering-llm-techniques-inference-optimization\/ Accessed: 2025-05-13"},{"key":"e_1_3_2_1_33_1","volume-title":"Different-Sized LLMs. In International Conference on Machine Learning. 39682\u201339701","author":"Park Yeonhong","year":"2024","unstructured":"Yeonhong Park, Jake Hyun, Sanglyul Cho, Bonggeun Sim, and Jae W Lee. 2024. Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs. In International Conference on Machine Learning. 39682\u201339701."},{"key":"e_1_3_2_1_34_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_1_35_1","volume-title":"Wanda: A Simple and Scalable Pruning Method for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"32892","author":"Sun Mingjie","unstructured":"Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. Wanda: A Simple and Scalable Pruning Method for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202). PMLR, 32873\u201332892."},{"key":"e_1_3_2_1_36_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971."},{"key":"e_1_3_2_1_37_1","volume-title":"\u0141 ukasz Kaiser, and Illia Polosukhin","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30 (2017)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_1_39_1","unstructured":"Haojun Xia Zhen Zheng Xiaoxia Wu Shiyang Chen Zhewei Yao Stephen Youn Arash Bakhtiari Michael Wyatt Donglin Zhuang Zhongzhu Zhou et al. 2024. Fp6-llm: Efficiently serving large language models through fp6-centric algorithm-system co-design. arXiv preprint arXiv:2401.14112."},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning.","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning."},{"key":"e_1_3_2_1_41_1","first-page":"27168","article-title":"ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers","volume":"35","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Zhen Dong, Zhan Zheng, Amir Gholami, Jiachen Yu, Eric Tan, Kurt Keutzer, and Michael W Mahoney. 2022. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. In Advances in Neural Information Processing Systems. 35, 27168\u201327183.","journal-title":"Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_1_42_1","unstructured":"Zhewei Yao Xiaoxia Wu Cheng Li Stephen Youn and Yuxiong He. 2023. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation. arXiv preprint arXiv:2303.08302."},{"key":"e_1_3_2_1_43_1","volume-title":"Xi Victoria Lin, et al","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068."}],"event":{"name":"PPoPP '26: 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","location":"Sydney NSW Australia","acronym":"PPoPP '26","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing","SIGPLAN ACM Special Interest Group on Programming Languages"]},"container-title":["Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3774934.3786423","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T15:29:49Z","timestamp":1769614189000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3774934.3786423"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,28]]},"references-count":43,"alternative-id":["10.1145\/3774934.3786423","10.1145\/3774934"],"URL":"https:\/\/doi.org\/10.1145\/3774934.3786423","relation":{},"subject":[],"published":{"date-parts":[[2026,1,28]]},"assertion":[{"value":"2026-01-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}