{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T01:49:59Z","timestamp":1773193799021,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":104,"publisher":"ACM","funder":[{"name":"CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,21]]},"DOI":"10.1145\/3695053.3730989","type":"proceedings-article","created":{"date-parts":[[2025,6,20]],"date-time":"2025-06-20T16:43:11Z","timestamp":1750437791000},"page":"1193-1209","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-4763-3321","authenticated-orcid":false,"given":"Akshat","family":"Ramachandran","sequence":"first","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3533-9405","authenticated-orcid":false,"given":"Souvik","family":"Kundu","sequence":"additional","affiliation":[{"name":"Intel Labs, San Diego, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5738-6942","authenticated-orcid":false,"given":"Tushar","family":"Krishna","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,6,20]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC55821.2022.9926299"},{"key":"e_1_3_3_2_3_2","unstructured":"Marah Abdin Sam\u00a0Ade Jacobs Ammar\u00a0Ahmad Awan Jyoti Aneja Ahmed Awadallah Hany Awadalla Nguyen Bach Amit Bahree Arash Bakhtiari Harkirat Behl et\u00a0al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.14219 (2024)."},{"key":"e_1_3_3_2_4_2","unstructured":"Saleh Ashkboos Maximilian\u00a0L Croci Marcelo Gennari\u00a0do Nascimento Torsten Hoefler and James Hensman. 2024. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.15024 (2024)."},{"key":"e_1_3_3_2_5_2","unstructured":"Saleh Ashkboos Amirkeivan Mohtashami Maximilian\u00a0L Croci Bo Li Martin Jaggi Dan Alistarh Torsten Hoefler and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.00456 (2024)."},{"key":"e_1_3_3_2_6_2","unstructured":"Anas Awadalla Irena Gao Josh Gardner Jack Hessel Yusuf Hanafy Wanrong Zhu Kalyani Marathe Yonatan Bitton Samir Gadre Shiori Sagawa et\u00a0al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2308.01390 (2023)."},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/2925426.2926259"},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6239"},{"key":"e_1_3_3_2_10_2","unstructured":"Runjin Chen Zhenyu Zhang Junyuan Hong Souvik Kundu and Zhangyang Wang. 2024. SEAL: Steerable Reasoning Calibration of Large Language Models for Free. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2504.07986v1."},{"key":"e_1_3_3_2_11_2","unstructured":"Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Doll\u00e1r and C\u00a0Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1504.00325 (2015)."},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Yu-Hsin Chen Tien-Ju Yang Joel Emer and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 2 (2019) 292\u2013308.","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_3_3_2_13_2","unstructured":"Christopher Clark Kenton Lee Ming-Wei Chang Tom Kwiatkowski Michael Collins and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes\/no questions. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1905.10044 (2019)."},{"key":"e_1_3_3_2_14_2","unstructured":"Peter Clark Isaac Cowhey Oren Etzioni Tushar Khot Ashish Sabharwal Carissa Schoenick and Oyvind Tafjord. 2018. Think you have solved question answering? try arc the ai2 reasoning challenge. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1803.05457 (2018)."},{"key":"e_1_3_3_2_15_2","unstructured":"NVIDIA Corporation. 2017. Inside Volta: The World\u2019s Most Advanced Data Center GPU. https:\/\/devblogs.nvidia.com\/inside-volta\/ Accessed: February 2025."},{"key":"e_1_3_3_2_16_2","unstructured":"Steve Dai Rangha Venkatesan Mark Ren Brian Zimmer William Dally and Brucek Khailany. 2021. Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference. Proceedings of Machine Learning and Systems 3 (2021) 873\u2013884."},{"key":"e_1_3_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589351"},{"key":"e_1_3_3_2_18_2","first-page":"30318","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM. int8 () 8-bit matrix multiplication for transformers at scale. In Proceedings of the 36th International Conference on Neural Information Processing Systems. 30318\u201330332."},{"key":"e_1_3_3_2_19_2","unstructured":"Tim Dettmers Ruslan Svirschevski Vage Egiazarian Denis Kuznedelev Elias Frantar Saleh Ashkboos Alexander Borzunov Torsten Hoefler and Dan Alistarh. 2023. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.03078 (2023)."},{"key":"e_1_3_3_2_20_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1810.04805 (2018)."},{"key":"e_1_3_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00038"},{"key":"e_1_3_3_2_22_2","unstructured":"Mario Drumond Tao Lin Martin Jaggi and Babak Falsafi. 2018. Training dnns with hybrid block floating point. Advances in Neural Information Processing Systems 31 (2018)."},{"key":"e_1_3_3_2_23_2","unstructured":"Chao Fang Man Shi Robin Geens Arne Symons Zhongfeng Wang and Marian Verhelst. 2024. Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.15982 (2024)."},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589348"},{"key":"e_1_3_3_2_25_2","first-page":"10323","volume-title":"International Conference on Machine Learning","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning. PMLR, 10323\u201310337."},{"key":"e_1_3_3_2_26_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2210.17323 (2022)."},{"key":"e_1_3_3_2_27_2","unstructured":"Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima et\u00a0al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2101.00027 (2020)."},{"key":"e_1_3_3_2_28_2","unstructured":"Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_3_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589038"},{"key":"e_1_3_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00095"},{"key":"e_1_3_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00380"},{"key":"e_1_3_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICNN.1993.298572"},{"key":"e_1_3_3_2_34_2","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2009.03300 (2020)."},{"key":"e_1_3_3_2_35_2","unstructured":"Wei Huang Xudong Ma Haotong Qin Xingyu Zheng Chengtao Lv Hong Chen Jie Luo Xiaojuan Qi Xianglong Liu and Michele Magno. 2024. How good are low-bit quantized llama3 models? an empirical study. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.14047 (2024)."},{"key":"e_1_3_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00686"},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071058"},{"key":"e_1_3_3_2_38_2","unstructured":"Geonhwa Jeong Po-An Tsai Stephen\u00a0W Keckler and Tushar Krishna. 2024. SDQ: Sparse Decomposed Quantization for LLM Inference. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.13868 (2024)."},{"key":"e_1_3_3_2_39_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Emma\u00a0Bou Hanna Florian Bressand et\u00a0al. 2024. Mixtral of experts. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.04088 (2024)."},{"key":"e_1_3_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_3_2_41_2","unstructured":"Hao Kang Qingru Zhang Souvik Kundu Geonhwa Jeong Zaoxing Liu Tushar Krishna and Tuo Zhao. 2024. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. NeurIPS ESNLP Workshop (2024)."},{"key":"e_1_3_3_2_42_2","doi-asserted-by":"crossref","unstructured":"Ben Keller Rangharajan Venkatesan Steve Dai Stephen\u00a0G Tell Brian Zimmer Charbel Sakr William\u00a0J Dally C\u00a0Thomas Gray and Brucek Khailany. 2023. A 95.6-TOPS\/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm. IEEE Journal of Solid-State Circuits 58 4 (2023) 1129\u20131141.","DOI":"10.1109\/JSSC.2023.3234893"},{"key":"e_1_3_3_2_43_2","unstructured":"Mahmoud Khairy Jain Akshay Tor Aamodt and Timothy\u00a0G Rogers. 2018. Exploring modern GPU memory system design challenges through accurate modeling. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1810.07269 (2018)."},{"key":"e_1_3_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00047"},{"key":"e_1_3_3_2_45_2","doi-asserted-by":"crossref","unstructured":"Tushar Krishna Chia-Hsin\u00a0Owen Chen Woo-Cheol Kwon and Li-Shiuan Peh. 2014. Smart: Single-cycle multihop traversals over a shared network on chip. IEEE micro 34 3 (2014) 43\u201356.","DOI":"10.1109\/MM.2014.48"},{"key":"e_1_3_3_2_46_2","doi-asserted-by":"crossref","unstructured":"Souvik Kundu Anahita Bhiwandiwalla Sungduk Yu Phillip Howard Tiep Le Sharath\u00a0Nittur Sridhar David Cobbley Hao Kang and Vasudev Lal. 2025. LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression. NACCL (2025).","DOI":"10.18653\/v1\/2025.findings-naacl.84"},{"key":"e_1_3_3_2_47_2","unstructured":"Andrey Kuzmin Markus Nagel Mart Van\u00a0Baalen Arash Behboodi and Tijmen Blankevoort. 2024. Pruning vs quantization: which is better? Advances in neural information processing systems 36 (2024)."},{"key":"e_1_3_3_2_48_2","unstructured":"Yann LeCun John Denker and Sara Solla. 1989. Optimal brain damage. Advances in neural information processing systems 2 (1989)."},{"key":"e_1_3_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i12.29237"},{"key":"e_1_3_3_2_50_2","doi-asserted-by":"crossref","unstructured":"Jingwen Leng Tayler Hetherington Ahmed ElTantawy Syed Gilani Nam\u00a0Sung Kim Tor\u00a0M Aamodt and Vijay\u00a0Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. ACM SIGARCH computer architecture news 41 3 (2013) 487\u2013498.","DOI":"10.1145\/2508148.2485964"},{"key":"e_1_3_3_2_51_2","unstructured":"Muyang Li Ji Lin Chenlin Meng Stefano Ermon Song Han and Jun-Yan Zhu. 2022. Efficient spatially sparse inference for conditional gans and diffusion models. Advances in neural information processing systems 35 (2022) 28858\u201328873."},{"key":"e_1_3_3_2_52_2","unstructured":"Yuhang Li Ruihao Gong Xu Tan Yang Yang Peng Hu Qi Zhang Fengwei Yu Wei Wang and Shi Gu. 2021. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2102.05426 (2021)."},{"key":"e_1_3_3_2_53_2","unstructured":"Yinglong Li Xiaoyu Liu Jiacheng Li Ruikang Xu Yinda Chen and Zhiwei Xiong. 2025. QMamba: Post-Training Quantization for Vision State Space Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2501.13624 (2025)."},{"key":"e_1_3_3_2_54_2","unstructured":"Ji Lin Jiaming Tang Haotian Tang Shang Yang Wei-Ming Chen Wei-Chen Wang Guangxuan Xiao Xingyu Dang Chuang Gan and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems 6 (2024) 87\u2013100."},{"key":"e_1_3_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02520"},{"key":"e_1_3_3_2_56_2","unstructured":"Yujun Lin Haotian Tang Shang Yang Zhekai Zhang Guangxuan Xiao Chuang Gan and Song Han. 2024. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.04532 (2024)."},{"key":"e_1_3_3_2_57_2","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong\u00a0Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems 36 (2024)."},{"key":"e_1_3_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Yuhao Liu Shubham Rai Salim Ullah and Akash Kumar. 2023. High Flexibility Designs of Quantized Runtime Reconfigurable Multi-Precision Multipliers. IEEE Embedded Systems Letters (2023).","DOI":"10.1109\/LES.2023.3298736"},{"key":"e_1_3_3_2_59_2","volume-title":"The Thirty-eighth Annual Conference on Neural Information Processing Systems","author":"Liu Yue","year":"2024","unstructured":"Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. 2024. VMamba: Visual State Space Model. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=ZgtLQQR1K7"},{"key":"e_1_3_3_2_60_2","unstructured":"Zirui Liu Jiayi Yuan Hongye Jin Shaochen Zhong Zhaozhuo Xu Vladimir Braverman Beidi Chen and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.02750 (2024)."},{"key":"e_1_3_3_2_61_2","unstructured":"Zechun Liu Changsheng Zhao Igor Fedorov Bilge Soran Dhruv Choudhary Raghuraman Krishnamoorthi Vikas Chandra Yuandong Tian and Tijmen Blankevoort. 2024. SpinQuant\u2013LLM quantization with learned rotations. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.16406 (2024)."},{"key":"e_1_3_3_2_62_2","unstructured":"Xudong Lu Aojun Zhou Yuhui Xu Renrui Zhang Peng Gao and Hongsheng Li. 2024. SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.16057 (2024)."},{"key":"e_1_3_3_2_63_2","unstructured":"Stephen Merity Caiming Xiong James Bradbury and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1609.07843 (2016)."},{"key":"e_1_3_3_2_64_2","unstructured":"AI Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date. Meta AI (2024)."},{"key":"e_1_3_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582069"},{"key":"e_1_3_3_2_66_2","doi-asserted-by":"crossref","unstructured":"Naveen Muralimanohar Rajeev Balasubramonian and Norman\u00a0P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories 27 (2009) 28.","DOI":"10.1109\/MM.2008.2"},{"key":"e_1_3_3_2_67_2","unstructured":"NVIDIA. 2024. TensorRT-LLM: High-Performance LLM Inference Library. https:\/\/github.com\/NVIDIA\/TensorRT-LLM Accessed: 2025-02-19."},{"key":"e_1_3_3_2_68_2","unstructured":"NVIDIA. 2025. NVIDIA A100 Tensor Core GPU. https:\/\/www.nvidia.com\/en-us\/data-center\/a100\/ Accessed: 2025-02-19."},{"key":"e_1_3_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00063"},{"key":"e_1_3_3_2_70_2","unstructured":"Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan Trevor Killeen Zeming Lin Natalia Gimelshein Luca Antiga et\u00a0al. 2019. Pytorch: An imperative style high-performance deep learning library. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_3_2_71_2","unstructured":"Open\u00a0Compute Project. 2024. OCP Microscaling Formats MX V1.0 Spec. https:\/\/www.opencompute.org\/documents\/ocp-microscaling-formats-mx-v1-0-spec-final-pdf#page=10.23 Accessed: 2024-07-13."},{"key":"e_1_3_3_2_72_2","doi-asserted-by":"crossref","unstructured":"Friedrich Pukelsheim. 1994. The three sigma rule. The American Statistician 48 2 (1994) 88\u201391.","DOI":"10.1080\/00031305.1994.10476030"},{"key":"e_1_3_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00015"},{"key":"e_1_3_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00016"},{"key":"e_1_3_3_2_75_2","doi-asserted-by":"crossref","unstructured":"Akshat Ramachandran Souvik Kundu and Tushar Krishna. 2024. CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs. ECCV (2024).","DOI":"10.1007\/978-3-031-72855-6_18"},{"key":"e_1_3_3_2_76_2","unstructured":"Akshat Ramachandran Souvik Kundu Arnab Raha Shamik Kundu Deepak\u00a0K. Mathaikutty and Tushar Krishna. 2025. Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator. arxiv:https:\/\/arXiv.org\/abs\/2504.14365\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2504.14365"},{"key":"e_1_3_3_2_77_2","unstructured":"Akshat Ramachandran Mingyu Lee Huan Xu Souvik Kundu and Tushar Krishna. 2025. OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.10959 (2025)."},{"key":"e_1_3_3_2_78_2","doi-asserted-by":"crossref","unstructured":"Akshat Ramachandran Zishen Wan Geonhwa Jeong John Gustafson and Tushar Krishna. 2024. Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2403.05465 (2024).","DOI":"10.1145\/3649329.3656544"},{"key":"e_1_3_3_2_79_2","unstructured":"Bita\u00a0Darvish Rouhani Ritchie Zhao Ankit More Mathew Hall Alireza Khodamoradi Summer Deng Dhruv Choudhary Marius Cornea Eric Dellinger Kristof Denolf et\u00a0al. 2023. Microscaling data formats for deep learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.10537 (2023)."},{"key":"e_1_3_3_2_80_2","doi-asserted-by":"crossref","unstructured":"Keisuke Sakaguchi Ronan\u00a0Le Bras Chandra Bhagavatula and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64 9 (2021) 99\u2013106.","DOI":"10.1145\/3474381"},{"key":"e_1_3_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS51556.2021.9401196"},{"key":"e_1_3_3_2_82_2","unstructured":"Wenqi Shao Mengzhao Chen Zhaoyang Zhang Peng Xu Lirui Zhao Zhiqian Li Kaipeng Zhang Peng Gao Yu Qiao and Ping Luo. 2023. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv:https:\/\/arXiv.org\/abs\/2308.13137 (2023)."},{"key":"e_1_3_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358302"},{"key":"e_1_3_3_2_84_2","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195659"},{"key":"e_1_3_3_2_85_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00069"},{"key":"e_1_3_3_2_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00851"},{"key":"e_1_3_3_2_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/CICC53496.2022.9772810"},{"key":"e_1_3_3_2_88_2","unstructured":"Mingjie Sun Zhuang Liu Anna Bair and J\u00a0Zico Kolter. 2023. A simple and effective pruning approach for large language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2306.11695 (2023)."},{"key":"e_1_3_3_2_89_2","unstructured":"Wei Sun Aojun Zhou Sander Stuijk Rob Wijnhoven Andrew\u00a0O Nelson Henk Corporaal et\u00a0al. 2021. DominoSearch: Find layer-wise fine-grained N: M sparse schemes from dense neural networks. Advances in neural information processing systems 34 (2021) 20721\u201320732."},{"key":"e_1_3_3_2_90_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18072.2020.9218516"},{"key":"e_1_3_3_2_91_2","doi-asserted-by":"crossref","unstructured":"Jianming Tong Anirudh Itagi Prasanth Chatarasi and Tushar Krishna. 2024. FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.13170 (2024).","DOI":"10.1109\/ISCA59077.2024.00024"},{"key":"e_1_3_3_2_92_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.09288 (2023)."},{"key":"e_1_3_3_2_93_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2018.00059"},{"key":"e_1_3_3_2_94_2","unstructured":"Naigang Wang Jungwook Choi Daniel Brand Chia-Yu Chen and Kailash Gopalakrishnan. 2018. Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_3_3_2_95_2","unstructured":"Xiaoxia Wu Haojun Xia Stephen Youn Zhen Zheng Shiyang Chen Arash Bakhtiari Michael Wyatt Yuxiong He Olatunji Ruwase Leon Song et\u00a0al. 2023. Zeroquant (4+ 2): Redefining llms quantization with a new fp6-centric strategy for diverse generative tasks. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.08583 (2023)."},{"key":"e_1_3_3_2_96_2","first-page":"38087","volume-title":"International Conference on Machine Learning","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087\u201338099."},{"key":"e_1_3_3_2_97_2","doi-asserted-by":"crossref","unstructured":"Jingfeng Yang Hongye Jin Ruixiang Tang Xiaotian Han Qizhang Feng Haoming Jiang Shaochen Zhong Bing Yin and Xia Hu. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data 18 6 (2024) 1\u201332.","DOI":"10.1145\/3649506"},{"key":"e_1_3_3_2_98_2","unstructured":"Lu Yin Ajay\u00a0K. Jaiswal Shiwei Liu Souvik Kundu and Zhangyang Wang. 2024. Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs\u201cDifficult\" Downstream Tasks in LLMs. Forty-first International Conference on Machine Learning."},{"key":"e_1_3_3_2_99_2","unstructured":"Lu Yin You Wu Zhenyu Zhang Cheng-Yu Hsieh Yaqing Wang Yiling Jia Mykola Pechenizkiy Yi Liang Zhangyang Wang and Shiwei Liu. 2023. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.05175 (2023)."},{"key":"e_1_3_3_2_100_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00071"},{"key":"e_1_3_3_2_101_2","doi-asserted-by":"crossref","unstructured":"Rowan Zellers Ari Holtzman Yonatan Bisk Ali Farhadi and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1905.07830 (2019).","DOI":"10.18653\/v1\/P19-1472"},{"key":"e_1_3_3_2_102_2","unstructured":"Biao Zhang Zhongtao Liu Colin Cherry and Orhan Firat. 2024. When scaling meets llm finetuning: The effect of data model and finetuning method. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.17193 (2024)."},{"key":"e_1_3_3_2_103_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi\u00a0Victoria Lin et\u00a0al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2205.01068 (2022)."},{"key":"e_1_3_3_2_104_2","unstructured":"Yilong Zhao Chien-Yu Lin Kan Zhu Zihao Ye Lequn Chen Size Zheng Luis Ceze Arvind Krishnamurthy Tianqi Chen and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024) 196\u2013209."},{"key":"e_1_3_3_2_105_2","unstructured":"Lianghui Zhu Bencheng Liao Qian Zhang Xinlong Wang Wenyu Liu and Xinggang Wang. 2024. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.09417 (2024)."}],"event":{"name":"ISCA '25: Proceedings of the 52nd Annual International Symposium on Computer Architecture","location":"Tokyo Japan","acronym":"SIGARCH '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 52nd Annual International Symposium on Computer Architecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695053.3730989","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T10:59:34Z","timestamp":1750503574000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695053.3730989"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,20]]},"references-count":104,"alternative-id":["10.1145\/3695053.3730989","10.1145\/3695053"],"URL":"https:\/\/doi.org\/10.1145\/3695053.3730989","relation":{},"subject":[],"published":{"date-parts":[[2025,6,20]]},"assertion":[{"value":"2025-06-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}