{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T10:48:14Z","timestamp":1777459694411,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":21,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,4,27]]},"DOI":"10.1145\/3805621.3807608","type":"proceedings-article","created":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T13:08:45Z","timestamp":1777381725000},"page":"49-59","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8117-9422","authenticated-orcid":false,"given":"Jihao","family":"Xin","sequence":"first","affiliation":[{"name":"KAUST, Thuwal, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-1509-0051","authenticated-orcid":false,"given":"Tian","family":"Lyu","sequence":"additional","affiliation":[{"name":"KAUST, Thuwal, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1964-8168","authenticated-orcid":false,"given":"Qilong","family":"Pan","sequence":"additional","affiliation":[{"name":"KAUST, Thuwal, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8820-1629","authenticated-orcid":false,"given":"Kesen","family":"Wang","sequence":"additional","affiliation":[{"name":"HUMAIN, Riyadh, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5051-4283","authenticated-orcid":false,"given":"Marco","family":"Canini","sequence":"additional","affiliation":[{"name":"KAUST, Thuwal, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,4,28]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Zefan Cai Yichi Zhang Bofei Gao Yuliang Liu Yucheng Li Tianyu Liu Keming Lu Wayne Xiong Yue Dong Junjie Hu and Wen Xiao. 2025. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069 [cs.CL] https:\/\/arxiv.org\/abs\/2406.02069"},{"key":"e_1_3_2_1_2_1","volume-title":"Palu: KV-Cache Compression with Low-Rank Projection. In The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=LWMS4pk2vK","author":"Chang Chi-Chih","year":"2025","unstructured":"Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. 2025. Palu: KV-Cache Compression with Low-Rank Projection. In The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=LWMS4pk2vK"},{"key":"e_1_3_2_1_3_1","volume-title":"Oh (Eds.)","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 16344\u201316359. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf"},{"key":"e_1_3_2_1_4_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh. 2023. SparseGPT: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML'23). JMLR.org, Article 414, 15 pages."},{"key":"e_1_3_2_1_5_1","volume-title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG] https:\/\/arxiv.org\/abs\/2210.17323","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG] https:\/\/arxiv.org\/abs\/2210.17323"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_7_1","volume-title":"Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.)","volume":"6","author":"Lin Ji","year":"2024","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. In Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 87\u2013100. https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2024\/file\/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf"},{"key":"e_1_3_2_1_8_1","volume-title":"Levine (Eds.)","volume":"36","author":"Ma Xinyin","year":"2023","unstructured":"Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. LLM-Pruner: On the Structural Pruning of Large Language Models. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 21702\u201321720. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/44956951349095f74492a5471128a7e0-Paper-Conference.pdf"},{"key":"e_1_3_2_1_9_1","unstructured":"Meta AI. 2024. Llama 3 Model Card. https:\/\/github.com\/meta-llama\/llama3."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"crossref","unstructured":"Pavlo Molchanov Arun Mallya Stephen Tyree Iuri Frosio and Jan Kautz. 2019. Importance Estimation for Neural Network Pruning. arXiv:1906.10771 [cs.LG] https:\/\/arxiv.org\/abs\/1906.10771","DOI":"10.1109\/CVPR.2019.01152"},{"key":"e_1_3_2_1_11_1","unstructured":"NVIDIA. 2024. NVIDIA TensorRT: Programmable Inference Accelerator. https:\/\/developer.nvidia.com\/tensorrt."},{"key":"e_1_3_2_1_12_1","unstructured":"NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Architecture. White Paper. https:\/\/resources.nvidia.com\/en-us-hopper-architecture\/nvidia-h100-tensor-c."},{"key":"e_1_3_2_1_13_1","unstructured":"SemiAnalysis. 2024. NVIDIA Tensor Core Evolution: From Volta To Blackwell. https:\/\/newsletter.semianalysis.com\/p\/nvidia-tensor-core-evolution-from-volta-to-blackwell."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.52202\/079017-2193"},{"key":"e_1_3_2_1_15_1","volume-title":"Alvarez","author":"Shen Maying","year":"2021","unstructured":"Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, and Jose M. Alvarez. 2021. HALP: Hardware-Aware Latency Pruning. arXiv:2110.10811 [cs.CV] https:\/\/arxiv.org\/abs\/2110.10811"},{"key":"e_1_3_2_1_16_1","volume-title":"The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=PxoFut3dWW","author":"Sun Mingjie","year":"2024","unstructured":"Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2024. A Simple and Effective Pruning Approach for Large Language Models. In The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=PxoFut3dWW"},{"key":"e_1_3_2_1_17_1","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Tang Jiaming","year":"2024","unstructured":"Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long-context LLM inference. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML'24). JMLR.org, Article 1955, 11 pages."},{"key":"e_1_3_2_1_18_1","unstructured":"Xin Wang Yu Zheng Zhongwei Wan and Mi Zhang. 2025. SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression. In The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=LNYIUouhdt"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","unstructured":"Jinqi Xiao Chengming Zhang Yu Gong Miao Yin Yang Sui Lizhi Xiang Dingwen Tao and Bo Yuan. 2023. HALOC: hardware-aware automatic low-rank compression for compact neural networks. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI'23\/IAAI'23\/EAAI'23). AAAI Press Article 1175 9 pages. doi:10.1609\/aaai.v37i9.26244","DOI":"10.1609\/aaai.v37i9.26244"},{"key":"e_1_3_2_1_20_1","volume-title":"ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models. arXiv:2312.05821 [cs.CL] https:\/\/arxiv.org\/abs\/2312.05821","author":"Yuan Zhihang","year":"2025","unstructured":"Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. 2025. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models. arXiv:2312.05821 [cs.CL] https:\/\/arxiv.org\/abs\/2312.05821"},{"key":"e_1_3_2_1_21_1","volume-title":"Wang, and Beidi Chen.","author":"Zhang Zhenyu","year":"2023","unstructured":"Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\u00e9, Clark Barrett, Zhangyang \u201cAtlas\u201d Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34661\u201334710. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/6ceefa7b15572587b78ecfcebb2827f8-Paper-Conference.pdf"}],"event":{"name":"EuroSys '26: 21st European Conference on Computer Systems","location":"Edinburgh Scotland Uk","acronym":"EuroMLSys '26","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the Sixth European Workshop on Machine Learning and Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3805621.3807608","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T13:10:36Z","timestamp":1777381836000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805621.3807608"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,27]]},"references-count":21,"alternative-id":["10.1145\/3805621.3807608","10.1145\/3805621"],"URL":"https:\/\/doi.org\/10.1145\/3805621.3807608","relation":{},"subject":[],"published":{"date-parts":[[2026,4,27]]},"assertion":[{"value":"2026-04-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}