{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T16:08:16Z","timestamp":1780675696391,"version":"3.54.1"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,10]]},"abstract":"<jats:p>\n            With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide an efficient support for handling unstructured sparsity on modern GPUs, especially on the highly-structured tensor core hardware. Therefore, we propose Flash-LLM for enabling low-cost and highly efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive tensor cores. Based on our key observation that the main bottleneck of generative model inference is the several skinny matrix multiplications for which tensor cores would be significantly under-utilized due to low computational intensity, we propose a general\n            <jats:italic>Load-as-Sparse and Compute-as-Dense<\/jats:italic>\n            methodology for unstructured sparse matrix multiplication (SpMM). The basic insight is to address the significant memory bandwidth bottleneck while tolerating redundant computations that are not critical for end-to-end performance on tensor cores. Based on this, we design an effective software framework for tensor core based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation\/memory-access overlapping. Extensive evaluations demonstrate that (1) at SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9X and 1.5X, respectively.(2) At end-to-end framework level on OPT-30B\/66B\/175B models, for\n            <jats:italic>tokens per GPU-second<\/jats:italic>\n            , Flash-LLM achieves up to 3.8X and 3.6X improvement over DeepSpeed and FasterTransformer, respectively, with significantly lower inference cost.\n          <\/jats:p>","DOI":"10.14778\/3626292.3626303","type":"journal-article","created":{"date-parts":[[2023,12,11]],"date-time":"2023-12-11T23:24:55Z","timestamp":1702337095000},"page":"211-224","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":36,"title":["Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity"],"prefix":"10.14778","volume":"17","author":[{"given":"Haojun","family":"Xia","sequence":"first","affiliation":[{"name":"University of Sydney"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhen","family":"Zheng","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuchao","family":"Li","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Donglin","family":"Zhuang","sequence":"additional","affiliation":[{"name":"University of Sydney"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhongzhu","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Sydney"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiafei","family":"Qiu","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yong","family":"Li","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Lin","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shuaiwen Leon","family":"Song","sequence":"additional","affiliation":[{"name":"University of Sydney"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,10]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00051"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.bigscience-1.9"},{"key":"e_1_2_1_3_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476182"},{"key":"e_1_2_1_5_1","volume-title":"Retrieved","author":"AI.","year":"2022","unstructured":"Eleuther AI. 2022 . GPT-NeoX-20B . Retrieved October 14, 2023 from https:\/\/huggingface.co\/EleutherAI\/gpt-neox-20b EleutherAI. 2022. GPT-NeoX-20B. Retrieved October 14, 2023 from https:\/\/huggingface.co\/EleutherAI\/gpt-neox-20b"},{"key":"e_1_2_1_6_1","volume-title":"Retrieved","author":"Face Hugging","year":"2023","unstructured":"Hugging Face . 2023 . Model Parallelism . Retrieved October 14, 2023 from https:\/\/huggingface.co\/docs\/transformers\/v4.15.0\/parallelism Hugging Face. 2023. Model Parallelism. Retrieved October 14, 2023 from https:\/\/huggingface.co\/docs\/transformers\/v4.15.0\/parallelism"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_2_1_8_1","volume-title":"Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh . 2023. Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774 ( 2023 ). Elias Frantar and Dan Alistarh. 2023. Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv preprint arXiv:2301.00774 (2023)."},{"key":"e_1_2_1_9_1","volume-title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint arXiv:2210.17323","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar , Saleh Ashkboos , Torsten Hoefler , and Dan Alistarh . 2022 . GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint arXiv:2210.17323 (2022). Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint arXiv:2210.17323 (2022)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433723"},{"key":"e_1_2_1_11_1","volume-title":"Retrieved","author":"Gale Trevor","year":"2020","unstructured":"Trevor Gale , Matei Zaharia , Cliff Young , and Erich Elsen . 2020 . sputnik github . Retrieved October 14, 2023 from https:\/\/github.com\/google-research\/sputnik\/ Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. sputnik github. Retrieved October 14, 2023 from https:\/\/github.com\/google-research\/sputnik\/"},{"key":"e_1_2_1_12_1","volume-title":"Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E Hinton.","author":"Gomez Aidan N","year":"2019","unstructured":"Aidan N Gomez , Ivan Zhang , Siddhartha Rao Kamalakara , Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E Hinton. 2019 . Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019). Aidan N Gomez, Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E Hinton. 2019. Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019)."},{"key":"e_1_2_1_13_1","volume-title":"Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224 3","author":"Gray Scott","year":"2017","unstructured":"Scott Gray , Alec Radford , and Diederik P Kingma . 2017. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224 3 ( 2017 ), 2. Scott Gray, Alec Radford, and Diederik P Kingma. 2017. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224 3 (2017), 2."},{"key":"e_1_2_1_14_1","volume-title":"Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149","author":"Han Song","year":"2015","unstructured":"Song Han , Huizi Mao , and William J Dally . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 ( 2015 ). Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)."},{"key":"e_1_2_1_15_1","volume-title":"Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28","author":"Han Song","year":"2015","unstructured":"Song Han , Jeff Pool , John Tran , and William Dally . 2015. Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28 ( 2015 ). Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_2_1_16_1","first-page":"1","article-title":"Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks","volume":"22","author":"Hoefler Torsten","year":"2021","unstructured":"Torsten Hoefler , Dan Alistarh , Tal Ben-Nun , Nikoli Dryden , and Alexandra Peste . 2021 . Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks . J. Mach. Learn. Res. 22 , 241 (2021), 1 -- 124 . Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. 2021. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res. 22, 241 (2021), 1--124.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295712"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530588"},{"key":"e_1_2_1_19_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Dehao Chen , Mia Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V Le , Yonghui Wu , 2019 . Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019). Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_1_20_1","volume-title":"2022 USENIX Annual Technical Conference, USENIX ATC 2022","author":"Jia Xianyan","year":"2022","unstructured":"Xianyan Jia , Le Jiang , Ang Wang , Wencong Xiao , Ziji Shi , Jie Zhang , Xinyuan Li , Langshi Chen , Yong Li , Zhen Zheng , Xiaoyong Liu , and Wei Lin . 2022 . Whale: Efficient Giant Model Training over Heterogeneous GPUs . In 2022 USENIX Annual Technical Conference, USENIX ATC 2022 , Carlsbad, CA, USA , July 11-13, 2022, Jiri Schindler and Noa Zilberman (Eds.). USENIX Association, 673--688. Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin. 2022. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022, Jiri Schindler and Noa Zilberman (Eds.). USENIX Association, 673--688."},{"key":"e_1_2_1_21_1","first-page":"1","article-title":"Beyond Data and Model Parallelism for Deep Neural Networks","volume":"1","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia , Matei Zaharia , and Alex Aiken . 2019 . Beyond Data and Model Parallelism for Deep Neural Networks . Proceedings of Machine Learning and Systems 1 (2019), 1 -- 13 . Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1--13.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of NAACL-HLT. 4171--4186","author":"Ming-Wei Chang Jacob Devlin","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of NAACL-HLT. 4171--4186 . Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133901"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342276"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/3571885.3571934"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551828"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3195774"},{"key":"e_1_2_1_29_1","volume-title":"Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270","author":"Liu Zhuang","year":"2018","unstructured":"Zhuang Liu , Mingjie Sun , Tinghui Zhou , Gao Huang , and Trevor Darrell . 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 ( 2018 ). Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452773"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_2_1_32_1","volume-title":"Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius.","author":"Mishra Asit","year":"2021","unstructured":"Asit Mishra , Jorge Albericio Latorre , Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. 2021 . Accelerating Sparse Deep Neural Networks . arXiv:2104.08378 [cs.LG] Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. 2021. Accelerating Sparse Deep Neural Networks. arXiv:2104.08378 [cs.LG]"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 11264--11272","author":"Molchanov Pavlo","year":"2019","unstructured":"Pavlo Molchanov , Arun Mallya , Stephen Tyree , Iuri Frosio , and Jan Kautz . 2019 . Importance estimation for neural networkpruning . In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 11264--11272 . Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. Importance estimation for neural networkpruning. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 11264--11272."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_2_1_35_1","volume-title":"FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. arXiv preprint arXiv:2304.03946","author":"Nie Xiaonan","year":"2023","unstructured":"Xiaonan Nie , Xupeng Miao , Zilong Wang , Zichao Yang , Jilong Xue , Lingxiao Ma , Gang Cao , and Bin Cui . 2023. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. arXiv preprint arXiv:2304.03946 ( 2023 ). Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. 2023. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. arXiv preprint arXiv:2304.03946 (2023)."},{"key":"e_1_2_1_36_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2020","unstructured":"NVIDIA. 2020 . NVIDIA A100 Tensor Core GPU Architecture . Retrieved October 14, 2023 from https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/data-center\/nvidia-ampere-architecture-whitepaper.pdf NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. Retrieved October 14, 2023 from https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/data-center\/nvidia-ampere-architecture-whitepaper.pdf"},{"key":"e_1_2_1_37_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2022","unstructured":"NVIDIA. 2022 . NVIDIA Faster-Transformer . Retrieved October 14, 2023 from https:\/\/github.com\/NVIDIA\/FasterTransformer NVIDIA. 2022. NVIDIA Faster-Transformer. Retrieved October 14, 2023 from https:\/\/github.com\/NVIDIA\/FasterTransformer"},{"key":"e_1_2_1_38_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2022","unstructured":"NVIDIA. 2022 . NVIDIA H100 Tensor Core GPU Architecture . Retrieved October 14, 2023 from https:\/\/www.hpctech.co.jp\/catalog\/gtc22-whitepaper-hopper_v1.01.pdf NVIDIA. 2022. NVIDIA H100 Tensor Core GPU Architecture. Retrieved October 14, 2023 from https:\/\/www.hpctech.co.jp\/catalog\/gtc22-whitepaper-hopper_v1.01.pdf"},{"key":"e_1_2_1_39_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . cuBLAS Docs . Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html NVIDIA. 2023. cuBLAS Docs. Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html"},{"key":"e_1_2_1_40_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . cuSPARSE Library . Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html NVIDIA. 2023. cuSPARSE Library. Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html"},{"key":"e_1_2_1_41_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . cuSPARSELt Library . Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/cuda\/cusparselt\/ NVIDIA. 2023. cuSPARSELt Library. Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/cuda\/cusparselt\/"},{"key":"e_1_2_1_42_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . CUTLASS 3.2 . Retrieved October 14, 2023 from https:\/\/github.com\/NVIDIA\/cutlass NVIDIA. 2023. CUTLASS 3.2. Retrieved October 14, 2023 from https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_2_1_43_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . Nsight Compute Profiling Guide . Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/#introduction NVIDIA. 2023. Nsight Compute Profiling Guide. Retrieved October 14, 2023 from https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/#introduction"},{"key":"e_1_2_1_44_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . Nsight System . Retrieved October 14, 2023 from https:\/\/developer.nvidia.com\/nsight-systems NVIDIA. 2023. Nsight System. Retrieved October 14, 2023 from https:\/\/developer.nvidia.com\/nsight-systems"},{"key":"e_1_2_1_45_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9.  Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588717"},{"key":"e_1_2_1_48_1","unstructured":"Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E Gonzalez etal 2023. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865 (2023).  Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E Gonzalez et al. 2023. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865 (2023)."},{"key":"e_1_2_1_49_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . 2019 . Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_2_1_50_1","volume-title":"Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro.","author":"Smith Shaden","year":"2022","unstructured":"Shaden Smith , Mostofa Patwary , Brandon Norick , Patrick LeGresley , Samyam Rajbhandari , Jared Casper , Zhun Liu , Shrimai Prabhumoye , George Zerveas , Vijay Korthikanti , Elton Zhang , Rewon Child , Reza Yazdani Aminabadi , Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022 . Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model . arXiv:2201.11990 [cs.CL] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv:2201.11990 [cs.CL]"},{"key":"e_1_2_1_51_1","unstructured":"Mingjie Sun Zhuang Liu Anna Bair and J. Zico Kolter. 2023. A Simple and Effective Pruning Approach for Large Language Models. arXiv:2306.11695 [cs.CL]  Mingjie Sun Zhuang Liu Anna Bair and J. Zico Kolter. 2023. A Simple and Effective Pruning Approach for Large Language Models. arXiv:2306.11695 [cs.CL]"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554896"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/3514061.3514067"},{"key":"e_1_2_1_54_1","volume-title":"Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008","author":"Ullrich Karen","year":"2017","unstructured":"Karen Ullrich , Edward Meeds , and Max Welling . 2017. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008 ( 2017 ). Karen Ullrich, Edward Meeds, and Max Welling. 2017. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008 (2017)."},{"key":"e_1_2_1_55_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_56_1","volume-title":"Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32","author":"Wang Alex","year":"2019","unstructured":"Alex Wang , Yada Pruksachatkun , Nikita Nangia , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy , and Samuel Bowman . 2019 . Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32 (2019). Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_1_57_1","unstructured":"Yuke Wang Boyuan Feng Zheng Wang and Yufei Ding. 2023. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. arXiv:2112.02052 [cs.LG]  Yuke Wang Boyuan Feng Zheng Wang and Yufei Ding. 2023. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. arXiv:2112.02052 [cs.LG]"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582047"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386367.3432728"},{"key":"e_1_2_1_61_1","volume-title":"Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li , Xi Victoria Lin , Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022 . OPT : Open Pre-trained Transformer Language Models . arXiv:2205.01068 [cs.CL] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 [cs.CL]"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561265"},{"key":"e_1_2_1_63_1","volume-title":"Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng , Zhuohan Li , Hao Zhang , Yonghao Zhuang , Zhifeng Chen , Yanping Huang , Yida Wang , Yuanzhong Xu , Danyang Zhuo , Eric P. Xing , Joseph E. Gonzalez , and Ion Stoica . 2022 . Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022 , Carlsbad, CA, USA , July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, 559--578. Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, 559--578."},{"key":"e_1_2_1_64_1","volume-title":"Retrieved","author":"Zheng Ningxin","year":"2022","unstructured":"Ningxin Zheng . 2022 . SparTA github . Retrieved October 14, 2023 from https:\/\/github.com\/microsoft\/SparTA\/tree\/sparta_artifact Ningxin Zheng. 2022. SparTA github. Retrieved October 14, 2023 from https:\/\/github.com\/microsoft\/SparTA\/tree\/sparta_artifact"},{"key":"e_1_2_1_65_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Ningxin","year":"2022","unstructured":"Ningxin Zheng , Bin Lin , Quanlu Zhang , Lingxiao Ma , Yuqing Yang , Fan Yang , Yang Wang , Mao Yang , and Lidong Zhou . 2022 . {SparTA}:{Deep-Learning} Model Sparsity via {Tensor-with-Sparsity-Attribute} . In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 213--232. Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. {SparTA}:{Deep-Learning} Model Sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213--232."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3626292.3626303","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,8]],"date-time":"2024-01-08T23:08:52Z","timestamp":1704755332000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3626292.3626303"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10]]},"references-count":65,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,10]]}},"alternative-id":["10.14778\/3626292.3626303"],"URL":"https:\/\/doi.org\/10.14778\/3626292.3626303","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,10]]},"assertion":[{"value":"2023-10-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}