{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T16:03:14Z","timestamp":1776096194522,"version":"3.50.1"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,10,26]],"date-time":"2023-10-26T00:00:00Z","timestamp":1698278400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2021YFB0301300"],"award-info":[{"award-number":["2021YFB0301300"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Major Program of Guangdong Basic and Applied Research","award":["2019B030302002"],"award-info":[{"award-number":["2019B030302002"]}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U1911401"],"award-info":[{"award-number":["U1911401"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Guangdong Province Special Support Program for Cultivating High-Level Talents","award":["2021TQ06X160"],"award-info":[{"award-number":["2021TQ06X160"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance.<\/jats:p>\n          <jats:p>In this article, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation\/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28% on the entire transformer model, 63.8% on the self-attention module, and reduces memory footprint of intermediate results by 7.8\u00d7, compared with prevailing frameworks.<\/jats:p>","DOI":"10.1145\/3617689","type":"journal-article","created":{"date-parts":[[2023,8,26]],"date-time":"2023-08-26T10:34:42Z","timestamp":1693046082000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4707-9492","authenticated-orcid":false,"given":"Jiangsu","family":"Du","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1417-3012","authenticated-orcid":false,"given":"Jiazhi","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-4718-2933","authenticated-orcid":false,"given":"Jiang","family":"Zheng","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9145-3919","authenticated-orcid":false,"given":"Hongbin","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5582-1031","authenticated-orcid":false,"given":"Dan","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5315-3375","authenticated-orcid":false,"given":"Yutong","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}]}],"member":"320","published-online":{"date-parts":[[2023,10,26]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"265","volume-title":"Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et\u00a0al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). 265\u2013283."},{"key":"e_1_3_1_3_2","first-page":"1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201922)","author":"Aminabadi Reza Yazdani","year":"2022","unstructured":"Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley et\u00a0al. 2022. DeepSpeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201922). IEEE, 1\u201315."},{"key":"e_1_3_1_4_2","volume-title":"Hugging Face Transformer Inference Under 1 Millisecond Latency","author":"Benesty Micha\u00ebl","year":"2021","unstructured":"Micha\u00ebl Benesty. 2021. Hugging Face Transformer Inference Under 1 Millisecond Latency. Retrieved August 7, 2023 from https:\/\/towardsdatascience.com\/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c"},{"key":"e_1_3_1_5_2","first-page":"1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Chen Shiyang","year":"2021","unstructured":"Shiyang Chen, Shaoyi Huang, Santosh Pandey, Bingbing Li, Guang R. Gao, Long Zheng, Caiwen Ding, and Hang Liu. 2021. E.T.: Re-thinking self-attention for transformer models on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201318."},{"key":"e_1_3_1_6_2","first-page":"578","volume-title":"Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze et\u00a0al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). 578\u2013594."},{"key":"e_1_3_1_7_2","first-page":"613","volume-title":"Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201917)","author":"Crankshaw Daniel","year":"2017","unstructured":"Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201917). 613\u2013627."},{"key":"e_1_3_1_8_2","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Info. Process. Syst. 35 (2022), 16344\u201316359.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_9_2","first-page":"800","article-title":"Tensorflow lite micro: Embedded machine learning for tinyml systems","volume":"3","author":"David Robert","year":"2021","unstructured":"Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Tiezhen Wang et\u00a0al. 2021. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proc. Mach. Learn. Syst. 3 (2021), 800\u2013811.","journal-title":"Proc. Mach. Learn. Syst."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3524059.3532372"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441578"},{"key":"e_1_3_1_12_2","first-page":"1","volume-title":"Proceedings of the 13th EuroSys Conference","author":"Gao Pin","year":"2018","unstructured":"Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the 13th EuroSys Conference. 1\u201315."},{"key":"e_1_3_1_13_2","first-page":"443","volume-title":"Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920)","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like clockwork: Performance predictability from the bottom up. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920). 443\u2013462."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3545008.3545022"},{"key":"e_1_3_1_15_2","first-page":"2","volume-title":"Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT\u201919)","volume":"1","author":"Kenton Jacob Devlin, Ming-Wei Chang,","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, KentonLee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT\u201919), Vol. 1. 2."},{"key":"e_1_3_1_16_2","unstructured":"Nikita Kitaev \u0141ukasz Kaiser and Anselm Levskaya. 2020. Reformer: The efficient transformer. Retrieved from https:\/\/arXiv:2001.04451"},{"issue":"1","key":"e_1_3_1_17_2","first-page":"105","article-title":"PaddlePaddle: An open-source deep learning platform from industrial practice","volume":"1","author":"Ma Yanjun","year":"2019","unstructured":"Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Front. Data Comput. 1, 1 (2019), 105\u2013115.","journal-title":"Front. Data Comput."},{"key":"e_1_3_1_18_2","first-page":"522","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918)","author":"Markidis Stefano","year":"2018","unstructured":"Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. Nvidia tensor core programmability, performance, and precision. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918). IEEE, 522\u2013531."},{"key":"e_1_3_1_19_2","volume-title":"Proceedings of the GPU Technology Conference","volume":"3","author":"Micikevicius Paulius","year":"2012","unstructured":"Paulius Micikevicius. 2012. GPU performance analysis and optimization. In Proceedings of the GPU Technology Conference, Vol. 3."},{"key":"e_1_3_1_20_2","unstructured":"Sharan Narang Erich Elsen Gregory Diamos and Shubho Sengupta. 2017. Exploring sparsity in recurrent neural networks. Retrieved from https:\/\/arXiv:1704.05119"},{"key":"e_1_3_1_21_2","volume-title":"cuBLAS","year":"2023","unstructured":"NVIDIA. 2023. cuBLAS. Retrieved August 7, 2023 from https:\/\/developer.nvidia.com\/cublascub"},{"key":"e_1_3_1_22_2","volume-title":"FasterTransformer","year":"2023","unstructured":"NVIDIA. 2023. FasterTransformer. Retrieved August 7, 2023 from https:\/\/github.com\/NVIDIA\/FasterTransformer"},{"key":"e_1_3_1_23_2","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga et\u00a0al. 2019. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Info. Process. Syst. 32 (2019).","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_24_2","volume-title":"Optimize your Inference Jobs using Dynamic Batch Inference with TorchServe on Amazon SageMaker","author":"Phi Nguyen","year":"2022","unstructured":"Nguyen Phi, Chauhan Geeta, Shojanazeri Hamid, and Kulkarni Nikhil. 2022. Optimize your Inference Jobs using Dynamic Batch Inference with TorchServe on Amazon SageMaker. Retrieved August 7, 2023 from https:\/\/aws.amazon.com\/cn\/blogs\/machine-learning\/optimize-your-inference-jobs-using-dynamic-batch-inference-with-torchserve-on-amazon-sagemaker\/"},{"issue":"8","key":"e_1_3_1_25_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever et\u00a0al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_1_26_2","first-page":"10183","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Tay Yi","year":"2021","unstructured":"Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. 2021. Synthesizer: Rethinking self-attention for transformer models. In Proceedings of the International Conference on Machine Learning. PMLR, 10183\u201310192."},{"key":"e_1_3_1_27_2","volume-title":"XLA: Optimizing Compiler for Machine Learning","unstructured":"TensorFlow. [n. d.]. XLA: Optimizing Compiler for Machine Learning. Retrieved August 7, 2023 from https:\/\/www.tensorflow.org\/xla"},{"key":"e_1_3_1_28_2","volume-title":"Proceedings of the GPU Technology Conference","volume":"1","author":"Vanholder Han","year":"2016","unstructured":"Han Vanholder. 2016. Efficient inference with tensorrt. In Proceedings of the GPU Technology Conference, Vol. 1."},{"key":"e_1_3_1_29_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Info. Process. Syst. 30 (2017).","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Alex Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. Retrieved from https:\/\/arXiv:1804.07461","DOI":"10.18653\/v1\/W18-5446"},{"key":"e_1_3_1_31_2","first-page":"1","volume-title":"Proceedings of the 39th International Conference on Computer-Aided Design","author":"Yang Xiaoxuan","year":"2020","unstructured":"Xiaoxuan Yang, Bonan Yan, Hai Li, and Yiran Chen. 2020. ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration. In Proceedings of the 39th International Conference on Computer-Aided Design. 1\u20139."},{"key":"e_1_3_1_32_2","first-page":"521","volume-title":"Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922). 521\u2013538."},{"key":"e_1_3_1_33_2","first-page":"344","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201923)","author":"Zhai Yujia","year":"2023","unstructured":"Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2023. ByteTransformer: A high-performance transformer boosted for variable-length inputs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201923). IEEE, 344\u2013355."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617689","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3617689","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:32Z","timestamp":1750178192000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617689"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,26]]},"references-count":32,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3617689"],"URL":"https:\/\/doi.org\/10.1145\/3617689","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,26]]},"assertion":[{"value":"2023-05-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-07","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}