{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,14]],"date-time":"2026-01-14T10:11:23Z","timestamp":1768385483687,"version":"3.49.0"},"reference-count":55,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T00:00:00Z","timestamp":1767830400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>As one of the most widely used high-performance kernels, General Matrix Multiplication, or GEMM, plays a pivotal role in diverse application fields. With the growing prevalence of training for Convolutional Neural Networks (CNNs) and Large Language Models (LLMs), the design and implementation of high-efficiency, low-precision GEMM on modern Neural Processing Unit (NPU) platforms are of great significance. In this work, HGEMM for Ascend NPU is presented, which enables collaborative processing of different computation types by Cube units and Vector units. The major contributions of this work are the following: (i) dual-stream pipeline scheduling is implemented, which synchronizes padding operations, matrix\u2013matrix multiplications, and element-wise instructions across hierarchical buffers and compute units; (ii) a suite of tiling strategies and a corresponding strategy selection mechanism are developed, comprehensively accounting for the impacts from M, N, and K directions; and (iii) SplitK as well as ShuffleK methods are raised to address the challenges of memory access efficiency and AI Core utilization. Extensive evaluations demonstrate that our proposed HGEMM achieves an average 3.56\u00d7 speedup over the CATLASS template-based implementation under identical Ascend NPU configurations, and an average 2.10\u00d7 speedup relative to the cuBLAS implementation on Nvidia A800 GPUs under general random workloads. It also achieves a maximum computational utilization exceeding 90% under benchmark workloads. Moreover, the proposed HGEMM not only significantly outperforms the CATLASS template-based implementation but also delivers efficiency comparable to the cuBLAS implementation in OPT-based bandwidth-limited LLM inference workloads.<\/jats:p>","DOI":"10.3390\/computers15010039","type":"journal-article","created":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T15:33:16Z","timestamp":1767886396000},"page":"39","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Efficient Low-Precision GEMM on Ascend NPU: HGEMM\u2019s Synergy of Pipeline Scheduling, Tiling, and Memory Optimization"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9872-3095","authenticated-orcid":false,"given":"Erkun","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Future Technology, South China University of Technology, Guangzhou 510641, China"},{"name":"Pengcheng Laboratory, Shenzhen 518071, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2273-1504","authenticated-orcid":false,"given":"Pengxiang","family":"Xu","sequence":"additional","affiliation":[{"name":"Pengcheng Laboratory, Shenzhen 518071, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6372-7088","authenticated-orcid":false,"given":"Lu","family":"Lu","sequence":"additional","affiliation":[{"name":"Pengcheng Laboratory, Shenzhen 518071, China"},{"name":"School of Computer Science & Engineering, South China University of Technology, Guangzhou 510006, China"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3503469","article-title":"Energy Efficient Boosting of GEMM Accelerators for DNN via Reuse","volume":"27","author":"Cicek","year":"2022","journal-title":"ACM Trans. Des. Autom. Electron. Syst."},{"key":"ref_2","first-page":"1","article-title":"A LAPACK Implementation of the Dynamic Mode Decomposition","volume":"50","year":"2024","journal-title":"ACM Trans. Math. Softw."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Nair, H., Vellaisamy, P., Chen, A., Finn, J., Li, A., Trivedi, M., and Shen, J.P. (2023, January 21\u201325). tuGEMM: Area-Power-Efficient Temporal Unary GEMM Architecture for Low-Precision Edge AI. Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA.","DOI":"10.1109\/ISCAS46773.2023.10181357"},{"key":"ref_4","unstructured":"NVIDIA (2024, May 21). cuBLAS (v12.5). Available online: https:\/\/docs.nvidia.com\/cuda\/archive\/12.5.0."},{"key":"ref_5","unstructured":"Advanced Micro Devices, and Inc (2024, June 04). rocBLAS 4.1.2 Documentation. Available online: https:\/\/rocm.docs.amd.com\/projects\/rocBLAS\/en\/docs-6.1.2\/index.html."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Xu, R.G., Van Zee, F.G., and van de Geijn, R.A. (2023, January 21\u201323). Towards a Unified Implementation of GEMM in BLIS. Proceedings of the 37th ACM International Conference on Supercomputing, ICS \u201923, Orlando, FL, USA.","DOI":"10.1145\/3577193.3593707"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Abdelfattah, A., Haidar, A., Tomov, S., and Dongarra, J. (2017, January 13\u201316). Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. Proceedings of the International Conference on Supercomputing, ICS \u201917, Chicago, IL, USA.","DOI":"10.1145\/3079079.3079103"},{"key":"ref_8","unstructured":"Nvidia (2025, October 29). CUDA Templates for Linear Algebra Subroutines. Available online: https:\/\/github.com\/NVIDIA\/cutlass."},{"key":"ref_9","unstructured":"Kerr, A., Merrill, D., Demouth, J., and Tran, J. (2026, January 06). CUTLASS: Fast Linear Algebra in CUDA C++. Nvidia, Available online: https:\/\/developer.nvidia.com\/blog\/cutlass-linear-algebra-cuda\/."},{"key":"ref_10","unstructured":"Huawei (2025, October 29). Catlass: CANN Templates for Linear Algebra Subroutines. Available online: https:\/\/gitcode.com\/cann\/catlass."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Chen, Y., and Lu, L. (2025). AscQLUT: A Decode-Fused INT4 GEMM Kernel for Accelerating Low-Bit Quantized Matrix Multiplication via Lookup Tables on Ascend 910B NPU. preprint.","DOI":"10.2139\/ssrn.5679592"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ma, Z., Wang, H., Feng, G., Zhang, C., Xie, L., He, J., Chen, S., and Zhai, J. (2022, January 28\u201330). Efficiently emulating high-bitwidth computation with low-bitwidth hardware. Proceedings of the 36th ACM International Conference on Supercomputing, ICS \u201922, Virtual Event.","DOI":"10.1145\/3524059.3532377"},{"key":"ref_13","first-page":"148","article-title":"FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics","volume":"6","author":"Hong","year":"2024","journal-title":"Mach. Learn. Syst."},{"key":"ref_14","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., and Bhosale, S. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv."},{"key":"ref_15","first-page":"30318","article-title":"GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale","volume":"35","author":"Dettmers","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_16","unstructured":"Xia, H., Zheng, Z., Wu, X., Chen, S., Yao, Z., Youn, S., Bakhtiari, A., Wyatt, M., Zhuang, D., and Zhou, A. (2024, January 10\u201312). Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs. Proceedings of the 2024 USENIX Annual Technical Conference (USENIX ATC 24), Santa Clara, CA, USA."},{"key":"ref_17","first-page":"49146","article-title":"Training Transformers with 4-bit Integers","volume":"36","author":"Xi","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_18","unstructured":"Wu, X., Li, C., Yazdani Aminabadi, R., Yao, Z., and He, Y. (2023, January 23\u201329). Understanding Int4 Quantization for Language Models: Latency Speedup, Composability, and Failure Cases. Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Kunkel, J.M., Balaji, P., and Dongarra, J. (2016). Performance, Design, and Autotuning of Batched GEMM for GPUs. High Performance Computing, Springer.","DOI":"10.1007\/978-3-319-41321-1"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"495","DOI":"10.1016\/j.procs.2017.05.138","article-title":"The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems","volume":"108","author":"Dongarra","year":"2017","journal-title":"Procedia Comput. Sci."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3431921","article-title":"A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines","volume":"47","author":"Abdelfattah","year":"2021","journal-title":"ACM Trans. Math. Softw."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3378176","article-title":"Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor","volume":"17","author":"Jiang","year":"2020","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Miji\u0107, N., and Davidovi\u0107, D. (2022, January 23\u201327). Batched matrix operations on distributed GPUs with application in theoretical physics. Proceedings of the 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia.","DOI":"10.23919\/MIPRO55190.2022.9803591"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Li, X., Liang, Y., Yan, S., Jia, L., and Li, Y. (2019, January 16\u201320). A coordinated tiling and batching framework for efficient GEMM on GPUs. Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP \u201919, Washington, DC, USA.","DOI":"10.1145\/3293883.3295734"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1177\/1094342020965661","article-title":"Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs","volume":"35","author":"Ernst","year":"2021","journal-title":"Int. J. High Perform. Comput. Appl."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"3842","DOI":"10.1109\/TCAD.2020.3012753","article-title":"NPU Thermal Management","volume":"39","author":"Amrouch","year":"2020","journal-title":"IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst."},{"key":"ref_27","unstructured":"Georgie, P. (2023). What Is an NPU? Here\u2019s Why Everyone\u2019s Suddenly Talking About Them, Digital Trends Media Group."},{"key":"ref_28","unstructured":"Kim, S., and Deka, G.C. (2021). Chapter Seven - Architecture of neural processing unit for deep neural networks. Hardware Accelerator Systems for Artificial Intelligence and Machine Learning, Elsevier."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H., and Hu, Y. (March, January 27). Ascend: A Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing: Industry Track Paper. Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Virtually.","DOI":"10.1109\/HPCA51647.2021.00071"},{"key":"ref_30","unstructured":"Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., and Clark, A. (2022). Training Compute-Optimal Large Language Models. arXiv."},{"key":"ref_31","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Guo, H., Guo, N., Meinel, C., and Yang, H. (November, January 30). Low-bit CUTLASS GEMM Template Auto-tuning using Neural Network. Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Kaifeng, China.","DOI":"10.1109\/ISPA63168.2024.00057"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Xue, Y., Liu, Y., Nai, L., and Huang, J. (2023, January 17\u201321). V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness. Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA \u201923, Orlando, FL, USA.","DOI":"10.1145\/3579371.3589059"},{"key":"ref_34","unstructured":"Wang, C., Pang, W., Wu, X., Jun, G., Romero, L., Taka, E., Marculescu, D., Nowatzki, T., Vasireddy, P., and Melber, J. (2025). Can Asymmetric Tile Buffering Be Beneficial?. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hovhannisyan, A. (2025, January 26\u201327). Optimizing DGEMM Using Vectorized Micro-Kernels and Memory-Aware Parallelization. Proceedings of the Computer Science and Information Technologies (CSIT) Workshop, CSIT 2025, London, UK.","DOI":"10.51408\/csit2025_81"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Wang, H., Xu, H., Yang, D., Zhou, X., and Cheng, D. (2025, January 16\u201321). HyTiS: Hybrid Tile Scheduling for GPU GEMM with Enhanced Wave Utilization and Cache Locality. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC \u201925, St. Louis, MO, USA.","DOI":"10.1145\/3712285.3759771"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.jpdc.2021.02.013","article-title":"TSM2X: High-performance tall-and-skinny matrix\u2013matrix multiplication on GPUs","volume":"151","author":"Rivera","year":"2021","journal-title":"J. Parallel Distrib. Comput."},{"key":"ref_38","first-page":"267","article-title":"Efficient Mixed-Precision Tall-and-Skinny Matrix-Matrix Multiplication for GPUs","volume":"11","author":"Tang","year":"2021","journal-title":"Int. J. Netw. Comput."},{"key":"ref_39","unstructured":"Park, G., Park, B., Kim, M., Lee, S., Kim, J., Kwon, B., Kwon, S.J., Kim, B., Lee, Y., and Lee, D. (2024). LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models. arXiv."},{"key":"ref_40","unstructured":"Heo, G., Lee, S., Cho, J., Choi, H., Lee, S., Ham, H., Kim, G., Mahajan, D., and Park, J. (May, January 27). NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Hu, H., Xiao, B., Sun, S., Yin, J., Zhang, Z., Luo, X., Jiang, C., Xu, W., Jia, X., and Liu, X. (2025, January 16\u201321). LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC \u201925, St. Louis, MO, USA.","DOI":"10.1145\/3712285.3759852"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Neuwirth, S., Paul, A.K., Weinzierl, T., and Carson, E.C. (2026). Stream-K++: Adaptive GPU GEMM Kernel Selection and Scheduling for AI Using Bloom Filters. High Performance Computing, Springer.","DOI":"10.1007\/978-3-032-07612-0"},{"key":"ref_43","unstructured":"Taka, E., Roesti, A., Melber, J., Vasireddy, P., Denolf, K., and Marculescu, D. (2025). Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs. arXiv."},{"key":"ref_44","unstructured":"Huawei (Huawei, 2024). Non-Contiguous-to-Contiguous Conversion (Vector Operators)-Basic Tuning-Operator Computation Perform, Huawei."},{"key":"ref_45","unstructured":"Huawei (Huawei, 2024). AI Core-Background Knowledge-TBE&AI CPU Operator Development-Operator development-7.0.0-CANN commercial edition-Ascend Documentation-Ascend Community, Huawei."},{"key":"ref_46","unstructured":"Huawei (Huawei, 2024). Hardware Architecture-Operator development-8.0.RC2.alpha003-CANN community edition-Ascend Documentation-Ascend Community, Huawei, (In Chinese)."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Anderson, A., Vasudevan, A., Keane, C., and Gregg, D. (2020, January 13\u201315). High-Performance Low-Memory Lowering: GEMM-based Algorithms for DNN Convolution. Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, HI, USA.","DOI":"10.1109\/SBAC-PAD49847.2020.00024"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Han, Q., Hu, Y., Yu, F., Yang, H., Liu, B., Hu, P., Gong, R., Wang, Y., Wang, R., and Luan, Z. (2020, January 17\u201320). Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures. Proceedings of the 49th International Conference on Parallel Processing, ICPP \u201920, Edmonton, AB, Canada.","DOI":"10.1145\/3404397.3404407"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"13393","DOI":"10.1007\/s11227-022-04336-3","article-title":"A batched GEMM optimization framework for deep learning","volume":"78","author":"Yang","year":"2022","journal-title":"J. Supercomput."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"511","DOI":"10.1177\/1094342010385729","article-title":"An Improved Magma Gemm For Fermi Graphics Processing Units","volume":"24","author":"Nath","year":"2010","journal-title":"Int. J. High Perform. Comput. Appl."},{"key":"ref_51","unstructured":"Huawei (2023, October 24). Atlas 300T A2 Training Card User Guide 03. Available online: https:\/\/support.huawei.com\/enterprise\/en\/doc\/EDOC1100338863\/5549b5ec\/performance?idPath=23710424|251366513|22892968|252309113|254184749."},{"key":"ref_52","unstructured":"Nvidia (2023, October 24). NVIDIA A800 40GB Active Graphics Card. Available online: https:\/\/www.nvidia.com\/en-us\/products\/workstations\/a800."},{"key":"ref_53","unstructured":"Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., and Lin, X.V. (2022). OPT: Open Pre-trained Transformer Language Models. arXiv."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Dongarra, J., and Luszczek, P. (2025). HPL-MxP benchmark: Mixed-precision algorithms, iterative refinement, and scalable data generation. Int. J. High Perform. Comput. Appl.","DOI":"10.1177\/10943420251382476"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Xue, W., Yang, K., Liu, Y., Fan, D., Xu, P., and Tian, Y. (2024, January 17). Unlocking High Performance with Low-Bit NPUs and CPUs for Highly Optimized HPL-MxP on Cloud Brain II. Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA.","DOI":"10.1109\/SC41406.2024.00088"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/15\/1\/39\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,14]],"date-time":"2026-01-14T05:17:25Z","timestamp":1768367845000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/15\/1\/39"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,8]]},"references-count":55,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["computers15010039"],"URL":"https:\/\/doi.org\/10.3390\/computers15010039","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,8]]}}}