{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:33:58Z","timestamp":1772724838833,"version":"3.50.1"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,20]],"date-time":"2024-11-20T00:00:00Z","timestamp":1732060800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2023YFB3001503"],"award-info":[{"award-number":["2023YFB3001503"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>\n            Sparse-dense matrix multiplication (SpMM) is the performance bottleneck of many high-performance and deep-learning applications, making it attractive to design specialized SpMM hardware accelerators. Unfortunately, existing hardware solutions do not take full advantage of data reuse opportunities of the input and output matrices or suffer from irregular memory access patterns. Their strategies increase the off-chip memory traffic and bandwidth pressure, leaving much room for improvement. We present\n            <jats:sc>Mentor<\/jats:sc>\n            , a new approach to designing SpMM accelerators. Our key insight is that column-wise dataflow, while rarely exploited in prior works, can address these issues in SpMM computations.\n            <jats:sc>Mentor<\/jats:sc>\n            is a software-hardware co-design approach for leveraging column-wise dataflow to improve data reuse and regular memory accesses of SpMM. On the software level,\n            <jats:sc>Mentor<\/jats:sc>\n            incorporates a novel streaming construction scheme to preprocess the input matrix for enabling a streaming access pattern. On the hardware level, it employs a fully pipelined design to unlock the potential of column-wise dataflow further. The design of\n            <jats:sc>Mentor<\/jats:sc>\n            is underpinned by a carefully designed analytical model to find the tradeoff between performance and hardware resources. We have implemented an FPGA prototype of\n            <jats:sc>Mentor<\/jats:sc>\n            . Experimental results show that\n            <jats:sc>Mentor<\/jats:sc>\n            achieves speedup by geomean 2.05\u00d7 (up to 3.98\u00d7), reduces the memory traffic by geomean 2.92\u00d7 (up to 4.93\u00d7), and improves bandwidth utilization by geomean 1.38\u00d7 (up to 2.89\u00d7), compared with the state-of-the-art hardware solutions.\n          <\/jats:p>","DOI":"10.1145\/3688612","type":"journal-article","created":{"date-parts":[[2024,8,26]],"date-time":"2024-08-26T10:01:11Z","timestamp":1724666471000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-2673-8673","authenticated-orcid":false,"given":"Xiaobo","family":"Lu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3542-4869","authenticated-orcid":false,"given":"Jianbin","family":"Fang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6828-3364","authenticated-orcid":false,"given":"Lin","family":"Peng","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0317-8192","authenticated-orcid":false,"given":"Chun","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7603-4210","authenticated-orcid":false,"given":"Zidong","family":"Du","sequence":"additional","affiliation":[{"name":"Institute Of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5503-4457","authenticated-orcid":false,"given":"Yongwei","family":"Zhao","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7858-6238","authenticated-orcid":false,"given":"Zheng","family":"Wang","sequence":"additional","affiliation":[{"name":"Northwest University, Xi'an, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,20]]},"reference":[{"key":"e_1_3_3_2_2","first-page":"75","volume-title":"SpringSim (HPS\u201915)","author":"Anzt Hartwig","year":"2015","unstructured":"Hartwig Anzt, Stanimire Tomov, and Jack J Dongarra. 2015. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In SpringSim (HPS\u201915). 75\u201382."},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE48585.2020.9116501"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00029"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT52795.2021.00016"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3559009.3569691"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/1840845.1840883"},{"issue":"1","key":"e_1_3_3_8_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2049662.2049663","article-title":"The University of Florida sparse matrix collection","volume":"38","author":"Davis Timothy A.","year":"2011","unstructured":"Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1\u201325.","journal-title":"ACM Transactions on Mathematical Software (TOMS)"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304041"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00024"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00090"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01464"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433723"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00079"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589054"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3508041"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001163"},{"key":"e_1_3_3_18_2","unstructured":"Mikael Henaff Joan Bruna and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv:1506.05163. Retrieved from https:\/\/arxiv.org\/abs\/1506.05163"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00017"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2019.2912923"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00076"},{"key":"e_1_3_3_22_2","unstructured":"Advanced Micro Devices Inc.2022. UltraScale Architecture-Based FPGAs Memory IP Product Guide (PG150).Retrieved from https:\/\/docs.amd.com\/v\/u\/en-US\/pg150-ultrascale-memory-ip"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358286"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3016078.2851152"},{"key":"e_1_3_3_25_2","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations."},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00070"},{"key":"e_1_3_3_27_2","doi-asserted-by":"crossref","unstructured":"Shiqing Li Shuo Huai and Weichen Liu. 2023. An efficient gustavson-based sparse matrix\u2013matrix multiplication accelerator on embedded FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42 12 (2023) 4671\u20134680.","DOI":"10.1109\/TCAD.2023.3281719"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE56975.2023.10136958"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575706"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480125"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582069"},{"key":"e_1_3_3_32_2","unstructured":"Maxim Naumov Dheevatsa Mudigere Hao-Jun Shi Jianyu Huang Narayanan Sundaraman Jongsoo Park Xiaodong Wang Udit Gupta Carole-Jean Wu Alisson Azzolini Dmytro Dzhulgakov Andrey Mallevich Ilia Cherniavskii Yinghai Lu Raghuraman Krishnamoorthi Ansha Yu Volodymyr Kondratenko Stephanie Pereira Xianjie Chen and Misha Smelyanskiy. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019)."},{"key":"e_1_3_3_33_2","doi-asserted-by":"crossref","unstructured":"Toluwanimi O. Odemuyiwa Hadi Asghari-Moghaddam Michael Pellauer Kartik Hegde Po-An Tsai Neal C. Crago Aamer Jaleel John D. Owens Edgar Solomonik Joel S. Emer and Fletcher W. Christopher. 2023. Accelerating sparse data orchestration via dynamic reflexive tiling. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems Vol. 3 18\u201332.","DOI":"10.1145\/3582016.3582064"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00067"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080254"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00015"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507738"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00034"},{"key":"e_1_3_3_39_2","doi-asserted-by":"crossref","unstructured":"Ahmet Erdem Sar\u0131y\u00fcce Erik Saule Kamer Kaya and Umit V. \u00c7ataly\u00fcrek. 2015. Regularizing graph centrality computations. Journal of Parallel and Distributed Computing 76 C (2015) 106\u2013119.","DOI":"10.1016\/j.jpdc.2014.07.006"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-93417-4_38"},{"key":"e_1_3_3_41_2","unstructured":"Haihao Shen Hengyu Meng Bo Dong Zhe Wang Ofir Zafrir Yi Ding Yu Luo Hanwen Chang Qun Gao Ziheng Wang Guy Boudoukh and Moshe Wasserblat. 2023. An efficient sparse inference software accelerator for transformer-based language models on CPUs. arXiv preprint arXiv:2306.16601 (2023)."},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502357"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00068"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00062"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304624"},{"key":"e_1_3_3_46_2","doi-asserted-by":"crossref","unstructured":"Ruiqin Tian Luanzheng Guo Jiajia Li Bin Ren and Gokcen Kestor. 2021. A high performance sparse tensor algebra compiler in MLIR. In 2021 IEEE\/ACM 7th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) IEEE 27\u201338.","DOI":"10.1109\/LLVMHPC54804.2021.00009"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD51958.2021.9643506"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00018"},{"key":"e_1_3_3_49_2","doi-asserted-by":"crossref","unstructured":"Jianhua Gao Weixing Ji Fangli Chang Shiyu Han Bingxin Wei Zeming Liu and Yizhuo Wang. 2023. A systematic survey of general sparse matrix-matrix multiplication. ACM Computing Surveys 55 12 (2023) 1\u201336.","DOI":"10.1145\/3571157"},{"key":"e_1_3_3_50_2","unstructured":"Wei Wen Chunpeng Wu Yandan Wang Yiran Chen and Hai Li. 2016. Learning structured sparsity in deep neural networks. Advances in Neural Information Processing Systems 29 (2016) 2074\u20132082."},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2978386"},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3623793"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2014.44"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00012"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-96983-1_48"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322271"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33017370"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00041"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071027"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446702"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783723"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1186\/s40649-019-0069-y"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00030"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/2854150"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3688612","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3688612","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:30Z","timestamp":1750295430000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3688612"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,20]]},"references-count":63,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3688612"],"URL":"https:\/\/doi.org\/10.1145\/3688612","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,20]]},"assertion":[{"value":"2024-03-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-07-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}