{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:59:40Z","timestamp":1750309180836,"version":"3.41.0"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T00:00:00Z","timestamp":1707955200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this article, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using SpMM kernel implementations for GPUs. We compare two SpMM implementations to the state-of-the-art PASTA sparse tensor contraction implementation using: (1) SpMM with hierarchical tensor data layout; and, (2) unfolding followed by an invocation of cuSPARSE\u2019s SpMM. Results show that SpMM can outperform PASTA 70.9% of the time, but none of the three approaches is best overall. Therefore, we use a decision tree classifier to identify the best performing sparse tensor contraction kernel based on precomputed properties of the sparse tensor.<\/jats:p>","DOI":"10.1145\/3633462","type":"journal-article","created":{"date-parts":[[2023,12,28]],"date-time":"2023-12-28T21:58:00Z","timestamp":1703800680000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5966-5774","authenticated-orcid":false,"given":"Khalid","family":"Ahmad","sequence":"first","affiliation":[{"name":"University of Utah, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3542-7675","authenticated-orcid":false,"given":"Cris","family":"Cecka","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6093-7602","authenticated-orcid":false,"given":"Michael","family":"Garland","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3058-7573","authenticated-orcid":false,"given":"Mary","family":"Hall","sequence":"additional","affiliation":[{"name":"University of Utah, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,2,15]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"1117","volume-title":"Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014 IEEE 6th International Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and System (HPCC, CSS, ICESS)","author":"Abu-Sufah Walid","year":"2014","unstructured":"Walid Abu-Sufah and Khalid Ahmad. 2014. On implementing sparse matrix multi-vector multiplication on GPUs. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014 IEEE 6th International Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and System (HPCC, CSS, ICESS). IEEE, 1117\u20131124."},{"key":"e_1_3_1_3_2","first-page":"218","volume-title":"Proceedings of the International Workshop on Languages and Compilers for Parallel Computing","author":"Ahmad Khalid","year":"2016","unstructured":"Khalid Ahmad, Anand Venkat, and Mary Hall. 2016. Optimizing LOBPCG: Sparse matrix loop and data transformations in action. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing. Springer, 218\u2013232."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.125"},{"key":"e_1_3_1_5_2","unstructured":"Brett W. Bader Tamara G. Kolda et\u00a0al. 2015. Matlab tensor toolbox version 2.6. Available online. Retrieved from https:\/\/www.tensortoolbox.org\/"},{"key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1016\/B978-0-12-385963-1.00026-5","volume-title":"Proceedings of the GPU Computing Gems Jade Edition","author":"Bell Nathan","year":"2012","unstructured":"Nathan Bell and Jared Hoberock. 2012. Thrust: A productivity-oriented library for CUDA. In Proceedings of the GPU Computing Gems Jade Edition. Elsevier, 359\u2013371."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1137\/S0895479896305696"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/567806.567810"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433723"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447818.3461703"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3208040.3208062"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295712"},{"key":"e_1_3_1_13_2","first-page":"1","volume-title":"Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Huang Guyue","year":"2020","unstructured":"Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. Ge-spmm: General-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201312."},{"key":"e_1_3_1_14_2","volume-title":"An Introduction to Statistical Learning","author":"James Gareth","year":"2017","unstructured":"Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning. Springer."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.5555\/AAI28928307"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1137\/07070111X"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2401575"},{"key":"e_1_3_1_18_2","first-page":"193","volume-title":"Proceedings of the 2020 IEEE International Symposium on Workload Characterization (IISWC)","author":"Li Jiajia","year":"2020","unstructured":"Jiajia Li, Mahesh Lakshminarasimhan, Xiaolong Wu, Ang Li, Catherine Olschanowsky, and Kevin Barker. 2020. A sparse tensor benchmark suite for CPUs and GPUs. In Proceedings of the 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 193\u2013204."},{"key":"e_1_3_1_19_2","first-page":"26","volume-title":"Proceedings of the 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3)","author":"Li Jiajia","year":"2016","unstructured":"Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. 2016. Optimizing sparse tensor times matrix on multi-core and many-core architectures. In Proceedings of the 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3). IEEE, 26\u201333."},{"key":"e_1_3_1_20_2","first-page":"238","volume-title":"Proceedings of the SC18: International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Li Jiajia","year":"2018","unstructured":"Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. In Proceedings of the SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 238\u2013252."},{"key":"e_1_3_1_21_2","first-page":"129","volume-title":"Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014","author":"Lo Yu Jung","year":"2015","unstructured":"Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Nicholas J. Wright, Mary W. Hall, and Leonid Oliker. 2015. Roofline model toolkit: A practical tool for architectural and program analysis. In Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014. Springer, 129\u2013148."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2018.07.018"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3016078.2851190"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3350755.3400245"},{"key":"e_1_3_1_25_2","unstructured":"NVIDIA. 2021. CUSPARSE. https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html"},{"key":"e_1_3_1_26_2","unstructured":"NVIDIA. 2021. CUTENSOR. (2021). https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html"},{"key":"e_1_3_1_27_2","first-page":"193","volume-title":"Proceedings of the 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)","author":"Shi Yang","year":"2016","unstructured":"Yang Shi, Uma Naresh Niranjan, Animashree Anandkumar, and Cris Cecka. 2016. Tensor contractions with extended BLAS kernels on CPU and GPU. In Proceedings of the 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). IEEE, 193\u2013202."},{"key":"e_1_3_1_28_2","unstructured":"S. Smith J. W. Choi J. Li R. Vuduc J. Park X. Liu and G. Karypis. 2017. The Formidable Repository of Open Sparse Tensors and Tools. http:\/\/frostt.io\/"},{"key":"e_1_3_1_29_2","first-page":"61","volume-title":"Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium","author":"Smith Shaden","year":"2015","unstructured":"Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 61\u201370."},{"key":"e_1_3_1_30_2","unstructured":"Edgar Solomonik and Torsten Hoefler. 2015. Sparse tensor algebra as a parallel programming model. arXiv:1512.00066. Retrieved from https:\/\/arxiv.org\/abs\/1512.00066"},{"key":"e_1_3_1_31_2","first-page":"672","volume-title":"Proceedings of the European Conference on Parallel Processing","author":"Yang Carl","year":"2018","unstructured":"Carl Yang, Ayd\u0131n Bulu\u00e7, and John D Owens. 2018. Design principles for sparse matrix multiplication on the GPU. In Proceedings of the European Conference on Parallel Processing. Springer, 672\u2013687."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/0304-3975(80)90073-0"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633462","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3633462","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:54:01Z","timestamp":1750287241000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633462"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,15]]},"references-count":31,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3633462"],"URL":"https:\/\/doi.org\/10.1145\/3633462","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2024,2,15]]},"assertion":[{"value":"2023-05-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-02","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}