{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T23:04:28Z","timestamp":1767827068938,"version":"3.49.0"},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2015,10,26]],"date-time":"2015-10-26T00:00:00Z","timestamp":1445817600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Math. Softw."],"published-print":{"date-parts":[[2015,10,26]]},"abstract":"<jats:p>Sparse matrix--matrix multiplication (SpGEMM) is a key operation in numerous areas from information to the physical sciences. Implementing SpGEMM efficiently on throughput-oriented processors, such as the graphics processing unit (GPU), requires the programmer to expose substantial fine-grained parallelism while conserving the limited off-chip memory bandwidth. Balancing these concerns, we decompose the SpGEMM operation into three highly parallel phases: expansion, sorting, and contraction, and introduce a set of complementary bandwidth-saving performance optimizations. Our implementation is fully general and our optimization strategy adaptively processes the SpGEMM workload row-wise to substantially improve performance by decreasing the work complexity and utilizing the memory hierarchy more effectively.<\/jats:p>","DOI":"10.1145\/2699470","type":"journal-article","created":{"date-parts":[[2015,10,27]],"date-time":"2015-10-27T12:36:39Z","timestamp":1445949399000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":89,"title":["Optimizing Sparse Matrix\u2014Matrix Multiplication for the GPU"],"prefix":"10.1145","volume":"41","author":[{"given":"Steven","family":"Dalton","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana--Champaign, Urbana, IL"}]},{"given":"Luke","family":"Olson","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana--Champaign, Urbana, IL"}]},{"given":"Nathan","family":"Bell","sequence":"additional","affiliation":[{"name":"Google, Mountain View, CA"}]}],"member":"320","published-online":{"date-parts":[[2015,10,26]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Lois Curfman McInnes, and Barry F. Smith","author":"Balay Satish","year":"1997","unstructured":"Satish Balay , William D. Gropp , Lois Curfman McInnes, and Barry F. Smith . 1997 . Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen (Eds.). Birkh\u00e4user Press , Berlin, 163--202. Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. 1997. Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen (Eds.). Birkh\u00e4user Press, Berlin, 163--202."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02070824"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1137\/110838844"},{"key":"e_1_2_1_4_1","volume-title":"Retrieved","author":"Bell Nathan","year":"2009","unstructured":"Nathan Bell and Michael Garland . 2009 a. CUSP: Generic Parallel Algorithms for Sparse Matrix and Graph Computations . Retrieved August 28, 2015 from http:\/\/cusplibrary.github.io\/. Nathan Bell and Michael Garland. 2009a. CUSP: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. Retrieved August 28, 2015 from http:\/\/cusplibrary.github.io\/."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654078"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2008.4536313"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342011403516"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1137\/110848244"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1693453.1693471"},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","unstructured":"Edith Cohen. 1996. On optimizing multiplications of sparse matrices. In IPCO. 219--233.   Edith Cohen. 1996. On optimizing multiplications of sparse matrices. In IPCO. 219--233.","DOI":"10.1007\/3-540-61310-2_17"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1839676.1839694"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2008.45"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/355791.355796"},{"key":"e_1_2_1_15_1","volume-title":"An Overview of Trilinos. Technical Report SAND2003-2927","author":"Heroux Michael","year":"2003","unstructured":"Michael Heroux , Roscoe Bartlett , Vicki Howle Robert Hoekstra , Jonathan Hu , Tamara Kolda , Richard Lehoucq , Kevin Long , Roger Pawlowski , Eric Phipps , Andrew Salinger , Heidi Thornquist , Ray Tuminaro , James Willenbring , and Alan Williams . 2003 . An Overview of Trilinos. Technical Report SAND2003-2927 . Sandia National Laboratories, Washington, DC. Michael Heroux, Roscoe Bartlett, Vicki Howle Robert Hoekstra, Jonathan Hu, Tamara Kolda, Richard Lehoucq, Kevin Long, Roger Pawlowski, Eric Phipps, Andrew Salinger, Heidi Thornquist, Ray Tuminaro, James Willenbring, and Alan Williams. 2003. An Overview of Trilinos. Technical Report SAND2003-2927. Sandia National Laboratories, Washington, DC."},{"key":"e_1_2_1_16_1","volume-title":"Thrust: A Parallel Template Library","author":"Hoberock Jared","year":"2011","unstructured":"Jared Hoberock and Nathan Bell . 2011 . Thrust: A Parallel Template Library . http:\/\/thrust.github.io\/. Jared Hoberock and Nathan Bell. 2011. Thrust: A Parallel Template Library. http:\/\/thrust.github.io\/."},{"key":"e_1_2_1_17_1","volume-title":"SC\u201911","author":"Kurzak Jakub","unstructured":"Jakub Kurzak , Stanimire Tomov , and Jack Dongarra . 2011. Autotuning GEMMs for Fermi . In SC\u201911 , Seattle, WA . Jakub Kurzak, Stanimire Tomov, and Jack Dongarra. 2011. Autotuning GEMMs for Fermi. In SC\u201911, Seattle, WA."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145832"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626411000187"},{"key":"e_1_2_1_21_1","volume-title":"Gpu gems 3","author":"Nguyen Hubert","unstructured":"Hubert Nguyen . 2007. Gpu gems 3 ( first ed.). Addison-Wesley Professional , Indianapolis, IN . Hubert Nguyen. 2007. Gpu gems 3 (first ed.). Addison-Wesley Professional, Indianapolis, IN."},{"key":"e_1_2_1_22_1","unstructured":"NVIDIA. 2010. TESLA C2050\/C2070 GPU Computing Processor Supercomputing at 1\/10th the Cost. NVIDIA Corporation. www.nvidia.com\/docs\/IO\/43395\/NV&lowbar;DS&lowbar;Tesla&lowbar;C2050&lowbar;C2070&lowbar;jul10&lowbar;lores.pdf.  NVIDIA. 2010. TESLA C2050\/C2070 GPU Computing Processor Supercomputing at 1\/10th the Cost. NVIDIA Corporation. www.nvidia.com\/docs\/IO\/43395\/NV&lowbar;DS&lowbar;Tesla&lowbar;C2050&lowbar;C2070&lowbar;jul10&lowbar;lores.pdf."},{"key":"e_1_2_1_23_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2012","unstructured":"NVIDIA. 2012 a. cuSPARSE: Users Guide . Retrieved August 28, 2015 from http:\/\/developer.nvidia.com\/cusparse. (2012). NVIDIA. 2012a. cuSPARSE: Users Guide. Retrieved August 28, 2015 from http:\/\/developer.nvidia.com\/cusparse. (2012)."},{"key":"e_1_2_1_24_1","volume-title":"NVIDIA CUDA Programming Guide","author":"NVIDIA.","year":"2015","unstructured":"NVIDIA. 2012b. NVIDIA CUDA Programming Guide . NVIDIA Corporation . Retrieved August 28, 2015 from http:\/\/www.nvidia.com\/CUDA Version 5.0. NVIDIA. 2012b. NVIDIA CUDA Programming Guide. NVIDIA Corporation. Retrieved August 28, 2015 from http:\/\/www.nvidia.com\/CUDA Version 5.0."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/0196-6774(89)90005-9"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03641-5_4"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.12.006"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"}],"container-title":["ACM Transactions on Mathematical Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2699470","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2699470","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:16:58Z","timestamp":1750227418000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2699470"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,10,26]]},"references-count":27,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,10,26]]}},"alternative-id":["10.1145\/2699470"],"URL":"https:\/\/doi.org\/10.1145\/2699470","relation":{},"ISSN":["0098-3500","1557-7295"],"issn-type":[{"value":"0098-3500","type":"print"},{"value":"1557-7295","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,10,26]]},"assertion":[{"value":"2013-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-10-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}