{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T15:52:46Z","timestamp":1764172366855,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,12,17]],"date-time":"2019-12-17T00:00:00Z","timestamp":1576540800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["1704715"],"award-info":[{"award-number":["1704715"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"U.S. Government"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>\n            We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy (MpSpMV) for Nvidia V100 GPUs. The approach has three benefits: (1) It reduces computation time, (2) it reduces the size of the input matrix and therefore reduces data movement, and (3) it provides an opportunity for increased parallelism. MpSpMV\u2019s decision to lower to single precision is\n            <jats:italic>data driven<\/jats:italic>\n            , based on individual nonzero values of the sparse matrix. On all real-valued matrices from the Sparse Matrix Collection, we obtain a maximum speedup of 2.61\u00d7 and average speedup of 1.06\u00d7 over double precision, while maintaining higher accuracy compared to single precision.\n          <\/jats:p>","DOI":"10.1145\/3371275","type":"journal-article","created":{"date-parts":[[2019,12,18]],"date-time":"2019-12-18T13:21:11Z","timestamp":1576675271000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["Data-driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs"],"prefix":"10.1145","volume":"16","author":[{"given":"Khalid","family":"Ahmad","sequence":"first","affiliation":[{"name":"University of Utah, Salt Lake City, UT"}]},{"given":"Hari","family":"Sundar","sequence":"additional","affiliation":[{"name":"University of Utah, Salt Lake City, UT"}]},{"given":"Mary","family":"Hall","sequence":"additional","affiliation":[{"name":"University of Utah, Salt Lake City, UT"}]}],"member":"320","published-online":{"date-parts":[[2019,12,17]]},"reference":[{"volume-title":"Languages and Compilers for Parallel Computing","author":"Ahmad Khalid","key":"e_1_2_1_1_1","unstructured":"Khalid Ahmad , Anand Venkat , and Mary Hall . 2017. Optimizing LOBPCG: Sparse matrix loop and data transformations in action . In Languages and Compilers for Parallel Computing , Chen Ding, John Criswell, and Peng Wu (Eds.). Springer International Publishing , Cham , 218--232. Khalid Ahmad, Anand Venkat, and Mary Hall. 2017. Optimizing LOBPCG: Sparse matrix loop and data transformations in action. In Languages and Compilers for Parallel Computing, Chen Ding, John Criswell, and Peng Wu (Eds.). Springer International Publishing, Cham, 218--232."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2630699"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1213--1222","author":"Aktulga H. M.","year":"2014","unstructured":"H. M. Aktulga , A. Buluc , S. Williams , and C. Yang . 2014. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations . In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1213--1222 . DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2014 .125 10.1109\/IPDPS.2014.125 H. M. Aktulga, A. Buluc, S. Williams, and C. Yang. 2014. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1213--1222. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2014.125"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the International Workshop on Applied Parallel Computing. Springer, 121--130","author":"Amestoy Patrick R.","year":"2000","unstructured":"Patrick R. Amestoy , Iain S. Duff , Jean Yves L\u2019Excellent , and Jacko Koster . 2000 . MUMPS: A general purpose distributed memory sparse solver . In Proceedings of the International Workshop on Applied Parallel Computing. Springer, 121--130 . Patrick R. Amestoy, Iain S. Duff, Jean Yves L\u2019Excellent, and Jacko Koster. 2000. MUMPS: A general purpose distributed memory sparse solver. In Proceedings of the International Workshop on Applied Parallel Computing. Springer, 121--130."},{"volume-title":"An Introduction to Numerical Analysis","author":"Atkinson Kendall E.","key":"e_1_2_1_5_1","unstructured":"Kendall E. Atkinson . 2008. An Introduction to Numerical Analysis . John Wiley 8 Sons. Kendall E. Atkinson. 2008. An Introduction to Numerical Analysis. John Wiley 8 Sons."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2005.52"},{"key":"e_1_2_1_8_1","volume-title":"CUSP: Generic parallel algorithms for sparse matrix and graph computations","author":"Bell Nathran Steven","year":"2015","unstructured":"Nathran Steven Bell and Michael Dalton Garland . 2015 . CUSP: Generic parallel algorithms for sparse matrix and graph computations , 2015. Version 0.5.0. Retrieved March 9, 2015 from http:\/\/cusplibrary.github.io. Nathran Steven Bell and Michael Dalton Garland. 2015. CUSP: Generic parallel algorithms for sparse matrix and graph computations, 2015. Version 0.5.0. Retrieved March 9, 2015 from http:\/\/cusplibrary.github.io."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/361002.361007"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1377596.1377597"},{"key":"e_1_2_1_11_1","unstructured":"NVIDIA CUSPARSE. 2019. CUBLAS libraries.  NVIDIA CUSPARSE. 2019. CUBLAS libraries."},{"volume-title":"Department of Computer and Information Science and Engineering","author":"Davis Timothy A.","key":"e_1_2_1_12_1","unstructured":"Timothy A. Davis . 2003. Umfpack Version 4.1 User Guide . Department of Computer and Information Science and Engineering , University of Florida (2003) . Timothy A. Davis. 2003. Umfpack Version 4.1 User Guide. Department of Computer and Information Science and Engineering, University of Florida (2003)."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_2_1_14_1","unstructured":"Manuel Le Gallo Irem Boybat Bipin Rajendran Abu Sebastian Evangelos Eleftheriou etal 2017. Mixed-precision training of deep neural networks using computational memory. arXiv preprint arXiv:1712.01192 (2017).  Manuel Le Gallo Irem Boybat Bipin Rajendran Abu Sebastian Evangelos Eleftheriou et al. 2017. Mixed-precision training of deep neural networks using computational memory. arXiv preprint arXiv:1712.01192 (2017)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2013.6645508"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1080\/17445760601122076"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2012.09.022"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the International Conference on Machine Learning. 1737--1746","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep learning with limited numerical precision . In Proceedings of the International Conference on Machine Learning. 1737--1746 . Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the International Conference on Machine Learning. 1737--1746."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14529\/jsfi160203"},{"volume-title":"The End of Error: Unum Computing","author":"Gustafson John L.","key":"e_1_2_1_20_1","unstructured":"John L. Gustafson . 2017. The End of Error: Unum Computing . Chapman 8 Hall\/CRC. John L. Gustafson. 2017. The End of Error: Unum Computing. Chapman 8 Hall\/CRC."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14529\/jsfi170206"},{"volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918)","author":"Haidar Azzam","key":"e_1_2_1_22_1","unstructured":"Azzam Haidar , Stanimire Tomov , Jack Dongarra , and Nicholas J. Higham . 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers . In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918) . IEEE Press, Los Alamitos, CA. Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918). IEEE Press, Los Alamitos, CA."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3148226.3148237"},{"key":"e_1_2_1_24_1","volume-title":"Bailey","author":"Hida Yozo","year":"2007","unstructured":"Yozo Hida , Xiaoye S. Li , and David H . Bailey . 2007 . Library for Double-double and Quad-double Arithmetic . Technical Report. NERSC Division, Lawrence Berkeley National Laboratory (2007). Yozo Hida, Xiaoye S. Li, and David H. Bailey. 2007. Library for Double-double and Quad-double Arithmetic. Technical Report. NERSC Division, Lawrence Berkeley National Laboratory (2007)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2401575"},{"key":"e_1_2_1_26_1","volume-title":"Doing moore with less\u2014Leapfrogging Moore\u2019s law with inexactness for supercomputing. CoRR abs\/1610.02606","author":"Leyffer Sven","year":"2016","unstructured":"Sven Leyffer , Stefan M. Wild , Mike Fagan , Marc Snir , Krishna V. Palem , Kazutomo Yoshii , and Hal Finkel . 2016. Doing moore with less\u2014Leapfrogging Moore\u2019s law with inexactness for supercomputing. CoRR abs\/1610.02606 ( 2016 ). arxiv:1610.02606 http:\/\/arxiv.org\/abs\/1610.02606. Sven Leyffer, Stefan M. Wild, Mike Fagan, Marc Snir, Krishna V. Palem, Kazutomo Yoshii, and Hal Finkel. 2016. Doing moore with less\u2014Leapfrogging Moore\u2019s law with inexactness for supercomputing. CoRR abs\/1610.02606 (2016). arxiv:1610.02606 http:\/\/arxiv.org\/abs\/1610.02606."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1089014.1089017"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/567806.567808"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2017.8091031"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918)","author":"Markidis S.","year":"2018","unstructured":"S. Markidis , S. W. D. Chien , E. Laure , I. B. Peng , and J. S. Vetter . 2018. NVIDIA tensor core programmability, performance precision . In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918) . 522--531. DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2018 .00091 10.1109\/IPDPSW.2018.00091 S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. 2018. NVIDIA tensor core programmability, performance precision. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918). 522--531. DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2018.00091"},{"key":"e_1_2_1_31_1","unstructured":"Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David Garcia Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh etal 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).  Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David Garcia Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2018.8351656"},{"key":"e_1_2_1_33_1","unstructured":"Tesla NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture.  Tesla NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503296"},{"volume-title":"Iterative Methods for Sparse Linear Systems","author":"Saad Yousef","key":"e_1_2_1_35_1","unstructured":"Yousef Saad . 2003. Iterative Methods for Sparse Linear Systems . Vol. 82 . SIAM. Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems. Vol. 82. SIAM."},{"key":"#cr-split#-e_1_2_1_36_1.1","doi-asserted-by":"crossref","unstructured":"W. A. Sufah and K. Ahmad. 2014. On implementing sparse matrix multi-vector multiplication on GPUs. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications 2014 IEEE 6th International Symposium on Cyberspace Safety and Security and 2014 IEEE 11th International Conference on Embedded Software and Systems (HPCC-CSS-ICESS'14). 1117--1124. DOI:https:\/\/doi.org\/10.1109\/HPCC.2014.165 10.1109\/HPCC.2014.165","DOI":"10.1109\/HPCC.2014.165"},{"key":"#cr-split#-e_1_2_1_36_1.2","doi-asserted-by":"crossref","unstructured":"W. A. Sufah and K. Ahmad. 2014. On implementing sparse matrix multi-vector multiplication on GPUs. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications 2014 IEEE 6th International Symposium on Cyberspace Safety and Security and 2014 IEEE 11th International Conference on Embedded Software and Systems (HPCC-CSS-ICESS'14). 1117--1124. DOI:https:\/\/doi.org\/10.1109\/HPCC.2014.165","DOI":"10.1109\/HPCC.2014.165"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3371275","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3371275","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:53:04Z","timestamp":1750204384000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3371275"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,17]]},"references-count":36,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3371275"],"URL":"https:\/\/doi.org\/10.1145\/3371275","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2019,12,17]]},"assertion":[{"value":"2019-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-12-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}