{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:33:29Z","timestamp":1763458409236,"version":"3.45.0"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2016,3,28]],"date-time":"2016-03-28T00:00:00Z","timestamp":1459123200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"T\u00fcbitak","award":["110E028"],"award-info":[{"award-number":["110E028"]}]},{"name":"National Science Foundation","award":["CCF 1017077"],"award-info":[{"award-number":["CCF 1017077"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,4,5]]},"abstract":"<jats:p>Runtime specialization is used for optimizing programs based on partial information available only at runtime. In this paper we apply autotuning on runtime specialization of Sparse Matrix-Vector Multiplication to predict a best specialization method among several. In 91% to 96% of the predictions, either the best or the second-best method is chosen. Predictions achieve average speedups that are very close to the speedups achievable when only the best methods are used. By using an efficient code generator and a carefully designed set of matrix features, we show the runtime costs can be amortized to bring performance benefits for many real-world cases.<\/jats:p>","DOI":"10.1145\/2851500","type":"journal-article","created":{"date-parts":[[2016,3,28]],"date-time":"2016-03-28T08:53:25Z","timestamp":1459155205000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication"],"prefix":"10.1145","volume":"13","author":[{"given":"Buse","family":"Yilmaz","sequence":"first","affiliation":[{"name":"Ozyegin University, Istanbul, Turkey"}]},{"given":"Bari\u015f","family":"Aktemur","sequence":"additional","affiliation":[{"name":"Ozyegin University, Istanbul, Turkey"}]},{"given":"Mar\u00edA J.","family":"Garzar\u00e1n","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign and Intel Corporation, IL, USA"}]},{"given":"Sam","family":"Kamin","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign and Google, IL, USA"}]},{"given":"Furkan","family":"Kira\u00e7","sequence":"additional","affiliation":[{"name":"Ozyegin University, Istanbul, Turkey"}]}],"member":"320","published-online":{"date-parts":[[2016,3,28]]},"reference":[{"key":"e_1_2_1_1_1","series-title":"Lecture Notes in Computer Science","volume-title":"Supercomputing","author":"Abu-Sufah W.","unstructured":"W. Abu-Sufah and A. Abdel Karim. 2013. Auto-tuning of sparse matrix-vector multiplication on graphics processors. In Supercomputing. Lecture Notes in Computer Science, Vol. 7905. Springer, Berlin, 151--164."},{"key":"e_1_2_1_2_1","unstructured":"ACML. 2013. AMD Core Math Library User Guide 6.0.6. Retrieved from http:\/\/amd-dev.wpengine.netdna-cdn.com\/wordpress\/media\/2013\/12\/acml.pdf."},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","unstructured":"B. Aktemur Y. Kameyama O. Kiselyov and C. Shan. 2013. Shonan challenge for generative programming. In Partial Evaluation and Program Manipulation (PEPM\u201913). 147--154.","DOI":"10.1145\/2426890.2426917"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"W. Armstrong and A. P. Rendell. 2008. Reinforcement learning for automated performance tuning. In Cluster Computing. 411--420.","DOI":"10.1109\/CLUSTR.2008.4663802"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2010.04.016"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-010-0145-2"},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In High Performance Computing Networking Storage and Analysis (SC\u201909). 18:1--18:11.","DOI":"10.1145\/1654059.1654078"},{"volume-title":"21st Annual Symp. on Parallelism in Algorithms and Architectures (SPAA\u201909)","author":"Bulu\u00e7 A.","key":"e_1_2_1_8_1","unstructured":"A. Bulu\u00e7, J. Fineman, M. Frigo, J. Gilbert, and C. Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In 21st Annual Symp. on Parallelism in Algorithms and Architectures (SPAA\u201909). 233--244."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"A. Bulu\u00e7 S. Williams L. Oliker and J. Demmel. 2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS\u201911. 721--733.","DOI":"10.1109\/IPDPS.2011.73"},{"key":"e_1_2_1_10_1","volume-title":"Technical Report UCB\/EECS-2012-215. EECS Department, U. of California","author":"Byun J.","year":"2012","unstructured":"J. Byun, R. Lin, K. Yelick, and J. Demmel. 2012. Autotuning Sparse Matrix-Vector Multiplication for Multicore. Technical Report UCB\/EECS-2012-215. EECS Department, U. of California, Berkeley."},{"key":"e_1_2_1_11_1","first-page":"5","article-title":"Multi-stage programming with functors and monads: Eliminating abstraction overhead from generic code. Sci","volume":"76","author":"Carette J.","year":"2011","unstructured":"J. Carette and O. Kiselyov. 2011. Multi-stage programming with functors and monads: Eliminating abstraction overhead from generic code. Sci. Comput. Program. 76, 5 (May 2011), 349--375.","journal-title":"Comput. Program."},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"J. Choi A. Singh and R. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Principles and Practice of Parallel Programming (PPoPP\u201910). 115--126.","DOI":"10.1145\/1693453.1693471"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.scico.2004.03.011"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"E. D\u2019Azevedo M. Fahey and R. Mills. 2005. Vectorized sparse matrix multiply for compressed row storage format. In ICCS\u201905. 99--106.","DOI":"10.1007\/11428831_13"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Y. Ding J. Ansel K. Veeramachaneni X. Shen U. O\u2019Reilly and S. Amarasinghe. 2015. Autotuning algorithmic choice for input sensitivity. In Prog. Language Design and Implementation (PLDI\u201915). 379--390.","DOI":"10.1145\/2737924.2737969"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1732"},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"M. Frigo. 1999. A fast fourier transform compiler. In Programming Language Design and Implementation (PLDI\u201999). 169--180.","DOI":"10.1145\/301618.301661"},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Y. Fukui H. Yoshida and S. Higono. 1989. Supercomputing of circuits simulation. In Supercomputing (SC\u201989). 81--85.","DOI":"10.1145\/76263.76272"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"P. Giorgi and B. Vialla. 2014. Generating optimized sparse matrix vector product over finite fields. In Mathematical Software (ICMS\u201914). 685--690.","DOI":"10.1007\/978-3-662-44199-2_102"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"G. Goumas K. Kourtis N. Anastopoulos V. Karakasis and N. Koziris. 2008. Understanding the performance of sparse matrix-vector multiplication. In Parallel Distributed and Network-Based Processing (PDP\u201908). 283--292.","DOI":"10.1109\/PDP.2008.41"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"D. Grewe and A. Lokhmotov. 2011. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation. In General Purpose Processing on Graphics Processing Units (GPGPU-4). 12:1--12:8.","DOI":"10.1145\/1964179.1964196"},{"key":"e_1_2_1_23_1","unstructured":"Roger G. Grimes David R. Kincaid and David M. Young. 1978. ITPACK 2.0 User\u2019s Guide. Report CNA-150. Center for Numerical Analysis University of Texas at Austin Austin TX."},{"key":"e_1_2_1_24_1","unstructured":"W. Gropp D. Kaushik D. Keyes and B. Smith. 1999. Toward realistic performance bounds for implicit CFD codes. In Parallel CFD\u201999."},{"key":"e_1_2_1_25_1","first-page":"5","article-title":"A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs","volume":"25","author":"Guo P.","year":"2014","unstructured":"P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE TPDS 25, 5 (May 2014), 1112--1123.","journal-title":"IEEE TPDS"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/321556.321565"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004041296"},{"volume-title":"pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures. Master\u2019s thesis","author":"Jain A.","key":"e_1_2_1_28_1","unstructured":"A. Jain. 2008. pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures. Master\u2019s thesis. University of California, Berkeley."},{"key":"e_1_2_1_29_1","volume-title":"Jumbo: Run-time code generation for java and its applications. In Code Generation and Optimization (CGO\u201903). 48--56.","author":"Kamin S.","year":"2003","unstructured":"S. Kamin, L. Clausen, and A. Jarvis. 2003. Jumbo: Run-time code generation for java and its applications. In Code Generation and Optimization (CGO\u201903). 48--56."},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"S. Kamin M. Garzar\u00e1n B. Aktemur D. Xu B. Y\u0131lmaz and Z. Chen. 2014. Optimization by runtime specialization for sparse matrix-vector multiplication. In Generative Programming: Concepts and Experiences (GPCE\u201914). 93--102.","DOI":"10.1145\/2658761.2658773"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1880037.1880041"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2038037.1941587"},{"key":"e_1_2_1_33_1","volume-title":"LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization (CGO&rsquo;\u201904). 75--86.","author":"Lattner C.","year":"2004","unstructured":"C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization (CGO&rsquo;\u201904). 75--86."},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"P. Lee and M. Leone. 1996. Optimizing ML with run-time code generation. In Programming Language Design and Implementation (PLDI\u201996). 137--148.","DOI":"10.1145\/231379.231407"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462181"},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"X. Liu M. Smelyanskiy E. Chow and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Supercomputing (ICS\u201913). 273--282.","DOI":"10.1145\/2464996.2465013"},{"key":"e_1_2_1_37_1","unstructured":"LLVM. 2013. LLVM Web Site. Retrieved from http:\/\/llvm.cs.uiuc.edu."},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"N. Mateev K. Pingali P. Stodghill and V. Kotlyar. 2000. Next-generation generic programming and its application to sparse matrix computations. In Supercomputing (ICS\u201900). 88--99.","DOI":"10.1145\/335231.335240"},{"key":"e_1_2_1_39_1","unstructured":"Matrix Market. 1997. Matrix Market Web Site. Retrieved from http:\/\/math.nist.gov\/MatrixMarket."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004038951"},{"key":"e_1_2_1_41_1","unstructured":"MKL. 2013. Intel\u00ae Math Kernel Library. Retrieved from http:\/\/software.intel.com\/en-us\/articles\/intel-mkl."},{"volume-title":"Parallel and Distributed Processing Symp. (IPDPS\u201914)","author":"Muralidharan S.","key":"e_1_2_1_42_1","unstructured":"S. Muralidharan, M. Shantharam, M. Hall, M. Garland, and B. Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Parallel and Distributed Processing Symp. (IPDPS\u201914). 501--512."},{"volume-title":"Parallel & Distributed Processing Symp. Workshops (IPDPSW&rsquo;\u201914)","author":"Neelima B.","key":"e_1_2_1_43_1","unstructured":"B. Neelima, G. Ram, M. Reddy, and Prakash S. Raghavendra. 2014. Predicting an optimal sparse matrix format for SpMV computation on GPU. In Parallel & Distributed Processing Symp. Workshops (IPDPSW&rsquo;\u201914). 1427--1436."},{"key":"e_1_2_1_44_1","unstructured":"OpenMP 2009. OpenMP API for parallel programming version 3.0. Retrieved from http:\/\/openmp.org\/wp."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840306"},{"key":"e_1_2_1_47_1","doi-asserted-by":"crossref","unstructured":"T. Rompf and M. Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Generative Prog. and Component Engineering (GPCE\u201910). 127--136.","DOI":"10.1145\/1868294.1868314"},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"T. Rompf A. Sujeeth N. Amin K. Brown V. Jovanovic H. Lee M. Jonnalagedda K. Olukotun and M. Odersky. 2013. Optimizing data structures in high-level programs. In Principles of Programming Languages (POPL\u201913). 497--510.","DOI":"10.1145\/2429069.2429128"},{"volume-title":"Iterative Methods for Sparse Linear Systems","author":"Saad Y.","key":"e_1_2_1_49_1","unstructured":"Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, PA."},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","unstructured":"B. Su and K. Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Supercomputing (ICS\u201912). 353--364.","DOI":"10.1145\/2304576.2304624"},{"key":"e_1_2_1_51_1","doi-asserted-by":"crossref","unstructured":"X. Sun Y. Zhang T. Wang X. Zhang L. Yuan and L. Rao. 2011. Optimizing SpMV for diagonal sparse matrices on GPU. In Parallel Processing (ICPP\u201911). 492--501.","DOI":"10.1109\/ICPP.2011.53"},{"key":"e_1_2_1_52_1","doi-asserted-by":"crossref","unstructured":"A. Venkat M. Hall and M. Strout. 2015. Loop and data transformations for sparse matrix code. In Programming Language Design and Implementation (PLDI\u201915). 521--532.","DOI":"10.1145\/2737924.2738003"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004041293"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/16\/1\/071"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(00)00087-9"},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","unstructured":"J. Willcock and A. Lumsdaine. 2006. Accelerating sparse matrix computations via data compression. In Supercomputing (ICS\u201906). 307--316.","DOI":"10.1145\/1183401.1183444"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.12.006"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2851500","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2851500","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2851500","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:22:53Z","timestamp":1763457773000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2851500"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,3,28]]},"references-count":57,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,4,5]]}},"alternative-id":["10.1145\/2851500"],"URL":"https:\/\/doi.org\/10.1145\/2851500","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2016,3,28]]},"assertion":[{"value":"2015-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-11-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-03-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}