{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T08:48:24Z","timestamp":1758271704196,"version":"3.41.0"},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2016,10,25]],"date-time":"2016-10-25T00:00:00Z","timestamp":1477353600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF project","award":["1216569"],"award-info":[{"award-number":["1216569"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61502450, 61432018, 61521092 and 61272136"],"award-info":[{"award-number":["61502450, 61432018, 61521092 and 61272136"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"AMD Inc. Extension of Conference Paper"},{"name":"National Key Research and Development Program of China","award":["2016YFB0200803"],"award-info":[{"award-number":["2016YFB0200803"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,12,28]]},"abstract":"<jats:p>Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for both GPUs and Intel MIC many-core architectures. First, we devise a new SpMV format, called Blocked Compressed Common Coordinate (BCCOO). BCCOO extends the blocked Common Coordinate (COO) by using bit flags to store the row indices to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices for better data locality. Then, to address the load imbalance problem, we propose a highly efficient matrix-based segmented sum\/scan algorithm for SpMV, which eliminates global synchronization. At last, we introduce an autotuning framework to choose optimization parameters. Experimental results show that our proposed framework has a significant advantage over the existing SpMV libraries. In single precision, our proposed scheme outperforms clSpMV COCKTAIL format by 255% on average on AMD FirePro W8000, and outperforms CUSPARSE V7.0 by 73.7% on average and outperforms CSR5 by 53.6% on average on GeForce Titan X; in double precision, our proposed scheme outperforms CUSPARSE V7.0 by 34.0% on average and outperforms CSR5 by 16.2% on average on Tesla K20, and has equivalent performance compared with CSR5 on Intel MIC.<\/jats:p>","DOI":"10.1145\/2994148","type":"journal-article","created":{"date-parts":[[2016,10,26]],"date-time":"2016-10-26T13:20:01Z","timestamp":1477488001000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["A Cross-Platform SpMV Framework on Many-Core Architectures"],"prefix":"10.1145","volume":"13","author":[{"given":"Yunquan","family":"Zhang","sequence":"first","affiliation":[{"name":"State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences, Beijing, China"}]},{"given":"Shigang","family":"Li","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences, Beijing, China"}]},{"given":"Shengen","family":"Yan","sequence":"additional","affiliation":[{"name":"SenseTime Group Limited, Department of Information Engineering, Chinese University of Hong Kong"}]},{"given":"Huiyang","family":"Zhou","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC"}]}],"member":"320","published-online":{"date-parts":[[2016,10,25]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"RC24704 (W0812-047)","author":"Baskaran Muthu Manikandan","year":"2008","unstructured":"Muthu Manikandan Baskaran and Rajesh Bordawekar . 2008. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. IBM Reserach Report , RC24704 (W0812-047) ( 2008 ). Muthu Manikandan Baskaran and Rajesh Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047) (2008)."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654078"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.42122"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/882262.882364"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.73"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1837853.1693471"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2015.55"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375527.1375559"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.68"},{"key":"e_1_2_1_12_1","volume-title":"CUDPP: CUDA data parallel primitives library.","author":"Harris Mark","year":"2007","unstructured":"Mark Harris , John Owens , Shubho Sengupta , Yao Zhang , and Andrew Davidson . 2007 . CUDPP: CUDA data parallel primitives library. (2007). Mark Harris, John Owens, Shubho Sengupta, Yao Zhang, and Andrew Davidson. 2007. CUDPP: CUDA data parallel primitives library. (2007)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1137\/130930352"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-014-5254-x"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751209"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465013"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2015.7245713"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11515-8_10"},{"volume-title":"CUSPARSE library. (2014)","author":"CUDA NVIDIA.","key":"e_1_2_1_20_1","unstructured":"CUDA NVIDIA. 2014. CUSPARSE library. (2014) . NVIDIA Corporation , Santa Clara , California. CUDA NVIDIA. 2014. CUSPARSE library. (2014). NVIDIA Corporation, Santa Clara, California."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2011.05.005"},{"key":"e_1_2_1_22_1","first-page":"97","article-title":"Scan primitives for GPU computing","volume":"2007","author":"Sengupta Shubhabrata","year":"2007","unstructured":"Shubhabrata Sengupta , Mark Harris , Yao Zhang , and John D. Owens . 2007 . Scan primitives for GPU computing . In Graphics Hardware , Vol. 2007. 97 -- 106 . Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan primitives for GPU computing. In Graphics Hardware, Vol. 2007. 97--106.","journal-title":"Graphics Hardware"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304624"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503234"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-89740-8_1"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1658"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.12.006"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517327.2442539"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2994148","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2994148","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:39:50Z","timestamp":1750217990000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2994148"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,10,25]]},"references-count":27,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,12,28]]}},"alternative-id":["10.1145\/2994148"],"URL":"https:\/\/doi.org\/10.1145\/2994148","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2016,10,25]]},"assertion":[{"value":"2016-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-10-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}