{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,6,22]],"date-time":"2024-06-22T11:55:51Z","timestamp":1719057351823},"reference-count":439,"publisher":"Association for Computing Machinery (ACM)","issue":"11","funder":[{"DOI":"10.13039\/501100003246","name":"Dutch Research Council","doi-asserted-by":"crossref"},{"name":"NWA-ORC Call","award":["NWA.1160.18.316"]},{"DOI":"10.13039\/100013407","name":"Netherlands eScience Center","doi-asserted-by":"crossref","award":["027.016.G06"]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2023,11,30]]},"abstract":"In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 articles published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.<\/jats:p>","DOI":"10.1145\/3570638","type":"journal-article","created":{"date-parts":[[2022,11,14]],"date-time":"2022-11-14T12:35:46Z","timestamp":1668429346000},"page":"1-81","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Optimization Techniques for GPU Programming"],"prefix":"10.1145","volume":"55","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-5716-1118","authenticated-orcid":false,"given":"Pieter","family":"Hijma","sequence":"first","affiliation":[{"name":"Vrije Universiteit Amsterdam, Amsterdam, The Netherlands"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-8792-6305","authenticated-orcid":false,"given":"Stijn","family":"Heldens","sequence":"additional","affiliation":[{"name":"Netherlands eScience Center, Amsterdam, The Netherlands"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-3278-0518","authenticated-orcid":false,"given":"Alessio","family":"Sclocco","sequence":"additional","affiliation":[{"name":"Netherlands eScience Center, Amsterdam, The Netherlands"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-7508-3272","authenticated-orcid":false,"given":"Ben","family":"van Werkhoven","sequence":"additional","affiliation":[{"name":"Netherlands eScience Center, Amsterdam, The Netherlands"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-9827-4461","authenticated-orcid":false,"given":"Henri E.","family":"Bal","sequence":"additional","affiliation":[{"name":"Vrije Universiteit Amsterdam, Amsterdam, The Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2023,3,16]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"2018. Frontier: OLCF\u2019s Exascale Future. Retrieved July 2021 from https:\/\/www.olcf.ornl.gov\/2018\/02\/13\/frontier-olcfs-exascale-future\/."},{"key":"e_1_3_3_3_2","unstructured":"2019. U.S. Department of Energy and Intel to Deliver First Exascale Supercomputer Argonne National Laboratory. Retrieved July 2021 from https:\/\/www.anl.gov\/article\/us-department-of-energy-and-intel-to-deliver-first-exascale-supercomputer."},{"key":"e_1_3_3_4_2","unstructured":"2020. May We Introduce: LUMI. Retrieved July 2021 from https:\/\/www.lumi-supercomputer.eu\/may-we-introduce-lumi\/."},{"key":"e_1_3_3_5_2","article-title":"On the development of variable size batched computation for heterogeneous parallel architectures","author":"Abdelfattah A.","year":"2016","unstructured":"A. Abdelfattah , A. Haidar , S. Tomov , et\u00a0al. 2016. On the development of variable size batched computation for heterogeneous parallel architectures. IEEE International Parallel and Distributed Processing Symposium Workshops (2016).","journal-title":"IEEE International Parallel and Distributed Processing Symposium Workshops"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2016.05.303"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079103"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2018.01.005"},{"key":"e_1_3_3_9_2","first-page":"207","article-title":"Systematic approach in optimizing numerical memory-bound kernels on GPU","author":"Abdelfattah A.","year":"2013","unstructured":"A. Abdelfattah , D. Keyes , and H. Ltaief . 2013. Systematic approach in optimizing numerical memory-bound kernels on GPU. European Conference on Parallel Processing (2013), 207\u2013216.","journal-title":"European Conference on Parallel Processing"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/2818311"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3874"},{"key":"e_1_3_3_12_2","article-title":"Progressive optimization of batched LU factorization on GPUs","author":"Abdelfattah A.","year":"2019","unstructured":"A. Abdelfattah , S. Tomov , and J. Dongarra . 2019. Progressive optimization of batched LU factorization on GPUs. IEEE High Performance Extreme Computing Conference (2019).","journal-title":"IEEE High Performance Extreme Computing Conference"},{"key":"e_1_3_3_13_2","article-title":"High performance CUDA AES implementation: A quantitative performance analysis approach","author":"Abdelrahman A. A.","year":"2017","unstructured":"A. A. Abdelrahman , M. M. Fouad , H. Dahshan , et\u00a0al. 2017. High performance CUDA AES implementation: A quantitative performance analysis approach. In Proceedings of the Computing Conference (2017).","journal-title":"In Proceedings of the Computing Conference"},{"key":"e_1_3_3_14_2","article-title":"Acceleration of bilateral filtering algorithm for manycore and multicore architectures","author":"Agarwal D.","year":"2012","unstructured":"D. Agarwal , S. Wilf , A. Dhungel , et\u00a0al. 2012. Acceleration of bilateral filtering algorithm for manycore and multicore architectures. Proceedings of the International Conference on Parallel Processing.","journal-title":"Proceedings of the International Conference on Parallel Processing."},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3418075"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1587\/transfun.E100.A.1188"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-017-1972-3"},{"key":"e_1_3_3_18_2","first-page":"1","article-title":"A review of CUDA optimization techniques and tools for structured grid computing","author":"Al-Mouhamed M. A.","year":"2019","unstructured":"M. A. Al-Mouhamed , A. H. Khan , and N. Mohammad . 2019. A review of CUDA optimization techniques and tools for structured grid computing. Computing (2019), 1\u201327.","journal-title":"Computing"},{"key":"e_1_3_3_19_2","article-title":"Exploring the parallel capabilities of GPU: Berlekamp-Massey algorithm case study","author":"Ali H.","year":"2019","unstructured":"H. Ali , G. M. Fathy , Z. Fayez , et\u00a0al. 2019. Exploring the parallel capabilities of GPU: Berlekamp-Massey algorithm case study. Cluster Computing (2019).","journal-title":"Cluster Computing"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2018.05.030"},{"key":"e_1_3_3_21_2","unstructured":"AMD . 2017. Radeon\u2019s Next-generation Vega Architecture (Whitepaper)."},{"key":"e_1_3_3_22_2","unstructured":"AMD . 2019. Introducing RDNA Architecture (Whitepaper)."},{"key":"e_1_3_3_23_2","unstructured":"AMD . 2020. Introducing CDNA Architecture (Whitepaper)."},{"key":"e_1_3_3_24_2","article-title":"Optimized password recovery for encrypted RAR on GPUs","author":"An X.","year":"2015","unstructured":"X. An , H. Jia , and Y. Zhang . 2015. Optimized password recovery for encrypted RAR on GPUs. 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.","journal-title":"2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems."},{"key":"e_1_3_3_25_2","first-page":"133","article-title":"Taming control divergence in GPUs through control flow linearization","author":"Anantpur J.","year":"2014","unstructured":"J. Anantpur and R. Govindarajan . 2014. Taming control divergence in GPUs through control flow linearization. International Conference on Compiler Construction (2014), 133\u2013153.","journal-title":"International Conference on Compiler Construction"},{"key":"e_1_3_3_26_2","unstructured":"M. Andersch G. Palmer R. Krashinsky et\u00a0al. 2022. NVIDIA Hopper Architecture In-Depth. Retrieved July 2021 from https:\/\/developer.nvidia.com\/blog\/nvidia-hopper-architecture-in-depth\/."},{"key":"e_1_3_3_27_2","article-title":"Reducing vector I\/O for faster GPU sparse matrix-vector multiplication","author":"Anh P. N. Q.","year":"2015","unstructured":"P. N. Q. Anh , R. Fan , and Y. Wen . 2015. Reducing vector I\/O for faster GPU sparse matrix-vector multiplication. IEEE International Parallel and Distributed Processing Symposium (2015).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_28_2","article-title":"Balanced hashing and efficient GPU sparse general matrix-matrix multiplication","author":"Anh P. N. Q.","year":"2016","unstructured":"P. N. Q. Anh , R. Fan , and Y. Wen . 2016. Balanced hashing and efficient GPU sparse general matrix-matrix multiplication. In Proceedings of the International Conference on Supercomputing.","journal-title":"In Proceedings of the International Conference on Supercomputing."},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3380930"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342016646844"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/2834899.2834907"},{"key":"e_1_3_3_32_2","article-title":"Optimizing krylov subspace solvers on graphics processing units","author":"Anzt H.","year":"2014","unstructured":"H. Anzt , W. Sawyer , S. Tomov , et\u00a0al. 2014. Optimizing krylov subspace solvers on graphics processing units. IEEE International Parallel & Distributed Processing Symposium Workshops (2014).","journal-title":"IEEE International Parallel & Distributed Processing Symposium Workshops"},{"key":"e_1_3_3_33_2","article-title":"GPU-SFFT: A GPU based parallel algorithm for computing the sparse fast fourier transform (SFFT) of k-sparse signals","author":"Artiles O.","year":"2019","unstructured":"O. Artiles and F. Saeed . 2019. GPU-SFFT: A GPU based parallel algorithm for computing the sparse fast fourier transform (SFFT) of k-sparse signals. IEEE International Conference on Big Data.","journal-title":"IEEE International Conference on Big Data."},{"key":"e_1_3_3_34_2","article-title":"Memory-efficient parallel simulation of electron beam dynamics using GPUs","author":"Arumugam K.","year":"2016","unstructured":"K. Arumugam , D. Ranjan , M. Zubair , et\u00a0al. 2016. Memory-efficient parallel simulation of electron beam dynamics using GPUs. In Proceedings of the IEEE 23rd International Conference on High Performance Computing.","journal-title":"Proceedings of the IEEE 23rd International Conference on High Performance Computing."},{"key":"e_1_3_3_35_2","article-title":"An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs","author":"Ashari A.","year":"2014","unstructured":"A. Ashari , N. Sedaghati , J. Eisenlohr , et\u00a0al. 2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing.","journal-title":"In Proceedings of the 28th ACM International Conference on Supercomputing."},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.69"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.11.001"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3108139"},{"key":"e_1_3_3_39_2","article-title":"A dynamic hash table for the GPU","author":"Ashkiani S.","year":"2018","unstructured":"S. Ashkiani , M. Farach-Colton , and J. D. Owens . 2018. A dynamic hash table for the GPU. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium.","journal-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium."},{"key":"e_1_3_3_40_2","article-title":"CUDA memory techniques for matrix multiplication on quadro 4000","author":"Athil T.","year":"2014","unstructured":"T. Athil , R. Christian , and Y. B. Reddy . 2014. CUDA memory techniques for matrix multiplication on quadro 4000. In Proceedings of the 11th International Conference on Information Technology: New Generations.","journal-title":"In Proceedings of the 11th International Conference on Information Technology: New Generations."},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4055"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/197405.197406"},{"key":"e_1_3_3_43_2","article-title":"A fast parallel selection algorithm on GPUs","author":"Bakunas-Milanowski D.","year":"2015","unstructured":"D. Bakunas-Milanowski , V. Rego , J. Sang , et\u00a0al. 2015. A fast parallel selection algorithm on GPUs. In Proceedings of the International Conference on Computational Science and Computational Intelligence.","journal-title":"In Proceedings of the International Conference on Computational Science and Computational Intelligence."},{"key":"e_1_3_3_44_2","article-title":"Accelerating discrete wavelet transforms on GPUS","author":"Barina D.","year":"2017","unstructured":"D. Barina , M. Kula , M. Matysek , et\u00a0al. 2017. Accelerating discrete wavelet transforms on GPUS. In Proceedings of the IEEE International Conference on Image Processing.","journal-title":"In Proceedings of the IEEE International Conference on Image Processing."},{"key":"e_1_3_3_45_2","article-title":"Computing strongly connected components in parallel on CUDA","author":"Barnat J.","year":"2011","unstructured":"J. Barnat , P. Bauch , L. Brim , et\u00a0al. 2011. Computing strongly connected components in parallel on CUDA. IEEE International Parallel & Distributed Processing Symposium (2011).","journal-title":"IEEE International Parallel & Distributed Processing Symposium"},{"key":"e_1_3_3_46_2","article-title":"CudaDMA optimizing GPU memory bandwidth via warp specialization","author":"Bauer M.","year":"2011","unstructured":"M. Bauer , H. Cook , and B. Khailany . 2011. CudaDMA optimizing GPU memory bandwidth via warp specialization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis .","journal-title":"In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654078"},{"key":"e_1_3_3_48_2","article-title":"Fast multiplication in binary fields on GPUs via register cache","author":"Ben-Sasson E.","year":"2016","unstructured":"E. Ben-Sasson , M. Hamilis , M. Silberstein , et\u00a0al. 2016. Fast multiplication in binary fields on GPUs via register cache. In Proceedings of the International Conference on Supercomputing.","journal-title":"In Proceedings of the International Conference on Supercomputing."},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2013.208"},{"key":"e_1_3_3_50_2","first-page":"328","article-title":"ECC2K-130 on NVIDIA GPUs","author":"Bernstein D. J.","year":"2010","unstructured":"D. J. Bernstein , H. Chen , C. Cheng , et\u00a0al. 2010. ECC2K-130 on NVIDIA GPUs. Progress in Cryptology - INDOCRYPT (2010), 328\u2013346.","journal-title":"Progress in Cryptology - INDOCRYPT"},{"key":"e_1_3_3_51_2","first-page":"112","article-title":"Compiler optimizations for industrial unstructured mesh CFD applications on GPUs","author":"Bertolli C.","year":"2013","unstructured":"C. Bertolli , A. Betts , N. Loriant , et\u00a0al. 2013. Compiler optimizations for industrial unstructured mesh CFD applications on GPUs. Languages and Compilers for Parallel Computing (2013), 112\u2013126.","journal-title":"Languages and Compilers for Parallel Computing"},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2012.06.003"},{"key":"e_1_3_3_53_2","first-page":"279","article-title":"Performance analysis of the SHA-3 candidates on exotic multi-core architectures","author":"Bos J. W.","year":"2010","unstructured":"J. W. Bos and D. Stefan . 2010. Performance analysis of the SHA-3 candidates on exotic multi-core architectures. Cryptographic Hardware and Embedded Systems (2010), 279\u2013293.","journal-title":"Cryptographic Hardware and Embedded Systems"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cor.2011.03.014"},{"key":"e_1_3_3_55_2","article-title":"Compile-time GPU memory access optimizations","author":"Braak G. van d.","year":"2010","unstructured":"G. van d. Braak , B. Mesman , and H. Corporaal . 2010. Compile-time GPU memory access optimizations. International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (2010).","journal-title":"International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation"},{"key":"e_1_3_3_56_2","article-title":"Compiling graph applications for GPU s with GraphIt","author":"Brahmakshatriya A.","year":"2021","unstructured":"A. Brahmakshatriya , Y. Zhang , C. Hong , et\u00a0al. 2021. Compiling graph applications for GPU s with GraphIt. IEEE\/ACM International Symposium on Code Generation and Optimization.","journal-title":"IEEE\/ACM International Symposium on Code Generation and Optimization."},{"issue":"0","key":"e_1_3_3_57_2","article-title":"Graphics processing unit (GPU) programming strategies and trends in GPU computing","author":"Brodtkorb A. R.","year":"2012","unstructured":"A. R. Brodtkorb , T. R. Hagen , and M. L. S\u00e6tra . 2012. Graphics processing unit (GPU) programming strategies and trends in GPU computing. J. Parallel Distrib. Comput. 0 (2012).","journal-title":"J. Parallel Distrib. Comput."},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3973"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCSoC.2015.38"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/B978-0-12-384988-5.00006-1"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2485994"},{"key":"e_1_3_3_62_2","article-title":"Hornet: An efficient data structure for dynamic sparse graphs and matrices on GPUs","author":"Busato F.","year":"2018","unstructured":"F. Busato , O. Green , N. Bombieri , et\u00a0al. 2018. Hornet: An efficient data structure for dynamic sparse graphs and matrices on GPUs. IEEE High Performance Extreme Computing Conference (2018).","journal-title":"IEEE High Performance Extreme Computing Conference"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10707-017-0312-3"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/1531743.1531766"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2012.01.002"},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2017.12.002"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1002\/nme.2989"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2012.04.209"},{"key":"e_1_3_3_69_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.2931"},{"key":"e_1_3_3_70_2","article-title":"A scalable, numerically stable, high-performance tridiagonal solver using GPUs","author":"Chang L.","year":"2012","unstructured":"L. Chang , J. A. Stratton , H. Kim , et\u00a0al. 2012. A scalable, numerically stable, high-performance tridiagonal solver using GPUs. International Conference for High Performance Computing, Networking, Storage and Analysis (2012).","journal-title":"International Conference for High Performance Computing, Networking, Storage and Analysis"},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4187"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2013.235"},{"key":"e_1_3_3_73_2","article-title":"Dymaxion","author":"Che S.","year":"2011","unstructured":"S. Che , J. W. Sheaffer , and K. Skadron . 2011. Dymaxion. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","journal-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_3_74_2","doi-asserted-by":"publisher","DOI":"10.3233\/JCM-180840"},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830818"},{"key":"e_1_3_3_76_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.image.2013.08.001"},{"key":"e_1_3_3_77_2","article-title":"moDNN: Memory optimal DNN training on GPUs","author":"Chen X.","year":"2018","unstructured":"X. Chen , D. Z. Chen , and X. S. Hu . 2018. moDNN: Memory optimal DNN training on GPUs. Design, Automation & Test in Europe Conference & Exhibition .","journal-title":"Design, Automation & Test in Europe Conference & Exhibition"},{"key":"e_1_3_3_78_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4064"},{"key":"e_1_3_3_79_2","doi-asserted-by":"publisher","DOI":"10.1145\/1693453.1693471"},{"key":"e_1_3_3_80_2","article-title":"PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures","author":"Christen M.","year":"2011","unstructured":"M. Christen , O. Schenk , and H. Burkhart . 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. IEEE International Parallel & Distributed Processing Symposium (2011).","journal-title":"IEEE International Parallel & Distributed Processing Symposium"},{"key":"e_1_3_3_81_2","doi-asserted-by":"crossref","DOI":"10.1109\/ICPADS.2011.29","article-title":"Architecture-aware mapping and optimization on a 1600-core GPU","author":"Daga M.","year":"2011","unstructured":"M. Daga , T. Scogland , and W. Feng . 2011. Architecture-aware mapping and optimization on a 1600-core GPU. IEEE 17th International Conference on Parallel and Distributed Systems (2011).","journal-title":"IEEE 17th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_82_2","article-title":"Optimizing sparse matrix operations on GPUs using merge path","author":"Dalton S.","year":"2015","unstructured":"S. Dalton , S. Baxter , D. Merrill , et\u00a0al. 2015. Optimizing sparse matrix operations on GPUs using merge path. IEEE International Parallel and Distributed Processing Symposium (2015).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/2699470"},{"key":"e_1_3_3_84_2","article-title":"Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures","author":"Datta K.","year":"2008","unstructured":"K. Datta , M. Murphy , V. Volkov , et\u00a0al. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. SC - International Conference for High Performance Computing, Networking, Storage and Analysis (2008).","journal-title":"SC - International Conference for High Performance Computing, Networking, Storage and Analysis"},{"key":"e_1_3_3_85_2","article-title":"Work-efficient parallel GPU methods for single-source shortest paths","author":"Davidson A.","year":"2014","unstructured":"A. Davidson , S. Baxter , M. Garland , et\u00a0al. 2014. Work-efficient parallel GPU methods for single-source shortest paths. IEEE 28th International Parallel and Distributed Processing Symposium (2014).","journal-title":"IEEE 28th International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2013.57"},{"key":"e_1_3_3_87_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2013.10.002"},{"key":"e_1_3_3_88_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2012.01.003"},{"key":"e_1_3_3_89_2","article-title":"Kepler GPU vs. Xeon Phi: Performance case study with a high-order CFD application","author":"Deng L.","year":"2015","unstructured":"L. Deng , H. Bai , D. Zhao , et\u00a0al. 2015. Kepler GPU vs. Xeon Phi: Performance case study with a high-order CFD application. IEEE International Conference on Computer and Communications (2015).","journal-title":"IEEE International Conference on Computer and Communications"},{"key":"e_1_3_3_90_2","first-page":"539","article-title":"Taming irregular EDA applications on GPUs","author":"Deng Y.","year":"2009","unstructured":"Y. Deng , B. D. Wang , and S. Mu . 2009. Taming irregular EDA applications on GPUs. IEEE\/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD (2009), 539\u2013546.","journal-title":"IEEE\/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD"},{"key":"e_1_3_3_91_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3836"},{"key":"e_1_3_3_92_2","first-page":"47","article-title":"Data reordering for minimizing threads divergence in GPU-based evaluating association rules","author":"Djenouri Y.","year":"2015","unstructured":"Y. Djenouri , A. Bendjoudi , M. Mehdi , et\u00a0al. 2015. Data reordering for minimizing threads divergence in GPU-based evaluating association rules. Distributed Computing and Artificial Intelligence, 12th International Conference (2015), 47\u201354.","journal-title":"Distributed Computing and Artificial Intelligence, 12th International Conference"},{"key":"e_1_3_3_93_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC47752.2019.9041954"},{"key":"e_1_3_3_94_2","article-title":"Optimizing option pricing algorithms and profiling power consumption on VLIW APU architecture","author":"Doerksen M.","year":"2012","unstructured":"M. Doerksen , P. Thulasiraman , and R. K. Thulasiram . 2012. Optimizing option pricing algorithms and profiling power consumption on VLIW APU architecture. IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (2012).","journal-title":"IEEE 10th International Symposium on Parallel and Distributed Processing with Applications"},{"key":"e_1_3_3_95_2","first-page":"0623","article-title":"Optimizing high-performance CUDA DSP filter for ECG signals","author":"Domazet E.","year":"2016","unstructured":"E. Domazet , M. Gusev , and S. Ristov . 2016. Optimizing high-performance CUDA DSP filter for ECG signals. Proceedings of the 27th International DAAAM Symposium (2016), 0623\u20130632.","journal-title":"Proceedings of the 27th International DAAAM Symposium"},{"key":"e_1_3_3_96_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2012.10.015"},{"key":"e_1_3_3_97_2","article-title":"LU factorization of small matrices: Accelerating batched DGETRF on the GPU","author":"Dong T.","year":"2014","unstructured":"T. Dong , A. Haidar , P. Luszczek , et\u00a0al. 2014. LU factorization of small matrices: Accelerating batched DGETRF on the GPU. In Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (2014).","journal-title":"In Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst"},{"key":"e_1_3_3_98_2","article-title":"Fast scan algorithms on graphics processors","author":"Dotsenko Y.","year":"2008","unstructured":"Y. Dotsenko , N. K. Govindaraju , P. Sloan , et\u00a0al. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd International Conference on Supercomputing .","journal-title":"In Proceedings of the 22nd International Conference on Supercomputing"},{"key":"e_1_3_3_99_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2019.06.018"},{"key":"e_1_3_3_100_2","article-title":"Profiling and optimization of CT reconstruction on Nvidia Quadro GV100","author":"Dwivedi S.","year":"2020","unstructured":"S. Dwivedi and A. Heumann . 2020. Profiling and optimization of CT reconstruction on Nvidia Quadro GV100. IEEE High Performance Extreme Computing Conference (2020).","journal-title":"IEEE High Performance Extreme Computing Conference"},{"key":"e_1_3_3_101_2","doi-asserted-by":"publisher","DOI":"10.2528\/PIER11031607"},{"key":"e_1_3_3_102_2","article-title":"Efficient sparse matrix-vector multiplication on cache-based GPUs","author":"Eguly I. R.","year":"2012","unstructured":"I. R. Eguly and M. Giles . 2012. Efficient sparse matrix-vector multiplication on cache-based GPUs. Innovative Parallel Computing, InPar (2012).","journal-title":"Innovative Parallel Computing, InPar"},{"key":"e_1_3_3_103_2","article-title":"Optimizing image sharpening algorithm on GPU","author":"Fan M.","year":"2015","unstructured":"M. Fan , H. Jia , Y. Zhang , et\u00a0al. 2015. Optimizing image sharpening algorithm on GPU. 44th International Conference on Parallel Processing (2015).","journal-title":"44th International Conference on Parallel Processing"},{"key":"e_1_3_3_104_2","article-title":"Cache-friendly design for complex spatially-variable coefficient stencils on many-core architectures","author":"Fang J.","year":"2016","unstructured":"J. Fang , H. Fu , and G. Yang . 2016. Cache-friendly design for complex spatially-variable coefficient stencils on many-core architectures. IEEE 23rd International Conference on High Performance Computing (2016).","journal-title":"IEEE 23rd International Conference on High Performance Computing"},{"key":"e_1_3_3_105_2","article-title":"Optimizing complex spatially-variant coefficient stencils for seismic modeling on GPU","author":"Fang J.","year":"2015","unstructured":"J. Fang , H. Fu , H. Zhang , et\u00a0al. 2015. Optimizing complex spatially-variant coefficient stencils for seismic modeling on GPU. IEEE 21st International Conference on Parallel and Distributed Systems. (2015).","journal-title":"IEEE 21st International Conference on Parallel and Distributed Systems."},{"key":"e_1_3_3_106_2","article-title":"Accelerating cost aggregation for real-time stereo matching","author":"Fang J.","year":"2012","unstructured":"J. Fang , A. L. Varbanescu , J. Shen , et\u00a0al. 2012. Accelerating cost aggregation for real-time stereo matching. IEEE 18th International Conference on Parallel and Distributed Systems (2012).","journal-title":"IEEE 18th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_107_2","article-title":"An auto-tuning solution to data streams clustering in OpenCL","author":"Fang J.","year":"2011","unstructured":"J. Fang , A. L. Varbanescu , and H. Sips . 2011. An auto-tuning solution to data streams clustering in OpenCL. 14th IEEE International Conference on Computational Science and Engineering (2011).","journal-title":"14th IEEE International Conference on Computational Science and Engineering"},{"key":"e_1_3_3_108_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2017.11.003"},{"key":"e_1_3_3_109_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPDC.2010.22"},{"key":"e_1_3_3_110_2","doi-asserted-by":"crossref","DOI":"10.1109\/ICPADS.2011.91","article-title":"Optimization of sparse matrix-vector multiplication with variant CSR on GPUs","author":"Feng X.","year":"2011","unstructured":"X. Feng , H. Jin , R. Zheng , et\u00a0al. 2011. Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. IEEE 17th International Conference on Parallel and Distributed Systems (2011).","journal-title":"IEEE 17th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_111_2","article-title":"Implementing smith-waterman algorithm with two-dimensional cache on GPUs","author":"Feng X.","year":"2012","unstructured":"X. Feng , H. Jin , R. Zheng , et\u00a0al. 2012. Implementing smith-waterman algorithm with two-dimensional cache on GPUs. 2nd International Conference on Cloud and Green Computing (2012).","journal-title":"2nd International Conference on Cloud and Green Computing"},{"key":"e_1_3_3_112_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-015-1483-z"},{"key":"e_1_3_3_113_2","doi-asserted-by":"publisher","DOI":"10.1145\/2935746"},{"key":"e_1_3_3_114_2","article-title":"Single kernel soft synchronization technique for task arrays on CUDA-enabled GPUs, with applications","author":"Funasaka S.","year":"2017","unstructured":"S. Funasaka , K. Nakano , and Y. Ito . 2017. Single kernel soft synchronization technique for task arrays on CUDA-enabled GPUs, with applications. 5th International Symposium on Computing and Networking (2017).","journal-title":"5th International Symposium on Computing and Networking"},{"key":"e_1_3_3_115_2","first-page":"25","article-title":"Thread block compaction for efficient SIMT control flow","author":"Fung W. W. L.","year":"2011","unstructured":"W. W. L. Fung and T. M. Aamodt . 2011. Thread block compaction for efficient SIMT control flow. IEEE 17th International Symposium on High Performance Computer Architecture (2011), 25\u201336.","journal-title":"IEEE 17th International Symposium on High Performance Computer Architecture"},{"key":"e_1_3_3_116_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307681.3326606"},{"key":"e_1_3_3_117_2","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-020-03697-x"},{"key":"e_1_3_3_118_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-016-0430-9"},{"key":"e_1_3_3_119_2","article-title":"DPF-ECC: Accelerating elliptic curve cryptography with floating-point computing power of GPUs","author":"Gao L.","year":"2020","unstructured":"L. Gao , F. Zheng , N. Emmart , et\u00a0al. 2020. DPF-ECC: Accelerating elliptic curve cryptography with floating-point computing power of GPUs. IEEE International Parallel and Distributed Processing Symposium (2020).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_120_2","doi-asserted-by":"publisher","DOI":"10.1155\/2018\/6093054"},{"key":"e_1_3_3_121_2","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/bxr062"},{"key":"e_1_3_3_122_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2016.01.008"},{"key":"e_1_3_3_123_2","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503223"},{"key":"e_1_3_3_124_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00138-012-0443-3"},{"key":"e_1_3_3_125_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2012.319"},{"key":"e_1_3_3_126_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2412549"},{"key":"e_1_3_3_127_2","article-title":"High performance multilevel graph partitioning on GPU","author":"Goodarzi B.","year":"2019","unstructured":"B. Goodarzi , F. Khorasani , V. Sarkar , et\u00a0al. 2019. High performance multilevel graph partitioning on GPU. International Conference on High Performance Computing & Simulation (2019).","journal-title":"International Conference on High Performance Computing & Simulation"},{"key":"e_1_3_3_128_2","article-title":"High performance discrete Fourier transforms on graphics processors","author":"Govindaraju N. K.","year":"2008","unstructured":"N. K. Govindaraju , B. Lloyd , Y. Dotsenko , et\u00a0al. 2008. High performance discrete Fourier transforms on graphics processors. SC - International Conference for High Performance Computing, Networking, Storage and Analysis (2008).","journal-title":"SC - International Conference for High Performance Computing, Networking, Storage and Analysis"},{"key":"e_1_3_3_129_2","article-title":"Clustering throughput optimization on the GPU","author":"Gowanlock M.","year":"2017","unstructured":"M. Gowanlock , C. M. Rude , D. M. Blair , et\u00a0al. 2017. Clustering throughput optimization on the GPU. IEEE International Parallel and Distributed Processing Symposium (2017).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_130_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.68"},{"key":"e_1_3_3_131_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342015593156"},{"key":"e_1_3_3_132_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295718"},{"key":"e_1_3_3_133_2","article-title":"A study of Persistent Threads style GPU programming for GPGPU workloads","author":"Gupta K.","year":"2012","unstructured":"K. Gupta , J. A. Stuart , and J. D. Owens . 2012. A study of Persistent Threads style GPU programming for GPGPU workloads. Innovative Parallel Computing (2012).","journal-title":"Innovative Parallel Computing"},{"key":"e_1_3_3_134_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-35740-4_29"},{"key":"e_1_3_3_135_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2012.336"},{"key":"e_1_3_3_136_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2012.02.013"},{"key":"e_1_3_3_137_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.advengsoft.2010.10.007"},{"key":"e_1_3_3_138_2","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414632"},{"key":"e_1_3_3_139_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038228.3038237"},{"key":"e_1_3_3_140_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014567546"},{"key":"e_1_3_3_141_2","doi-asserted-by":"publisher","DOI":"10.1145\/1964179.1964184"},{"key":"e_1_3_3_142_2","doi-asserted-by":"publisher","DOI":"10.1145\/2458523.2458525"},{"key":"e_1_3_3_143_2","unstructured":"M. Harris . 2013. Unified Memory in CUDA 6. Retrieved June 2021 from https:\/\/developer.nvidia.com\/blog\/unified-memory-in-cuda-6\/."},{"key":"e_1_3_3_144_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4705"},{"key":"e_1_3_3_145_2","doi-asserted-by":"publisher","DOI":"10.1145\/3372390"},{"key":"e_1_3_3_146_2","unstructured":"S. Heldens A. Sclocco and H. Dreuning . 2019. NLeSC\/litstudy."},{"key":"e_1_3_3_147_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00101"},{"key":"e_1_3_3_148_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295707"},{"key":"e_1_3_3_149_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123970"},{"key":"e_1_3_3_150_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304619"},{"key":"e_1_3_3_151_2","article-title":"A warp-synchronous implementation for multiple-length multiplication on the GPU","author":"Honda T.","year":"2015","unstructured":"T. Honda , Y. Ito , and K. Nakano . 2015. A warp-synchronous implementation for multiple-length multiplication on the GPU. 3rd International Symposium on Computing and Networking (2015).","journal-title":"3rd International Symposium on Computing and Networking"},{"key":"e_1_3_3_152_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2016PAP0027"},{"key":"e_1_3_3_153_2","doi-asserted-by":"publisher","DOI":"10.1145\/3192366.3192397"},{"key":"e_1_3_3_154_2","article-title":"Parallel LDPC decoding on a GPU using OpenCL and global memory for accelerators","author":"Hong J.","year":"2015","unstructured":"J. Hong and K. Chung . 2015. Parallel LDPC decoding on a GPU using OpenCL and global memory for accelerators. IEEE International Conference on Networking, Architecture and Storage (2015).","journal-title":"IEEE International Conference on Networking, Architecture and Storage"},{"key":"e_1_3_3_155_2","article-title":"Accelerating CUDA graph algorithms at maximum warp","author":"Hong S.","year":"2011","unstructured":"S. Hong , S. K. Kim , T. Oguntebi , et\u00a0al. 2011. Accelerating CUDA graph algorithms at maximum warp. Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming.","journal-title":"Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming."},{"key":"e_1_3_3_156_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11704-019-8184-3"},{"key":"e_1_3_3_157_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2015.11.001"},{"key":"e_1_3_3_158_2","doi-asserted-by":"publisher","DOI":"10.1109\/InfoSEEE.2014.6947772"},{"key":"e_1_3_3_159_2","article-title":"Performance analysis and optimization for MTTKRP of sparse tensor on CPU and GPU","author":"Hu R.","year":"2020","unstructured":"R. Hu , W. Yang , X. Zhou , et\u00a0al. 2020. Performance analysis and optimization for MTTKRP of sparse tensor on CPU and GPU. 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS) (2020).","journal-title":"2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS)"},{"key":"e_1_3_3_160_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3002610"},{"key":"e_1_3_3_161_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00017"},{"key":"e_1_3_3_162_2","article-title":"An empirically optimized radix sort for GPU","author":"Huang B.","year":"2009","unstructured":"B. Huang , J. Gao , and X. Li . 2009. An empirically optimized radix sort for GPU. IEEE International Symposium on Parallel and Distributed Processing with Applications (2009).","journal-title":"IEEE International Symposium on Parallel and Distributed Processing with Applications"},{"key":"e_1_3_3_163_2","doi-asserted-by":"publisher","DOI":"10.1145\/3372419"},{"key":"e_1_3_3_164_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2019.102589"},{"key":"e_1_3_3_165_2","article-title":"A CUDA implementation of the standard particle swarm optimization","author":"Hussain M. M.","year":"2016","unstructured":"M. M. Hussain , H. Hattori , and N. Fujimoto . 2016. A CUDA implementation of the standard particle swarm optimization. 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (2016).","journal-title":"18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing"},{"key":"e_1_3_3_166_2","article-title":"Highly parallel transformation and quantization for HEVC encoder on GPUs","author":"Igarashi H.","year":"2016","unstructured":"H. Igarashi , F. Takano , and T. Moriyoshi . 2016. Highly parallel transformation and quantization for HEVC encoder on GPUs. Visual Communications and Image Processing (2016).","journal-title":"Visual Communications and Image Processing"},{"key":"e_1_3_3_167_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.E96.D.2596"},{"key":"e_1_3_3_168_2","article-title":"Fast ellipse detection algorithm using hough transform on the GPU","author":"Ito Y.","year":"2011","unstructured":"Y. Ito , K. Ogawa , and K. Nakano . 2011. Fast ellipse detection algorithm using hough transform on the GPU. 2nd International Conference on Networking and Computing (2011).","journal-title":"2nd International Conference on Networking and Computing"},{"key":"e_1_3_3_169_2","doi-asserted-by":"publisher","DOI":"10.1109\/RoEduNet.2011.5993693"},{"key":"e_1_3_3_170_2","doi-asserted-by":"publisher","DOI":"10.1145\/1513895.1513903"},{"key":"e_1_3_3_171_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2010.107"},{"key":"e_1_3_3_172_2","doi-asserted-by":"publisher","DOI":"10.1145\/3361870"},{"key":"e_1_3_3_173_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3751"},{"key":"e_1_3_3_174_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-38241-3_6"},{"key":"e_1_3_3_175_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2013.58"},{"key":"e_1_3_3_176_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3722"},{"key":"e_1_3_3_177_2","article-title":"Graph-oriented code transformation approach for register-limited stencils on GPUs","author":"Jin M.","year":"2016","unstructured":"M. Jin , H. Fu , Z. Lv , et\u00a0al. 2016. Graph-oriented code transformation approach for register-limited stencils on GPUs. 16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (2016).","journal-title":"16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing"},{"key":"e_1_3_3_178_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-41050-6_11"},{"key":"e_1_3_3_179_2","article-title":"Increasing GPU-speedup of volume rendering for images with high complexity","author":"Jun S.","year":"2015","unstructured":"S. Jun and O. Ha . 2015. Increasing GPU-speedup of volume rendering for images with high complexity. 8th International Conference on u- and e-Service, Science and Technology (2015).","journal-title":"8th International Conference on u- and e-Service, Science and Technology"},{"key":"e_1_3_3_180_2","article-title":"WarpDrive: Massively parallel hashing on multi-GPU nodes","author":"Junger D.","year":"2018","unstructured":"D. Junger , C. Hundt , and B. Schmidt . 2018. WarpDrive: Massively parallel hashing on multi-GPU nodes. IEEE International Parallel and Distributed Processing Symposium (2018).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_181_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-015-1613-7"},{"key":"e_1_3_3_182_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2018.11.012"},{"key":"e_1_3_3_183_2","doi-asserted-by":"publisher","DOI":"10.2991\/ijndc.2014.2.3.2"},{"key":"e_1_3_3_184_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830796"},{"key":"e_1_3_3_185_2","article-title":"Scalable SIMD-efficient graph processing on GPUs","author":"Khorasani F.","year":"2015","unstructured":"F. Khorasani , R. Gupta , and L. N. Bhuyan . 2015. Scalable SIMD-efficient graph processing on GPUs. International Conference on Parallel Architecture and Compilation (2015).","journal-title":"International Conference on Parallel Architecture and Compilation"},{"key":"e_1_3_3_186_2","article-title":"Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement","author":"Khorasani F.","year":"2016","unstructured":"F. Khorasani , B. Rowe , R. Gupta , et\u00a0al. 2016. Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement. IEEE International Parallel and Distributed Processing Symposium (2016).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_187_2","article-title":"CuSha","author":"Khorasani F.","year":"2014","unstructured":"F. Khorasani , K. Vora , R. Gupta , et\u00a0al. 2014. CuSha. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing.","journal-title":"In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing."},{"key":"e_1_3_3_188_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2011.01.025"},{"key":"e_1_3_3_189_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4470"},{"key":"e_1_3_3_190_2","first-page":"450","article-title":"Improving ODE integration on graphics processing units by reducing thread divergence","author":"Kovac T.","year":"2019","unstructured":"T. Kovac , T. Haber , F. van Reeth , et\u00a0al. 2019. Improving ODE integration on graphics processing units by reducing thread divergence. International Conference on Computational Science (2019), 450\u2013456.","journal-title":"International Conference on Computational Science"},{"key":"e_1_3_3_191_2","first-page":"213","volume-title":"An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms","author":"Kowarschik M.","year":"2003","unstructured":"M. Kowarschik and C. Wei\u00df . 2003. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms . Springer, Berlin, 213\u2013232."},{"key":"e_1_3_3_192_2","article-title":"Optimization techniques for OpenCL-based linear algebra routines","author":"Kozacik S.","year":"2014","unstructured":"S. Kozacik , P. Fox , J. Humphrey , et\u00a0al. 2014. Optimization techniques for OpenCL-based linear algebra routines. Modeling and Simulation for Defense Systems and Applications IX (2014).","journal-title":"Modeling and Simulation for Defense Systems and Applications IX"},{"key":"e_1_3_3_193_2","doi-asserted-by":"publisher","DOI":"10.1137\/130930352"},{"key":"e_1_3_3_194_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2013.08.002"},{"key":"e_1_3_3_195_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404397.3404426"},{"key":"e_1_3_3_196_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11554-012-0309-y"},{"key":"e_1_3_3_197_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2015.2467331"},{"key":"e_1_3_3_198_2","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1007\/978-981-33-4859-2_27","article-title":"Optimization of ray-tracing algorithm for simulation of PMD sensors","author":"Lade S.","year":"2021","unstructured":"S. Lade , P. Kulkarni , P. Saraf , et\u00a0al. 2021. Optimization of ray-tracing algorithm for simulation of PMD sensors. Machine Learning and Information Processing (2021), 267\u2013280.","journal-title":"Machine Learning and Information Processing"},{"key":"e_1_3_3_199_2","article-title":"GPU implementation of the branch and bound method for knapsack problems","author":"Lalami M. E.","year":"2012","unstructured":"M. E. Lalami and D. El-Baz . 2012. GPU implementation of the branch and bound method for knapsack problems. IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012).","journal-title":"IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum"},{"key":"e_1_3_3_200_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14539"},{"key":"e_1_3_3_201_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cmpb.2010.10.013"},{"key":"e_1_3_3_202_2","article-title":"Design space exploration of the turbo decoding algorithm on GPUs","author":"Lee D.","year":"2010","unstructured":"D. Lee , M. Wolf , and H. Kim . 2010. Design space exploration of the turbo decoding algorithm on GPUs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems.","journal-title":"In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems."},{"key":"e_1_3_3_203_2","article-title":"Optimization of GPU-based sparse matrix multiplication for large sparse networks","author":"Lee J.","year":"2020","unstructured":"J. Lee , S. Kang , Y. Yu , et\u00a0al. 2020. Optimization of GPU-based sparse matrix multiplication for large sparse networks. IEEE 36th International Conference on Data Engineering (2020).","journal-title":"IEEE 36th International Conference on Data Engineering"},{"key":"e_1_3_3_204_2","doi-asserted-by":"publisher","DOI":"10.1145\/1816038.1816021"},{"key":"e_1_3_3_205_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.5048"},{"key":"e_1_3_3_206_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2012.09.013"},{"key":"e_1_3_3_207_2","doi-asserted-by":"publisher","DOI":"10.5555\/1656506.1656513"},{"key":"e_1_3_3_208_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2718515"},{"key":"e_1_3_3_209_2","article-title":"High performance parallel graph coloring on GPGPUs","author":"Li P.","year":"2016","unstructured":"P. Li , X. Chen , Z. Quan , et\u00a0al. 2016. High performance parallel graph coloring on GPGPUs. IEEE International Parallel and Distributed Processing Symposium Workshops (2016).","journal-title":"IEEE International Parallel and Distributed Processing Symposium Workshops"},{"key":"e_1_3_3_210_2","first-page":"884","article-title":"A note on auto-tuning GEMM for GPUs","author":"Li Y.","year":"2009","unstructured":"Y. Li , J. Dongarra , and S. Tomov . 2009. A note on auto-tuning GEMM for GPUs. International Conference on Computational Science (2009), 884\u2013892.","journal-title":"International Conference on Computational Science"},{"key":"e_1_3_3_211_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3744"},{"key":"e_1_3_3_212_2","article-title":"Parallelization and optimization of a combustion simulation application on GPU platform","author":"Li Z.","year":"2020","unstructured":"Z. Li and Y. Che . 2020. Parallelization and optimization of a combustion simulation application on GPU platform. Proceedings of the 4th International Conference on High Performance Compilation, Computing and Communications.","journal-title":"Proceedings of the 4th International Conference on High Performance Compilation, Computing and Communications."},{"key":"e_1_3_3_213_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2018.10.012"},{"key":"e_1_3_3_214_2","article-title":"Design and evaluation of a parallel k-nearest neighbor algorithm on CUDA-enabled GPU","author":"Liang S.","year":"2010","unstructured":"S. Liang , Y. Liu , C. Wang , et\u00a0al. 2010. Design and evaluation of a parallel k-nearest neighbor algorithm on CUDA-enabled GPU. IEEE 2nd Symposium on Web Society (2010).","journal-title":"IEEE 2nd Symposium on Web Society"},{"key":"e_1_3_3_215_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2681072"},{"key":"e_1_3_3_216_2","article-title":"An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases","author":"Ligowski L.","year":"2009","unstructured":"L. Ligowski and W. Rudnicki . 2009. An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. IEEE International Symposium on Parallel & Distributed Processing (2009).","journal-title":"IEEE International Symposium on Parallel & Distributed Processing"},{"key":"e_1_3_3_217_2","article-title":"A software technique to enhance register utilization of convolutional neural networks on GPGPUs","author":"Lin C.","year":"2017","unstructured":"C. Lin , A. Cheng , and B. Lai . 2017. A software technique to enhance register utilization of convolutional neural networks on GPGPUs. International Conference on Applied System Innovation (2017).","journal-title":"International Conference on Applied System Innovation"},{"key":"e_1_3_3_218_2","article-title":"Performance enhancement of GPU parallel computing using memory allocation optimization","author":"Lin C.","year":"2020","unstructured":"C. Lin , J. Liu , and P. Yang . 2020. Performance enhancement of GPU parallel computing using memory allocation optimization. 14th International Conference on Ubiquitous Information Management and Communication (2020).","journal-title":"14th International Conference on Ubiquitous Information Management and Communication"},{"key":"e_1_3_3_219_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2020.02.003"},{"key":"e_1_3_3_220_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242089"},{"key":"e_1_3_3_221_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCOMM.2014.010214.132406"},{"key":"e_1_3_3_222_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_3_3_223_2","doi-asserted-by":"publisher","DOI":"10.1287\/opre.9.3.383"},{"key":"e_1_3_3_224_2","article-title":"A unified optimization approach for sparse tensor operations on GPUs","author":"Liu B.","year":"2017","unstructured":"B. Liu , C. Wen , A. D. Sarwate , et\u00a0al. 2017. A unified optimization approach for sparse tensor operations on GPUs. IEEE International Conference on Cluster Computing (2017).","journal-title":"IEEE International Conference on Cluster Computing"},{"key":"e_1_3_3_225_2","first-page":"411","article-title":"SIMD-X: Programming and processing of graph algorithms on GPUs","author":"Liu H.","year":"2019","unstructured":"H. Liu and H. H. Huang . 2019. SIMD-X: Programming and processing of graph algorithms on GPUs. USENIX Annual Technical Conference (2019), 411\u2013428.","journal-title":"USENIX Annual Technical Conference"},{"key":"e_1_3_3_226_2","first-page":"395","article-title":"Memory capacity aware non-blocking data transfer on GPGPU","author":"Liu H.","year":"2013","unstructured":"H. Liu , H. Kuo , K. Chen , et\u00a0al. 2013. Memory capacity aware non-blocking data transfer on GPGPU. IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation (2013), 395\u2013400.","journal-title":"IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation"},{"key":"e_1_3_3_227_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378471"},{"key":"e_1_3_3_228_2","article-title":"Data layout optimization for GPGPU architectures","author":"Liu J.","year":"2013","unstructured":"J. Liu , W. Ding , O. Jang , et\u00a0al. 2013. Data layout optimization for GPGPU architectures. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.","journal-title":"Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming."},{"key":"e_1_3_3_229_2","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2017.3001256"},{"key":"e_1_3_3_230_2","article-title":"An efficient GPU general sparse matrix-matrix multiplication for irregular data","author":"Liu W.","year":"2014","unstructured":"W. Liu and B. Vinter . 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. IEEE 28th International Parallel and Distributed Processing Symposium (2014).","journal-title":"IEEE 28th International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_231_2","doi-asserted-by":"publisher","DOI":"10.1186\/1756-0500-2-73"},{"key":"e_1_3_3_232_2","doi-asserted-by":"publisher","DOI":"10.1186\/1756-0500-3-93"},{"key":"e_1_3_3_233_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2019.03.014"},{"key":"e_1_3_3_234_2","article-title":"Optimizing GPU memory transactions for convolution operations","author":"Lu G.","year":"2020","unstructured":"G. Lu , W. Zhang , and Z. Wang . 2020. Optimizing GPU memory transactions for convolution operations. IEEE International Conference on Cluster Computing (2020).","journal-title":"IEEE International Conference on Cluster Computing"},{"key":"e_1_3_3_235_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2016EDP7174"},{"key":"e_1_3_3_236_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.5754"},{"key":"e_1_3_3_237_2","doi-asserted-by":"publisher","DOI":"10.1145\/1837274.1837289"},{"key":"e_1_3_3_238_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03644-6_12"},{"key":"e_1_3_3_239_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2018.07.018"},{"key":"e_1_3_3_240_2","article-title":"AdELL: An adaptive warp-balancing ELL format for efficient sparse matrix-vector multiplication on GPUs","author":"Maggioni M.","year":"2013","unstructured":"M. Maggioni and T. Berger-Wolf . 2013. AdELL: An adaptive warp-balancing ELL format for efficient sparse matrix-vector multiplication on GPUs. 42nd International Conference on Parallel Processing (2013).","journal-title":"42nd International Conference on Parallel Processing"},{"key":"e_1_3_3_241_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2013.05.196"},{"key":"e_1_3_3_242_2","article-title":"CoAdELL: Adaptivity and compression for improving sparse matrix-vector multiplication on GPUs","author":"Maggioni M.","year":"2014","unstructured":"M. Maggioni and T. Berger-Wolf . 2014. CoAdELL: Adaptivity and compression for improving sparse matrix-vector multiplication on GPUs. IEEE International Parallel & Distributed Processing Symposium Workshops (2014).","journal-title":"IEEE International Parallel & Distributed Processing Symposium Workshops"},{"key":"e_1_3_3_243_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2016.03.011"},{"key":"e_1_3_3_244_2","first-page":"1","article-title":"A large-scale cross-architecture evaluation of thread-coarsening","author":"Magni A.","year":"2013","unstructured":"A. Magni , C. Dubach , and M. F. O\u2019Boyle . 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , 1\u201311.","journal-title":"In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis"},{"key":"e_1_3_3_245_2","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628087"},{"key":"e_1_3_3_246_2","doi-asserted-by":"publisher","DOI":"10.15803\/ijnc.1.2_260"},{"key":"e_1_3_3_247_2","doi-asserted-by":"publisher","DOI":"10.1080\/17445760.2012.703195"},{"key":"e_1_3_3_248_2","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-9-S2-S10"},{"key":"e_1_3_3_249_2","article-title":"An efficient transaction-based GPU implementation of minimum spanning forest algorithm","author":"Manoochehri S.","year":"2017","unstructured":"S. Manoochehri , B. Goodarzi , and D. Goswami . 2017. An efficient transaction-based GPU implementation of minimum spanning forest algorithm. International Conference on High Performance Computing & Simulation (2017).","journal-title":"International Conference on High Performance Computing & Simulation"},{"key":"e_1_3_3_250_2","article-title":"Sparse matrix-matrix multiplication on modern architectures","author":"Matam K.","year":"2012","unstructured":"K. Matam , S. R. K. B. Indarapu , and K. Kothapalli . 2012. Sparse matrix-matrix multiplication on modern architectures. 19th International Conference on High Performance Computing (2012).","journal-title":"19th International Conference on High Performance Computing"},{"key":"e_1_3_3_251_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2011.04.036"},{"key":"e_1_3_3_252_2","doi-asserted-by":"publisher","DOI":"10.1145\/3368826.3377904"},{"key":"e_1_3_3_253_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2014.06.003"},{"key":"e_1_3_3_254_2","article-title":"A GPU implementation of inclusion-based points-to analysis","author":"Mendez-Lojo M.","year":"2012","unstructured":"M. Mendez-Lojo , M. Burtscher , and K. Pingali . 2012. A GPU implementation of inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.","journal-title":"Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming."},{"key":"e_1_3_3_255_2","doi-asserted-by":"publisher","DOI":"10.1145\/2370036.2145831"},{"key":"e_1_3_3_256_2","article-title":"Scalable GPU graph traversal","author":"Merrill D.","year":"2012","unstructured":"D. Merrill , M. Garland , and A. Grimshaw . 2012. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.","journal-title":"Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming."},{"key":"e_1_3_3_257_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626411000187"},{"key":"e_1_3_3_258_2","doi-asserted-by":"publisher","DOI":"10.1145\/1513895.1513905"},{"key":"e_1_3_3_259_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2012.04.011"},{"key":"e_1_3_3_260_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2016EDP7178"},{"key":"e_1_3_3_261_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.01.011"},{"key":"e_1_3_3_262_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.101635"},{"key":"e_1_3_3_263_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00607-014-0434-5"},{"key":"e_1_3_3_264_2","article-title":"A memory optimization technique for software-managed scratchpad memory in GPUs","author":"Moazeni M.","year":"2009","unstructured":"M. Moazeni , A. Bui , and M. Sarrafzadeh . 2009. A memory optimization technique for software-managed scratchpad memory in GPUs. IEEE 7th Symposium on Application Specific Processors (2009).","journal-title":"IEEE 7th Symposium on Application Specific Processors"},{"key":"e_1_3_3_265_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11515-8_10"},{"key":"e_1_3_3_266_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-012-1053-9"},{"key":"e_1_3_3_267_2","doi-asserted-by":"publisher","DOI":"10.3390\/app9050947"},{"key":"e_1_3_3_268_2","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2015.66"},{"issue":"5","key":"e_1_3_3_269_2","first-page":"211","article-title":"Optimization of sparse matrix-vector multiplication for CRS format on NVIDIA kepler architecture GPUs","volume":"7975","author":"Mukunoki D.","year":"2013","unstructured":"D. Mukunoki and D. Takahashi . 2013. Optimization of sparse matrix-vector multiplication for CRS format on NVIDIA kepler architecture GPUs. International Conference on Computational Science and Its Applications 7975 LNCS, PART 5 (2013), 211\u2013223.","journal-title":"International Conference on Computational Science and Its Applications"},{"key":"e_1_3_3_270_2","article-title":"Compact data structure and scalable algorithms for the sparse grid technique","author":"Murarasu A.","year":"2011","unstructured":"A. Murarasu , J. Weidendorfer , G. Buse , et\u00a0al. 2011. Compact data structure and scalable algorithms for the sparse grid technique. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming.","journal-title":"In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming."},{"key":"e_1_3_3_271_2","article-title":"Optimal loop unrolling for GPGPU programs","author":"Murthy G. S.","year":"2010","unstructured":"G. S. Murthy , M. Ravishankar , M. M. Baskaran , et\u00a0al. 2010. Optimal loop unrolling for GPGPU programs. IEEE International Symposium on Parallel & Distributed Processing (2010).","journal-title":"IEEE International Symposium on Parallel & Distributed Processing"},{"key":"e_1_3_3_272_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2463813"},{"key":"e_1_3_3_273_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2016.05.304"},{"key":"e_1_3_3_274_2","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810130"},{"key":"e_1_3_3_275_2","doi-asserted-by":"publisher","DOI":"10.1145\/2458523.2458533"},{"key":"e_1_3_3_276_2","article-title":"Data-driven versus topology-driven irregular computations on GPUs","author":"Nasre R.","year":"2013","unstructured":"R. Nasre , M. Burtscher , and K. Pingali . 2013. Data-driven versus topology-driven irregular computations on GPUs. IEEE 27th International Symposium on Parallel and Distributed Processing (2013).","journal-title":"IEEE 27th International Symposium on Parallel and Distributed Processing"},{"key":"e_1_3_3_277_2","article-title":"Morph algorithms on GPUs","author":"Nasre R.","year":"2013","unstructured":"R. Nasre , M. Burtscher , and K. Pingali . 2013. Morph algorithms on GPUs. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.","journal-title":"In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming."},{"key":"e_1_3_3_278_2","article-title":"Optimizing symmetric dense matrix-vector multiplication on GPUs","author":"Nath R.","year":"2011","unstructured":"R. Nath , S. Tomov , T. \u201cTim\u201d Dong , et\u00a0al. 2011. Optimizing symmetric dense matrix-vector multiplication on GPUs. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","journal-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_3_279_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342010385729"},{"key":"e_1_3_3_280_2","article-title":"Predicting an optimal sparse matrix format for SpMV computation on GPU","author":"Neelima B.","year":"2014","unstructured":"B. Neelima , G. R. M. Reddy , and P. S. Raghavendra . 2014. Predicting an optimal sparse matrix format for SpMV computation on GPU. IEEE International Parallel & Distributed Processing Symposium Workshops (2014).","journal-title":"IEEE International Parallel & Distributed Processing Symposium Workshops"},{"key":"e_1_3_3_281_2","article-title":"3.5-D blocking optimization for stencil computations on modern CPUs and GPUs","author":"Nguyen A.","year":"2010","unstructured":"A. Nguyen , N. Satish , J. Chhugani , et\u00a0al. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010).","journal-title":"ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis"},{"key":"e_1_3_3_282_2","article-title":"Load-balanced sparse MTTKRP on GPUs","author":"Nisa I.","year":"2019","unstructured":"I. Nisa , J. Li , A. Sukumaran-Rajam , et\u00a0al. 2019. Load-balanced sparse MTTKRP on GPUs. IEEE International Parallel and Distributed Processing Symposium (2019).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_283_2","article-title":"Sampled dense matrix multiplication for high-performance machine learning","author":"Nisa I.","year":"2018","unstructured":"I. Nisa , A. Sukumaran-Rajam , S. E. Kurt , et\u00a0al. 2018. Sampled dense matrix multiplication for high-performance machine learning. IEEE 25th International Conference on High Performance Computing (2018).","journal-title":"IEEE 25th International Conference on High Performance Computing"},{"key":"e_1_3_3_284_2","article-title":"Accelerating the dynamic programming for the matrix chain product on the GPU","author":"Nishida K.","year":"2011","unstructured":"K. Nishida , Y. Ito , and K. Nakano . 2011. Accelerating the dynamic programming for the matrix chain product on the GPU. Second International Conference on Networking and Computing (2011).","journal-title":"Second International Conference on Networking and Computing"},{"key":"e_1_3_3_285_2","first-page":"1","article-title":"Accelerating the dynamic programming for the optimal polygon triangulation on the GPU","author":"Nishida K.","year":"2012","unstructured":"K. Nishida , K. Nakano , and Y. Ito . 2012. Accelerating the dynamic programming for the optimal polygon triangulation on the GPU. Algorithms and Architectures for Parallel Processing (2012), 1\u201315.","journal-title":"Algorithms and Architectures for Parallel Processing"},{"key":"e_1_3_3_286_2","doi-asserted-by":"publisher","DOI":"10.1145\/1900008.1900035"},{"key":"e_1_3_3_287_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2014.2324587"},{"key":"e_1_3_3_288_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCSoC.2015.10"},{"key":"e_1_3_3_289_2","unstructured":"NVIDIA . 2010. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi Whitepaper."},{"key":"e_1_3_3_290_2","unstructured":"NVIDIA . 2012. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110\/210 Whitepaper."},{"key":"e_1_3_3_291_2","unstructured":"NVIDIA . 2014. NVIDIA GeForce GTX 980 Whitepaper."},{"key":"e_1_3_3_292_2","unstructured":"NVIDIA . 2016. NVIDIA Tesla P100 Whitepaper."},{"key":"e_1_3_3_293_2","unstructured":"NVIDIA . 2017. NVIDIA Tesla V100 Whitepaper."},{"key":"e_1_3_3_294_2","unstructured":"NVIDIA . 2018. NVIDIA Turing GPU Architecture Whitepaper."},{"key":"e_1_3_3_295_2","unstructured":"NVIDIA . 2020. CUDA C++ Programming Guide. Retrieved July 2022 from https:\/\/docs.nvidia.com\/cuda\/pdf\/CUDA_C_Programming_Guide.pdf."},{"key":"e_1_3_3_296_2","unstructured":"NVIDIA . 2020. NVIDIA A100 Tensor Core GPU Architecture."},{"key":"e_1_3_3_297_2","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-015-0744-4"},{"key":"e_1_3_3_298_2","article-title":"Achieving TeraCUPS on longest common subsequence problem using GPGPUs","author":"Ozsoy A.","year":"2013","unstructured":"A. Ozsoy , A. Chauhan , and M. Swany . 2013. Achieving TeraCUPS on longest common subsequence problem using GPGPUs. International Conference on Parallel and Distributed Systems (2013).","journal-title":"International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_299_2","article-title":"A compiler for throughput optimization of graph algorithms on GPUs","author":"Pai S.","year":"2016","unstructured":"S. Pai and K. Pingali . 2016. A compiler for throughput optimization of graph algorithms on GPUs. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. (2016).","journal-title":"Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications."},{"key":"e_1_3_3_300_2","doi-asserted-by":"publisher","DOI":"10.1145\/2499368.2451160"},{"key":"e_1_3_3_301_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-011-0631-3"},{"key":"e_1_3_3_302_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-41278-3_64"},{"key":"e_1_3_3_303_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-16205-4_9"},{"key":"e_1_3_3_304_2","doi-asserted-by":"publisher","DOI":"10.1145\/2909437.2909451"},{"key":"e_1_3_3_305_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2020.02.069"},{"key":"e_1_3_3_306_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11771-017-3676-5"},{"key":"e_1_3_3_307_2","article-title":"Using dynamic parallelism for fine-grained, irregular workloads: A case study of the N-queens problem","author":"Plauth M.","year":"2015","unstructured":"M. Plauth , F. Feinbube , F. Schlegel , et\u00a0al. 2015. Using dynamic parallelism for fine-grained, irregular workloads: A case study of the N-queens problem. Third International Symposium on Computing and Networking. (2015).","journal-title":"Third International Symposium on Computing and Networking."},{"key":"e_1_3_3_308_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.newast.2007.05.004"},{"issue":"1","key":"e_1_3_3_309_2","first-page":"97","article-title":"GPU-based high performance password recovery technique for hash functions","volume":"32","author":"Qiu W.","year":"2016","unstructured":"W. Qiu , Z. Gong , Y. Guo , et\u00a0al. 2016. GPU-based high performance password recovery technique for hash functions. Journal of Information Science and Engineering 32, 1 (2016), 97\u2013112.","journal-title":"Journal of Information Science and Engineering"},{"key":"e_1_3_3_310_2","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_3_3_311_2","doi-asserted-by":"crossref","DOI":"10.1145\/2967938.2967967","article-title":"Resource conscious reuse-driven tiling for GPUs","author":"Rawat P. S.","year":"2016","unstructured":"P. S. Rawat , C. Hong , M. Ravishankar , et\u00a0al. 2016. Resource conscious reuse-driven tiling for GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation.","journal-title":"Proceedings of the International Conference on Parallel Architectures and Compilation."},{"key":"e_1_3_3_312_2","doi-asserted-by":"crossref","DOI":"10.1145\/2967938.2967950","article-title":"Reduction drawing","author":"Reddy C.","year":"2016","unstructured":"C. Reddy , M. Kruse , and A. Cohen . 2016. Reduction drawing. In Proceedings of the International Conference on Parallel Architectures and Compilation.","journal-title":"Proceedings of the International Conference on Parallel Architectures and Compilation."},{"key":"e_1_3_3_313_2","article-title":"Impact of vectorization over 16-bit data-types on GPUs","author":"Reis L.","year":"2018","unstructured":"L. Reis , R. Nobre , and J. M. P. Cardoso . 2018. Impact of vectorization over 16-bit data-types on GPUs. In Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms.","journal-title":"In Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms."},{"key":"e_1_3_3_314_2","article-title":"Efficient control flow restructuring for GPUs","author":"Reissmann N.","year":"2016","unstructured":"N. Reissmann , T. L. Falch , B. A. Bjornseth , et\u00a0al. 2016. Efficient control flow restructuring for GPUs. International Conference on High Performance Computing & Simulation (2016).","journal-title":"International Conference on High Performance Computing & Simulation"},{"key":"e_1_3_3_315_2","doi-asserted-by":"publisher","DOI":"10.1145\/2884045.2884046"},{"key":"e_1_3_3_316_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2021.02.013"},{"key":"e_1_3_3_317_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3970"},{"key":"e_1_3_3_318_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2016.62"},{"key":"e_1_3_3_319_2","doi-asserted-by":"publisher","DOI":"10.1145\/1345206.1345220"},{"key":"e_1_3_3_320_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.05.011"},{"key":"e_1_3_3_321_2","article-title":"Program optimization space pruning for a multithreaded gpu","author":"Ryoo S.","year":"2008","unstructured":"S. Ryoo , C. I. Rodrigues , S. S. Stone , et\u00a0al. 2008. Program optimization space pruning for a multithreaded gpu. In Proceedings of the 6th Annual IEEE\/ACM International Symposium on Code Generation and Optimization.","journal-title":"Proceedings of the 6th Annual IEEE\/ACM International Symposium on Code Generation and Optimization."},{"key":"e_1_3_3_322_2","article-title":"SAGE","author":"Samadi M.","year":"2013","unstructured":"M. Samadi , J. Lee , D. A. Jamshidi , et\u00a0al. 2013. SAGE. In Proceedings of the International Symposium on Microarchitecture.","journal-title":"Proceedings of the International Symposium on Microarchitecture."},{"key":"e_1_3_3_323_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2012.194"},{"key":"e_1_3_3_324_2","article-title":"Reuse and refactoring of GPU kernels to design complex applications","author":"Sarkar S.","year":"2012","unstructured":"S. Sarkar , S. Mitra , and A. Srinivasan . 2012. Reuse and refactoring of GPU kernels to design complex applications. IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (2012).","journal-title":"IEEE 10th International Symposium on Parallel and Distributed Processing with Applications"},{"key":"e_1_3_3_325_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2012.2232647"},{"key":"e_1_3_3_326_2","article-title":"Designing efficient sorting algorithms for manycore GPUs","author":"Satish N.","year":"2009","unstructured":"N. Satish , M. Harris , and M. Garland . 2009. Designing efficient sorting algorithms for manycore GPUs. IEEE International Symposium on Parallel & Distributed Processing (2009).","journal-title":"IEEE International Symposium on Parallel & Distributed Processing"},{"key":"e_1_3_3_327_2","doi-asserted-by":"publisher","DOI":"10.1049\/iet-cdt.2017.0149"},{"key":"e_1_3_3_328_2","first-page":"97","article-title":"Scan primitives for GPU computing","author":"Sengupta S.","year":"2007","unstructured":"S. Sengupta , M. Harris , Y. Zhang , et\u00a0al. 2007. Scan primitives for GPU computing. Proceedings of the SIGGRAPH\/Eurographics Workshop on Graphics Hardware (2007), 97\u2013106.","journal-title":"Proceedings of the SIGGRAPH\/Eurographics Workshop on Graphics Hardware"},{"key":"e_1_3_3_329_2","article-title":"GPU-based graph traversal on compressed graphs","author":"Sha M.","year":"2019","unstructured":"M. Sha , Y. Li , and K. Tan . 2019. GPU-based graph traversal on compressed graphs. In Proceedings of the International Conference on Management of Data.","journal-title":"Proceedings of the International Conference on Management of Data."},{"key":"e_1_3_3_330_2","article-title":"Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format","author":"Shi S.","year":"2020","unstructured":"S. Shi , Q. Wang , and X. Chu . 2020. Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format. IEEE 26th International Conference on Parallel and Distributed Systems (2020).","journal-title":"IEEE 26th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_331_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-021-10642-4"},{"key":"e_1_3_3_332_2","article-title":"CUDA memory optimizations for large data-structures in the gravit simulator","author":"Siegel J.","year":"2009","unstructured":"J. Siegel , J. Ributzka , and X. Li . 2009. CUDA memory optimizations for large data-structures in the gravit simulator. International Conference on Parallel Processing Workshops (2009).","journal-title":"International Conference on Parallel Processing Workshops"},{"key":"e_1_3_3_333_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0409-y"},{"key":"e_1_3_3_334_2","article-title":"Accelerating domain propagation: An efficient GPU-parallel algorithm over sparse matrices","author":"Sofranac B.","year":"2020","unstructured":"B. Sofranac , A. Gleixner , and S. Pokutta . 2020. Accelerating domain propagation: An efficient GPU-parallel algorithm over sparse matrices. IEEE\/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (2020).","journal-title":"IEEE\/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms"},{"key":"e_1_3_3_335_2","article-title":"A fast GPU algorithm for graph connectivity","author":"Soman J.","year":"2010","unstructured":"J. Soman , K. Kishore , and P. J. Narayanan . 2010. A fast GPU algorithm for graph connectivity. Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum.","journal-title":"Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum."},{"key":"e_1_3_3_336_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-29347-4_51"},{"key":"e_1_3_3_337_2","article-title":"Fast two dimensional convex hull on the GPU","author":"Srungarapu S.","year":"2011","unstructured":"S. Srungarapu , D. P. Reddy , K. Kothapalli , et\u00a0al. 2011. Fast two dimensional convex hull on the GPU. Proceedings of the 25th IEEE International Conference on Advanced Information Networking and Applications Workshops.","journal-title":"Proceedings of the 25th IEEE International Conference on Advanced Information Networking and Applications Workshops."},{"key":"e_1_3_3_338_2","article-title":"Optimization and architecture effects on GPU computing workload performance","author":"Stratton J. A.","year":"2012","unstructured":"J. A. Stratton , N. Anssari , C. Rodrigues , et\u00a0al. 2012. Optimization and architecture effects on GPU computing workload performance. Innovative Parallel Computing (2012).","journal-title":"Innovative Parallel Computing"},{"key":"e_1_3_3_339_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342019832958"},{"key":"e_1_3_3_340_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304624"},{"key":"e_1_3_3_341_2","first-page":"125","article-title":"On the GPU performance of 3D stencil computations implemented in OpenCL","author":"Su H.","year":"2013","unstructured":"H. Su , N. Wu , M. Wen , et\u00a0al. 2013. On the GPU performance of 3D stencil computations implemented in OpenCL. International Supercomputing Conference (2013), 125\u2013135.","journal-title":"International Supercomputing Conference"},{"key":"e_1_3_3_342_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2014.03.013"},{"key":"e_1_3_3_343_2","article-title":"Optimizing SpMV for diagonal sparse matrices on GPU","author":"Sun X.","year":"2011","unstructured":"X. Sun , Y. Zhang , T. Wang , et\u00a0al. 2011. Optimizing SpMV for diagonal sparse matrices on GPU. International Conference on Parallel Processing (2011).","journal-title":"International Conference on Parallel Processing"},{"key":"e_1_3_3_344_2","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555266"},{"key":"e_1_3_3_345_2","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854336"},{"key":"e_1_3_3_346_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342018816368"},{"key":"e_1_3_3_347_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-014-1102-4"},{"key":"e_1_3_3_348_2","doi-asserted-by":"publisher","DOI":"10.1002\/nme.3240"},{"key":"e_1_3_3_349_2","article-title":"Fast implementation of DGEMM on Fermi GPU","author":"Tan G.","year":"2011","unstructured":"G. Tan , L. Li , S. Triechle , et\u00a0al. 2011. Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","journal-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_3_350_2","doi-asserted-by":"publisher","DOI":"10.1145\/2907294.2907297"},{"key":"e_1_3_3_351_2","article-title":"An efficient skinny matrix-matrix multiplication method by folding input matrices into tensor core operations","author":"Tang H.","year":"2020","unstructured":"H. Tang , K. Komatsu , M. Sato , et\u00a0al. 2020. An efficient skinny matrix-matrix multiplication method by folding input matrices into tensor core operations. 8th International Symposium on Computing and Networking Workshops (2020).","journal-title":"8th International Symposium on Computing and Networking Workshops"},{"key":"e_1_3_3_352_2","article-title":"Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method","author":"Tang W. T.","year":"2013","unstructured":"W. T. Tang , W. J. Tan , R. Krishnamoorthy , et\u00a0al. 2013. Optimizing and auto-tuning iterative stencil loops for GPUs with the in-plane method. IEEE 27th International Symposium on Parallel and Distributed Processing (2013).","journal-title":"IEEE 27th International Symposium on Parallel and Distributed Processing"},{"key":"e_1_3_3_353_2","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503234"},{"key":"e_1_3_3_354_2","article-title":"A high-throughput solver for marginalized graph kernels on GPU","author":"Tang Y.","year":"2020","unstructured":"Y. Tang , O. Selvitopi , D. T. Popovici , et\u00a0al. 2020. A high-throughput solver for marginalized graph kernels on GPU. IEEE International Parallel and Distributed Processing Symposium (2020).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_355_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3077551"},{"key":"e_1_3_3_356_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2009.12.005"},{"key":"e_1_3_3_357_2","article-title":"Dense linear algebra solvers for multicore with GPU accelerators","author":"Tomov S.","year":"2010","unstructured":"S. Tomov , R. Nath , H. Ltaief , et\u00a0al. 2010. Dense linear algebra solvers for multicore with GPU accelerators. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2010).","journal-title":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum"},{"key":"e_1_3_3_358_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-017-2225-1"},{"key":"e_1_3_3_359_2","article-title":"Memory-efficient parallelization of 3D lattice boltzmann flow solver on a GPU","author":"Tran N.","year":"2015","unstructured":"N. Tran , M. Lee , and D. H. Choi . 2015. Memory-efficient parallelization of 3D lattice boltzmann flow solver on a GPU. IEEE 22nd International Conference on High Performance Computing (2015).","journal-title":"IEEE 22nd International Conference on High Performance Computing"},{"key":"e_1_3_3_360_2","article-title":"Performance optimization of aho-corasick algorithm on a GPU","author":"Tran N.","year":"2013","unstructured":"N. Tran , M. Lee , S. Hong , et\u00a0al. 2013. Performance optimization of aho-corasick algorithm on a GPU. 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (2013).","journal-title":"12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications"},{"key":"e_1_3_3_361_2","first-page":"29","article-title":"Task management for irregular-parallelworkloads on the GPU","author":"Tzengy S.","year":"2010","unstructured":"S. Tzengy , A. Patney , and J. D. Owens . 2010. Task management for irregular-parallelworkloads on the GPU. High-Performance Graphics - ACM SIGGRAPH \/ Eurographics Symposium Proc., HPG (2010), 29\u201337.","journal-title":"High-Performance Graphics - ACM SIGGRAPH \/ Eurographics Symposium Proc., HPG"},{"key":"e_1_3_3_362_2","article-title":"An efficient GPU implementation of ant colony optimization for the traveling salesman problem","author":"Uchida A.","year":"2012","unstructured":"A. Uchida , Y. Ito , and K. Nakano . 2012. An efficient GPU implementation of ant colony optimization for the traveling salesman problem. Third International Conference on Networking and Computing (2012).","journal-title":"Third International Conference on Networking and Computing"},{"key":"e_1_3_3_363_2","article-title":"Optimized GPU implementation of JPEG 2000 for satellite image decompression","author":"Ufuk D. U.","year":"2018","unstructured":"D. U. Ufuk , A. Temizel , and A. M. Ozbayoglu . 2018. Optimized GPU implementation of JPEG 2000 for satellite image decompression. IEEE International Conference on Computational Science and Engineering (2018).","journal-title":"IEEE International Conference on Computational Science and Engineering"},{"key":"e_1_3_3_364_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2012.2190928"},{"key":"e_1_3_3_365_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-87403-4_7"},{"key":"e_1_3_3_366_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1658"},{"key":"e_1_3_3_367_2","article-title":"Improving the performance of the sparse matrix vector product with GPUs","author":"V\u00e1zquez F.","year":"2010","unstructured":"F. V\u00e1zquez , G. Ortega , J. J. Fern\u00e1ndez , et\u00a0al. 2010. Improving the performance of the sparse matrix vector product with GPUs. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology.","journal-title":"In Proceedings of the 10th IEEE International Conference on Computer and Information Technology."},{"key":"e_1_3_3_368_2","article-title":"Image-domain gridding on graphics processors","author":"Veenboer B.","year":"2017","unstructured":"B. Veenboer , M. Petschow , and J. W. Romein . 2017. Image-domain gridding on graphics processors. IEEE International Parallel and Distributed Processing Symposium (2017).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_369_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-29400-7_36"},{"key":"e_1_3_3_370_2","article-title":"Algorithm flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking","author":"Vespa L.","year":"2015","unstructured":"L. Vespa , A. Bauman , and J. Wells . 2015. Algorithm flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking. IEEE High Performance Extreme Computing Conference (2015).","journal-title":"IEEE High Performance Extreme Computing Conference"},{"key":"e_1_3_3_371_2","article-title":"Optimized GPU histograms for multi-modal registration","author":"Vetter C.","year":"2011","unstructured":"C. Vetter and R. Westermann . 2011. Optimized GPU histograms for multi-modal registration. IEEE International Symposium on Biomedical Imaging: From Nano to Macro (2011).","journal-title":"IEEE International Symposium on Biomedical Imaging: From Nano to Macro"},{"key":"e_1_3_3_372_2","first-page":"826","article-title":"Optimizing 3D convolutions for wavelet transforms on CPUs with SSE Units and GPUs","author":"Videau B.","year":"2013","unstructured":"B. Videau , V. Marangozova-Martin , L. Genovese , et\u00a0al. 2013. Optimizing 3D convolutions for wavelet transforms on CPUs with SSE Units and GPUs. Euro-Par Parallel Processing (2013), 826\u2013837.","journal-title":"Euro-Par Parallel Processing"},{"key":"e_1_3_3_373_2","article-title":"Double precision stencil computations on Kepler GPUs","author":"Vizitiu A.","year":"2014","unstructured":"A. Vizitiu , L. Itu , L. Lazar , et\u00a0al. 2014. Double precision stencil computations on Kepler GPUs. 18th International Conference on System Theory, Control and Computing (2014).","journal-title":"18th International Conference on System Theory, Control and Computing"},{"key":"e_1_3_3_374_2","first-page":"16","article-title":"Better performance at lower occupancy","volume":"10","author":"Volkov V.","year":"2010","unstructured":"V. Volkov . 2010. Better performance at lower occupancy. Proceedings of the GPU Technology Conference 10 (2010), 16.","journal-title":"Proceedings of the GPU Technology Conference"},{"key":"e_1_3_3_375_2","volume-title":"Understanding Latency Hiding on GPUs","author":"Volkov V.","year":"2016","unstructured":"V. Volkov . 2016. Understanding Latency Hiding on GPUs . Ph.D. Dissertation. UC Berkeley."},{"key":"e_1_3_3_376_2","article-title":"Benchmarking GPUs to tune dense linear algebra","author":"Volkov V.","year":"2008","unstructured":"V. Volkov and J. W. Demmel . 2008. Benchmarking GPUs to tune dense linear algebra. SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2008).","journal-title":"SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC"},{"key":"e_1_3_3_377_2","article-title":"Evaluation of splitting-up conjugate gradient method on GPUs","author":"Wakatani A.","year":"2016","unstructured":"A. Wakatani . 2016. Evaluation of splitting-up conjugate gradient method on GPUs. 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2016).","journal-title":"24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing"},{"key":"e_1_3_3_378_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3629"},{"key":"e_1_3_3_379_2","article-title":"Kernel fusion: An effective method for better power efficiency on multithreaded GPU","author":"Wang G.","year":"2010","unstructured":"G. Wang , Y. Lin , and W. Yi . 2010. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. IEEE\/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing (2010).","journal-title":"IEEE\/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing"},{"key":"e_1_3_3_380_2","article-title":"Program optimization of array-intensive SPEC2k benchmarks on multithreaded GPU using CUDA and Brook \\(+\\)","author":"Wang G.","year":"2009","unstructured":"G. Wang , T. Tang , X. Fang , et\u00a0al. 2009. Program optimization of array-intensive SPEC2k benchmarks on multithreaded GPU using CUDA and Brook \\(+\\) . 15th International Conference on Parallel and Distributed Systems (2009).","journal-title":"15th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_381_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.E96.D.2319"},{"key":"e_1_3_3_382_2","article-title":"Communication optimization on GPU: A case study of sequence alignment algorithms","author":"Wang J.","year":"2017","unstructured":"J. Wang , X. Xie , and J. Cong . 2017. Communication optimization on GPU: A case study of sequence alignment algorithms. IEEE International Parallel and Distributed Processing Symposium (2017).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_383_2","doi-asserted-by":"publisher","DOI":"10.1145\/3337821.3337839"},{"key":"e_1_3_3_384_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-69244-5_4"},{"key":"e_1_3_3_385_2","doi-asserted-by":"publisher","DOI":"10.1145\/2688500.2688538"},{"key":"e_1_3_3_386_2","article-title":"Performance optimization for CPU-GPU heterogeneous parallel system","author":"Wang Y.","year":"2016","unstructured":"Y. Wang , J. Qiao , S. Lin , et\u00a0al. 2016. Performance optimization for CPU-GPU heterogeneous parallel system. 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (2016).","journal-title":"12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery"},{"key":"e_1_3_3_387_2","article-title":"Optimizing sparse matrix-vector multiplication on CUDA","author":"Wang Z.","year":"2010","unstructured":"Z. Wang , X. Xu , W. Zhao , et\u00a0al. 2010. Optimizing sparse matrix-vector multiplication on CUDA. 2nd International Conference on Education Technology and Computer (2010).","journal-title":"2nd International Conference on Education Technology and Computer"},{"key":"e_1_3_3_388_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-017-2041-7"},{"key":"e_1_3_3_389_2","article-title":"Optimization of linked list prefix computations on multithreaded GPUs using CUDA","author":"Wei Z.","year":"2010","unstructured":"Z. Wei and J. JaJa . 2010. Optimization of linked list prefix computations on multithreaded GPUs using CUDA. IEEE International Symposium on Parallel & Distributed Processing (2010).","journal-title":"IEEE International Symposium on Parallel & Distributed Processing"},{"key":"e_1_3_3_390_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.08.004"},{"key":"e_1_3_3_391_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2013.09.003"},{"key":"e_1_3_3_392_2","volume-title":"Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors","author":"Werkhoven B. van","year":"2011","unstructured":"B. van Werkhoven , J. Maassen , and F. J. Seinstra . 2011. Optimizing convolution operations in CUDA with adaptive tiling. In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors ."},{"key":"e_1_3_3_393_2","doi-asserted-by":"crossref","unstructured":"B. van Werkhoven J. Maassen F. J. Seinstra et\u00a0al. 2014. Performance models for CPU-GPU data transfers. 2014 14th IEEE\/ACM International Symposium on Cluster Cloud and Grid Computing 11\u201320.","DOI":"10.1109\/CCGrid.2014.16"},{"key":"e_1_3_3_394_2","first-page":"694","article-title":"GPUexplore 2.0: Unleashing GPU explicit-state model checking","author":"Wijs A.","year":"2016","unstructured":"A. Wijs , T. Neele , and D. Bo\u0161na\u010dki . 2016. GPUexplore 2.0: Unleashing GPU explicit-state model checking. FM: Formal Methods (2016), 694\u2013701.","journal-title":"FM: Formal Methods"},{"key":"e_1_3_3_395_2","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_3_3_396_2","doi-asserted-by":"crossref","DOI":"10.1109\/ICPADS.2011.92","article-title":"Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism","author":"Wu C.","year":"2011","unstructured":"C. Wu , J. Ke , H. Lin , et\u00a0al. 2011. Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism. IEEE 17th International Conference on Parallel and Distributed Systems (2011).","journal-title":"IEEE 17th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_397_2","article-title":"Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization","author":"Wu C.","year":"2012","unstructured":"C. Wu , K. Wei , and T. Lin . 2012. Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. IEEE 18th International Conference on Parallel and Distributed Systems (2012).","journal-title":"IEEE 18th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_398_2","article-title":"Optimizing data warehousing applications for GPUs using kernel fusion\/fission","author":"Wu H.","year":"2012","unstructured":"H. Wu , G. Diamos , J. Wang , et\u00a0al. 2012. Optimizing data warehousing applications for GPUs using kernel fusion\/fission. IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012).","journal-title":"IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum"},{"key":"e_1_3_3_399_2","article-title":"Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs","author":"Wu J.","year":"2012","unstructured":"J. Wu and J. JaJa . 2012. Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs. Innovative Parallel Computing (2012).","journal-title":"Innovative Parallel Computing"},{"key":"e_1_3_3_400_2","article-title":"Enabling prefix sum parallelism pattern for recurrences with principled function reconstruction","author":"Xia Y.","year":"2019","unstructured":"Y. Xia , P. Jiang , and G. Agrawal . 2019. Enabling prefix sum parallelism pattern for recurrences with principled function reconstruction. In Proceedings of the 28th International Conference on Compiler Construction.","journal-title":"In Proceedings of the 28th International Conference on Compiler Construction."},{"key":"e_1_3_3_401_2","article-title":"Inter-block GPU communication via fast barrier synchronization","author":"Xiao S.","year":"2010","unstructured":"S. Xiao and W. Feng . 2010. Inter-block GPU communication via fast barrier synchronization. IEEE International Symposium on Parallel & Distributed Processing (2010).","journal-title":"IEEE International Symposium on Parallel & Distributed Processing"},{"key":"e_1_3_3_402_2","article-title":"Accelerating protein sequence search in a heterogeneous computing system","author":"Xiao S.","year":"2011","unstructured":"S. Xiao , H. Lin , and W. Feng . 2011. Accelerating protein sequence search in a heterogeneous computing system. IEEE International Parallel & Distributed Processing Symposium (2011).","journal-title":"IEEE International Parallel & Distributed Processing Symposium"},{"key":"e_1_3_3_403_2","article-title":"Generalized GPU acceleration for applications employing finite-volume methods","author":"Xu J.","year":"2016","unstructured":"J. Xu , H. Fu , L. Gan , et\u00a0al. 2016. Generalized GPU acceleration for applications employing finite-volume methods. 16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (2016).","journal-title":"16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing"},{"key":"e_1_3_3_404_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-011-0626-0"},{"key":"e_1_3_3_405_2","article-title":"Auto-tuning GEMV on many-core GPU","author":"Xu W.","year":"2012","unstructured":"W. Xu , Z. Liu , J. Wu , et\u00a0al. 2012. Auto-tuning GEMV on many-core GPU. IEEE 18th International Conference on Parallel and Distributed Systems (2012).","journal-title":"IEEE 18th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_3_406_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4947"},{"key":"e_1_3_3_407_2","article-title":"Optimizing algorithm of sparse linear systems on GPU","author":"Yan D.","year":"2011","unstructured":"D. Yan , H. Cao , X. Dong , et\u00a0al. 2011. Optimizing algorithm of sparse linear systems on GPU. 6th Annual Chinagrid Conference (2011).","journal-title":"6th Annual Chinagrid Conference"},{"key":"e_1_3_3_408_2","article-title":"Demystifying tensor cores to optimize half-precision matrix multiply","author":"Yan D.","year":"2020","unstructured":"D. Yan , W. Wang , and X. Chu . 2020. Demystifying tensor cores to optimize half-precision matrix multiply. IEEE International Parallel and Distributed Processing Symposium (2020).","journal-title":"IEEE International Parallel and Distributed Processing Symposium"},{"key":"e_1_3_3_409_2","doi-asserted-by":"publisher","DOI":"10.1145\/3332466.3374520"},{"key":"e_1_3_3_410_2","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555255"},{"key":"e_1_3_3_411_2","article-title":"StreamScan: Fast scan algorithms for GPUs without global barrier synchronization","author":"Yan S.","year":"2013","unstructured":"S. Yan , G. Long , and Y. Zhang . 2013. StreamScan: Fast scan algorithms for GPUs without global barrier synchronization. 18th ACM SIGPLAN Symp. on Principles and Pract. of Parallel Program. (2013).","journal-title":"18th ACM SIGPLAN Symp. on Principles and Pract. of Parallel Program."},{"key":"e_1_3_3_412_2","first-page":"672","article-title":"Design principles for sparse matrix multiplication on the GPU","author":"Yang C.","year":"2018","unstructured":"C. Yang , A. Bulu\u00e7 , and J. D. Owens . 2018. Design principles for sparse matrix multiplication on the GPU. Euro-Par : Parallel Processing (2018), 672\u2013687.","journal-title":"Euro-Par : Parallel Processing"},{"key":"e_1_3_3_413_2","article-title":"Real-time motion estimation for 1080p videos on graphics processing units with shared memory optimization","author":"Yang S.","year":"2009","unstructured":"S. Yang , T. Lin , and S. Chien . 2009. Real-time motion estimation for 1080p videos on graphics processing units with shared memory optimization. IEEE Workshop on Signal Processing Systems (2009).","journal-title":"IEEE Workshop on Signal Processing Systems"},{"key":"e_1_3_3_414_2","doi-asserted-by":"publisher","DOI":"10.14778\/1938545.1938548"},{"key":"e_1_3_3_415_2","doi-asserted-by":"publisher","DOI":"10.1145\/1806596.1806606"},{"key":"e_1_3_3_416_2","doi-asserted-by":"publisher","DOI":"10.1145\/2207222.2207225"},{"key":"e_1_3_3_417_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-012-0228-3"},{"key":"e_1_3_3_418_2","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555254"},{"key":"e_1_3_3_419_2","article-title":"A case study of SWIM: Optimization of memory intensive application on GPGPU","author":"Yi W.","year":"2010","unstructured":"W. Yi , Y. Tang , G. Wang , et\u00a0al. 2010. A case study of SWIM: Optimization of memory intensive application on GPGPU. 3rd International Symposium on Parallel Architectures, Algorithms and Programming (2010).","journal-title":"3rd International Symposium on Parallel Architectures, Algorithms and Programming"},{"key":"e_1_3_3_420_2","article-title":"Histogram optimization with CUDA","author":"Yong K. K.","year":"2016","unstructured":"K. K. Yong and S. S. O. Talib . 2016. Histogram optimization with CUDA. IEEE Industrial Electronics and Applications Conference (2016).","journal-title":"IEEE Industrial Electronics and Applications Conference"},{"key":"e_1_3_3_421_2","article-title":"Automatic tuning of sparse matrix-vector multiplication for CRS format on GPUs","author":"Yoshizawa H.","year":"2012","unstructured":"H. Yoshizawa and D. Takahashi . 2012. Automatic tuning of sparse matrix-vector multiplication for CRS format on GPUs. IEEE 15th International Conference on Computational Science and Engineering (2012).","journal-title":"IEEE 15th International Conference on Computational Science and Engineering"},{"key":"e_1_3_3_422_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014524807"},{"key":"e_1_3_3_423_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038228.3038234"},{"key":"e_1_3_3_424_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2010.2077771"},{"key":"e_1_3_3_425_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2016EDP7090"},{"key":"e_1_3_3_426_2","article-title":"Batched small tensor-matrix multiplications on GPUs","author":"Zhai K.","year":"2020","unstructured":"K. Zhai , T. Banerjee , A. Wijayasiri , et\u00a0al. 2020. Batched small tensor-matrix multiplications on GPUs. IEEE 27th International Conference on High Performance Computing, Data, and Analytics (2020).","journal-title":"IEEE 27th International Conference on High Performance Computing, Data, and Analytics"},{"key":"e_1_3_3_427_2","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810104"},{"key":"e_1_3_3_428_2","article-title":"On-the-fly elimination of dynamic irregularities for GPU computing","author":"Zhang E. Z.","year":"2011","unstructured":"E. Z. Zhang , Y. Jiang , Z. Guo , et\u00a0al. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems.","journal-title":"Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems."},{"key":"e_1_3_3_429_2","article-title":"RegTT: Accelerating tree traversals on GPUs by exploiting regularities","author":"Zhang F.","year":"2016","unstructured":"F. Zhang , P. Di , H. Zhou , et\u00a0al. 2016. RegTT: Accelerating tree traversals on GPUs by exploiting regularities. 45th International Conference on Parallel Processing (2016).","journal-title":"45th International Conference on Parallel Processing"},{"key":"e_1_3_3_430_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3066635"},{"key":"e_1_3_3_431_2","article-title":"The optimization of parallel Smith-Waterman sequence alignment using on-chip memory of GPGPU","author":"Zhang Q.","year":"2010","unstructured":"Q. Zhang , H. An , G. Liu , et\u00a0al. 2010. The optimization of parallel Smith-Waterman sequence alignment using on-chip memory of GPGPU. IEEE 5th International Conference on Bio-Inspired Computing: Theories and Applications (2010).","journal-title":"IEEE 5th International Conference on Bio-Inspired Computing: Theories and Applications"},{"key":"e_1_3_3_432_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24151-2_11"},{"key":"e_1_3_3_433_2","doi-asserted-by":"publisher","DOI":"10.1145\/2994148"},{"key":"e_1_3_3_434_2","article-title":"GPU accelerated high-quality video\/image super-resolution","author":"Zhao Z.","year":"2016","unstructured":"Z. Zhao , L. Song , R. Xie , et\u00a0al. 2016. GPU accelerated high-quality video\/image super-resolution. IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (2016).","journal-title":"IEEE International Symposium on Broadband Multimedia Systems and Broadcasting"},{"key":"e_1_3_3_435_2","doi-asserted-by":"publisher","DOI":"10.1109\/tpds.2013.111"},{"key":"e_1_3_3_436_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-016-1738-3"},{"key":"e_1_3_3_437_2","article-title":"WolfGraph: The edge-centric graph processing on GPU","author":"Zhu H.","year":"2019","unstructured":"H. Zhu , L. He , M. Leeke , et\u00a0al. 2019. WolfGraph: The edge-centric graph processing on GPU. Future Generation Computer Systems (2019).","journal-title":"Future Generation Computer Systems"},{"key":"e_1_3_3_438_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2017.7927180"},{"key":"e_1_3_3_439_2","article-title":"Accelerating support count for association rule mining on GPUs","author":"Zois V.","year":"2016","unstructured":"V. Zois , A. Panangadan , and V. Prasanna . 2016. Accelerating support count for association rule mining on GPUs. IEEE International Parallel and Distributed Processing Symposium Workshops (2016).","journal-title":"IEEE International Parallel and Distributed Processing Symposium Workshops"},{"key":"e_1_3_3_440_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1913"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3570638","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,30]],"date-time":"2023-11-30T19:58:43Z","timestamp":1701374323000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3570638"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,16]]},"references-count":439,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2023,11,30]]}},"alternative-id":["10.1145\/3570638"],"URL":"http:\/\/dx.doi.org\/10.1145\/3570638","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,16]]},"assertion":[{"value":"2021-07-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}