{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T07:41:18Z","timestamp":1768030878506,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":28,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,6,26]],"date-time":"2019-06-26T00:00:00Z","timestamp":1561507200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,6,26]]},"DOI":"10.1145\/3330345.3330355","type":"proceedings-article","created":{"date-parts":[[2019,6,18]],"date-time":"2019-06-18T12:14:30Z","timestamp":1560860070000},"page":"106-116","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":33,"title":["TSM2"],"prefix":"10.1145","author":[{"given":"Jieyang","family":"Chen","sequence":"first","affiliation":[{"name":"University of California"}]},{"given":"Nan","family":"Xiong","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Xin","family":"Liang","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Dingwen","family":"Tao","sequence":"additional","affiliation":[{"name":"The University of Alabama"}]},{"given":"Sihuan","family":"Li","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Kaiming","family":"Ouyang","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Kai","family":"Zhao","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Nathan","family":"DeBardeleben","sequence":"additional","affiliation":[{"name":"Los Alamos National Laboratory"}]},{"given":"Qiang","family":"Guan","sequence":"additional","affiliation":[{"name":"Kent State University"}]},{"given":"Zizhong","family":"Chen","sequence":"additional","affiliation":[{"name":"University of California"}]}],"member":"320","published-online":{"date-parts":[[2019,6,26]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"K-means by NVIDIA. https:\/\/github.com\/NVIDIA\/kmeans  K-means by NVIDIA. https:\/\/github.com\/NVIDIA\/kmeans"},{"key":"e_1_3_2_1_2_1","unstructured":"MAGMA:. icl.cs.utk.edu\/magma  MAGMA:. icl.cs.utk.edu\/magma"},{"key":"e_1_3_2_1_3_1","unstructured":"2018. cuBLAS Benchmark. (2018). http:\/\/developer.download.nvidia.com\/compute\/cuda\/compute-docs\/cuda-performance-report.pdf  2018. cuBLAS Benchmark. (2018). http:\/\/developer.download.nvidia.com\/compute\/cuda\/compute-docs\/cuda-performance-report.pdf"},{"key":"e_1_3_2_1_4_1","unstructured":"2018. cuda programming guide. (2018). http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#multiprocessor-level  2018. cuda programming guide. (2018). http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#multiprocessor-level"},{"key":"e_1_3_2_1_5_1","unstructured":"2018. cuDNN. (2018). https:\/\/developer.nvidia.com\/cudnn  2018. cuDNN. (2018). https:\/\/developer.nvidia.com\/cudnn"},{"key":"e_1_3_2_1_6_1","volume-title":"www.culatools.com","author":"CULA.","year":"2018","unstructured":"2018. CULA. ( 2018 ). www.culatools.com 2018. CULA. (2018). www.culatools.com"},{"key":"e_1_3_2_1_7_1","unstructured":"2018. PTX programming guide. (2018). http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html#data-movement-and-conversion-instructions-ld  2018. PTX programming guide. (2018). http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html#data-movement-and-conversion-instructions-ld"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818311"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00071"},{"key":"e_1_3_2_1_10_1","volume-title":"Architecture and Storage (NAS), 2016 IEEE International Conference on.","author":"Chen Jieyang","unstructured":"Jieyang Chen , Sihuan Li , and Zizhong Chen . GPU-ABFT : Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs. In Networking , Architecture and Storage (NAS), 2016 IEEE International Conference on. Jieyang Chen, Sihuan Li, and Zizhong Chen. GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs. In Networking, Architecture and Storage (NAS), 2016 IEEE International Conference on."},{"key":"e_1_3_2_1_11_1","volume-title":"Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs. In Parallel and Distributed Processing Symposium","author":"Chen Jieyang","year":"2016","unstructured":"Jieyang Chen , Xin Liang , and Zizhong Chen . 2016 . Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs. In Parallel and Distributed Processing Symposium , 2016 IEEE International. Jieyang Chen, Xin Liang, and Zizhong Chen. 2016. Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs. In Parallel and Distributed Processing Symposium, 2016 IEEE International."},{"key":"e_1_3_2_1_12_1","volume-title":"SC16: International Conference for. IEEE, 667--677","author":"Chen Jieyang","year":"2016","unstructured":"Jieyang Chen , Li Tan , Panruo Wu , Dingwen Tao , Hongbo Li , Xin Liang , Sihuan Li , Rong Ge , Laxmi Bhuyan , and Zizhong Chen . 2016 . GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing. In High Performance Computing, Networking, Storage and Analysis , SC16: International Conference for. IEEE, 667--677 . Jieyang Chen, Li Tan, Panruo Wu, Dingwen Tao, Hongbo Li, Xin Liang, Sihuan Li, Rong Ge, Laxmi Bhuyan, and Zizhong Chen. 2016. GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 667--677."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1014052.1014118"},{"key":"e_1_3_2_1_15_1","volume-title":"Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs","author":"Dongarra Jack","year":"2014","unstructured":"Jack Dongarra , Mark Gates , Azzam Haidar , Jakub Kurzak , Piotr Luszczek , Stanimire Tomov , and Ichitaro Yamazaki . 2014. Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs ( 2014 ). Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs (2014)."},{"key":"e_1_3_2_1_16_1","volume-title":"SC16: International Conference for.","author":"Heinecke Alexander","year":"2016","unstructured":"Alexander Heinecke , Greg Henry , Maxwell Hutchinson , and Hans Pabst . 2016 . LIBXSMM: accelerating small matrix multiplications by runtime code generation. In High Performance Computing, Networking, Storage and Analysis , SC16: International Conference for. Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: accelerating small matrix multiplications by runtime code generation. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126915"},{"key":"e_1_3_2_1_19_1","unstructured":"CUDA NVIDIA. 2017. Basic Linear Algebra Subroutines (cuBLAS) library. (2017).  CUDA NVIDIA. 2017. Basic Linear Algebra Subroutines (cuBLAS) library. (2017)."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2014.09.001"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2015.108"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2907294.2907306"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2009.12.005"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2010.5470941"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2925426.2926256"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2010.5452013"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018750"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2133173.2133185"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2907294.2907315"}],"event":{"name":"ICS '19: 2019 International Conference on Supercomputing","location":"Phoenix Arizona","acronym":"ICS '19","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the ACM International Conference on Supercomputing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3330345.3330355","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3330345.3330355","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:54:05Z","timestamp":1750204445000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3330345.3330355"}},"subtitle":["optimizing tall-and-skinny matrix-matrix multiplication on GPUs"],"short-title":[],"issued":{"date-parts":[[2019,6,26]]},"references-count":28,"alternative-id":["10.1145\/3330345.3330355","10.1145\/3330345"],"URL":"https:\/\/doi.org\/10.1145\/3330345.3330355","relation":{},"subject":[],"published":{"date-parts":[[2019,6,26]]},"assertion":[{"value":"2019-06-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}