{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T16:16:00Z","timestamp":1776442560222,"version":"3.51.2"},"publisher-location":"New York, NY, USA","reference-count":38,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,2,17]],"date-time":"2021-02-17T00:00:00Z","timestamp":1613520000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,2,17]]},"DOI":"10.1145\/3437801.3441599","type":"proceedings-article","created":{"date-parts":[[2021,2,20]],"date-time":"2021-02-20T23:04:20Z","timestamp":1613862260000},"page":"278-291","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":32,"title":["EGEMM-TC"],"prefix":"10.1145","author":[{"given":"Boyuan","family":"Feng","sequence":"first","affiliation":[{"name":"University of California"}]},{"given":"Yuke","family":"Wang","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Guoyang","family":"Chen","sequence":"additional","affiliation":[{"name":"Alibaba Group US Inc."}]},{"given":"Weifeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Alibaba Group US Inc."}]},{"given":"Yuan","family":"Xie","sequence":"additional","affiliation":[{"name":"University of California"}]},{"given":"Yufei","family":"Ding","sequence":"additional","affiliation":[{"name":"University of California"}]}],"member":"320","published-online":{"date-parts":[[2021,2,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Martin S. Andersen Joachim Dahl and Lieven Vandenberghe. 2020. Convex Optimization Solver. https:\/\/cvxopt.org\/.  Martin S. Andersen Joachim Dahl and Lieven Vandenberghe. 2020. Convex Optimization Solver. https:\/\/cvxopt.org\/."},{"key":"e_1_3_2_1_2_1","unstructured":"angelhof. 2017. Hipeac GPUs K-means. https:\/\/github.com\/angelhof\/gpus-kmeans.git.  angelhof. 2017. Hipeac GPUs K-means. https:\/\/github.com\/angelhof\/gpus-kmeans.git."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2005.52"},{"key":"e_1_3_2_1_4_1","volume-title":"Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc","author":"Banner Ron","unstructured":"Ron Banner , Yury Nahshan , and Daniel Soudry . 2019. Post training 4-bit quantization of convolutional networks for rapid-deployment . In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc , E. Fox, and R. Garnett (Eds.), Vol. 32 . Curran Associates, Inc. , Vancouver , 7950--7958. Ron Banner, Yury Nahshan, and Daniel Soudry. 2019. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Vancouver, 7950--7958."},{"key":"e_1_3_2_1_5_1","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Meghan Cowan , Haichen Shen , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018 . TVM: An Automated End-to-End Optimizing Compiler for Deep Learning . In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) . USENIX Association, USA, 579--594. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18). USENIX Association, USA, 579--594."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3331057"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"T.J. Dekker. 1971\/72. A Floating-Point Technique for Extending the Available Precision. Numer. Math. 18 (1971\/72) 224--242. http:\/\/eudml.org\/doc\/132105  T.J. Dekker. 1971\/72. A Floating-Point Technique for Extending the Available Precision. Numer. Math. 18 (1971\/72) 224--242. http:\/\/eudml.org\/doc\/132105","DOI":"10.1007\/BF01397083"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.nima.2005.11.140"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2010.5654017"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3314221.3314597"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00286"},{"key":"e_1_3_2_1_12_1","unstructured":"Zhe Jia Marco Maggioni Jeffrey Smith and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv:1903.07486 http:\/\/arxiv.org\/abs\/1903.07486  Zhe Jia Marco Maggioni Jeffrey Smith and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv:1903.07486 http:\/\/arxiv.org\/abs\/1903.07486"},{"key":"e_1_3_2_1_13_1","unstructured":"Zhe Jia Marco Maggioni Benjamin Staiger and Daniele Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.  Zhe Jia Marco Maggioni Benjamin Staiger and Daniele Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking."},{"key":"e_1_3_2_1_14_1","volume-title":"The Art of Computer Programming, Volume 2: Seminumerical Algorithms","author":"Knuth Donald E.","unstructured":"Donald E. Knuth . 1997. The Art of Computer Programming, Volume 2: Seminumerical Algorithms ( third ed.). Addison-Wesley , Boston . Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2: Seminumerical Algorithms (third ed.). Addison-Wesley, Boston."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2013.6494986"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2014.6844477"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295734"},{"key":"e_1_3_2_1_18_1","volume-title":"Sci. China Ser. GPhys. Mech. Astron.","author":"LiLi Li Yanxia Zhang","unstructured":"Yanxia Zhang LiLi Li and YongHeng Zhao . 2008. k-Nearest Neighbors for automated classification of celestial objects . In Sci. China Ser. GPhys. Mech. Astron. , Vol. 51 . Springer , China , 916--922. Yanxia Zhang LiLi Li and YongHeng Zhao. 2008. k-Nearest Neighbors for automated classification of celestial objects. In Sci. China Ser. GPhys. Mech. Astron., Vol. 51. Springer, China, 916--922."},{"key":"e_1_3_2_1_19_1","volume-title":"2017 12th International Conference on Computer Science and Education (ICCSE). IEEE","author":"Lin K.","unstructured":"K. Lin , L. Jing , M. Wang , M. Qiu , and Z. Ji . 2017. A novel long-term air quality forecasting algorithm based on kNN and NARX . In 2017 12th International Conference on Computer Science and Education (ICCSE). IEEE , Beijing, China, 343--348. K. Lin, L. Jing, M. Wang, M. Qiu, and Z. Ji. 2017. A novel long-term air quality forecasting algorithm based on kNN and NARX. In 2017 12th International Conference on Computer Science and Education (ICCSE). IEEE, Beijing, China, 343--348."},{"key":"e_1_3_2_1_20_1","volume-title":"Performance Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Computer Society","author":"Markidis S.","unstructured":"S. Markidis , S. W. D. Chien , E. Laure , I. B. Peng , and J. S. Vetter . 2018. NVIDIA Tensor Core Programmability , Performance Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Computer Society , Vancouver, British Columbia, CANADA, 522--531. S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Computer Society, Vancouver, British Columbia, CANADA, 522--531."},{"key":"e_1_3_2_1_21_1","unstructured":"NVIDIA. 2017. Programming Tensor Cores in CUDA 9. https:\/\/devblogs.nvidia.com\/programming-tensor-cores-cuda-9\/.  NVIDIA. 2017. Programming Tensor Cores in CUDA 9. https:\/\/devblogs.nvidia.com\/programming-tensor-cores-cuda-9\/."},{"key":"e_1_3_2_1_22_1","unstructured":"NVIDIA. 2017. Tensor Core Performance. https:\/\/www.nvidia.com\/en-us\/data-center\/volta-gpu-architecture\/.  NVIDIA. 2017. Tensor Core Performance. https:\/\/www.nvidia.com\/en-us\/data-center\/volta-gpu-architecture\/."},{"key":"e_1_3_2_1_23_1","unstructured":"NVIDIA. 2018. Nvidia RTX 6000. https:\/\/www.nvidia.com\/en-us\/design-visualization\/quadro\/rtx-6000\/.  NVIDIA. 2018. Nvidia RTX 6000. https:\/\/www.nvidia.com\/en-us\/design-visualization\/quadro\/rtx-6000\/."},{"key":"e_1_3_2_1_24_1","unstructured":"NVIDIA. 2018. Nvidia T4. https:\/\/www.nvidia.com\/en-us\/data-center\/tesla-t4\/.  NVIDIA. 2018. Nvidia T4. https:\/\/www.nvidia.com\/en-us\/data-center\/tesla-t4\/."},{"key":"e_1_3_2_1_25_1","unstructured":"NVIDIA. 2020. cuBLAS: CUDA Toolkit Documentation. https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html.  NVIDIA. 2020. cuBLAS: CUDA Toolkit Documentation. https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html."},{"key":"e_1_3_2_1_26_1","unstructured":"NVIDIA. 2020. CUDA Binary Utilities. https:\/\/docs.nvidia.com\/cuda\/cuda-binary-utilities\/index.html#instruction-set-ref.  NVIDIA. 2020. CUDA Binary Utilities. https:\/\/docs.nvidia.com\/cuda\/cuda-binary-utilities\/index.html#instruction-set-ref."},{"key":"e_1_3_2_1_27_1","unstructured":"NVIDIA. 2020. CUDA C++ Programming Guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.  NVIDIA. 2020. CUDA C++ Programming Guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html."},{"key":"e_1_3_2_1_28_1","unstructured":"NVIDIA. 2020. CUDA Event. https:\/\/devblogs.nvidia.com\/how-implement-performance-metrics-cuda-cc\/.  NVIDIA. 2020. CUDA Event. https:\/\/devblogs.nvidia.com\/how-implement-performance-metrics-cuda-cc\/."},{"key":"e_1_3_2_1_29_1","unstructured":"NVIDIA. 2020. PTX and SASS Assembly Debugging. https:\/\/docs.nvidia.com\/gameworks\/content\/developertools\/desktop\/ptx_sass_assembly_debugging.htm.  NVIDIA. 2020. PTX and SASS Assembly Debugging. https:\/\/docs.nvidia.com\/gameworks\/content\/developertools\/desktop\/ptx_sass_assembly_debugging.htm."},{"key":"e_1_3_2_1_30_1","unstructured":"Institute of Electrical and Electronics Engineers. 1985. IEEE Standard for Binary Floating Point Arithmetic.  Institute of Electrical and Electronics Engineers. 1985. IEEE Standard for Binary Floating Point Arithmetic."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1038\/tpj.2010.56"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/11690634_6"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/155090.155114"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375609"},{"key":"e_1_3_2_1_36_1","first-page":"3","article-title":"Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates","volume":"18","author":"Shewchuk Jonathan Richard","year":"1997","unstructured":"Jonathan Richard Shewchuk . 1997 . Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates . Discrete & Computational Geometry 18 , 3 (Oct. 1997), 305--363. Jonathan Richard Shewchuk. 1997. Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates. Discrete & Computational Geometry 18, 3 (Oct. 1997), 305--363.","journal-title":"Discrete & Computational Geometry"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00071"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018755"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3155284.3018755"}],"event":{"name":"PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","location":"Virtual Event Republic of Korea","acronym":"PPoPP '21","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3437801.3441599","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3437801.3441599","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:25Z","timestamp":1750191445000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3437801.3441599"}},"subtitle":["accelerating scientific computing on tensor cores with extended precision"],"short-title":[],"issued":{"date-parts":[[2021,2,17]]},"references-count":38,"alternative-id":["10.1145\/3437801.3441599","10.1145\/3437801"],"URL":"https:\/\/doi.org\/10.1145\/3437801.3441599","relation":{},"subject":[],"published":{"date-parts":[[2021,2,17]]},"assertion":[{"value":"2021-02-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}