{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:25:54Z","timestamp":1755998754364,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":79,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,11]],"date-time":"2022-06-11T00:00:00Z","timestamp":1654905600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science Foundation (NSF)","award":["CNS-2007124"],"award-info":[{"award-number":["CNS-2007124"]}]},{"name":"National Science Foundation (NSF)","award":["CNS-1940048"],"award-info":[{"award-number":["CNS-1940048"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,18]]},"DOI":"10.1145\/3470496.3527411","type":"proceedings-article","created":{"date-parts":[[2022,5,31]],"date-time":"2022-05-31T19:06:01Z","timestamp":1654023961000},"page":"552-566","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["SIMD\n            <sup>2<\/sup>"],"prefix":"10.1145","author":[{"given":"Yunan","family":"Zhang","sequence":"first","affiliation":[{"name":"University of California"}]},{"given":"Po-An","family":"Tsai","sequence":"additional","affiliation":[{"name":"NVIDIA Research"}]},{"given":"Hung-Wei","family":"Tseng","sequence":"additional","affiliation":[{"name":"University of California"}]}],"member":"320","published-online":{"date-parts":[[2022,6,11]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"http:\/\/www.netlib.org\/blas\/","author":"Basic Linear Algebra BLAS","year":"2004","unstructured":"BLAS ( Basic Linear Algebra Subprograms). http:\/\/www.netlib.org\/blas\/ , 2004 . BLAS (Basic Linear Algebra Subprograms). http:\/\/www.netlib.org\/blas\/, 2004."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00023"},{"key":"e_1_3_2_1_3_1","unstructured":"Arm Corporation. Introducing the Scalable Matrix Extension for the Armv9-A Architecture. https:\/\/community.arm.com\/developer\/ip-products\/processors\/b\/processors-ip-blog\/posts\/scalable-matrix-extension-armv9-a-architecture 2021.  Arm Corporation. Introducing the Scalable Matrix Extension for the Armv9-A Architecture. https:\/\/community.arm.com\/developer\/ip-products\/processors\/b\/processors-ip-blog\/posts\/scalable-matrix-extension-armv9-a-architecture 2021."},{"key":"e_1_3_2_1_4_1","volume-title":"High Performance and Low Latency Deep Learning Inference Accelerator. In 2021 IEEE Hot Chips 33 Symposium (HCS)","author":"Chatha Karam","year":"2021","unstructured":"Karam Chatha . Qualcomm\u00ae Cloud Al 100: 12TOPS\/W Scalable , High Performance and Low Latency Deep Learning Inference Accelerator. In 2021 IEEE Hot Chips 33 Symposium (HCS) , 2021 . Karam Chatha. Qualcomm\u00ae Cloud Al 100: 12TOPS\/W Scalable, High Performance and Low Latency Deep Learning Inference Accelerator. In 2021 IEEE Hot Chips 33 Symposium (HCS), 2021."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.022071134"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the 1999 ACM\/IEEE conference on Supercomputing","author":"Corbal Jesus","year":"1999","unstructured":"Jesus Corbal , Roger Espasa , and Mateo Valero . MOM : a matrix SIMD instruction set architecture for multimedia applications . In Proceedings of the 1999 ACM\/IEEE conference on Supercomputing , 1999 . Jesus Corbal, Roger Espasa, and Mateo Valero. MOM: a matrix SIMD instruction set architecture for multimedia applications. In Proceedings of the 1999 ACM\/IEEE conference on Supercomputing, 1999."},{"key":"e_1_3_2_1_7_1","volume-title":"Introduction to Algorithms","author":"Cormen Thomas H.","year":"2009","unstructured":"Thomas H. Cormen , Charles E. Leiserson , Ronald L. Rivest , and Clifford Stein . Introduction to Algorithms , Third Edition. The MIT Press , 3 rd edition, 2009 . Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3331057"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT52795.2021.00032"},{"key":"e_1_3_2_1_10_1","volume-title":"Algorithms","author":"Erickson Jeff","year":"2019","unstructured":"Jeff Erickson . Algorithms . 2019 . Jeff Erickson. Algorithms. 2019."},{"key":"e_1_3_2_1_11_1","first-page":"278","volume-title":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '21","author":"Feng Boyuan","year":"2021","unstructured":"Boyuan Feng , Yuke Wang , Guoyang Chen , Weifeng Zhang , Yuan Xie , and Yufei Ding . Egemm-tc : Accelerating scientific computing on tensor cores with extended precision . In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '21 , pages 278 -- 291 , 2021 . Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. Egemm-tc: Accelerating scientific computing on tensor cores with extended precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '21, pages 278--291, 2021."},{"key":"e_1_3_2_1_12_1","volume-title":"Algorithm 97: Shortest path. Commun. ACM, page 345, jun","author":"Floyd Robert W.","year":"1962","unstructured":"Robert W. Floyd . Algorithm 97: Shortest path. Commun. ACM, page 345, jun 1962 . Robert W. Floyd. Algorithm 97: Shortest path. Commun. ACM, page 345, jun 1962."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00079"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358291"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00050"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783759"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-0-12-385963-1.00007-1"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358275"},{"key":"e_1_3_2_1_19_1","unstructured":"Jared Hoberock and Nathan Bell. Thrust: A parallel template library. http:\/\/thrust.github.io\/ 2010.  Jared Hoberock and Nathan Bell. Thrust: A parallel template library. http:\/\/thrust.github.io\/ 2010."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3329785.3329932"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476177"},{"key":"e_1_3_2_1_22_1","volume-title":"Hung-Wei Tseng. TCUDB: Accelerating Database with Tensor Processors. In the 2022 ACM SIGMOD\/PODS International Conference on Management of Data, SIGMOD 2022","author":"Hu Yu-Ching","year":"2022","unstructured":"Yu-Ching Hu , Yuliang Li , and Hung-Wei Tseng. TCUDB: Accelerating Database with Tensor Processors. In the 2022 ACM SIGMOD\/PODS International Conference on Management of Data, SIGMOD 2022 , 2022 . Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. TCUDB: Accelerating Database with Tensor Processors. In the 2022 ACM SIGMOD\/PODS International Conference on Management of Data, SIGMOD 2022, 2022."},{"key":"e_1_3_2_1_23_1","unstructured":"Intel Corporation. Intrinsics for Intel(R) Advanced Matrix Extensions (Intel(R) AMX) Instructions. https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/documentation\/cpp-compiler-developer-guide-and-reference\/top\/compiler-reference\/intrinsics\/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html 2021.  Intel Corporation. Intrinsics for Intel(R) Advanced Matrix Extensions (Intel(R) AMX) Instructions. https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/documentation\/cpp-compiler-developer-guide-and-reference\/top\/compiler-reference\/intrinsics\/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html 2021."},{"key":"e_1_3_2_1_24_1","volume-title":"cuASR: CUDA Algebra for Semirings. https:\/\/github.com\/hpcgarage\/cuASR","author":"Hammond Jeff","year":"2021","unstructured":"Jeff Hammond . cuASR: CUDA Algebra for Semirings. https:\/\/github.com\/hpcgarage\/cuASR , 2021 . Jeff Hammond. cuASR: CUDA Algebra for Semirings. https:\/\/github.com\/hpcgarage\/cuASR, 2021."},{"key":"e_1_3_2_1_25_1","volume-title":"https:\/\/github.com\/jiachengpan\/cudaMST","author":"Pan Jiacheng","year":"2016","unstructured":"Jiacheng Pan . CUDA MST. https:\/\/github.com\/jiachengpan\/cudaMST , 2016 . Jiacheng Pan. CUDA MST. https:\/\/github.com\/jiachengpan\/cudaMST, 2016."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3360307"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_2_1_28_1","first-page":"47","volume-title":"Proceedings of the 23rd ACM SIGGRAPH\/EUROGRAPHICS Symposium on Graphics Hardware","author":"Gary","year":"2008","unstructured":"Gary J. Katz and Joseph T. Kider. All-pairs shortest-paths for large graphs on the gpu . In Proceedings of the 23rd ACM SIGGRAPH\/EUROGRAPHICS Symposium on Graphics Hardware , pages 47 -- 55 , 2008 . Gary J. Katz and Joseph T. Kider. All-pairs shortest-paths for large graphs on the gpu. In Proceedings of the 23rd ACM SIGGRAPH\/EUROGRAPHICS Symposium on Graphics Hardware, pages 47--55, 2008."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2016.7761646"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00047"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1090\/S0002-9939-1956-0078686-7"},{"key":"e_1_3_2_1_32_1","first-page":"740","volume-title":"the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO), MICRO '52","author":"Kwon Youngeun","year":"2019","unstructured":"Youngeun Kwon , Yunjae Lee , and Minsoo Rhu . TensorDIMM : A practical near-memory processing architecture for embeddings and tensor operations in deep learning . In the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO), MICRO '52 , pages 740 -- 753 , 2019 . Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO), MICRO '52, pages 740--753, 2019."},{"key":"e_1_3_2_1_33_1","volume-title":"12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020","author":"Lee Sangwon","year":"2020","unstructured":"Sangwon Lee , Gyuyoung Park , and Myoungsoo Jung . TensorPRAM : Designing a scalable heterogeneous deep learning accelerator with byte-addressable prams . In 12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020 , July 13 --14 , 2020 , 2020. Sangwon Lee, Gyuyoung Park, and Myoungsoo Jung. TensorPRAM: Designing a scalable heterogeneous deep learning accelerator with byte-addressable prams. In 12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020, July 13--14, 2020, 2020."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3152217"},{"key":"e_1_3_2_1_35_1","volume-title":"RM Petry, and RN Seitz. Investigation of model techniques-first annual report-6 june 1956--1 july 1957--a study of model techniques for communication systems","author":"Leyzorek M","year":"1957","unstructured":"M Leyzorek , RS Gray , AA Johnson , WC Ladew , SR Meaker Jr , RM Petry, and RN Seitz. Investigation of model techniques-first annual report-6 june 1956--1 july 1957--a study of model techniques for communication systems . Case Institute of Technology , Cleveland, Ohio , 1957 . M Leyzorek, RS Gray, AA Johnson, WC Ladew, SR Meaker Jr, RM Petry, and RN Seitz. Investigation of model techniques-first annual report-6 june 1956--1 july 1957--a study of model techniques for communication systems. Case Institute of Technology, Cleveland, Ohio, 1957."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/Cluster48925.2021.00035"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00071"},{"key":"e_1_3_2_1_38_1","volume-title":"ECL-APSP v1.0. https:\/\/userweb.cs.txstate.edu\/~burtscher\/research\/ECL-APSP\/","author":"Liu Yiqian","year":"2021","unstructured":"Yiqian Liu and Martin Burtscher . ECL-APSP v1.0. https:\/\/userweb.cs.txstate.edu\/~burtscher\/research\/ECL-APSP\/ , 2021 . Yiqian Liu and Martin Burtscher. ECL-APSP v1.0. https:\/\/userweb.cs.txstate.edu\/~burtscher\/research\/ECL-APSP\/, 2021."},{"key":"e_1_3_2_1_39_1","first-page":"28","volume-title":"Liu and Hung-Wei Tseng. NDS: N-Dimensional Storage. In MICRO-54:  54th Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2021","author":"Yu-Chia","year":"2021","unstructured":"Yu-Chia Liu and Hung-Wei Tseng. NDS: N-Dimensional Storage. In MICRO-54: 54th Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2021 , pages 28 -- 45 , 2021 . Yu-Chia Liu and Hung-Wei Tseng. NDS: N-Dimensional Storage. In MICRO-54: 54th Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2021, pages 28--45, 2021."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3092312"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC43674.2020.9286192"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISBI48211.2021.9434068"},{"key":"e_1_3_2_1_43_1","volume-title":"A multi-stage cuda kernel for floyd-warshall. ArXiv, abs\/1001.4108","author":"Lund Ben D.","year":"2010","unstructured":"Ben D. Lund and Justin W. Smith . A multi-stage cuda kernel for floyd-warshall. ArXiv, abs\/1001.4108 , 2010 . Ben D. Lund and Justin W. Smith. A multi-stage cuda kernel for floyd-warshall. ArXiv, abs\/1001.4108, 2010."},{"key":"e_1_3_2_1_44_1","volume-title":"Cuda Floyd Warshall implementation. https:\/\/github.com\/MTB90\/cuda-floyd_warshall","author":"Bojanowski Mateusz","year":"2018","unstructured":"Mateusz Bojanowski . Cuda Floyd Warshall implementation. https:\/\/github.com\/MTB90\/cuda-floyd_warshall , 2018 . Mateusz Bojanowski. Cuda Floyd Warshall implementation. https:\/\/github.com\/MTB90\/cuda-floyd_warshall, 2018."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168149.1168167"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA48506.2021.9561068"},{"issue":"3","key":"e_1_3_2_1_47_1","first-page":"321","article-title":"Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata","volume":"7","author":"Mohri Mehryar","year":"2002","unstructured":"Mehryar Mohri . Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata , Languages and Combinatorics , 7 ( 3 ): 321 -- 350 , 2002 . Mehryar Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321--350, 2002.","journal-title":"Languages and Combinatorics"},{"key":"e_1_3_2_1_48_1","volume-title":"Simulation of Quantum Many-Body Dynamics with Tensor Processing Units: Floquet Prethermalization. arXiv preprint arXiv:2111.08044","author":"Morningstar Alan","year":"2021","unstructured":"Alan Morningstar , Markus Hauru , Jackson Beall , Martin Ganahl , Adam G. M. Lewis , Vedika Khemani , and Guifre Vidal . Simulation of Quantum Many-Body Dynamics with Tensor Processing Units: Floquet Prethermalization. arXiv preprint arXiv:2111.08044 , 2021 . Alan Morningstar, Markus Hauru, Jackson Beall, Martin Ganahl, Adam G. M. Lewis, Vedika Khemani, and Guifre Vidal. Simulation of Quantum Many-Body Dynamics with Tensor Processing Units: Floquet Prethermalization. arXiv preprint arXiv:2111.08044, 2021."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00010"},{"key":"e_1_3_2_1_50_1","volume-title":"Daniel Sanchez. PHI: Architectural Support for Synchronization-and Bandwidth-Efficient Commutative Scatter Updates. In the 52nd Annual IEEE\/ACM international symposium on Microarchitecture (MICRO)","author":"Mukkara Anurag","year":"2019","unstructured":"Anurag Mukkara , Nathan Beckmann , and Daniel Sanchez. PHI: Architectural Support for Synchronization-and Bandwidth-Efficient Commutative Scatter Updates. In the 52nd Annual IEEE\/ACM international symposium on Microarchitecture (MICRO) , 2019 . Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. PHI: Architectural Support for Synchronization-and Bandwidth-Efficient Commutative Scatter Updates. In the 52nd Annual IEEE\/ACM international symposium on Microarchitecture (MICRO), 2019."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00043"},{"key":"e_1_3_2_1_52_1","volume-title":"https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf","author":"Tensor NVIDIA","year":"2020","unstructured":"NVIDIA A100 Tensor Core GPU Architecture . https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf , 2020 . NVIDIA A100 Tensor Core GPU Architecture. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf, 2020."},{"key":"e_1_3_2_1_53_1","volume-title":"NVIDIA T4 TENSOR CORE GPU. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/tesla-t4\/t4-tensor-core-datasheet-951643.pdf","author":"NVIDIA Corporation","year":"2019","unstructured":"NVIDIA Corporation . NVIDIA T4 TENSOR CORE GPU. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/tesla-t4\/t4-tensor-core-datasheet-951643.pdf , 2019 . NVIDIA Corporation. NVIDIA T4 TENSOR CORE GPU. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/tesla-t4\/t4-tensor-core-datasheet-951643.pdf, 2019."},{"key":"e_1_3_2_1_54_1","volume-title":"Warp Level Matrix Multiply-Accumulate Instructions. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html#warp-level-matrix-instructions","author":"NVIDIA Corporation","year":"2021","unstructured":"NVIDIA Corporation . Warp Level Matrix Multiply-Accumulate Instructions. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html#warp-level-matrix-instructions , 2021 . NVIDIA Corporation. Warp Level Matrix Multiply-Accumulate Instructions. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html#warp-level-matrix-instructions, 2021."},{"key":"e_1_3_2_1_55_1","volume-title":"NVIDIA Hopper Architecture In-Depth. https:\/\/developer.nvidia.com\/blog\/nvidia-hopper-architecture-in-depth\/","author":"NVIDIA Corporation","year":"2022","unstructured":"NVIDIA Corporation . NVIDIA Hopper Architecture In-Depth. https:\/\/developer.nvidia.com\/blog\/nvidia-hopper-architecture-in-depth\/ , 2022 . NVIDIA Corporation. NVIDIA Hopper Architecture In-Depth. https:\/\/developer.nvidia.com\/blog\/nvidia-hopper-architecture-in-depth\/, 2022."},{"key":"e_1_3_2_1_56_1","volume-title":"cuBool: sparse Boolean linear algebra for NVIDIA CUDA. https:\/\/github.com\/JetBrains-Research\/cuBool","author":"Orachyov Egor","year":"2021","unstructured":"Egor Orachyov , Pavel Alimov , and Semyon Grigorev . cuBool: sparse Boolean linear algebra for NVIDIA CUDA. https:\/\/github.com\/JetBrains-Research\/cuBool , 2021 . Version 1.2.0. Egor Orachyov, Pavel Alimov, and Semyon Grigorev. cuBool: sparse Boolean linear algebra for NVIDIA CUDA. https:\/\/github.com\/JetBrains-Research\/cuBool, 2021. Version 1.2.0."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080254"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00015"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00016"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1504\/IJCSE.2013.052115"},{"key":"e_1_3_2_1_61_1","first-page":"19","volume-title":"Benchmarking and Simulation of High Performance Computer Systems (PMBS)","author":"Salmon Justin","year":"2019","unstructured":"Justin Salmon and Simon McIntosh-Smith . Exploiting hardware-accelerated ray tracing for monte carlo particle transport with openmc. In 2019 IEEE\/ACM Performance Modeling , Benchmarking and Simulation of High Performance Computer Systems (PMBS) , pages 19 -- 29 , 2019 . Justin Salmon and Simon McIntosh-Smith. Exploiting hardware-accelerated ray tracing for monte carlo particle transport with openmc. In 2019 IEEE\/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 19--29, 2019."},{"key":"e_1_3_2_1_62_1","first-page":"225","volume-title":"Parallel Processing and Applied Mathematics","author":"Stanislav","year":"2012","unstructured":"Stanislav G. Sedukhin and Marcin Paprzycki. Generalizing matrix multiplication for efficient computations on modern computers . In Parallel Processing and Applied Mathematics , pages 225 -- 234 , 2012 . Stanislav G. Sedukhin and Marcin Paprzycki. Generalizing matrix multiplication for efficient computations on modern computers. In Parallel Processing and Applied Mathematics, pages 225--234, 2012."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2312005.2312018"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00052"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00068"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00062"},{"key":"e_1_3_2_1_67_1","volume-title":"Energy Efficiency Boost in the AI-Infused POWER10 Processor. In 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)","author":"Thompto Brian W","year":"2021","unstructured":"Brian W Thompto , Dung Q Nguyen , Jos\u00e9 E Moreira , Ramon Bertran , Hans Jacobson , Richard J Eickemeyer , Rahul M Rao , Michael Goulet , Marcy Byers , Christopher J Gonzalez , Energy Efficiency Boost in the AI-Infused POWER10 Processor. In 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) , 2021 . Brian W Thompto, Dung Q Nguyen, Jos\u00e9 E Moreira, Ramon Bertran, Hans Jacobson, Richard J Eickemeyer, Rahul M Rao, Michael Goulet, Marcy Byers, Christopher J Gonzalez, et al. Energy Efficiency Boost in the AI-Infused POWER10 Processor. In 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021."},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2014.6844458"},{"key":"e_1_3_2_1_69_1","volume-title":"kNN-CUDA. https:\/\/github.com\/vincentfpgarcia\/kNN-CUDA","author":"Vincent Garcia Michel Barlaud","year":"2018","unstructured":"Michel Barlaud Vincent Garcia , \u00c9ric Debreuve . kNN-CUDA. https:\/\/github.com\/vincentfpgarcia\/kNN-CUDA , 2018 . Michel Barlaud Vincent Garcia, \u00c9ric Debreuve. kNN-CUDA. https:\/\/github.com\/vincentfpgarcia\/kNN-CUDA, 2018."},{"issue":"4","key":"e_1_3_2_1_70_1","article-title":"First draft of a report on the edvac","volume":"15","author":"Neumann John Von","year":"1993","unstructured":"John Von Neumann . First draft of a report on the edvac . IEEE Annals of the History of Computing , 15 ( 4 ), 1993 . John Von Neumann. First draft of a report on the edvac. IEEE Annals of the History of Computing, 15(4), 1993.","journal-title":"IEEE Annals of the History of Computing"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00063"},{"key":"e_1_3_2_1_72_1","volume-title":"Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20--24","author":"Wulf Wm A","year":"1995","unstructured":"Wm A Wulf and Sally A McKee . Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20--24 , 1995 . Wm A Wulf and Sally A McKee. Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20--24, 1995."},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446702"},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00053"},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00030"},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00011"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358269"},{"key":"e_1_3_2_1_78_1","first-page":"76","volume-title":"Zhu. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22","author":"Yuhao","year":"2022","unstructured":"Yuhao Zhu. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22 , pages 76 -- 89 , 2022 . Yuhao Zhu. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22, pages 76--89, 2022."},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358256"}],"event":{"name":"ISCA '22: The 49th Annual International Symposium on Computer Architecture","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","IEEE CS TCAA IEEE CS technical committee on architectural acoustics"],"location":"New York New York","acronym":"ISCA '22"},"container-title":["Proceedings of the 49th Annual International Symposium on Computer Architecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3470496.3527411","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3470496.3527411","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:28Z","timestamp":1750188628000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3470496.3527411"}},"subtitle":["a generalized matrix instruction set for accelerating tensor computation beyond GEMM"],"short-title":[],"issued":{"date-parts":[[2022,6,11]]},"references-count":79,"alternative-id":["10.1145\/3470496.3527411","10.1145\/3470496"],"URL":"https:\/\/doi.org\/10.1145\/3470496.3527411","relation":{},"subject":[],"published":{"date-parts":[[2022,6,11]]},"assertion":[{"value":"2022-06-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}