{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T05:27:58Z","timestamp":1767850078341,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":87,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,4,27]],"date-time":"2024-04-27T00:00:00Z","timestamp":1714176000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["DGE-2137424"],"award-info":[{"award-number":["DGE-2137424"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,4,27]]},"DOI":"10.1145\/3620665.3640367","type":"proceedings-article","created":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T14:18:06Z","timestamp":1713795486000},"page":"20-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-1588-9604","authenticated-orcid":false,"given":"Michael","family":"Davies","sequence":"first","affiliation":[{"name":"University of Wisconsin-Madison, Madison, Wisconsin, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-4339-7233","authenticated-orcid":false,"given":"Ian","family":"McDougall","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison, Madison, Wisconsin, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0815-9090","authenticated-orcid":false,"given":"Selvaraj","family":"Anandaraj","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison, Madison, Wisconsin, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0826-7874","authenticated-orcid":false,"given":"Deep","family":"Machchhar","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison, Madison, Wisconsin, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-7953-6786","authenticated-orcid":false,"given":"Rithik","family":"Jain","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison, Madison, Wisconsin, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8315-2389","authenticated-orcid":false,"given":"Karthikeyan","family":"Sankaralingam","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison, Madison, Wisconsin, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2024,4,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Michael Andersch Greg Palmer Ronny Krashinsky Nick Stam Vishal Mehta Gonzalo Brito and Sridhar Ramaswamy. Nvidia hopper architecture in-depth. https:\/\/developer.nvidia.com\/blog\/nvidia-hopper-architecture-in-depth\/."},{"key":"e_1_3_2_1_2_1","first-page":"26","volume-title":"Benchmarking and Simulation of High Performance Computer Systems (PMBS)","author":"Anzt Hartwig","year":"2020","unstructured":"Hartwig Anzt, Yuhsiang M. Tsai, Ahmad Abdelfattah, Terry Cojean, and Jack Dongarra. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations. In 2020 IEEE\/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 26--38, GA, USA, November 2020. IEEE."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i8.16826"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2022.3161126"},{"key":"e_1_3_2_1_5_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020."},{"key":"e_1_3_2_1_6_1","unstructured":"Tianqi Chen Thierry Moreau Ziheng Jiang Lianmin Zheng Eddie Yan Meghan Cowan Haichen Shen Leyuan Wang Yuwei Hu Luis Ceze Carlos Guestrin and Arvind Krishnamurthy. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. page 17."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3061394"},{"key":"e_1_3_2_1_8_1","volume-title":"New York Times","author":"Clark Don","year":"2017","unstructured":"Don Clark. Why a 24-year-old chipmaker is one of tech's hot prospects. New York Times, September 2017."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2021.3098483"},{"key":"e_1_3_2_1_10_1","unstructured":"Werner Duvaud. Muzero general. https:\/\/github.com\/werner-duvaud\/muzero-general."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.330"},{"key":"e_1_3_2_1_12_1","volume-title":"Will AMD's MI300 Beat NVIDIA In AI?. january","year":"2023","unstructured":"Forbes. Will AMD's MI300 Beat NVIDIA In AI?. january 2023. https:\/\/www.forbes.com\/sites\/karlfreund\/2023\/01\/09\/will-amds-mi300-beat-nvidia-in-ai\/?sh=12520262491e."},{"key":"e_1_3_2_1_13_1","unstructured":"Karl Freund. Nvidia again claims the title for the fastest ai; competitors disagree."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00023"},{"key":"e_1_3_2_1_15_1","first-page":"19","volume-title":"Jure Leskovec. Inductive Representation Learning on Large Graphs. NEURIPS","author":"Hamilton William L","year":"2017","unstructured":"William L Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. NEURIPS 2017. page 19."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2830661"},{"key":"e_1_3_2_1_17_1","volume-title":"Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409","author":"Hestness Joel","year":"2017","unstructured":"Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017."},{"issue":"241","key":"e_1_3_2_1_18_1","first-page":"1","article-title":"Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks","volume":"22","author":"Hoefler Torsten","year":"2021","unstructured":"Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1--124, 2021.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_1_19_1","volume-title":"Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(1), jan","author":"Hoefler Torsten","year":"2021","unstructured":"Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(1), jan 2021."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00050"},{"key":"e_1_3_2_1_21_1","volume-title":"et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019."},{"key":"e_1_3_2_1_22_1","first-page":"711","article-title":"Data movement is all you need: A case study on optimizing transformers","volume":"3","author":"Ivanov Andrei","year":"2021","unstructured":"Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711--732, 2021.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_23_1","volume-title":"Dissecting the nvidia turing t4 gpu via microbenchmarking. arXiv preprint arXiv:1903.07486","author":"Jia Zhe","year":"2019","unstructured":"Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the nvidia turing t4 gpu via microbenchmarking. arXiv preprint arXiv:1903.07486, 2019."},{"key":"e_1_3_2_1_24_1","volume-title":"April","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking, April 2018. arXiv:1804.06826 [cs]."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00010"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3453483.3454038"},{"key":"e_1_3_2_1_27_1","unstructured":"Andrej Karpathy. mingpt. https:\/\/github.com\/karpathy\/minGPT."},{"key":"e_1_3_2_1_28_1","volume-title":"GPU Technology Conference. NVIDIA","author":"Kerr Andrew","year":"2020","unstructured":"Andrew Kerr. Developing cuda kernels to push tensor cores to the absolute limit on nvidia a100. In GPU Technology Conference. NVIDIA, 2020."},{"key":"e_1_3_2_1_29_1","volume-title":"NVIDIA Developer Blog","author":"Kerr Andrew","year":"2017","unstructured":"Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. Cutlass: Fast linear algebra in cuda c++. NVIDIA Developer Blog, 2017."},{"key":"e_1_3_2_1_30_1","unstructured":"Ronny Krashinsky Olivier Giroux Stephen Jones Nick Stam and Sridhar Ramaswamy. Nvidia ampere architecture in-depth. https:\/\/developer.nvidia.com\/blog\/nvidia-ampere-architecture-in-depth."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358284"},{"key":"e_1_3_2_1_32_1","volume-title":"Deep learning on fpgas: Past, present, and future. arXiv preprint arXiv:1602.04283","author":"Lacey Griffin","year":"2016","unstructured":"Griffin Lacey, Graham W Taylor, and Shawki Areibi. Deep learning on fpgas: Past, present, and future. arXiv preprint arXiv:1602.04283, 2016."},{"key":"e_1_3_2_1_33_1","volume-title":"MLIR: A compiler infrastructure for the end of moore's law. CoRR, abs\/2002.11054","author":"Lattner Chris","year":"2020","unstructured":"Chris Lattner, Jacques A. Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeisman, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. MLIR: A compiler infrastructure for the end of moore's law. CoRR, abs\/2002.11054, 2020."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2928289"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.3030548"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_1_37_1","first-page":"522","volume-title":"Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW)","author":"Markidis Stefano","year":"2018","unstructured":"Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pages 522--531. IEEE, 2018."},{"key":"e_1_3_2_1_38_1","first-page":"444","volume-title":"Benchmarking the NVIDIA V100 GPU and Tensor Cores","author":"Martineau Matt","year":"2018","unstructured":"Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. Benchmarking the NVIDIA V100 GPU and Tensor Cores. In Gabriele Mencagli, Dora B. Heras, Valeria Cardellini, Emiliano Casalicchio, Emmanuel Jeannot, Felix Wolf, Antonio Salis, Claudio Schifanella, Ravi Reddy Manumachu, Laura Ricci, Marco Beccuti, Laura Antonelli, Jos\u00e9 Daniel Garcia Sanchez, and Stephen L. Scott, editors, Euro-Par 2018: Parallel Processing Workshops, pages 444--455, Cham, 2019. Springer International Publishing."},{"key":"e_1_3_2_1_39_1","volume-title":"April","author":"Masters Dominic","year":"2018","unstructured":"Dominic Masters and Carlo Luschi. Revisiting Small Batch Training for Deep Neural Networks, April 2018. arXiv:1804.07612 [cs, stat]."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2549523"},{"key":"e_1_3_2_1_41_1","volume-title":"International Conference on Learning Representations","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. In International Conference on Learning Representations, 2018."},{"key":"e_1_3_2_1_42_1","volume-title":"Phi-2: The surprising power of small language models","year":"2023","unstructured":"Microsoft. Phi-2: The surprising power of small language models, 2023."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503250"},{"key":"e_1_3_2_1_44_1","article-title":"A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations","author":"Mittal Sparsh","year":"2021","unstructured":"Sparsh Mittal, Poonam Rajput, and Sreenivas Subramoney. A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations. IEEE Transactions on Neural Networks and Learning Systems, pages 1--21, 2021.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems, pages 1--21"},{"key":"e_1_3_2_1_45_1","volume-title":"Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989","author":"M\u00fcller Thomas","year":"2022","unstructured":"Thomas M\u00fcller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_47_1","unstructured":"NVIDIA. GP100 Pascal Whitepaper. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdf."},{"key":"e_1_3_2_1_48_1","unstructured":"NVIDIA. Kernel profiling guide. https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html#metrics-structure."},{"key":"e_1_3_2_1_49_1","unstructured":"NVIDIA. Ngc | pytorch. https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/pytorch."},{"key":"e_1_3_2_1_50_1","unstructured":"NVIDIA. Ngc | tensorflow. https:\/\/catalog.ngc.nvidia.com\/orgs\/nvidia\/containers\/tensorflow."},{"key":"e_1_3_2_1_51_1","unstructured":"NVIDIA. NVIDIA AMPERE GA102 GPU ARCHITECTURE. https:\/\/www.nvidia.com\/content\/PDF\/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf."},{"key":"e_1_3_2_1_52_1","unstructured":"NVIDIA. NVIDIA TESLA V100 GPU ARCHITECTURE. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"key":"e_1_3_2_1_53_1","volume-title":"May","author":"NVIDIA.","year":"2020","unstructured":"NVIDIA. NVIDIA Ampere Architecture In-Depth. https:\/\/developer.nvidia.com\/blog\/nvidia-ampere-architecture-in-depth\/, May 2020."},{"key":"e_1_3_2_1_54_1","volume-title":"Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499","author":"van den Oord Aaron","year":"2016","unstructured":"Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016."},{"key":"e_1_3_2_1_55_1","volume-title":"https:\/\/openai.com\/blog\/chatgpt","author":"Introducing AI.","year":"2022","unstructured":"OpenAI. Introducing ChatGPT. https:\/\/openai.com\/blog\/chatgpt, 2022."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124545"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00023"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_2_1_59_1","volume-title":"International Conference on Learning Representations","author":"Pfaff Tobias","year":"2021","unstructured":"Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh-based simulation with graph networks. In International Conference on Learning Representations, 2021."},{"key":"e_1_3_2_1_60_1","unstructured":"PyTorch. Performance Tuning Guide --- PyTorch Tutorials 1.12.1+cu102 documentation."},{"key":"e_1_3_2_1_61_1","unstructured":"PyTorch. Pytorch profiler. https:\/\/pytorch.org\/tutorials\/recipes\/recipes\/profiler_recipe.html."},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00016"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2018.10.045"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_65_1","unstructured":"Tiernan Ray. Chip industry is going to need a lot more software to catch Nvidia's lead in AI. =https:\/\/www.zdnet.com\/article\/chip-industry-is-going-to-need-a-lot-more-software-to-catch-nvidias-lead-in-ai\/."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00045"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01407"},{"key":"e_1_3_2_1_68_1","first-page":"1","volume-title":"Jeremy Kepner. Survey of Machine Learning Accelerators. In 2020 IEEE High Performance Extreme Computing Conference (HPEC)","author":"Reuther Albert","year":"2020","unstructured":"Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. Survey of Machine Learning Accelerators. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--12, Waltham, MA, USA, September 2020. IEEE."},{"key":"e_1_3_2_1_69_1","volume-title":"Compiling machine learning for peak performance","author":"Sabne Amit","year":"2020","unstructured":"Amit Sabne. Xla : Compiling machine learning for peak performance, 2020."},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-020-03051-4"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2018.2876312"},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN55064.2022.9891914"},{"key":"e_1_3_2_1_73_1","volume-title":"Measuring the effects of data parallelism on neural network training. CoRR, abs\/1811.03600","author":"Shallue Christopher J.","year":"2018","unstructured":"Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. CoRR, abs\/1811.03600, 2018."},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00027"},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3217824"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3468044.3468053"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2003.1223637"},{"key":"e_1_3_2_1_79_1","unstructured":"TensorFlow. Tensorflow profiler. https:\/\/www.tensorflow.org\/versions\/r1.15\/api_docs\/python\/tf\/profiler."},{"key":"e_1_3_2_1_80_1","volume-title":"Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023."},{"key":"e_1_3_2_1_81_1","volume-title":"Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730","author":"Vasilache Nicolas","year":"2018","unstructured":"Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018."},{"key":"e_1_3_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS48437.2020.00013"},{"key":"e_1_3_2_1_83_1","first-page":"37","volume-title":"OSDI","author":"Wang Haojie","year":"2021","unstructured":"Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. Pet: Optimizing tensor programs with partially equivalent transformations and automated corrections. In OSDI, pages 37--54, 2021."},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2010.5452013"},{"key":"e_1_3_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00071"},{"key":"e_1_3_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00012"},{"key":"e_1_3_2_1_87_1","first-page":"874","volume-title":"ISCA","author":"Zheng Size","year":"2022","unstructured":"Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. Amos: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In ISCA, pages 874--887, 2022."}],"event":{"name":"ASPLOS '24: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2","location":"La Jolla CA USA","acronym":"ASPLOS '24","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages","SIGBED ACM Special Interest Group on Embedded Systems"]},"container-title":["Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3620665.3640367","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3620665.3640367","content-type":"text\/html","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3620665.3640367","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3620665.3640367","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:41Z","timestamp":1750291421000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3620665.3640367"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,27]]},"references-count":87,"alternative-id":["10.1145\/3620665.3640367","10.1145\/3620665"],"URL":"https:\/\/doi.org\/10.1145\/3620665.3640367","relation":{},"subject":[],"published":{"date-parts":[[2024,4,27]]},"assertion":[{"value":"2024-04-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}