{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T00:08:58Z","timestamp":1755907738173,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":87,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,10,13]],"date-time":"2024-10-13T00:00:00Z","timestamp":1728777600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,10,14]]},"DOI":"10.1145\/3656019.3676950","type":"proceedings-article","created":{"date-parts":[[2024,10,11]],"date-time":"2024-10-11T10:34:08Z","timestamp":1728642848000},"page":"284-296","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["BOOM: Use your Desktop to Accurately Predict the Performance of Large Deep Neural Networks"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-2612-6634","authenticated-orcid":false,"given":"Qidong","family":"Su","sequence":"first","affiliation":[{"name":"University of Toronto, Canada; Vector Institute, Canada and CentML, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9581-9088","authenticated-orcid":false,"given":"Jiacheng","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Toronto, Canada and Vector Institute, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3839-0919","authenticated-orcid":false,"given":"Gennady","family":"Pekhimenko","sequence":"additional","affiliation":[{"name":"University of Toronto, Canada; Vector Institute, Canada and CentML, Canada"}]}],"member":"320","published-online":{"date-parts":[[2024,10,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2023. Amazon EC2 Instance Types. https:\/\/aws.amazon.com\/ec2\/instance-types\/."},{"key":"e_1_3_2_1_2_1","unstructured":"2023. AWS Inferentia. https:\/\/aws.amazon.com\/machine-learning\/inferentia\/."},{"key":"e_1_3_2_1_3_1","unstructured":"2023. AWS Trainium. https:\/\/aws.amazon.com\/machine-learning\/trainium\/."},{"key":"e_1_3_2_1_4_1","unstructured":"2023. cuBLAS. https:\/\/docs.nvidia.com\/cuda\/cublas\/."},{"key":"e_1_3_2_1_5_1","unstructured":"2023. NVIDIA Nsight Compute. https:\/\/developer.nvidia.com\/nsight-compute."},{"key":"e_1_3_2_1_6_1","unstructured":"2023. NVIDIA Nsight Systems. https:\/\/developer.nvidia.com\/nsight-systems."},{"key":"e_1_3_2_1_7_1","unstructured":"2023. PyTorch Memory Management. https:\/\/pytorch.org\/docs\/stable\/notes\/cuda.html#memory-management."},{"key":"e_1_3_2_1_8_1","unstructured":"2023. SambaNova DataScale. https:\/\/sambanova.ai\/products\/datascale\/."},{"key":"e_1_3_2_1_9_1","unstructured":"2024. vast.ai. https:\/\/vast.ai\/."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3322967"},{"key":"e_1_3_2_1_11_1","volume-title":"Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. arXiv preprint arXiv:2210.10246","author":"Andoorveedu Muralidhar","year":"2022","unstructured":"Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, and Gennady Pekhimenko. 2022. Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. arXiv preprint arXiv:2210.10246 (2022)."},{"key":"e_1_3_2_1_12_1","first-page":"181","article-title":"A deep learning based cost model for automatic code optimization","volume":"3","author":"Baghdadi Riyadh","year":"2021","unstructured":"Riyadh Baghdadi, Massinissa Merouani, Mohamed-Hicham Leghettas, Kamel Abdous, Taha Arbaoui, Karima Benatchba, 2021. A deep learning based cost model for automatic code optimization. Proceedings of Machine Learning and Systems 3 (2021), 181\u2013193.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_13_1","volume-title":"Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew\u00a0E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)."},{"key":"e_1_3_2_1_14_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems 33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared\u00a0D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877\u20131901."},{"key":"e_1_3_2_1_15_1","volume-title":"Pixelated butterfly: Simple and efficient sparse training for neural network models. arXiv preprint arXiv:2112.00029","author":"Chen Beidi","year":"2021","unstructured":"Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re. 2021. Pixelated butterfly: Simple and efficient sparse training for neural network models. arXiv preprint arXiv:2112.00029 (2021)."},{"key":"e_1_3_2_1_16_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. { TVM} : An automated { End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578\u2013594."},{"key":"e_1_3_2_1_17_1","volume-title":"Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)."},{"key":"e_1_3_2_1_18_1","volume-title":"Learning to optimize tensor programs. Advances in Neural Information Processing Systems 31","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. Advances in Neural Information Processing Systems 31 (2018)."},{"key":"e_1_3_2_1_19_1","volume-title":"cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)."},{"key":"e_1_3_2_1_20_1","volume-title":"Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509","author":"Child Rewon","year":"2019","unstructured":"Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)."},{"key":"e_1_3_2_1_21_1","volume-title":"models. visited June 28","author":"Contributors Torch","year":"2021","unstructured":"Torch Contributors. 2021. Torchvision. models. visited June 28 (2021)."},{"key":"e_1_3_2_1_22_1","volume-title":"6th International Conference on Learning Representations, ICLR","author":"Das Dipankar","year":"2018","unstructured":"Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj\u00a0D. Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jes\u00fas Corbal, Nikita Shustrov, Roman Dubtsov, Evarist Fomenko, and Vadim\u00a0O. Pirogov. 2018. Mixed Precision Training of Convolutional Neural Networks using Integer Operations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=H135uzZ0-"},{"key":"e_1_3_2_1_23_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3054824"},{"key":"e_1_3_2_1_25_1","volume-title":"An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00075"},{"key":"e_1_3_2_1_27_1","volume-title":"Tensorir: An abstraction for automatic tensorized program optimization. arXiv preprint arXiv:2207.04296","author":"Feng Siyuan","year":"2022","unstructured":"Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody\u00a0Hao Yu, Yong Yu, 2022. Tensorir: An abstraction for automatic tensorized program optimization. arXiv preprint arXiv:2207.04296 (2022)."},{"key":"e_1_3_2_1_28_1","unstructured":"Yanjie Gao Xianyu Gu Hongyu Zhang Haoxiang Lin and Mao Yang. 2021. Runtime Performance Prediction for Deep Learning Models with Graph Neural Network. Technical Report. Technical Report MSR-TR-2021-3. Microsoft."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3417050"},{"key":"e_1_3_2_1_30_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Geoffrey X\u00a0Yu","year":"2021","unstructured":"X\u00a0Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habitat: A { Runtime-Based} Computational Performance Predictor for Deep Neural Network Training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 503\u2013521."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00140"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3242044"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00070"},{"key":"e_1_3_2_1_37_1","first-page":"497","article-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization","volume":"2","author":"Jain Paras","year":"2020","unstructured":"Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497\u2013511.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359630"},{"key":"e_1_3_2_1_39_1","volume-title":"Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413","author":"Jia Zhe","year":"2019","unstructured":"Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele\u00a0Paolo Scarpazza. 2019. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413 (2019)."},{"key":"e_1_3_2_1_40_1","volume-title":"Motivation for and evaluation of the first tensor processing unit. ieee Micro 38, 3","author":"Jouppi Norman","year":"2018","unstructured":"Norman Jouppi, Cliff Young, Nishant Patil, and David Patterson. 2018. Motivation for and evaluation of the first tensor processing unit. ieee Micro 38, 3 (2018), 10\u201319."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2018.8622396"},{"key":"e_1_3_2_1_42_1","first-page":"387","article-title":"A learned performance model for tensor processing units","volume":"3","author":"Kaufman Sam","year":"2021","unstructured":"Sam Kaufman, Phitchaya Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, and Mike Burrows. 2021. A learned performance model for tensor processing units. Proceedings of Machine Learning and Systems 3 (2021), 387\u2013400.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2011.89"},{"key":"e_1_3_2_1_44_1","unstructured":"Andrew Kerr Haicheng Wu Manish Gupta Dustyn Blasig Pradeep Ramini Duane Merrill Aniket Shivam Piotr Majcher Paul Springer Markus Hohnerbach Jin Wang and Matt Nicely. 2022. CUTLASS. https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_3_2_1_45_1","volume-title":"Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616","author":"Kirisame Marisa","year":"2020","unstructured":"Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616 (2020)."},{"key":"e_1_3_2_1_46_1","volume-title":"Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942","author":"Lan Zhenzhong","year":"2019","unstructured":"Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)."},{"key":"e_1_3_2_1_47_1","unstructured":"Rebecca Lewington. 2021. An AI Chip With Unprecedented Performance To Do the Unimaginable. (2021)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Xiaoyao Liang. 2019. Ascend AI Processor architecture and programming.","DOI":"10.1016\/B978-0-12-823488-4.00003-5"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2019.8875654"},{"key":"e_1_3_2_1_50_1","volume-title":"PerfNetRT: Platform-Aware Performance Modeling for Optimized Deep Neural Networks. In 2020 International Computer Symposium (ICS). IEEE, 153\u2013158","author":"Liao Ying-Chiao","year":"2020","unstructured":"Ying-Chiao Liao, Chuan-Chi Wang, Chia-Heng Tu, Ming-Chang Kao, Wen-Yew Liang, and Shih-Hao Hung. 2020. PerfNetRT: Platform-Aware Performance Modeling for Optimized Deep Neural Networks. In 2020 International Computer Symposium (ICS). IEEE, 153\u2013158."},{"key":"e_1_3_2_1_51_1","volume-title":"Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971","author":"Lillicrap P","year":"2015","unstructured":"Timothy\u00a0P Lillicrap, Jonathan\u00a0J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)."},{"key":"e_1_3_2_1_52_1","volume-title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs. arXiv preprint arXiv:2201.07821","author":"Lin Zhongyi","year":"2022","unstructured":"Zhongyi Lin, Louis Feng, Ehsan\u00a0K Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, and John\u00a0D Owens. 2022. Building a Performance Model for Deep Learning Recommendation Model Training on GPUs. arXiv preprint arXiv:2201.07821 (2022)."},{"key":"e_1_3_2_1_53_1","volume-title":"View. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 173\u2013185","author":"Liu Guodong","year":"2021","unstructured":"Guodong Liu, Sa Wang, and Yungang Bao. 2021. SEER: A Time Prediction Model for CNNs from GPU Kernel\u2019s View. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 173\u2013185."},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001179"},{"key":"e_1_3_2_1_55_1","unstructured":"Yao Lu Song Bian Lequn Chen Yongjun He Yulong Hui Matthew Lentz Beibin Li Fei Liu Jialin Li Qi Liu Rui Liu Xiaoxuan Liu Lin Ma Kexin Rong Jianguo Wang Yingjun Wu Yongji Wu Huanchen Zhang Minjia Zhang Qizhen Zhang Tianyi Zhou and Danyang Zhuo. 2024. Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native. arxiv:2401.12230\u00a0[cs.DC]"},{"key":"e_1_3_2_1_56_1","first-page":"336","article-title":"Mlperf training benchmark","volume":"2","author":"Mattson Peter","year":"2020","unstructured":"Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, 2020. Mlperf training benchmark. Proceedings of Machine Learning and Systems 2 (2020), 336\u2013349.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_57_1","volume-title":"VLDB DISPA Workshop","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Analysis and exploitation of dynamic pricing in the public cloud for ml training. In VLDB DISPA Workshop 2020."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3453483.3454083"},{"key":"e_1_3_2_1_60_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_3_2_1_62_1","volume-title":"Language models are unsupervised multitask learners. OpenAI blog 1, 8","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462176"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00112"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00057"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783721"},{"key":"e_1_3_2_1_68_1","volume-title":"Julian Schrittwieser, Ioannis Antonoglou","author":"Silver David","year":"2016","unstructured":"David Silver, Aja Huang, Chris\u00a0J Maddison, Arthur Guez, Laurent Sifre, George Van Den\u00a0Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484\u2013489."},{"key":"e_1_3_2_1_69_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3520312.3534863"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC50609.2020.00016"},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i09.7123"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_1_74_1","volume-title":"International conference on machine learning. PMLR, 6105\u20136114","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105\u20136114."},{"key":"e_1_3_2_1_75_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_3_2_1_76_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_77_1","volume-title":"15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21)","author":"Wang Haojie","year":"2021","unstructured":"Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. { PET} : Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 37\u201354."},{"key":"e_1_3_2_1_78_1","first-page":"599","article-title":"Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models","volume":"3","author":"Wang Shang","year":"2021","unstructured":"Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, and Gennady Pekhimenko. 2021. Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models. Proceedings of Machine Learning and Systems 3 (2021), 599\u2013623.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1145\/3379337.3415890"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575737"},{"key":"e_1_3_2_1_82_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Zhang Quanlu","year":"2020","unstructured":"Quanlu Zhang, Zhenhua Han, Fan Yang, Yuge Zhang, Zhe Liu, Mao Yang, and Lidong Zhou. 2020. Retiarii: A Deep Learning { Exploratory-Training} Framework. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 919\u2013936."},{"key":"e_1_3_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/3158369"},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00092"},{"key":"e_1_3_2_1_85_1","volume-title":"14th USENIX symposium on operating systems design and implementation (OSDI 20)","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody\u00a0Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, 2020. Ansor: Generating { High-Performance} Tensor Programs for Deep Learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 863\u2013879."},{"volume-title":"2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 88\u2013100","author":"Zhu Hongyu","key":"e_1_3_2_1_86_1","unstructured":"Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. [n. d.]. Benchmarking and analyzing deep neural network training. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 88\u2013100."},{"key":"e_1_3_2_1_87_1","volume-title":"2020 USENIX Annual Technical Conference (USENIX ATC 20)","author":"Zhu Hongyu","year":"2020","unstructured":"Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for { DNN} Training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 337\u2013352."}],"event":{"name":"PACT '24: International Conference on Parallel Architectures and Compilation Techniques","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"],"location":"Long Beach CA USA","acronym":"PACT '24"},"container-title":["Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656019.3676950","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3656019.3676950","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T19:56:20Z","timestamp":1755892580000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656019.3676950"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,13]]},"references-count":87,"alternative-id":["10.1145\/3656019.3676950","10.1145\/3656019"],"URL":"https:\/\/doi.org\/10.1145\/3656019.3676950","relation":{},"subject":[],"published":{"date-parts":[[2024,10,13]]},"assertion":[{"value":"2024-10-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}