{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T18:53:47Z","timestamp":1767984827811,"version":"3.49.0"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T00:00:00Z","timestamp":1731024000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"AFRL under the DARPA RTML program","award":["award FA8650-20-2-7009"],"award-info":[{"award-number":["award FA8650-20-2-7009"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2025,1,31]]},"abstract":"<jats:p>Today\u2019s performance analysis frameworks for deep learning accelerators suffer from two significant limitations. First, although modern convolutional neural networks (CNNs) consist of many types of layers other than convolution, especially during training, these frameworks largely focus on convolution layers only. Second, these frameworks are generally targeted towards inference and lack support for training operations. This work proposes a novel open-source performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. The modeling effort of SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training on a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads. SimDIT-enabled performance analysis reveals that on a 64\u00d764 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18\u00d7 performance improvement over a generic static resource allocation for ResNet-50 inference.<\/jats:p>","DOI":"10.1145\/3696665","type":"journal-article","created":{"date-parts":[[2024,9,26]],"date-time":"2024-09-26T11:14:21Z","timestamp":1727349261000},"page":"1-34","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Performance Analysis of CNN Inference\/Training with Convolution and Non-Convolution Operations on ASIC Accelerators"],"prefix":"10.1145","volume":"30","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8548-1039","authenticated-orcid":false,"given":"Hadi","family":"Esmaeilzadeh","sequence":"first","affiliation":[{"name":"Computer Science and Engineering, University of California San Diego, La Jolla, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5514-8027","authenticated-orcid":false,"given":"Soroush","family":"Ghodrati","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, University of California San Diego, La Jolla, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4490-5018","authenticated-orcid":false,"given":"Andrew B.","family":"Kahng","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, Electrical and Computer Engineering, University of California San Diego, La Jolla, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0955-585X","authenticated-orcid":false,"given":"Sean","family":"Kinzer","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, University of California San Diego, La Jolla, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9358-6255","authenticated-orcid":false,"given":"Susmita Dey","family":"Manasi","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Minnesota Twin Cities, Minneapolis, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5353-2364","authenticated-orcid":false,"given":"Sachin S.","family":"Sapatnekar","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Minnesota Twin Cities, Minneapolis, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6669-9702","authenticated-orcid":false,"given":"Zhiang","family":"Wang","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of California San Diego, La Jolla, United States"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"2022. VTA Hardware Design Stack. (2022). Retrieved October 6 2024 from https:\/\/github.com\/pasqoc\/incubator-tvm-vta"},{"key":"e_1_3_3_3_2","unstructured":"2023. SimDIT: A Simulation Framework for DNN Inference and Training on ASIC Accelerator Platforms. (2023). Retrieved October 6 2024 from https:\/\/github.com\/VeriGOOD-ML\/public\/tree\/main\/genesys\/SimDIT"},{"key":"e_1_3_3_4_2","unstructured":"2023. VeriGOOD-ML. (2023). Retrieved October 6 2024 from https:\/\/github.com\/VeriGOOD-ML\/public"},{"key":"e_1_3_3_5_2","unstructured":"Suvadeep Banerjee Steve Burns Pasquale Cocchini Abhijit Davare Shweta Jain Desmond Kirkpatrick Anton Sorokin Jin Yang and Zhenkun Yang. 2021. A highly configurable hardware\/software stack for DNN inference acceleration. arXiv:2111.15024. Retrieved from https:\/\/arxiv.org\/abs\/2111.15024"},{"key":"e_1_3_3_6_2","first-page":"578","volume-title":"Proceedings of the USENIX Conference on Operating Systems Design and Implementation","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 578\u2013594."},{"key":"e_1_3_3_7_2","volume-title":"Proceedings of the IEEE\/ACM International Conference on Computer-Aided Design","author":"Esmaeilzadeh Hadi","year":"2021","unstructured":"Hadi Esmaeilzadeh, Soroush Ghodrati, Jie Gu, Shiyu Guo, Andrew B. Kahng, Joon Kyung Kim, Sean Kinzer, Rohan Mahapatra, Susmita Dey Manasi, Edwin Mascarenhas, Sachin S. Sapatnekar, Ravi Varadarajan, Zhiang Wang, Hanyang Xu, Brahmendra Reddy Yatham, and Ziqing Zeng. 2021. VeriGOOD-ML: An open-source flow for automated ML hardware synthesis. In Proceedings of the IEEE\/ACM International Conference on Computer-Aided Design. IEEE, Piscataway, NJ, 7."},{"key":"e_1_3_3_8_2","first-page":"769","volume-title":"Proceedings of the ACM\/IEEE Design Automation Conference","author":"Genc Hasan","year":"2021","unstructured":"Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the ACM\/IEEE Design Automation Conference. IEEE, Piscataway, NJ, 769\u2013774."},{"key":"e_1_3_3_9_2","unstructured":"Andrew Hard Kanishka Rao Rajiv Mathews Swaroop Ramaswamy Fran\u00e7oise Beaufays Sean Augenstein Hubert Eichner Chlo\u00e9 Kiddon and Daniel Ramage. 2018. Federated learning for mobile keyboard prediction. arXiv:1811.03604. Retrieved from https:\/\/arxiv.org\/abs\/1811.03604"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_11_2","first-page":"448","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning. JMLR.org, 448\u2013456."},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3360307"},{"key":"e_1_3_3_13_2","first-page":"1","volume-title":"Proceedings of the ACM International Symposium on Computer Architecture","author":"Jouppi Norman P.","year":"2017","unstructured":"Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM International Symposium on Computer Architecture. ACM, New York, NY, 1\u201312."},{"key":"e_1_3_3_14_2","first-page":"1097","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. ACM, New York, NY, 1097\u20131105."},{"key":"e_1_3_3_15_2","first-page":"899","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops","author":"Kukreja Navjot","year":"2019","unstructured":"Navjot Kukreja, Alena Shilova, Olivier Beaumont, Jan Huckelheim, Nicola Ferrier, Paul Hovland, and Gerard Gorman. 2019. Training on the edge: The why and the how. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, Piscataway, NJ, 899\u2013903."},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2985963"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2019.2897634"},{"key":"e_1_3_3_19_2","doi-asserted-by":"crossref","first-page":"475","DOI":"10.1145\/3566097.3567863","volume-title":"Proceedings of the Asia-South Pacific Design Automation Conference","author":"Manasi Susmita Dey","year":"2023","unstructured":"Susmita Dey Manasi, Suvadeep Banerjee, Abhijit Davare, Anton A. Sorokin, Steven M. Burns, Desmond A. Kirkpatrick, and Sachin S. Sapatnekar. 2023. Reusing GEMM hardware for efficient execution of depthwise separable convolution on ASIC-based DNN accelerators. In Proceedings of the Asia-South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 475\u2013482."},{"key":"e_1_3_3_20_2","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1145\/3394885.3431539","volume-title":"Proceedings of the Asia-South Pacific Design Automation Conference","author":"Manasi Susmita Dey","year":"2021","unstructured":"Susmita Dey Manasi and Sachin S. Sapatnekar. 2021. DeepOpt: Optimized scheduling of CNN workloads for ASIC-based systolic deep learning accelerators. In Proceedings of the Asia-South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 235\u2013241."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2019.2928962"},{"key":"e_1_3_3_22_2","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/3123939.3124545","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture","author":"O\u2019Connor Mike","year":"2017","unstructured":"Mike O\u2019Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture. IEEE, Piscataway, NJ, 41\u201354."},{"key":"e_1_3_3_23_2","first-page":"304","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software","author":"Parashar Angshuman","year":"2019","unstructured":"Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, NJ, 304\u2013315."},{"key":"e_1_3_3_24_2","volume-title":"Computer Organization and Design: The Hardware\/Software Interface","author":"Patterson David A.","year":"2014","unstructured":"David A. Patterson and John L. Hennessy. 2014. Computer Organization and Design: The Hardware\/Software Interface. Morgan Kaufmann, Burlington, MA."},{"issue":"5","key":"e_1_3_3_25_2","first-page":"1648","article-title":"TRIM: A design space exploration model for deep neural networks inference and training accelerators","volume":"42","author":"Qi Yangjie","year":"2022","unstructured":"Yangjie Qi, Shuo Zhang, and Tarek M. Taha. 2022. TRIM: A design space exploration model for deep neural networks inference and training accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 5 (2022), 1648\u20131661.","journal-title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_3_3_26_2","first-page":"3058","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Rajagopal Aditya","year":"2020","unstructured":"Aditya Rajagopal and Christos-Savvas Bouganis. 2020. Now that I can see, I can improve: Enabling data-driven finetuning of CNNs on the edge. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Piscataway, NJ, 3058\u20133067."},{"key":"e_1_3_3_27_2","volume-title":"Proceedings of the Workshop on Architecture and System Support for Transformer Models","author":"Reidy Brendan C.","year":"2023","unstructured":"Brendan C. Reidy, Mohammadreza Mohammadi, Mohammed E. Elbtity, and Ramtin Zand. 2023. Efficient deployment of transformer models on edge TPU accelerators: A real system evaluation. In Proceedings of the Workshop on Architecture and System Support for Transformer Models. openreview.net, 7."},{"key":"e_1_3_3_28_2","unstructured":"Ananda Samajdar Yuhao Zhu Paul Whatmough Matthew Mattina and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator simulator. arXiv:1811.02883. Retrieved from https:\/\/arxiv.org\/abs\/1811.02883"},{"key":"e_1_3_3_29_2","first-page":"40","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Xu Pengfei","year":"2020","unstructured":"Pengfei Xu, Xiaofan Zhang, Cong Hao, Yang Zhao, Yongan Zhang, Yue Wang, Chaojian Li, Zetong Guan, Deming Chen, and Yingyan Lin. 2020. AutoDNNchip: An automated DNN chip predictor and builder for both FPGAs and ASICs. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 40\u201350."},{"key":"e_1_3_3_30_2","first-page":"1916","volume-title":"Proceedings of the Asilomar Conference on Signals, Systems, and Computers","author":"Yang Tien-Ju","year":"2017","unstructured":"Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2017. A method to estimate the energy consumption of deep neural networks. In Proceedings of the Asilomar Conference on Signals, Systems, and Computers. IEEE, Piscataway, NJ, 1916\u20131920."},{"key":"e_1_3_3_31_2","first-page":"369","volume-title":"Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Yang Xuan","year":"2020","unstructured":"Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. 2020. Interstellar: Using halide\u2019s scheduling language to analyze DNN accelerators. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 369\u2013383."},{"key":"e_1_3_3_32_2","doi-asserted-by":"crossref","unstructured":"Kiran Seshadri Berkin Akin James Laudon Ravi Narayanaswami and Amir Yazdanbakhsh. 2022. An evaluation of edge TPU accelerators for convolutional neural networks. In Proceedings of the IEEE International Symposium on Workload Characterization IEEE Piscataway NJ 79\u201391.","DOI":"10.1109\/IISWC55918.2022.00017"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2022.3151820"},{"key":"e_1_3_3_34_2","first-page":"214","volume-title":"Proceedings of the IEEE International Symposium on Workload Characterization","author":"Zhou Yangjie","year":"2021","unstructured":"Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, Minyi Guo, and Yuhao Zhu. 2021. Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, Piscataway, NJ, 214\u2013225."}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696665","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3696665","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:57:43Z","timestamp":1750294663000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696665"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,8]]},"references-count":33,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1,31]]}},"alternative-id":["10.1145\/3696665"],"URL":"https:\/\/doi.org\/10.1145\/3696665","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"value":"1084-4309","type":"print"},{"value":"1557-7309","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,8]]},"assertion":[{"value":"2024-01-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-03","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}