{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T16:15:03Z","timestamp":1775837703420,"version":"3.50.1"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2020,11,10]],"date-time":"2020-11-10T00:00:00Z","timestamp":1604966400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000028","name":"Semiconductor Research Corporation","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000028","id-type":"DOI","asserted-by":"crossref"}]},{"name":"U.S. Government, under the DARPA DSSoC program"},{"name":"NSF","award":["#CNS-1718160 and #1533737"],"award-info":[{"award-number":["#CNS-1718160 and #1533737"]}]},{"name":"Intel"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,12,31]]},"abstract":"<jats:p>In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25\u201340%, with the rest spent on data movement and in the deep learning software framework. Thus far, it has been very difficult to study end-to-end DNN performance during early stage design (before RTL is available), because there are no existing DNN frameworks that support end-to-end simulation with easy custom hardware accelerator integration. To address this gap in research infrastructure, we present SMAUG, the first DNN framework that is purpose-built for simulation of end-to-end deep learning applications. SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration. To demonstrate the power and value of SMAUG, we present case studies that show how we can optimize overall performance and energy efficiency for up to 1.8\u00d7\u20135\u00d7 speedup over a baseline system, without changing any part of the accelerator microarchitecture, as well as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.<\/jats:p>","DOI":"10.1145\/3424669","type":"journal-article","created":{"date-parts":[[2020,11,10]],"date-time":"2020-11-10T23:16:11Z","timestamp":1605050171000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":54,"title":["SMAUG"],"prefix":"10.1145","volume":"17","author":[{"given":"Sam (Likun)","family":"Xi","sequence":"first","affiliation":[{"name":"Harvard University, Cambridge, MA"}]},{"given":"Yuan","family":"Yao","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA"}]},{"given":"Kshitij","family":"Bhardwaj","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA"}]},{"given":"Paul","family":"Whatmough","sequence":"additional","affiliation":[{"name":"Harvard University and Arm ML Research, Cambridge, MA"}]},{"given":"Gu-Yeon","family":"Wei","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA"}]},{"given":"David","family":"Brooks","sequence":"additional","affiliation":[{"name":"Harvard University, Cambridge, MA"}]}],"member":"320","published-online":{"date-parts":[[2020,11,10]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , et\u00a0al. 2016 . Tensorflow: A system for large-scale machine learning . In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916) . Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et\u00a0al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201918)","author":"Alsop Johnathan","unstructured":"Johnathan Alsop , Matthew D. Sinclair , and Sarita V. Adve . 2018. Spandex: A flexible interface for efficient heterogeneous coherence . In Proceedings of the International Symposium on Computer Architecture (ISCA\u201918) . Johnathan Alsop, Matthew D. Sinclair, and Sarita V. Adve. 2018. Spandex: A flexible interface for efficient heterogeneous coherence. In Proceedings of the International Symposium on Computer Architecture (ISCA\u201918)."},{"key":"e_1_2_1_3_1","volume-title":"Tsung Tai Yeh, et\u00a0al","author":"Alsop Johnathan","year":"2019","unstructured":"Johnathan Alsop , Matthew D. Sinclair , Srikant Bharadwaj , Alexandru Dutu , Anthony Gutierrez , Onur Kayiran , Michael LeBeane , Sooraj Puthoor , Xianwei Zhang , Tsung Tai Yeh, et\u00a0al . 2019 . Optimizing GPU cache policies for MI workloads. arXiv:1910.00134. Retrieved from https:\/\/arxiv.org\/abs\/1910.00134. Johnathan Alsop, Matthew D. Sinclair, Srikant Bharadwaj, Alexandru Dutu, Anthony Gutierrez, Onur Kayiran, Michael LeBeane, Sooraj Puthoor, Xianwei Zhang, Tsung Tai Yeh, et\u00a0al. 2019. Optimizing GPU cache policies for MI workloads. arXiv:1910.00134. Retrieved from https:\/\/arxiv.org\/abs\/1910.00134."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783725"},{"key":"e_1_2_1_5_1","unstructured":"AMD. 2014. Compute Cores. Technical Report. Retrieved from www.amd.com\/computecores.  AMD. 2014. Compute Cores. Technical Report. Retrieved from www.amd.com\/computecores."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173177"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00051"},{"key":"e_1_2_1_8_1","volume-title":"Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.","author":"Chen Tianqi","year":"2015","unstructured":"Tianqi Chen , Mu Li , Yutian Li , Min Lin , Naiyan Wang , Minjie Wang , Tianjun Xiao , Bing Xu , Chiyuan Zhang , and Zheng Zhang . 2015 . Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Retrieved from https:\/\/arxiv.org\/abs\/1512.01274. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Retrieved from https:\/\/arxiv.org\/abs\/1512.01274."},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Q. Yan , Haichen Shen , Meghan Cowan , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018 . TVM: An automated end-to-end optimizing compiler for deep learning . In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918) . Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2016.7418007"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001140"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2637364.2591973"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1950413.1950435"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2015.7372595"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124552"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201916)","author":"Clevert Djork-Arn\u00e9","year":"2016","unstructured":"Djork-Arn\u00e9 Clevert , Thomas Underthiner , and Sepp Hochreiter . 2016 . Fast and accurate deep network learning by exponential linear units (ELUs) . In Proceedings of the International Conference on Learning Representations (ICLR\u201916) . Djork-Arn\u00e9 Clevert, Thomas Underthiner, and Sepp Hochreiter. 2016. Fast and accurate deep network learning by exponential linear units (ELUs). In Proceedings of the International Conference on Learning Representations (ICLR\u201916)."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201919)","author":"Qin Eric","year":"2019","unstructured":"Eric Qin , Ananda Samajdar , Hyoukjun Kwon , Vineet Nadella , Sudarshan Srinivasan , Dipankar Das , Bharat Kaul , and Tushar Krishna . 2019 . SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training . In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201919) . Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2019. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201919)."},{"key":"e_1_2_1_18_1","unstructured":"Kayvon Fatahalian. 2011. A Camera Image Processing Pipeline. Retrieved from http:\/\/www.cs.cmu.edu\/afs\/cs\/academic\/class\/15869-f11\/www\/lectures\/16_camerapipeline1.pdf.  Kayvon Fatahalian. 2011. A Camera Image Processing Pipeline. Retrieved from http:\/\/www.cs.cmu.edu\/afs\/cs\/academic\/class\/15869-f11\/www\/lectures\/16_camerapipeline1.pdf."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358253"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2017.25"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001163"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the 25th International Conference on Pattern Recognition (ICPR\u201920)","author":"Hansen Patrick","unstructured":"Patrick Hansen , Alexey Vilkin , Yury Khrustalev , James Imber , David Hanwell , Matthew Mattina , and Paul N. Whatmough . 2020. ISP4ML: Understanding the role of image signal processing in efficient deep learning vision systems . In Proceedings of the 25th International Conference on Pattern Recognition (ICPR\u201920) . Patrick Hansen, Alexey Vilkin, Yury Khrustalev, James Imber, David Hanwell, Matthew Mattina, and Paul N. Whatmough. 2020. ISP4ML: Understanding the role of image signal processing in efficient deep learning vision systems. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR\u201920)."},{"key":"e_1_2_1_23_1","unstructured":"Mark Harris. 2013. Unified Memory in CUDA 6. Retrieved from https:\/\/devblogs.nvidia.com\/parallelforall\/unified-memory-in-cuda-6\/.  Mark Harris. 2013. Unified Memory in CUDA 6. Retrieved from https:\/\/devblogs.nvidia.com\/parallelforall\/unified-memory-in-cuda-6\/."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD45719.2019.8942048"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358252"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783722"},{"key":"e_1_2_1_30_1","unstructured":"Karthik Chandrasekar Christian Weis Yonghui Li Sven Goossens Matthias Jung Omar Naji Benny Akesson Norbert Wehn and Kees Goossens. [n.d.]. DRAMPower: Open-Source DRAM Power and Energy Estimation Tool. Retrieved from http:\/\/drampower.info.  Karthik Chandrasekar Christian Weis Yonghui Li Sven Goossens Matthias Jung Omar Naji Benny Akesson Norbert Wehn and Kees Goossens. [n.d.]. DRAMPower: Open-Source DRAM Power and Energy Estimation Tool. Retrieved from http:\/\/drampower.info."},{"key":"e_1_2_1_31_1","volume-title":"Deep Learning with Python","author":"Ketkar Nikhil","unstructured":"Nikhil Ketkar . 2017. Introduction to pytorch . In Deep Learning with Python . Springer . Nikhil Ketkar. 2017. Introduction to pytorch. In Deep Learning with Python. Springer."},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the European Conference on Computer Systems (EuroSys\u201919)","author":"Kim Youngsok","year":"2019","unstructured":"Youngsok Kim , Joonsung Kim , Dongju Chae , Daehyun Kim , and Jangwoo Kim . 2019 . uLayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization . In Proceedings of the European Conference on Computer Systems (EuroSys\u201919) . Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. uLayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In Proceedings of the European Conference on Computer Systems (EuroSys\u201919)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750421"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173176"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the 2019 56th ACM\/IEEE Design Automation Conference (DAC\u201919)","author":"Li H.","unstructured":"H. Li , M. Bhargav , P. N. Whatmough , and H. Philip Wong . 2019. On-chip memory technology design space explorations for mobile deep neural network accelerators . In Proceedings of the 2019 56th ACM\/IEEE Design Automation Conference (DAC\u201919) . 1--6. H. Li, M. Bhargav, P. N. Whatmough, and H. Philip Wong. 2019. On-chip memory technology design space explorations for mobile deep neural network accelerators. In Proceedings of the 2019 56th ACM\/IEEE Design Automation Conference (DAC\u201919). 1--6."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.23919\/FPL.2017.8056775"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001179"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2020.2979965"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.29"},{"key":"e_1_2_1_40_1","unstructured":"Micron Technology 2014. Mobile LPDDR4 SDRAM. Micron Technology.  Micron Technology 2014. Mobile LPDDR4 SDRAM. Micron Technology."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056066"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2014.2334635"},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201917)","author":"Olson Lena E.","unstructured":"Lena E. Olson , Mark D. Hill , and David A. Wood . 2017. Crossing guard: Mediating host-accelerator coherence interactions . In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201917) . Lena E. Olson, Mark D. Hill, and David A. Wood. 2017. Crossing guard: Mediating host-accelerator coherence interactions. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201917)."},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)","author":"Olson Lena E.","unstructured":"Lena E. Olson , Jason Power , Mark D. Hill , and David A. Wood . 2015. Border control: Sandboxing accelerators . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915) . Lena E. Olson, Jason Power, Mark D. Hill, and David A. Wood. 2015. Border control: Sandboxing accelerators. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201917)","author":"Parashar Angshuman","unstructured":"Angshuman Parashar , Minsoo Rhu , Anurag Mukkara , Antonio Puglielli , Rangharajan Venkatesan , Brucek Khailany , Joel S. Emer , Stephen W. Keckler , and William J. Dally . 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks . In Proceedings of the International Symposium on Computer Architecture (ISCA\u201917) . Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel S. Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA\u201917)."},{"key":"e_1_2_1_47_1","unstructured":"Jongsoo Park Maxim Naumov Protonu Basu Summer Deng Aravind Kalaiah Daya Khudia James Law Parth Malani Andrey Malevich Satish Nadathur Juan Pino Martin Schatz Alexander Sidorov Viswanath Sivakumar Andrew Tulloch Xiaodong Wang Yiming Wu Hector Yuen Utku Diril Dmytro Dzhulgakov Kim Hazelwood Bill Jia Yangqing Jia Lin Qiao Vijay Rao Nadav Rotem Sungjoo Yoo and Mikhail Smelyanskiy. 2018. Deep Learning Inference in Facebook Data Centers: Characterization Performance Optimizations and Hardware Implications. arxiv:cs.LG\/1811.09886. Retrieved from https:\/\/arxiv.org\/abs\/cs.LG\/1811.09886.  Jongsoo Park Maxim Naumov Protonu Basu Summer Deng Aravind Kalaiah Daya Khudia James Law Parth Malani Andrey Malevich Satish Nadathur Juan Pino Martin Schatz Alexander Sidorov Viswanath Sivakumar Andrew Tulloch Xiaodong Wang Yiming Wu Hector Yuen Utku Diril Dmytro Dzhulgakov Kim Hazelwood Bill Jia Yangqing Jia Lin Qiao Vijay Rao Nadav Rotem Sungjoo Yoo and Mikhail Smelyanskiy. 2018. Deep Learning Inference in Facebook Data Centers: Characterization Performance Optimizations and Hardware Implications. arxiv:cs.LG\/1811.09886. Retrieved from https:\/\/arxiv.org\/abs\/cs.LG\/1811.09886."},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201917)","author":"Piccolboni Luca","unstructured":"Luca Piccolboni , Paolo Mantovani , Giuseppe Di Guglielmo , and Luca P. Carloni . 2017. Broadening the exploration of the accelerator design space in embedded scalable platforms . In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201917) . IEEE, 1--7. Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2017. Broadening the exploration of the accelerator design space in embedded scalable platforms. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201917). IEEE, 1--7."},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914)","author":"Pichai Bharath","year":"2014","unstructured":"Bharath Pichai , Lisa Hsu , and Abhishek Bhattacharjee . 2014 . Architectural support for address translation on GPUs designing memory management units for CPU\/GPUs with unified address spaces . In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914) . Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs designing memory management units for CPU\/GPUs with unified address spaces. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914)."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913)","author":"Power Jason","unstructured":"Jason Power , Arkaprava Basu , Junli Gu , Sooraj Puthoor , Bradford M. Beckmann , Mark D. Hill , Steven K. Reinhardt , and David A. Wood . 2013. Heterogeneous system coherence for integrated CPU-GPU systems . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913) . Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201913)."},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201914)","author":"Power Jonathan","unstructured":"Jonathan Power , Mark D. Hill , and David A. Wood . 2014. Supporting x86-64 address translation for 100s of GPU lanes . In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201914) . Jonathan Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201914)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001165"},{"key":"e_1_2_1_54_1","volume-title":"Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou.","author":"Reddi Vijay Janapa","year":"2019","unstructured":"Vijay Janapa Reddi , Christine Cheng , David Kanter , Peter Mattson , Guenther Schmuelling , Carole-Jean Wu , Brian Anderson , Maximilien Breughe , Mark Charlebois , William Chou , Ramesh Chukka , Cody Coleman , Sam Davis , Pan Deng , Greg Diamos , Jared Duke , Dave Fick , J. Scott Gardner , Itay Hubara , Sachin Idgunji , Thomas B. Jablin , Jeff Jiao , Tom St. John , Pankaj Kanwar , David Lee , Jeffery Liao , Anton Lokhmotov , Francisco Massa , Peng Meng , Micikevicius, Colin Osborne , Gennady Pekhimenko , Arun Tejusve Raghunath Rajan , Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2019 . MLPerf Inference Benchmark. arxiv:cs.LG\/1911.02549. Retreived from https:\/\/arxiv.org\/abs\/cs.LG\/1911.02549. Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2019. MLPerf Inference Benchmark. arxiv:cs.LG\/1911.02549. Retreived from https:\/\/arxiv.org\/abs\/cs.LG\/1911.02549."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2513683.2513688"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS48437.2020.00016"},{"key":"e_1_2_1_57_1","unstructured":"Sergey Zagoruyko. 2015. Torch Blog: 92.45 on CIFAR10 in Torch. Retrieved from https:\/\/torch.ch\/blog\/07\/30\/cifar.html.  Sergey Zagoruyko. 2015. Torch Blog: 92.45 on CIFAR10 in Torch. Retrieved from https:\/\/torch.ch\/blog\/07\/30\/cifar.html."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001139"},{"key":"e_1_2_1_59_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919)","author":"Shao Yakun Sophia","unstructured":"Yakun Sophia Shao , Jason Clemons , Rangharajan Venkatesan , Brian Zimmer , Matthew Fojtik , Nan Jiang , Ben Keller , Alicia Klinefelter , Nathaniel Ross Pinckney , Priyanka Raina , Stephen G. Tell , Yanqing Zhang , William J. Dally , Joel S. Emer , C. Thomas Gray , Brucek Khailany , and Stephen W. Keckler . 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture . In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919) . 14--27. Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Ross Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel S. Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919). 14--27."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783751"},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the Workshop on Cognitive Architectures.","author":"Sharma Hardik","year":"2016","unstructured":"Hardik Sharma , Jongse Park , Emmanuel Amaro , Bradley Thwaites , Praneetha Kotha , Anmol Gupta , Joon Kyung Kim , Asit Mishra , and Hadi Esmaeilzadeh . 2016 . Dnnweaver: From high-level deep network models to fpga acceleration . In Proceedings of the Workshop on Cognitive Architectures. Hardik Sharma, Jongse Park, Emmanuel Amaro, Bradley Thwaites, Praneetha Kotha, Anmol Gupta, Joon Kyung Kim, Asit Mishra, and Hadi Esmaeilzadeh. 2016. Dnnweaver: From high-level deep network models to fpga acceleration. In Proceedings of the Workshop on Cognitive Architectures."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299874.3317996"},{"key":"e_1_2_1_63_1","unstructured":"Synopsys Inc. [n.d.]. Synopsys Platform Architect. Retrieved from https:\/\/www.synopsys.com\/verification\/virtual-prototyping\/platform-architect.html\/.  Synopsys Inc. [n.d.]. Synopsys Platform Architect. Retrieved from https:\/\/www.synopsys.com\/verification\/virtual-prototyping\/platform-architect.html\/."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00042"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD45719.2019.8942127"},{"key":"e_1_2_1_66_1","article-title":"DLAU: A scalable deep learning accelerator unit on FPGA","volume":"36","author":"Wang Chao","year":"2017","unstructured":"Chao Wang , Lei Gong , Qi Yu , Xi Li , Yuan Xie , and Xuehai Zhou . 2017 . DLAU: A scalable deep learning accelerator unit on FPGA . IEEE Trans. CAD Integr. Circ. Syst. 36 , 3 (2017). Chao Wang, Lei Gong, Qi Yu, Xi Li, Yuan Xie, and Xuehai Zhou. 2017. DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Trans. CAD Integr. Circ. Syst. 36, 3 (2017).","journal-title":"IEEE Trans. CAD Integr. Circ. Syst."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3195970.3196116"},{"key":"e_1_2_1_68_1","unstructured":"Yu Emma Wang Carole-Jean Wu Xiaodong Wang Kim Hazelwood and David Brooks. 2019. Exploiting Parallelism Opportunities with Deep Learning Frameworks. arxiv:cs.LG\/1908.04705. Retrieved from https:\/\/arxiv.org\/abs\/cs.LG\/1908.04705.  Yu Emma Wang Carole-Jean Wu Xiaodong Wang Kim Hazelwood and David Brooks. 2019. Exploiting Parallelism Opportunities with Deep Learning Frameworks. arxiv:cs.LG\/1908.04705. Retrieved from https:\/\/arxiv.org\/abs\/cs.LG\/1908.04705."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317875"},{"key":"e_1_2_1_70_1","volume-title":"Proceedings of the 2019 Symposium on VLSI Circuits. C34--C35","author":"Whatmough P. N.","unstructured":"P. N. Whatmough , S. K. Lee , M. Donato , H. Hsueh , S. Xi , U. Gupta , L. Pentecost , G. G. Ko , D. Brooks , and G. Wei . 2019. A 16nm 25mm2 SoC with a 54.5x flexibility-efficiency range from dual-core arm Cortex-A53 to eFPGA and cache-coherent accelerators . In Proceedings of the 2019 Symposium on VLSI Circuits. C34--C35 . P. N. Whatmough, S. K. Lee, M. Donato, H. Hsueh, S. Xi, U. Gupta, L. Pentecost, G. G. Ko, D. Brooks, and G. Wei. 2019. A 16nm 25mm2 SoC with a 54.5x flexibility-efficiency range from dual-core arm Cortex-A53 to eFPGA and cache-coherent accelerators. In Proceedings of the 2019 Symposium on VLSI Circuits. C34--C35."},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the Conference on Systems and Machine Learning (SysML\u201919)","author":"Whatmough Paul N.","year":"2019","unstructured":"Paul N. Whatmough , Chuteng Zhou , Patrick Hansen , Shreyas Kolala Venkataramanaiah , Jae sun Seo , and Matthew Mattina . 2019 . FixyNN: Efficient hardware for mobile computer vision via transfer learning . In Proceedings of the Conference on Systems and Machine Learning (SysML\u201919) . Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae sun Seo, and Matthew Mattina. 2019. FixyNN: Efficient hardware for mobile computer vision via transfer learning. In Proceedings of the Conference on Systems and Machine Learning (SysML\u201919)."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD45719.2019.8942149"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00029"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378514"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358269"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00052"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3424669","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3424669","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3424669","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:23:31Z","timestamp":1750202611000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3424669"}},"subtitle":["End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads"],"short-title":[],"issued":{"date-parts":[[2020,11,10]]},"references-count":79,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,12,31]]}},"alternative-id":["10.1145\/3424669"],"URL":"https:\/\/doi.org\/10.1145\/3424669","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,11,10]]},"assertion":[{"value":"2020-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}