{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T17:14:15Z","timestamp":1773249255148,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":95,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,6,17]],"date-time":"2023-06-17T00:00:00Z","timestamp":1686960000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,6,17]]},"DOI":"10.1145\/3579371.3589105","type":"proceedings-article","created":{"date-parts":[[2023,6,16]],"date-time":"2023-06-16T20:25:28Z","timestamp":1686947128000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":45,"title":["Understanding and Mitigating Hardware Failures in Deep Learning Training Systems"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7206-4845","authenticated-orcid":false,"given":"Yi","family":"He","sequence":"first","affiliation":[{"name":"University of Chicago, Chicago, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-0139-7242","authenticated-orcid":false,"given":"Mike","family":"Hutton","sequence":"additional","affiliation":[{"name":"Google, Sunnyvale, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5722-3467","authenticated-orcid":false,"given":"Steven","family":"Chan","sequence":"additional","affiliation":[{"name":"Google, Sunnyvale, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7911-6213","authenticated-orcid":false,"given":"Robert","family":"De Gruijl","sequence":"additional","affiliation":[{"name":"Google, Sunnyvale, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-3783-7150","authenticated-orcid":false,"given":"Rama","family":"Govindaraju","sequence":"additional","affiliation":[{"name":"Google, Sunnyvale, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6620-0038","authenticated-orcid":false,"given":"Nishant","family":"Patil","sequence":"additional","affiliation":[{"name":"Google, Sunnyvale, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0124-0463","authenticated-orcid":false,"given":"Yanjing","family":"Li","sequence":"additional","affiliation":[{"name":"University of Chicago, Chicago, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,6,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2023. Fault injection framework. https:\/\/github.com\/YLab-UChicago\/TrainingFI.git.  2023. Fault injection framework. https:\/\/github.com\/YLab-UChicago\/TrainingFI.git."},{"key":"e_1_3_2_1_2_1","volume-title":"International journal of electrical and computer engineering systems 12, 2","author":"Adam Khalid","year":"2021","unstructured":"Khalid Adam , Izzeldin I Mohd , and Younis Ibrahim . 2021. Analyzing the resilience of convolutional neural networks implemented on gpus: Alexnet as a case study . International journal of electrical and computer engineering systems 12, 2 ( 2021 ), 91--103. Khalid Adam, Izzeldin I Mohd, and Younis Ibrahim. 2021. Analyzing the resilience of convolutional neural networks implemented on gpus: Alexnet as a case study. International journal of electrical and computer engineering systems 12, 2 (2021), 91--103."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3076716"},{"key":"e_1_3_2_1_4_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_1_5_1","volume-title":"2008 IEEE International Solid-State Circuits Conference-Digest of Technical Papers. IEEE, 400--622","author":"Blaauw David","year":"2008","unstructured":"David Blaauw , Sudherssen Kalaiselvan , Kevin Lai , Wei-Hsiang Ma , Sanjay Pant , Carlos Tokunaga , Shidhartha Das , and David Bull . 2008 . Razor II: In situ error detection and correction for PVT and SER tolerance . In 2008 IEEE International Solid-State Circuits Conference-Digest of Technical Papers. IEEE, 400--622 . David Blaauw, Sudherssen Kalaiselvan, Kevin Lai, Wei-Hsiang Ma, Sanjay Pant, Carlos Tokunaga, Shidhartha Das, and David Bull. 2008. Razor II: In situ error detection and correction for PVT and SER tolerance. In 2008 IEEE International Solid-State Circuits Conference-Digest of Technical Papers. IEEE, 400--622."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3302"},{"key":"e_1_3_2_1_7_1","volume-title":"International Test Conference Silicon Lifecycle Management Workshop. https:\/\/marcello.altervista.org\/SLM.tttc-events.org\/program.html#Keynote1","author":"Bonderson Rich","year":"2021","unstructured":"Rich Bonderson . 2021 . Training in Turmoil: Silent Data Corruption in Systems at Scale . International Test Conference Silicon Lifecycle Management Workshop. https:\/\/marcello.altervista.org\/SLM.tttc-events.org\/program.html#Keynote1 Rich Bonderson. 2021. Training in Turmoil: Silent Data Corruption in Systems at Scale. International Test Conference Silicon Lifecycle Management Workshop. https:\/\/marcello.altervista.org\/SLM.tttc-events.org\/program.html#Keynote1"},{"key":"e_1_3_2_1_8_1","volume-title":"High-Performance Large-Scale Image Recognition Without Normalization. CoRR abs\/2102.06171","author":"Brock Andrew","year":"2021","unstructured":"Andrew Brock , Soham De , Samuel L. Smith , and Karen Simonyan . 2021. High-Performance Large-Scale Image Recognition Without Normalization. CoRR abs\/2102.06171 ( 2021 ). arXiv:2102.06171 https:\/\/arxiv.org\/abs\/2102.06171 Andrew Brock, Soham De, Samuel L. Smith, and Karen Simonyan. 2021. High-Performance Large-Scale Image Recognition Without Normalization. CoRR abs\/2102.06171 (2021). arXiv:2102.06171 https:\/\/arxiv.org\/abs\/2102.06171"},{"key":"e_1_3_2_1_9_1","volume-title":"Resilient Low Voltage Accelerators for High Energy Efficiency. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 147--158","author":"N. Chandramoorthy","year":"2019","unstructured":"N. Chandramoorthy et al. 2019 . Resilient Low Voltage Accelerators for High Energy Efficiency. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 147--158 . N. Chandramoorthy et al. 2019. Resilient Low Voltage Accelerators for High Energy Efficiency. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 147--158."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2019.00018"},{"key":"e_1_3_2_1_11_1","first-page":"13773","article-title":"Understanding gradient clipping in private SGD: A geometric perspective","volume":"33","author":"Chen Xiangyi","year":"2020","unstructured":"Xiangyi Chen , Steven Z Wu , and Mingyi Hong . 2020 . Understanding gradient clipping in private SGD: A geometric perspective . Advances in Neural Information Processing Systems 33 (2020), 13773 -- 13782 . Xiangyi Chen, Steven Z Wu, and Mingyi Hong. 2020. Understanding gradient clipping in private SGD: A geometric perspective. Advances in Neural Information Processing Systems 33 (2020), 13773--13782.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eng.2020.01.007"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2003.13874"},{"key":"e_1_3_2_1_14_1","volume-title":"Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction. ArXiv abs\/2003.13874","author":"Chen Zitao","year":"2020","unstructured":"Zitao Chen , Guanpeng Li , and Karthik Pattabiraman . 2020 . Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction. ArXiv abs\/2003.13874 (2020). Zitao Chen, Guanpeng Li, and Karthik Pattabiraman. 2020. Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction. ArXiv abs\/2003.13874 (2020)."},{"key":"e_1_3_2_1_15_1","volume-title":"Cross-layer resilience to tolerate hardware errors in digital systems. Ph. D. Dissertation","author":"Cheng Eric","unstructured":"Eric Cheng . 2018. Cross-layer resilience to tolerate hardware errors in digital systems. Ph. D. Dissertation . Stanford University . Eric Cheng. 2018. Cross-layer resilience to tolerate hardware errors in digital systems. Ph. D. Dissertation. Stanford University."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2897996"},{"key":"e_1_3_2_1_17_1","volume-title":"Proceedings of the 50th Annual Design Automation Conference. 1--10","author":"H. Cho","year":"2013","unstructured":"H. Cho et al. 2013 . Quantitative evaluation of soft error injection techniques for robust system design . In Proceedings of the 50th Annual Design Automation Conference. 1--10 . H. Cho et al. 2013. Quantitative evaluation of soft error injection techniques for robust system design. In Proceedings of the 50th Annual Design Automation Conference. 1--10."},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the International Conference on Computer-Aided Design. 150--157","author":"Cong J.","unstructured":"J. Cong and K. Gururaj . 2011. Assuring application-level correctness against soft errors . In Proceedings of the International Conference on Computer-Aided Design. 150--157 . J. Cong and K. Gururaj. 2011. Assuring application-level correctness against soft errors. In Proceedings of the International Conference on Computer-Aided Design. 150--157."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2203.08989"},{"key":"e_1_3_2_1_20_1","volume-title":"Silent data corruptions at scale. arXiv preprint arXiv:2102.11245","author":"Dixit Harish Dattatraya","year":"2021","unstructured":"Harish Dattatraya Dixit , Sneha Pendharkar , Matt Beadon , Chris Mason , Tejasvi Chakravarthy , Bharath Muthiah , and Sriram Sankar . 2021. Silent data corruptions at scale. arXiv preprint arXiv:2102.11245 ( 2021 ). Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. 2021. Silent data corruptions at scale. arXiv preprint arXiv:2102.11245 (2021)."},{"key":"e_1_3_2_1_21_1","unstructured":"M. Everingham L. Van Gool C. K. I. Williams J. Winn and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http:\/\/www.pascal-network.org\/challenges\/VOC\/voc2012\/workshop\/index.html.  M. Everingham L. Van Gool C. K. I. Williams J. Winn and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http:\/\/www.pascal-network.org\/challenges\/VOC\/voc2012\/workshop\/index.html."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1735970.1736063"},{"key":"e_1_3_2_1_23_1","volume-title":"Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research","volume":"256","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio . 2010 . Understanding the difficulty of training deep feedforward neural networks . In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research , Vol. 9), Yee Whye Teh and Mike Titterington (Eds.). PMLR, Chia Laguna Resort, Sardinia, Italy, 249-- 256 . https:\/\/proceedings.mlr.press\/v9\/glorot10a.html Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 9), Yee Whye Teh and Mike Titterington (Eds.). PMLR, Chia Laguna Resort, Sardinia, Italy, 249--256. https:\/\/proceedings.mlr.press\/v9\/glorot10a.html"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISQED51717.2021.9424287"},{"key":"e_1_3_2_1_25_1","unstructured":"Google. 2019. Tensorflow. https:\/\/www.tensorflow.org.  Google. 2019. Tensorflow. https:\/\/www.tensorflow.org."},{"key":"e_1_3_2_1_26_1","unstructured":"Google. 2021. Cloud TPU. https:\/\/cloud.google.com\/tpu.  Google. 2021. Cloud TPU. https:\/\/cloud.google.com\/tpu."},{"key":"e_1_3_2_1_27_1","unstructured":"Google. 2021. Profile your model with Cloud TPU tools. https:\/\/cloud.google.com\/tpu\/docs\/cloud-tpu-tools.  Google. 2021. Profile your model with Cloud TPU tools. https:\/\/cloud.google.com\/tpu\/docs\/cloud-tpu-tools."},{"key":"e_1_3_2_1_28_1","volume-title":"26th International Conference on Field Programmable Logic and Applications (FPL)","volume":"2017","author":"Gupta Prabhat K","year":"2016","unstructured":"Prabhat K Gupta . 2016 . Accelerating datacenter workloads . In 26th International Conference on Field Programmable Logic and Applications (FPL) , Vol. 2017 . 20. Prabhat K Gupta. 2016. Accelerating datacenter workloads. In 26th International Conference on Field Programmable Logic and Applications (FPL), Vol. 2017. 20."},{"key":"e_1_3_2_1_29_1","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks. 1--12","author":"Hari S. K. S.","unstructured":"S. K. S. Hari , S. V. Adve , and H. Naeimi . 2012. Low-cost program-level detectors for reducing silent data corruptions . In Proceedings of the International Conference on Dependable Systems and Networks. 1--12 . S. K. S. Hari, S. V. Adve, and H. Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In Proceedings of the International Conference on Dependable Systems and Networks. 1--12."},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). 123--134","author":"Sastry Hari Siva Kumar","year":"2012","unstructured":"Siva Kumar Sastry Hari , Sarita V. Adve , Helia Naeimi , and Pradeep Ramachandran . 2012 . Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults . In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). 123--134 . Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). 123--134."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2021.3063083"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00059"},{"key":"e_1_3_2_1_33_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]"},{"key":"e_1_3_2_1_34_1","volume-title":"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. CoRR abs\/1502.01852","author":"He Kaiming","year":"2015","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. CoRR abs\/1502.01852 ( 2015 ). arXiv:1502.01852 http:\/\/arxiv.org\/abs\/1502.01852 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. CoRR abs\/1502.01852 (2015). arXiv:1502.01852 http:\/\/arxiv.org\/abs\/1502.01852"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00033"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ITC44170.2019.9000180"},{"key":"e_1_3_2_1_37_1","volume-title":"Muhammad Abdullah Hanif, and Muhammad Shafique","author":"Hoang Le Ha","year":"2019","unstructured":"Le Ha Hoang , Muhammad Abdullah Hanif, and Muhammad Shafique . 2019 . FT-ClipAct: Resilience Analysis of Deep Neural Networks and Improving their Fault Tolerance using Clipped Activation. CoRR abs\/1912.00941 (2019). arXiv:1912.00941 http:\/\/arxiv.org\/abs\/1912.00941 Le Ha Hoang, Muhammad Abdullah Hanif, and Muhammad Shafique. 2019. FT-ClipAct: Resilience Analysis of Deep Neural Networks and Improving their Fault Tolerance using Clipped Activation. CoRR abs\/1912.00941 (2019). arXiv:1912.00941 http:\/\/arxiv.org\/abs\/1912.00941"},{"key":"e_1_3_2_1_38_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458336.3465297"},{"key":"e_1_3_2_1_40_1","volume-title":"2010 IEEE International Test Conference. IEEE, 1--10","author":"Hong Ted","year":"2010","unstructured":"Ted Hong , Yanjing Li , Sung-Boem Park , Diana Mui , David Lin , Ziyad Abdel Kaleq , Nagib Hakim , Helia Naeimi , Donald S Gardner , and Subhasish Mitra . 2010 . QED: Quick error detection tests for effective post-silicon validation . In 2010 IEEE International Test Conference. IEEE, 1--10 . Ted Hong, Yanjing Li, Sung-Boem Park, Diana Mui, David Lin, Ziyad Abdel Kaleq, Nagib Hakim, Helia Naeimi, Donald S Gardner, and Subhasish Mitra. 2010. QED: Quick error detection tests for effective post-silicon validation. In 2010 IEEE International Test Conference. IEEE, 1--10."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCRD54409.2022.9730377"},{"key":"e_1_3_2_1_42_1","volume-title":"Weinberger","author":"Huang Gao","year":"2018","unstructured":"Gao Huang , Zhuang Liu , Laurens van der Maaten , and Kilian Q . Weinberger . 2018 . Densely Connected Convolutional Networks . arXiv:1608.06993 [cs.CV] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2018. Densely Connected Convolutional Networks. arXiv:1608.06993 [cs.CV]"},{"key":"e_1_3_2_1_43_1","volume-title":"Walter","author":"Huynh Tri","year":"2019","unstructured":"Tri Huynh , Michael Maire , and Matthew R . Walter . 2019 . Multigrid Neural Memory. CoRR abs\/1906.05948 (2019). arXiv:1906.05948 http:\/\/arxiv.org\/abs\/1906.05948 Tri Huynh, Michael Maire, and Matthew R. Walter. 2019. Multigrid Neural Memory. CoRR abs\/1906.05948 (2019). arXiv:1906.05948 http:\/\/arxiv.org\/abs\/1906.05948"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.microrel.2020.113969"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00010"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080225"},{"key":"e_1_3_2_1_47_1","unstructured":"Alex Krizhevsky Geoffrey Hinton etal 2009. Learning multiple layers of features from tiny images. (2009).  Alex Krizhevsky Geoffrey Hinton et al. 2009. Learning multiple layers of features from tiny images. (2009)."},{"key":"e_1_3_2_1_48_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17)","author":"Guanpeng","unstructured":"Guanpeng Li et al. 2017. Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17) . 8:1--8:12. Guanpeng Li et al. 2017. Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). 8:1--8:12."},{"key":"e_1_3_2_1_49_1","volume-title":"TensorFI: A Configurable Fault Injector for TensorFlow Applications. In 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). 313--320","author":"G. Li","year":"2018","unstructured":"G. Li et al. 2018 . TensorFI: A Configurable Fault Injector for TensorFlow Applications. In 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). 313--320 . G. Li et al. 2018. TensorFI: A Configurable Fault Injector for TensorFlow Applications. In 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). 313--320."},{"key":"e_1_3_2_1_50_1","volume-title":"Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII","volume":"42","author":"Li M.-L.","unstructured":"M.-L. Li , P. Ramachandran , S. K. Sahoo , S. V. Adve , V. S. Adve , and Y. Zhou . 2008. Understanding the propagation of hard errors to software and implications for resilient system design . In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII , Vol. 42 . 265. M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. 2008. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII, Vol. 42. 265."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2014.2334301"},{"key":"e_1_3_2_1_52_1","volume-title":"IEEE Aerospace Conference Proceedings","volume":"5","author":"Lovellette M. N.","unstructured":"M. N. Lovellette , K. S. Wood , D. L. Wood , J. H. Beall , P. P. Shirvani , N. Oh , and E. J. McCluskey . 2002. Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed . In IEEE Aerospace Conference Proceedings , Vol. 5 . 2109--2119. M. N. Lovellette, K. S. Wood, D. L. Wood, J. H. Beall, P. P. Shirvani, N. Oh, and E. J. McCluskey. 2002. Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed. In IEEE Aerospace Conference Proceedings, Vol. 5. 2109--2119."},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00070"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSRE52982.2021.00025"},{"key":"e_1_3_2_1_55_1","unstructured":"J Markoff. 2022. Tiny Chips Big Headaches. https:\/\/arxiv.org\/abs\/2203.08989  J Markoff. 2022. Tiny Chips Big Headaches. https:\/\/arxiv.org\/abs\/2203.08989"},{"key":"e_1_3_2_1_56_1","volume-title":"Comprehensive Error Detection in Simple Cores. In 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO","author":"Meixner A.","year":"2007","unstructured":"A. Meixner , M. E. Bauer , and D. Sorin . 2007. Argus: Low-Cost , Comprehensive Error Detection in Simple Cores. In 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO 2007 ). 210--222. A. Meixner, M. E. Bauer, and D. Sorin. 2007. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. In 40th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO 2007). 210--222."},{"key":"e_1_3_2_1_57_1","volume-title":"International Conference on Learning Representations.","author":"Menon Aditya Krishna","year":"2019","unstructured":"Aditya Krishna Menon , Ankit Singh Rawat , Sashank J Reddi , and Sanjiv Kumar . 2019 . Can gradient clipping mitigate label noise? . In International Conference on Learning Representations. Aditya Krishna Menon, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. 2019. Can gradient clipping mitigate label noise?. In International Conference on Learning Representations."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.7873\/DATE.2015.0367"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.101689"},{"key":"e_1_3_2_1_60_1","unstructured":"MLCommons. 2021. v1.0 Results. https:\/\/mlcommons.org\/en\/training-normal-10\/.  MLCommons. 2021. v1.0 Results. https:\/\/mlcommons.org\/en\/training-normal-10\/."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/FDTC.2013.9"},{"key":"e_1_3_2_1_62_1","volume-title":"Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. CoRR abs\/2003.09518","author":"Naumov Maxim","year":"2020","unstructured":"Maxim Naumov , John Kim , Dheevatsa Mudigere , Srinivas Sridharan , Xiaodong Wang , Whitney Zhao , Serhat Yilmaz , Changkyu Kim , Hector Yuen , Mustafa Ozdal , Krishnakumar Nair , Isabel Gao , Bor-Yiing Su , Jiyan Yang , and Mikhail Smelyanskiy . 2020. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. CoRR abs\/2003.09518 ( 2020 ). arXiv:2003.09518 https:\/\/arxiv.org\/abs\/2003.09518 Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, and Mikhail Smelyanskiy. 2020. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. CoRR abs\/2003.09518 (2020). arXiv:2003.09518 https:\/\/arxiv.org\/abs\/2003.09518"},{"key":"e_1_3_2_1_63_1","unstructured":"Nvidia. 2021. Nvidia Ampere Architecture. https:\/\/www.nvidia.com\/en-us\/data-center\/ampere-architecture.  Nvidia. 2021. Nvidia Ampere Architecture. https:\/\/www.nvidia.com\/en-us\/data-center\/ampere-architecture."},{"key":"e_1_3_2_1_64_1","unstructured":"NVIDIA Corporation. 2018. NVDLA Open Source Project. http:\/\/nvdla.org\/primer.html.  NVIDIA Corporation. 2018. NVDLA Open Source Project. http:\/\/nvdla.org\/primer.html."},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/24.994926"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/24.994913"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2020.3012209"},{"key":"e_1_3_2_1_68_1","volume-title":"Just Say Zero: Containing Critical Bit-Error Propagation in Deep Neural Networks With Anomalous Feature Suppression. In 2020 IEEE\/ACM International Conference On Computer Aided Design (ICCAD). 1--9.","author":"Ozen Elbruz","year":"2020","unstructured":"Elbruz Ozen and Alex Orailoglu . 2020 . Just Say Zero: Containing Critical Bit-Error Propagation in Deep Neural Networks With Anomalous Feature Suppression. In 2020 IEEE\/ACM International Conference On Computer Aided Design (ICCAD). 1--9. Elbruz Ozen and Alex Orailoglu. 2020. Just Say Zero: Containing Critical Bit-Error Propagation in Deep Neural Networks With Anomalous Feature Suppression. In 2020 IEEE\/ACM International Conference On Computer Aided Design (ICCAD). 1--9."},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00075"},{"key":"e_1_3_2_1_70_1","volume-title":"Perturbation-based Fault Screening. In IEEE 13th International Symposium on High Performance Computer Architecture. 169--180","author":"Racunas P.","unstructured":"P. Racunas , K. Constantinides , S. Manne , and S. S. Mukherjee . 2007 . Perturbation-based Fault Screening. In IEEE 13th International Symposium on High Performance Computer Architecture. 169--180 . P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee. 2007. Perturbation-based Fault Screening. In IEEE 13th International Symposium on High Performance Computer Architecture. 169--180."},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.32"},{"key":"e_1_3_2_1_72_1","volume-title":"2018 55th ACM\/ESDA\/IEEE Design Automation Conference (DAC). 1--6.","author":"B. Reagen","year":"2018","unstructured":"B. Reagen et al. 2018 . Ares: A framework for quantifying the resilience of deep neural networks . In 2018 55th ACM\/ESDA\/IEEE Design Automation Conference (DAC). 1--6. B. Reagen et al. 2018. Ares: A framework for quantifying the resilience of deep neural networks. In 2018 55th ACM\/ESDA\/IEEE Design Automation Conference (DAC). 1--6."},{"key":"e_1_3_2_1_73_1","unstructured":"Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv:1804.02767 [cs.CV]  Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv:1804.02767 [cs.CV]"},{"key":"e_1_3_2_1_74_1","volume-title":"Proceedings of the international symposium on Code generation and optimization. 1--12","author":"Reis G. A.","unstructured":"G. A. Reis , J. Chang , N. Vachharajani , R. Rangan , and D. I. August . 2004. SWIFT: Software Implemented Fault Tolerance . In Proceedings of the international symposium on Code generation and optimization. 1--12 . G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2004. SWIFT: Software Implemented Fault Tolerance. In Proceedings of the international symposium on Code generation and optimization. 1--12."},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC49654.2021.9622867"},{"key":"e_1_3_2_1_76_1","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks. 70--79","author":"Sahoo S. K.","unstructured":"S. K. Sahoo , M. L. Li , P. Ramachandran , S. V. Adve , V. S. Adve , and Y. Zhou . 2008. Using likely program invariants to detect hardware errors . In Proceedings of the International Conference on Dependable Systems and Networks. 70--79 . S. K. Sahoo, M. L. Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou. 2008. Using likely program invariants to detect hardware errors. In Proceedings of the International Conference on Dependable Systems and Networks. 70--79."},{"key":"e_1_3_2_1_77_1","volume-title":"Steven Hesley, and Subhasish Mitra.","author":"Sankar Sriram","year":"2021","unstructured":"Sriram Sankar , Rama Govindaraju , Arjan Van De Ven , Steven Hesley, and Subhasish Mitra. 2021 . Panel : Hardware Operation at Scale Reliability to Address Silent Data Corruptions . Sriram Sankar, Rama Govindaraju, Arjan Van De Ven, Steven Hesley, and Subhasish Mitra. 2021. Panel: Hardware Operation at Scale Reliability to Address Silent Data Corruptions."},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2013.30"},{"key":"e_1_3_2_1_79_1","volume-title":"International Conference on Machine Learning. PMLR, 9367--9376","author":"Schmidt Robin M","year":"2021","unstructured":"Robin M Schmidt , Frank Schneider , and Philipp Hennig . 2021 . Descending through a crowded valley-benchmarking deep learning optimizers . In International Conference on Machine Learning. PMLR, 9367--9376 . Robin M Schmidt, Frank Schneider, and Philipp Hennig. 2021. Descending through a crowded valley-benchmarking deep learning optimizers. In International Conference on Machine Learning. PMLR, 9367--9376."},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2018.8342151"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2019.8714885"},{"key":"e_1_3_2_1_82_1","volume-title":"Proceedings of the 50th Annual Design Automation Conference on - DAC13","author":"Shafique M.","unstructured":"M. Shafique , S. Rehman , P. V. Aceituno , and J. Henkel . 2013. Exploiting program-level masking and error propagation for constrained reliability optimization . In Proceedings of the 50th Annual Design Automation Conference on - DAC13 . M. Shafique, S. Rehman, P. V. Aceituno, and J. Henkel. 2013. Exploiting program-level masking and error propagation for constrained reliability optimization. In Proceedings of the 50th Annual Design Automation Conference on - DAC13."},{"key":"e_1_3_2_1_83_1","volume-title":"International Conference on Computer Aided Verification. Springer, 104--125","author":"Singh Eshan","year":"2017","unstructured":"Eshan Singh , Clark Barrett , and Subhasish Mitra . 2017 . E-QED: electrical bug localization during post-silicon validation enabled by quick error detection and formal methods . In International Conference on Computer Aided Verification. Springer, 104--125 . Eshan Singh, Clark Barrett, and Subhasish Mitra. 2017. E-QED: electrical bug localization during post-silicon validation enabled by quick error detection and formal methods. In International Conference on Computer Aided Verification. Springer, 104--125."},{"key":"e_1_3_2_1_84_1","volume-title":"Going Deeper With Convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Christian","unstructured":"Christian Szegedy et al. 2015 . Going Deeper With Convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Christian Szegedy et al. 2015. Going Deeper With Convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_2_1_85_1","volume-title":"Le","author":"Tan Mingxing","year":"2020","unstructured":"Mingxing Tan and Quoc V . Le . 2020 . EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks . arXiv:1905.11946 [cs.LG] Mingxing Tan and Quoc V. Le. 2020. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [cs.LG]"},{"key":"e_1_3_2_1_86_1","unstructured":"Tensorflow. 2021. Training checkpoints. https:\/\/www.tensorflow.org\/guide\/checkpoint.  Tensorflow. 2021. Training checkpoints. https:\/\/www.tensorflow.org\/guide\/checkpoint."},{"key":"e_1_3_2_1_87_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs.CL]  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs.CL]"},{"key":"e_1_3_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783745"},{"key":"e_1_3_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2006.40"},{"key":"e_1_3_2_1_90_1","volume-title":"2017 IEEE International Solid-State Circuits Conference (ISSCC). 242--243","author":"Whatmough P. N.","year":"2017","unstructured":"P. N. Whatmough 2017 . 14.3 A 28nm SoC with a 1.2GHz 568nJ\/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications . In 2017 IEEE International Solid-State Circuits Conference (ISSCC). 242--243 . P. N. Whatmough et al. 2017. 14.3 A 28nm SoC with a 1.2GHz 568nJ\/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications. In 2017 IEEE International Solid-State Circuits Conference (ISSCC). 242--243."},{"key":"e_1_3_2_1_91_1","unstructured":"Wikipedia. 2022. Backpropagation. https:\/\/en.wikipedia.org\/wiki\/Backpropagation.  Wikipedia. 2022. Backpropagation. https:\/\/en.wikipedia.org\/wiki\/Backpropagation."},{"key":"e_1_3_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1905.11881"},{"key":"e_1_3_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2019.8890989"},{"key":"e_1_3_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpds.2020.3043449"},{"key":"e_1_3_2_1_95_1","volume-title":"Culotta (Eds.)","volume":"23","author":"Zinkevich Martin","year":"2010","unstructured":"Martin Zinkevich , Markus Weimer , Lihong Li , and Alex Smola . 2010 . Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A . Culotta (Eds.) , Vol. 23 . Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/ 2010\/file\/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. 2010. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (Eds.), Vol. 23. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/2010\/file\/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf"}],"event":{"name":"ISCA '23: 50th Annual International Symposium on Computer Architecture","location":"Orlando FL USA","acronym":"ISCA '23","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","IEEE"]},"container-title":["Proceedings of the 50th Annual International Symposium on Computer Architecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3579371.3589105","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:40Z","timestamp":1750178800000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3579371.3589105"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,17]]},"references-count":95,"alternative-id":["10.1145\/3579371.3589105","10.1145\/3579371"],"URL":"https:\/\/doi.org\/10.1145\/3579371.3589105","relation":{},"subject":[],"published":{"date-parts":[[2023,6,17]]},"assertion":[{"value":"2023-06-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}