{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,17]],"date-time":"2026-01-17T17:36:04Z","timestamp":1768671364611,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":38,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,6,5]],"date-time":"2023-06-05T00:00:00Z","timestamp":1685923200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,6,5]]},"DOI":"10.1145\/3579370.3594777","type":"proceedings-article","created":{"date-parts":[[2023,6,22]],"date-time":"2023-06-22T23:40:16Z","timestamp":1687477216000},"page":"124-135","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Predicting GPU Failures With High Precision Under Deep Learning Workloads"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0321-347X","authenticated-orcid":false,"given":"Heting","family":"Liu","sequence":"first","affiliation":[{"name":"ByteDance Inc., San Jose, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7388-7341","authenticated-orcid":false,"given":"Zhichao","family":"Li","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Seattle, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1420-5125","authenticated-orcid":false,"given":"Cheng","family":"Tan","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2360-4791","authenticated-orcid":false,"given":"Rongqiu","family":"Yang","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2115-7165","authenticated-orcid":false,"given":"Guohong","family":"Cao","sequence":"additional","affiliation":[{"name":"Pennsylvania State University, State College, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-4910-6095","authenticated-orcid":false,"given":"Zherui","family":"Liu","sequence":"additional","affiliation":[{"name":"ByteDance Inc., Beijing, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0730-8468","authenticated-orcid":false,"given":"Chuanxiong","family":"Guo","sequence":"additional","affiliation":[{"name":"Bytedance, Seattle, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,6,22]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672","author":"Ackerman Samuel","year":"2021","unstructured":"Samuel Ackerman , Orna Raz , Marcel Zalmanovici , and Aviad Zlotnick . 2021. Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672 ( 2021 ). Samuel Ackerman, Orna Raz, Marcel Zalmanovici, and Aviad Zlotnick. 2021. Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672 (2021)."},{"key":"e_1_3_2_1_2_1","volume-title":"2014 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks.","author":"Birke Robert","year":"2014","unstructured":"Robert Birke , Ioana Giurgiu , Lydia Y Chen , Dorothea Wiesmann , and Ton Engbersen . 2014 . Failure analysis of virtual and physical machines: patterns, causes and characteristics . In 2014 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. Robert Birke, Ioana Giurgiu, Lydia Y Chen, Dorothea Wiesmann, and Ton Engbersen. 2014. Failure analysis of virtual and physical machines: patterns, causes and characteristics. In 2014 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939699"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1015330.1015432"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3208040.3208051"},{"key":"e_1_3_2_1_6_1","volume-title":"2014 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks.","author":"Martino Catello Di","year":"2014","unstructured":"Catello Di Martino , Zbigniew Kalbarczyk , Ravishankar K Iyer , Fabio Baccanico , Joseph Fullop , and William Kramer . 2014 . Lessons learned from the analysis of system failures at petascale: The case of blue waters . In 2014 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In 2014 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks."},{"key":"e_1_3_2_1_7_1","volume-title":"2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--6.","author":"Fawaz Hassan Ismail","year":"2019","unstructured":"Hassan Ismail Fawaz , Germain Forestier , Jonathan Weber , Lhassane Idoumghar , and Pierre-Alain Muller . 2019 . Deep neural network ensembles for time series classification . In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--6. Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2019. Deep neural network ensembles for time series classification. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--6."},{"key":"e_1_3_2_1_8_1","unstructured":"Daniel Ford Fran\u00e7ois Labelle Florentina Popovici Murray Stokely Van-Anh Truong Luiz Barroso Carrie Grimes and Sean Quinlan. 2010. Availability in globally distributed storage systems. (2010).  Daniel Ford Fran\u00e7ois Labelle Florentina Popovici Murray Stokely Van-Anh Truong Luiz Barroso Carrie Grimes and Sean Quinlan. 2010. Availability in globally distributed storage systems. (2010)."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2020.2993728"},{"key":"e_1_3_2_1_10_1","volume-title":"2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks.","author":"Gupta Saurabh","year":"2015","unstructured":"Saurabh Gupta , Devesh Tiwari , Christopher Jantzi , James Rogers , and Don Maxwell . 2015 . Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems . In 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, and Don Maxwell. 2015. Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.58871"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2020.106622"},{"key":"e_1_3_2_1_13_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780."},{"key":"e_1_3_2_1_14_1","volume-title":"7th International Colloquium on Signal Processing and its Applications. IEEE, 112--116","author":"Izzeldin Huzaifa","year":"2011","unstructured":"Huzaifa Izzeldin , Vijanth S Asirvadam , and Nordin Saad . 2011 . Online sliding-window based for training MLP networks using advanced conjugate gradient. In 2011 IEEE 7th International Colloquium on Signal Processing and its Applications. IEEE, 112--116 . Huzaifa Izzeldin, Vijanth S Asirvadam, and Nordin Saad. 2011. Online sliding-window based for training MLP networks using advanced conjugate gradient. In 2011 IEEE 7th International Colloquium on Signal Processing and its Applications. IEEE, 112--116."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3130348.3130374"},{"key":"e_1_3_2_1_16_1","volume-title":"SC'18: International Conference for High Performance Computing, Networking, Storage and Analysis.","author":"Kalra Charu","year":"2018","unstructured":"Charu Kalra , Fritz Previlon , Xiangyu Li , Norman Rubin , and David Kaeli . 2018 . Prism: Predicting resilience of gpu applications using statistical methods . In SC'18: International Conference for High Performance Computing, Networking, Storage and Analysis. Charu Kalra, Fritz Previlon, Xiangyu Li, Norman Rubin, and David Kaeli. 2018. Prism: Predicting resilience of gpu applications using statistical methods. In SC'18: International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_2_1_17_1","volume-title":"Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30","author":"Ke Guolin","year":"2017","unstructured":"Guolin Ke , Qi Meng , Thomas Finley , Taifeng Wang , Wei Chen , Weidong Ma , Qiwei Ye , and Tie-Yan Liu . 2017 . Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017). Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.18"},{"key":"e_1_3_2_1_19_1","volume-title":"36th annual computer software and applications conference","author":"Malheiros Yuri","unstructured":"Yuri Malheiros , Alan Moraes , Cleyton Trindade , and Silvio Meira . 2012. A source code recommender system to support newcomers . In 36th annual computer software and applications conference . IEEE , 19--24. Yuri Malheiros, Alan Moraes, Cleyton Trindade, and Silvio Meira. 2012. A source code recommender system to support newcomers. In 36th annual computer software and applications conference. IEEE, 19--24."},{"key":"e_1_3_2_1_20_1","volume-title":"2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks.","author":"Meza Justin","year":"2015","unstructured":"Justin Meza , Qiang Wu , Sanjeev Kumar , and Onur Mutlu . 2015 . Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field . In 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/0925-2312(91)90023-5"},{"key":"e_1_3_2_1_22_1","volume-title":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).","author":"Nie Bin","year":"2016","unstructured":"Bin Nie , Devesh Tiwari , Saurabh Gupta , Evgenia Smirni , and James H Rogers . 2016 . A large-scale study of soft-errors on GPUs in the field . In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on GPUs in the field. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_23_1","volume-title":"2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).","author":"Nie Bin","year":"2017","unstructured":"Bin Nie , Ji Xue , Saurabh Gupta , Christian Engelmann , Evgenia Smirni , and Devesh Tiwari . 2017 . Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities . In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2017. Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)."},{"key":"e_1_3_2_1_24_1","volume-title":"2018 48th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN).","author":"Nie Bin","year":"2018","unstructured":"Bin Nie , Ji Xue , Saurabh Gupta , Tirthak Patel , Christian Engelmann , Evgenia Smirni , and Devesh Tiwari . 2018 . Machine learning models for GPU error prediction in a large scale HPC system . In 2018 48th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN). Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2018. Machine learning models for GPU error prediction in a large scale HPC system. In 2018 48th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN)."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2007.103"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126960"},{"key":"e_1_3_2_1_27_1","unstructured":"Eduardo Pinheiro Wolf-Dietrich Weber and Luiz Andr\u00e9 Barroso. 2007. Failure trends in a large disk drive population. (2007).  Eduardo Pinheiro Wolf-Dietrich Weber and Luiz Andr\u00e9 Barroso. 2007. Failure trends in a large disk drive population. (2007)."},{"key":"e_1_3_2_1_28_1","volume-title":"The AAAI-19 Workshop on Engineering Dependable and Secure Machine Learning Systems Software Engineering for Machine Learning (EDSMLS","author":"Raz Orna","year":"2019","unstructured":"Orna Raz , Marcel Zalmanovici , Aviad Zlotnick , and Eitan Farchi . 2019 . Automatically detecting data drift in machine learning based classifiers . In The AAAI-19 Workshop on Engineering Dependable and Secure Machine Learning Systems Software Engineering for Machine Learning (EDSMLS 2019). Orna Raz, Marcel Zalmanovici, Aviad Zlotnick, and Eitan Farchi. 2019. Automatically detecting data drift in machine learning based classifiers. In The AAAI-19 Workshop on Engineering Dependable and Secure Machine Learning Systems Software Engineering for Machine Learning (EDSMLS 2019)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/2930583.2930589"},{"key":"e_1_3_2_1_30_1","volume-title":"SC'13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.","author":"Sridharan Vilas","year":"2013","unstructured":"Vilas Sridharan , Jon Stearley , Nathan DeBardeleben , Sean Blanchard , and Sudhanva Gurumurthi . 2013 . Feng shui of supercomputer memory positional effects in DRAM and SRAM faults . In SC'13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng shui of supercomputer memory positional effects in DRAM and SRAM faults. In SC'13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_2_1_31_1","volume-title":"16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Tan Cheng","year":"2019","unstructured":"Cheng Tan , Ze Jin , Chuanxiong Guo , Tianrong Zhang , Haitao Wu , Karl Deng , Dongming Bi , and Dong Xiang . 2019 . {NetBouncer}: Active Device and Link Failure Localization in Data Center Networks . In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19) . 599--614. Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. {NetBouncer}: Active Device and Link Failure Localization in Data Center Networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 599--614."},{"key":"e_1_3_2_1_32_1","volume-title":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).","author":"Tiwari Devesh","year":"2015","unstructured":"Devesh Tiwari , Saurabh Gupta , James Rogers , Don Maxwell , Paolo Rech , Sudharshan Vazhkudai , Daniel Oliveira , Dave Londo , Nathan DeBardeleben , Philippe Navaux , 2015 . Understanding GPU errors on large-scale HPC systems and the implications for system design and operation . In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807128.1807161"},{"key":"e_1_3_2_1_34_1","volume-title":"2017 47th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN).","author":"Wang Guosai","year":"2017","unstructured":"Guosai Wang , Lifei Zhang , and Wei Xu . 2017 . What can we learn from four years of data center hardware failures? . In 2017 47th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN). Guosai Wang, Lifei Zhang, and Wei Xu. 2017. What can we learn from four years of data center hardware failures?. In 2017 47th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISI.2017.8004872"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CNSM.2015.7367348"},{"key":"e_1_3_2_1_37_1","volume-title":"Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy.","author":"Yang Jianbo","year":"2015","unstructured":"Jianbo Yang , Minh Nhut Nguyen , Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015 . Deep convolutional neural networks on multichannel time series for human activity recognition. In Twenty-fourth international joint conference on artificial intelligence. Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In Twenty-fourth international joint conference on artificial intelligence."},{"key":"e_1_3_2_1_38_1","volume-title":"International Conference on Machine Learning. PMLR, 1604--1612","author":"Zhu Xiaodan","year":"2015","unstructured":"Xiaodan Zhu , Parinaz Sobihani , and Hongyu Guo . 2015 . Long short-term memory over recursive structures . In International Conference on Machine Learning. PMLR, 1604--1612 . Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning. PMLR, 1604--1612."}],"event":{"name":"SYSTOR '23: 16th ACM International Conference on Systems and Storage","location":"Haifa Israel","acronym":"SYSTOR '23","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the 16th ACM International Conference on Systems and Storage"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3579370.3594777","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3579370.3594777","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:28Z","timestamp":1750182568000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3579370.3594777"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,5]]},"references-count":38,"alternative-id":["10.1145\/3579370.3594777","10.1145\/3579370"],"URL":"https:\/\/doi.org\/10.1145\/3579370.3594777","relation":{},"subject":[],"published":{"date-parts":[[2023,6,5]]},"assertion":[{"value":"2023-06-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}