{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T04:41:56Z","timestamp":1773895316262,"version":"3.50.1"},"reference-count":69,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,20]],"date-time":"2024-11-20T00:00:00Z","timestamp":1732060800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62025203"],"award-info":[{"award-number":["62025203"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>Silent Data Corruption (SDC) in processors can lead to various application-level issues, such as incorrect calculations and even data loss. Since traditional techniques are not effective in detecting these errors, it is very hard to address problems caused by SDCs in processors. For the same reason, knowledge about these SDCs in the wild is limited.<\/jats:p>\n          <jats:p>In this article, we conduct an extensive study on CPU SDCs in a large production CPU population, encompassing over one million processors. In addition to collecting overall statistics, we perform a detailed study to understand (1) whether certain processor features are particularly vulnerable and their potential impacts on applications; (2) the reproducibility of CPU SDCs and the triggering conditions (e.g., temperature) of those less reproducible SDCs; and (3) the challenges to mitigate and handle CPU SDCs.<\/jats:p>\n          <jats:p>We further investigate the implications that our observations obtained from the above researches have on the SDC fault models, SDC mitigation strategies, and the future research fields. In addition, we design an efficient SDC mitigation approach called Farron, which uses prioritized testing to detect highly reproducible SDCs and temperature control to mitigate less-reproducible SDCs. Our experimental results indicate that Farron can achieve better coverage of CPU SDCs with lower overall overhead, compared to the baseline used in Alibaba Cloud. This demonstrates that our observations are able to assist in SDC mitigation.<\/jats:p>","DOI":"10.1145\/3690825","type":"journal-article","created":{"date-parts":[[2024,9,2]],"date-time":"2024-09-02T10:53:50Z","timestamp":1725274430000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Understanding Silent Data Corruption in Processors for Mitigating its Effects"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-6842-3148","authenticated-orcid":false,"given":"Shaobu","family":"Wang","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3480-5902","authenticated-orcid":false,"given":"Guangyan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0688-6370","authenticated-orcid":false,"given":"Junyu","family":"Wei","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9721-4923","authenticated-orcid":false,"given":"Yang","family":"Wang","sequence":"additional","affiliation":[{"name":"The Ohio State University, Columbus, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7417-5469","authenticated-orcid":false,"given":"Jiesheng","family":"Wu","sequence":"additional","affiliation":[{"name":"Alibaba Cloud, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-3805-5377","authenticated-orcid":false,"given":"Qingchao","family":"Luo","sequence":"additional","affiliation":[{"name":"Alibaba Cloud, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,20]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1973.5009108"},{"key":"e_1_3_2_3_2","article-title":"Detection and prevention of silent data corruption in an exabyte-scale database system","author":"Bacon David F.","year":"2022","unstructured":"David F. Bacon. 2022. Detection and prevention of silent data corruption in an exabyte-scale database system. In Workshop on Silicon Errors in Logic: System Effects. IEEE.","journal-title":"Workshop on Silicon Errors in Logic: System Effects. IEEE"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/1416944.1416947"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TDMR.2005.853449"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC-CSS-ICESS.2015.9"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014532297"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749246.2749253"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/LADC53747.2021.9672590"},{"key":"e_1_3_2_10_2","first-page":"173","volume-title":"USENIX Conference on Operating Systems Design and Implementation.","year":"1999","unstructured":"M. Castro and B. Liskov. 1999. Practical Byzantine fault tolerance. In USENIX Conference on Operating Systems Design and Implementation.173\u2013186."},{"key":"e_1_3_2_11_2","first-page":"19","article-title":"Replication","volume":"5959","author":"Charron-Bost Bernadette","year":"2010","unstructured":"Bernadette Charron-Bost, Fernando Pedone, and Andr\u00e9 Schiper. 2010. Replication. Lect. Notes Comput. Sci. 5959 (2010), 19\u201340.","journal-title":"Lect. Notes Comput. Sci."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/23.903758"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1964.1053662"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629602"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2008.4630077"},{"key":"e_1_3_2_16_2","article-title":"ECC for L2 Cache Data Memory","author":"Corporation Intel","unstructured":"Intel Corporation. 2022. ECC for L2 Cache Data Memory. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683360\/18-0\/ecc-for-l2-cache-data-memory.html","journal-title":"https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683360\/18-0\/ecc-for-l2-cache-data-memory.html"},{"key":"e_1_3_2_17_2","article-title":"Intel Skylake\/Kaby Lake Processors: Broken Hyper-threading","author":"Corporation Intel","unstructured":"Intel Corporation. 2017. Intel Skylake\/Kaby Lake Processors: Broken Hyper-threading. Retrieved from https:\/\/lists.debian.org\/debian-devel\/2017\/06\/msg00308.html","journal-title":"https:\/\/lists.debian.org\/debian-devel\/2017\/06\/msg00308.html"},{"key":"e_1_3_2_18_2","article-title":"OpenDCDiag","author":"Corporation Intel","unstructured":"Intel Corporation. 2022. OpenDCDiag. Retrieved from https:\/\/github.com\/opendcdiag\/opendcdiag","journal-title":"https:\/\/github.com\/opendcdiag\/opendcdiag"},{"key":"e_1_3_2_19_2","article-title":"Optimizing Storage Solutions using the Intel\u00ae Intelligent Storage Acceleration Library.","author":"Corporation Intel","unstructured":"Intel Corporation. 2014. Optimizing Storage Solutions using the Intel\u00ae Intelligent Storage Acceleration Library. Retrieved from https:\/\/software.intel.com\/en-us\/articles\/optimizing-storage-solutions-using-the-intel-intelligent-storage-acceleration-library","journal-title":"https:\/\/software.intel.com\/en-us\/articles\/optimizing-storage-solutions-using-the-intel-intelligent-storage-acceleration-library"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNS.2018.2852606"},{"key":"e_1_3_2_21_2","unstructured":"Parth Deshmukh Sean Maginnis and Josh Chandler. 2011. Jerasure 2.0 (2011). Chancellor\u2019s Honors Program Projects."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2015.17"},{"key":"e_1_3_2_23_2","article-title":"Detecting silent data corruptions in the wild","author":"Dixit Harish Dattatraya","year":"2022","unstructured":"Harish Dattatraya Dixit, Laura Boyle, Gautham Vunnam, Sneha Pendharkar, Matt Beadon, and Sriram Sankar. 2022. Detecting silent data corruptions in the wild. arXiv:2203.08989 (2022).","journal-title":"arXiv:2203.08989"},{"key":"e_1_3_2_24_2","article-title":"Silent data corruptions at scale","author":"Dixit Harish Dattatraya","year":"2021","unstructured":"Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. 2021. Silent data corruptions at scale. arXiv preprint arXiv:2102.11245 (2021).","journal-title":"arXiv preprint arXiv:2102.11245"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3483840"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.2172\/1089338"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389102"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/945445.945450"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/2751504.2751512"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242086"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-20943-2"},{"key":"e_1_3_2_32_2","unstructured":"Unified EFI Forum Inc. Advanced Configuration and Power Interface Specification. Retrieved from https:\/\/uefi.org\/sites\/default\/files\/resources\/ACPI_6_2.pdf"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458336.3465297"},{"key":"e_1_3_2_34_2","first-page":"15","volume-title":"USENIX Annual Technical Conference (USENIX ATC\u201912)","author":"Huang Cheng","year":"2012","unstructured":"Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in Windows Azure storage. In USENIX Annual Technical Conference (USENIX ATC\u201912). 15\u201326."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"e_1_3_2_36_2","volume-title":"Impact of Parameter Variations on Multi-core Chips","author":"Humenay Eric","year":"2006","unstructured":"Eric Humenay, David Tarjan, and Kevin Skadron. 2006. Impact of Parameter Variations on Multi-core Chips. Technical Report. University of Virginia Department of Computer Science."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/DFT.2016.7684076"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/6420.6422"},{"key":"e_1_3_2_39_2","first-page":"237","volume-title":"10th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201912)","author":"Kapritsos Manos","year":"2012","unstructured":"Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo Alvisi, and Mike Dahlin. 2012. All about Eve: Execute-verify replication for multi-core servers. In 10th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201912). USENIX Association, USA, 237\u2013250."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ITC50571.2021.00011"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1012203009875"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476195"},{"key":"e_1_3_2_43_2","first-page":"1049","volume-title":"Asian Conference on Machine Learning","author":"Liu Cheng","year":"2019","unstructured":"Cheng Liu, Jingjing Gu, Zujia Yan, Fuzhen Zhuang, and Yunyun Wang. 2019. SDC-causing error detection based on lightweight vulnerability prediction. In Asian Conference on Machine Learning. PMLR, 1049\u20131064."},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/1064978.1065034"},{"key":"e_1_3_2_45_2","first-page":"287","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201923)","year":"2023","unstructured":"Jialun Lyu, Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, Savyasachi Samal, Ioannis Manousakis, Lisa Hsu, Preetha Subbarayalu, Ashish Raniwala, Brijesh Warrier, Ricardo Bianchini, Bianca Schroeder, and Daniel S. Berger. 2023. Hyrax: Fail-in-Place server operation in cloud platforms. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201923). 287\u2013304."},{"key":"e_1_3_2_46_2","unstructured":"W. Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes on the Status of IEEE 754 (94720-1776) 11."},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/TDMR.2005.855685"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2007.99"},{"key":"e_1_3_2_49_2","first-page":"61","volume-title":"USENIX Annual Technical Conference, General Track","author":"Moore Justin D.","year":"2005","unstructured":"Justin D. Moore, Jeffrey S. Chase, Parthasarathy Ranganathan, and Ratnesh K. Sharma. 2005. Making scheduling \u201cCool\u201d: Temperature-aware workload placement in data centers. In USENIX Annual Technical Conference, General Track. 61\u201375."},{"key":"e_1_3_2_50_2","volume-title":"Architecture Design for Soft Errors","author":"Mukherjee Shubu","year":"2011","unstructured":"Shubu Mukherjee. 2011. Architecture Design for Soft Errors. Morgan Kaufmann."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2005.37"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC53511.2021.00021"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2023.3285094"},{"key":"e_1_3_2_54_2","first-page":"5","volume-title":"Linux Symposium","author":"Petersen Martin K.","year":"2008","unstructured":"Martin K. Petersen. 2008. Linux data integrity extensions. In Linux Symposium. 5."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2005.61"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE54114.2022.9774600"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/263876.263881"},{"key":"e_1_3_2_58_2","article-title":"SiliFuzz: Fuzzing CPUs by proxy","author":"Serebryany Kostya","year":"2021","unstructured":"Kostya Serebryany, Maxim Lifantsev, Konstantin Shtoyk, Doug Kwan, and Peter Hochschild. 2021. SiliFuzz: Fuzzing CPUs by proxy. arXiv preprint arXiv:2110.11519 (2021).","journal-title":"arXiv preprint arXiv:2110.11519"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/VTS56346.2023.10139970"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ITC50671.2022.00046"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-011-0635-z"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613149"},{"key":"e_1_3_2_63_2","first-page":"357","volume-title":"10th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201913)","author":"Wang Yang","year":"2013","unstructured":"Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI\u201913). 357\u2013370."},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3567505"},{"key":"e_1_3_2_65_2","first-page":"1","article-title":"Calculating Useful Lifetimes of Embedded Processors","author":"Webber Allan","year":"2014","unstructured":"Allan Webber. 2014. Calculating Useful Lifetimes of Embedded Processors. Application Report, Texas Instruments. 1\u20136.","journal-title":"Application Report, Texas Instruments"},{"key":"e_1_3_2_66_2","volume-title":"Managing Temperature Effects in Nanoscale Adaptive Systems","author":"Wolpert David Solomon","year":"2011","unstructured":"David Solomon Wolpert. 2011. Managing Temperature Effects in Nanoscale Adaptive Systems. University of Rochester."},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2905842"},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS47876.2019.00127"},{"key":"e_1_3_2_69_2","first-page":"29","volume-title":"USENIX Conference on File and Storage Technologies (FAST\u201910)","author":"Zhang Yupu","year":"2010","unstructured":"Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2010. End-to-end data integrity for file systems: A ZFS case study. In USENIX Conference on File and Storage Technologies (FAST\u201910). 29\u201342."},{"key":"e_1_3_2_70_2","volume-title":"1st Workshop on Data Integrity and Secure Cloud Computing (DISCC\u201922)","author":"Zuo Gefei","year":"2022","unstructured":"Gefei Zuo, Jiacheng Ma, Andrew Quinn, and Baris Kasikci. 2022. Tolerate silent data errors with coded computation. In 1st Workshop on Data Integrity and Secure Cloud Computing (DISCC\u201922)."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3690825","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3690825","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:58:06Z","timestamp":1750294686000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3690825"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,20]]},"references-count":69,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3690825"],"URL":"https:\/\/doi.org\/10.1145\/3690825","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,20]]},"assertion":[{"value":"2024-02-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-20","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}