{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:35:16Z","timestamp":1763458516048,"version":"3.45.0"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2016,12,8]],"date-time":"2016-12-08T00:00:00Z","timestamp":1481155200000},"content-version":"vor","delay-in-days":366,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"SRC STARnet Centers"},{"DOI":"10.13039\/100007245","name":"MARCO","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100007245","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Center for Future Architectures Research"},{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,1,7]]},"abstract":"<jats:p>\n                    As memory systems scale, maintaining their Reliability Availability and Serviceability (RAS) is becoming more complex. To make matters worse, recent studies of DRAM failures in data centers and supercomputer environments have highlighted that large-granularity failures are common in DRAM chips. Furthermore, the move toward 3D-stacked memories can make the system vulnerable to newer failure modes, such as those occurring from faults in Through-Silicon Vias (TSVs). To architect future systems and to use emerging technology, system designers will need to employ strong error correction and repair techniques. Unfortunately, evaluating the relative effectiveness of these reliability mechanisms is often difficult and is traditionally done with analytical models, which are both error prone and time-consuming to develop. To this end, this article proposes F\n                    <jats:sc>ault<\/jats:sc>\n                    S\n                    <jats:sc>im<\/jats:sc>\n                    , a fast configurable memory-reliability simulation tool for 2D and 3D-stacked memory systems. FaultSim employs Monte Carlo simulations, which are driven by real-world failure statistics. We discuss the novel algorithms and data structures used in FaultSim to accelerate the evaluation of different resilience schemes. We implement BCH-1 (SECDED) and ChipKill codes using FaultSim and validate against an analytical model. FaultSim implements BCH-1 and ChipKill codes with a deviation of only 0.032% and 8.41% from the analytical model. FaultSim can simulate 1 million Monte Carlo trials (each for a period of 7 years) of BCH-1 and ChipKill codes in only 34 seconds and 33 seconds, respectively.\n                  <\/jats:p>","DOI":"10.1145\/2831234","type":"journal-article","created":{"date-parts":[[2015,12,10]],"date-time":"2015-12-10T09:22:10Z","timestamp":1449739330000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["F\n                    <scp>ault<\/scp>\n                    S\n                    <scp>im<\/scp>"],"prefix":"10.1145","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1732-4314","authenticated-orcid":false,"given":"Prashant J.","family":"Nair","sequence":"first","affiliation":[{"name":"School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA"}]},{"given":"David A.","family":"Roberts","sequence":"additional","affiliation":[{"name":"AMD Research, Advanced Micro Devices Inc."}]},{"given":"Moinuddin K.","family":"Qureshi","sequence":"additional","affiliation":[{"name":"School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA"}]}],"member":"320","published-online":{"date-parts":[[2015,12,8]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168941"},{"key":"e_1_2_2_2_1","volume-title":"Exponential Distribution: Theory, Methods and Applications","author":"Balakrishnan K.","year":"1996","unstructured":"K. Balakrishnan. 1996. Exponential Distribution: Theory, Methods and Applications. Taylor & Francis."},{"key":"e_1_2_2_3_1","unstructured":"K. Bergman S. Borkar D. Campbell W. Carlson W. Dally M. Denneau P. Franzon W. Harrod J. Hiller S. Karp S. Keckler D. Klein R. Lucas M. Richards A. Scarpelli S. Scott A. Snavely T. Sterling R. S. Williams and K. Yelick. 2008. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report."},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2012.6378642"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339657"},{"volume-title":"Proceedings of the 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201915)","author":"Cai Y.","key":"e_1_2_2_6_1","unstructured":"Y. Cai, Y. Luo, S. Ghose, and O. Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201915). 438--449."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063454"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.282.0124"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/2665671.2665683"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1964.1053699"},{"volume-title":"Proceedings of the 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201915)","author":"Chou C.","key":"e_1_2_2_11_1","unstructured":"C. Chou, P. Nair, and M. K. Qureshi. 2015. Reducing refresh power in mobile devices with morphable ECC. In Proceedings of the 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201915). 355--366."},{"key":"e_1_2_2_12_1","first-page":"1470","article-title":"Partial parity cache and data cache management method to improve the performance of an SSD-based RAID","volume":"99","author":"Chung H. H.","year":"2013","unstructured":"H. H. Chung, C. 2013. Partial parity cache and data cache management method to improve the performance of an SSD-based RAID. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 99, 1470--1480.","journal-title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems"},{"key":"e_1_2_2_13_1","unstructured":"N. DeBardeleben. 2013. Reliability models for double chipkill detect\/correct memory systems. In Los Alamos National Laboratory Associate Directorate for Theory Simulation and Computation (ADTSC\u201913) LAUR 13-20839."},{"key":"e_1_2_2_14_1","unstructured":"T. J. Dell. 1997. A White Paper on the Benefits of ChipkillCorrect ECC for PC Server Main Memory. Technical Report 11\/19\/97. IBM."},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.microrel.2004.01.016"},{"key":"e_1_2_2_16_1","unstructured":"H. Emery. 2013. The IBM zEnterprise EC12 (zEC12) System: Processor Memory and System Structure Enhancements. Technical Report."},{"key":"e_1_2_2_17_1","unstructured":"E. Fujiwara and D. K. Pradhan. 1989. Error-Control Coding in Computers Systmes. Prentice-Hall."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/2665671.2665685"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/PRDC.2013.18"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/2492708.2492905"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TR.1975.5215337"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2014.2332177"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/1331699.1331719"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/2665671.2665726"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669172"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063445"},{"volume-title":"Using the Weibull Distribution: Reliability, Modeling and Inference","author":"McCool J. I.","key":"e_1_2_2_27_1","unstructured":"J. I. McCool. 2012. Using the Weibull Distribution: Reliability, Modeling and Inference. Wiley."},{"volume-title":"Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA\u201910)","author":"Miller J. E.","key":"e_1_2_2_28_1","unstructured":"J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA\u201910). 1--12."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.30"},{"volume-title":"Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Nair P. J.","key":"e_1_2_2_30_1","unstructured":"P. J. Nair, C. Chou, B. Rajendran, and M. K. Qureshi. 2015. Reducing read latency of phase change memory via early read and Turbo Read. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915). 309--319."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485929"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.57"},{"key":"e_1_2_2_33_1","doi-asserted-by":"crossref","unstructured":"J. T. Pawlowski. 2011. Hybrid memory cube (HMC). In HOT-CHIPS 23.","DOI":"10.1109\/HOTCHIPS.2011.7477494"},{"key":"e_1_2_2_34_1","volume-title":"Proceedings of the IEEE International Conference on Computer Design","author":"Pellegrini A.","year":"2008","unstructured":"A. Pellegrini, K. Constantinides, Dan Zhang, S. Sudhakar, V. Bertacco, and T. Austin. 2008. CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework. In Proceedings of the IEEE International Conference on Computer Design, 2008 (ICCD\u201908). 363--370."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","unstructured":"M. K. Qureshi. 2011. Pay-As-You-Go: Low-overhead hard-error correction for phase change memories. In MICRO. 318--328. 10.1145\/2155620.2155658","DOI":"10.1145\/2155620.2155658"},{"volume-title":"Proceedings of the 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201915)","author":"Qureshi M. K.","key":"e_1_2_2_36_1","unstructured":"M. K. Qureshi, D.-H. Kim, S. Khan, P. J. Nair, and O. Mutlu. 2015. AVATAR: A variable-retention-time (VRT) aware refresh for DRAM systems. In Proceedings of the 2015 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201915). 427--437."},{"key":"e_1_2_2_37_1","unstructured":"D. A. Roberts and P. J. Nair. 2014. FaultSim: A fast configurable memory-resilience simulator. In The Memory Forum: In conjunction with ISCA-41."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2013.30"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2009.4"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2492101.1555372"},{"key":"e_1_2_2_41_1","unstructured":"Silicon Power 2010. DDR3 ECC Unbuffered DIMM Spec Sheet. Silicon Power."},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485958"},{"volume-title":"Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Son Y. H.","key":"e_1_2_2_43_1","unstructured":"Y. H. Son, S. Lee, S. O, S. Kwon, N. S. Kim, and J. H. Ahn. 2015. CiDRA: A cache-inspired DRAM resilience architecture. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915). 502--513."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694348"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389100"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503257"},{"key":"e_1_2_2_47_1","unstructured":"JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. In JESD235."},{"volume-title":"Octopus 8-Port DRAM for Die-Stack Applications: TSC100801\/2\/4","author":"Tezzaron Corp. 2010.","key":"e_1_2_2_48_1","unstructured":"Tezzaron Corp. 2010. Octopus 8-Port DRAM for Die-Stack Applications: TSC100801\/2\/4. Tezzaron Corp."},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.595583"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337192"},{"key":"e_1_2_2_51_1","volume-title":"SELSE","author":"Wang S.","year":"2015","unstructured":"S. Wang, H. (C.) Hu, H. Zheng, and P. Gupta. 2015. MEMRES: A fast memory system reliability simulator. In SELSE 2015."},{"volume-title":"Proceedings of the 2011 Annual Reliability and Maintainability Symposium (RAMS\u201911)","author":"White M.","key":"e_1_2_2_52_1","unstructured":"M. White, J. Qin, and J. B. Bernstein. 2011. A study of scaling effects on DRAM reliability. In Proceedings of the 2011 Annual Reliability and Maintainability Symposium (RAMS\u201911). 1--6."},{"volume-title":"Introduction to Reliability Analysis","author":"Zacks S.","key":"e_1_2_2_53_1","unstructured":"S. Zacks. 1992. Introduction to Reliability Analysis. Springer Texts in Statistics."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2831234","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2831234","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2831234","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:30:21Z","timestamp":1763458221000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2831234"}},"subtitle":["A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems"],"short-title":[],"issued":{"date-parts":[[2015,12,8]]},"references-count":53,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,1,7]]}},"alternative-id":["10.1145\/2831234"],"URL":"https:\/\/doi.org\/10.1145\/2831234","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2015,12,8]]},"assertion":[{"value":"2015-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-09-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-12-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}