{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T08:21:31Z","timestamp":1759134091593,"version":"3.41.0"},"reference-count":22,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2006,4,1]],"date-time":"2006-04-01T00:00:00Z","timestamp":1143849600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGOPS Oper. Syst. Rev."],"published-print":{"date-parts":[[2006,4]]},"abstract":"<jats:p>As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.<\/jats:p>","DOI":"10.1145\/1131322.1131340","type":"journal-article","created":{"date-parts":[[2006,7,24]],"date-time":"2006-07-24T17:00:26Z","timestamp":1153760426000},"page":"90-99","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++"],"prefix":"10.1145","volume":"40","author":[{"given":"Gengbin","family":"Zheng","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}]},{"given":"Chao","family":"Huang","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}]},{"given":"Laxmikant V.","family":"Kal\u00e9","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}]}],"member":"320","published-online":{"date-parts":[[2006,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"An overview of the bluegene\/1 supercomputer","author":"Adiga NR","year":"2002","unstructured":"NR Adiga , G Almasi , GS Almasi , Y Aridor , R Barik , D Beece , R Bellofatto , G Bhanot , R Bickford , M Blumrich , AA Bright , and J. An overview of the bluegene\/1 supercomputer , 2002 .]] NR Adiga, G Almasi, GS Almasi, Y Aridor, R Barik, D Beece, R Bellofatto, G Bhanot, R Bickford, M Blumrich, AA Bright, and J. An overview of the bluegene\/1 supercomputer, 2002.]]"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/795672.796978"},{"key":"e_1_2_1_3_1","first-page":"496","volume-title":"Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP)","author":"Antoniu Gabriel","year":"1999","unstructured":"Gabriel Antoniu , Luc Bouge , and Raymond Namyst . An efficient and transparent thread migration scheme in the PM2 runtime system . In Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP) San Juan , Puerto Rico. Lecture Notes in Computer Science 1586, pages 496 -- 510 . Springer-Verlag , April 1999 .]] Gabriel Antoniu, Luc Bouge, and Raymond Namyst. An efficient and transparent thread migration scheme in the PM2 runtime system. In Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP) San Juan, Puerto Rico. Lecture Notes in Computer Science 1586, pages 496--510. Springer-Verlag, April 1999.]]"},{"key":"e_1_2_1_4_1","volume-title":"LNCS 672","author":"Barak Amnon","year":"1993","unstructured":"Amnon Barak , Shai Guday , and Richard G. Wheeler . The mosix distributed operating system . In LNCS 672 . Springer , 1993 .]] Amnon Barak, Shai Guday, and Richard G. Wheeler. The mosix distributed operating system. In LNCS 672. Springer, 1993.]]"},{"key":"e_1_2_1_5_1","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"crossref","first-page":"385","DOI":"10.1007\/3-540-44467-X_35","volume-title":"Proceedings of the International Conference on High Performance Computing (HiPC","author":"Bhandarkar Milind","year":"2000","unstructured":"Milind Bhandarkar and L. V. Kal\u00e9 . A Parallel Framework for Explicit FEM . In M. Valero, V. K. Prasanna, and S. Vajpeyam, editors, Proceedings of the International Conference on High Performance Computing (HiPC 2000 ), Lecture Notes in Computer Science , volume 1970 , pages 385 -- 395 . Springer Verlag , December 2000.]] Milind Bhandarkar and L. V. Kal\u00e9. A Parallel Framework for Explicit FEM. In M. Valero, V. K. Prasanna, and S. Vajpeyam, editors, Proceedings of the International Conference on High Performance Computing (HiPC 2000), Lecture Notes in Computer Science, volume 1970, pages 385--395. Springer Verlag, December 2000.]]"},{"key":"e_1_2_1_6_1","first-page":"207","volume-title":"IEEE International Symposium on Reliability, Distributed Software, and Databases","author":"Briatico D.","year":"1984","unstructured":"D. Briatico , A. Ciuffoletti , and L. Simoncini . A distributed domino-effect free recovery algorithm . In IEEE International Symposium on Reliability, Distributed Software, and Databases , pages 207 -- 215 , December 1984 .]] D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, pages 207--215, December 1984.]]"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/781498.781513"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/214451.214456"},{"key":"e_1_2_1_9_1","unstructured":"Charm++ website. http:\/\/charm.cs.uiuc.edu\/.]]  Charm++ website. http:\/\/charm.cs.uiuc.edu\/.]]"},{"key":"e_1_2_1_10_1","volume-title":"CLIP: A checkpointing tool for message-passing parallel programs","author":"Chen Yuqun","year":"1997","unstructured":"Yuqun Chen , Kai Li , and James S. Plank . CLIP: A checkpointing tool for message-passing parallel programs . 1997 .]] Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. 1997.]]"},{"key":"e_1_2_1_11_1","unstructured":"Epcc blue gene\/1. http:\/\/www.epcc.ed.ac.uk\/.]]  Epcc blue gene\/1. http:\/\/www.epcc.ed.ac.uk\/.]]"},{"key":"e_1_2_1_12_1","volume-title":"Dept. of Computer Science","author":"Huang Chao","year":"2004","unstructured":"Chao Huang . System support for checkpoint and restart of charm++ and ampi applications. Master's thesis , Dept. of Computer Science , University of Illinois , 2004 .]] Chao Huang. System support for checkpoint and restart of charm++ and ampi applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.]]"},{"key":"e_1_2_1_13_1","first-page":"306","volume-title":"Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003","author":"Huang Chao","year":"2003","unstructured":"Chao Huang , Orion Lawlor , and L. V. Kal\u00e9 . Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003 ), LNCS 2958, pages 306 -- 322 , College Station, Texas , October 2003 .]] Chao Huang, Orion Lawlor, and L. V. Kal\u00e9. Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, pages 306--322, College Station, Texas, October 2003.]]"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1122971.1122976"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2004.1303335"},{"key":"e_1_2_1_16_1","volume-title":"LACSI 2002","author":"Kal\u00e9 Laxmikant V.","year":"2002","unstructured":"Laxmikant V. Kal\u00e9 . The virtualization model of parallel programming: Runtime optimizations and the state of art . In LACSI 2002 , Albuquerque , October 2002 .]] Laxmikant V. Kal\u00e9. The virtualization model of parallel programming: Runtime optimizations and the state of art. In LACSI 2002, Albuquerque, October 2002.]]"},{"key":"e_1_2_1_18_1","volume-title":"24th Annual International Symposium on Fault-Tolerant Computing","author":"James","year":"1994","unstructured":"James S. Plank and Kai Li. Faster checkpointing with n+1 parity . In 24th Annual International Symposium on Fault-Tolerant Computing , June 1994 .]] James S. Plank and Kai Li. Faster checkpointing with n+1 parity. In 24th Annual International Symposium on Fault-Tolerant Computing, June 1994.]]"},{"issue":"2","key":"e_1_2_1_19_1","first-page":"226","volume":"1","author":"Randell B.","year":"1975","unstructured":"B. Randell . System structure for software fault-tolerance. In IEEE Trans. on Software on Software Engineering , volume SE-1 ( 2 ), pages 226 -- 232 , June 1975 .]] B. Randell. System structure for software fault-tolerance. In IEEE Trans. on Software on Software Engineering, volume SE-1 (2), pages 226--232, June 1975.]]","journal-title":"Trans. on Software on Software Engineering"},{"key":"e_1_2_1_20_1","volume-title":"Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96)","author":"Georg","year":"1996","unstructured":"Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96) , Honolulu, Hawaii , 1996 .]] Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96), Honolulu, Hawaii, 1996.]]"},{"key":"e_1_2_1_21_1","first-page":"32","volume-title":"13th International Conference on Parallel Processing","author":"Tamir Y.","year":"1984","unstructured":"Y. Tamir and C. Equin . Error recovery in multicomputers using global checkpoints . In 13th International Conference on Parallel Processing , pages 32 -- 41 , August 1984 .]] Y. Tamir and C. Equin. Error recovery in multicomputers using global checkpoints. In 13th International Conference on Parallel Processing, pages 32--41, August 1984.]]"},{"key":"e_1_2_1_22_1","unstructured":"Turing cluster. http:\/\/www.cse.uiuc.edu\/turing.]]  Turing cluster. http:\/\/www.cse.uiuc.edu\/turing.]]"},{"key":"e_1_2_1_25_1","volume-title":"2004 IEEE International Conference on Cluster Computing","author":"Zheng Gengbin","year":"2004","unstructured":"Gengbin Zheng , Lixia Shi , and Laxmikant V. Kal\u00e9 . Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi . In 2004 IEEE International Conference on Cluster Computing , San Dieago, CA , September 2004 .]] Gengbin Zheng, Lixia Shi, and Laxmikant V. Kal\u00e9. Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In 2004 IEEE International Conference on Cluster Computing, San Dieago, CA, September 2004.]]"}],"container-title":["ACM SIGOPS Operating Systems Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1131322.1131340","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1131322.1131340","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T15:06:16Z","timestamp":1750259176000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1131322.1131340"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,4]]},"references-count":22,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2006,4]]}},"alternative-id":["10.1145\/1131322.1131340"],"URL":"https:\/\/doi.org\/10.1145\/1131322.1131340","relation":{},"ISSN":["0163-5980"],"issn-type":[{"type":"print","value":"0163-5980"}],"subject":[],"published":{"date-parts":[[2006,4]]},"assertion":[{"value":"2006-04-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}