{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:16:20Z","timestamp":1750306580094,"version":"3.41.0"},"publisher-location":"New York, New York, USA","reference-count":26,"publisher":"ACM Press","license":[{"start":{"date-parts":[[2014,1,1]],"date-time":"2014-01-01T00:00:00Z","timestamp":1388534400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Lawrence Berkeley National Laboratory"},{"name":"NSF","award":["1058779"],"award-info":[{"award-number":["1058779"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014]]},"DOI":"10.1145\/2663165.2663325","type":"proceedings-article","created":{"date-parts":[[2014,11,26]],"date-time":"2014-11-26T15:45:24Z","timestamp":1417016724000},"page":"121-132","source":"Crossref","is-referenced-by-count":2,"title":["Affinity-aware checkpoint restart"],"prefix":"10.1145","author":[{"given":"Ajay","family":"Saini","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Arash","family":"Rezaei","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Frank","family":"Mueller","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paul","family":"Hargrove","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eric","family":"Roman","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","reference":[{"key":"key-10.1145\/2663165.2663325-1","doi-asserted-by":"crossref","unstructured":"S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. InProceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004.","DOI":"10.1145\/1006209.1006248"},{"key":"key-10.1145\/2663165.2663325-2","doi-asserted-by":"crossref","unstructured":"J. Ansel, K. Arya, and G. Cooperman. Dmtcp: Transparent checkpointing for cluster computations and the desktop. InParallel &#38; Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12. IEEE, 2009.","DOI":"10.1109\/IPDPS.2009.5161063"},{"key":"key-10.1145\/2663165.2663325-3","doi-asserted-by":"crossref","unstructured":"D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The nas parallel benchmarks.International Journal of High Performance Computing Applications, 5(3):63--73, 1991.","DOI":"10.1177\/109434209100500306"},{"key":"key-10.1145\/2663165.2663325-4","doi-asserted-by":"crossref","unstructured":"Z. Chen. Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. InProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 167--176. ACM, 2013.","DOI":"10.1145\/2442516.2442533"},{"key":"key-10.1145\/2663165.2663325-5","unstructured":"cryopid-devel@lists.berlios.de. Cryopid - a process freezer for linux. https:\/\/github.com\/maaziz\/cryopid, 2004."},{"key":"key-10.1145\/2663165.2663325-6","doi-asserted-by":"crossref","unstructured":"D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing numa locks. InProceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 247--256, New York, NY, USA, 2012. ACM.","DOI":"10.1145\/2145816.2145848"},{"key":"key-10.1145\/2663165.2663325-7","doi-asserted-by":"crossref","unstructured":"J. Duell. The design and implementation of berkeley lab's linux checkpoint\/restart. 2005.","DOI":"10.2172\/891617"},{"key":"key-10.1145\/2663165.2663325-8","doi-asserted-by":"crossref","unstructured":"K. B. Ferreira, R. Riesen, R. Brighwell, P. Bridges, and D. Arnold. libhashckpt: hash-based incremental checkpointing using gpu&#226;&#258;&Zacute;s. InRecent Advances in the Message Passing Interface, pages 272--281. Springer, 2011.","DOI":"10.1007\/978-3-642-24449-0_31"},{"key":"key-10.1145\/2663165.2663325-9","doi-asserted-by":"crossref","unstructured":"A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated checkpointing without domino effect for send-deterministic mpi applications. InParallel &#38; Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 989--1000. IEEE, 2011.","DOI":"10.1109\/IPDPS.2011.95"},{"key":"key-10.1145\/2663165.2663325-10","unstructured":"H. Jin, M. Frumkin, and J. Yan. The openmp implementation of nas parallel benchmarks and its performance. Technical report, Technical Report NAS-99-011, NASA Ames Research Center, 1999."},{"key":"key-10.1145\/2663165.2663325-11","doi-asserted-by":"crossref","unstructured":"L. V. Kale and S. Krishnan. Charm++: A portable concurrent object oriented system based on c++. InProceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, pages 91--108, 1993.","DOI":"10.1145\/165854.165874"},{"key":"key-10.1145\/2663165.2663325-12","doi-asserted-by":"crossref","unstructured":"I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, et al. Exploring traditional and emerging parallel programming models using a proxy application. InParallel &#38; Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 919--932. IEEE, 2013.","DOI":"10.1109\/IPDPS.2013.115"},{"key":"key-10.1145\/2663165.2663325-13","doi-asserted-by":"crossref","unstructured":"D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. InInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.","DOI":"10.1145\/2503210.2503226"},{"key":"key-10.1145\/2663165.2663325-14","doi-asserted-by":"crossref","unstructured":"J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. InACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 90--99, Mar. 2006.","DOI":"10.1145\/1122971.1122987"},{"key":"key-10.1145\/2663165.2663325-15","doi-asserted-by":"crossref","unstructured":"A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. InHigh Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11. IEEE, 2010.","DOI":"10.1109\/SC.2010.18"},{"key":"key-10.1145\/2663165.2663325-16","doi-asserted-by":"crossref","unstructured":"N. Naksinehaboon, Y. Liu, C. Leangsuksun, R. Nassar, M. Paun, and S. L. Scott. Reliability-aware approach: An incremental checkpoint\/restart model in hpc environments. InCluster Computing and the Grid, 2008. CCGRID'08. 8th IEEE International Symposium on, pages 783--788. IEEE, 2008.","DOI":"10.1109\/CCGRID.2008.109"},{"key":"key-10.1145\/2663165.2663325-17","unstructured":"X. Ni, E. Meneses, N. Jain, and L. V. Kal&#233;. Acr: automatic checkpoint\/restart for soft and hard error protection. InProceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, page 7. ACM, 2013."},{"key":"key-10.1145\/2663165.2663325-18","doi-asserted-by":"crossref","unstructured":"B. Nicolae and F. Cappello. Ai-ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing. InProceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 155--166. ACM, 2013.","DOI":"10.1145\/2493123.2462918"},{"key":"key-10.1145\/2663165.2663325-19","doi-asserted-by":"crossref","unstructured":"A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. InProceedings of the 20th annual international conference on Supercomputing, pages 14--23. ACM, 2006.","DOI":"10.1145\/1183401.1183406"},{"key":"key-10.1145\/2663165.2663325-20","doi-asserted-by":"crossref","unstructured":"J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable molecular dynamics with namd.Journal of computational chemistry, 26(16):1781--1802, 2005.","DOI":"10.1002\/jcc.20289"},{"key":"key-10.1145\/2663165.2663325-21","unstructured":"E. Roman. A survey of checkpoint\/restart implementations. InLawrence Berkeley National Laboratory, Tech. LBNL, 2002."},{"key":"key-10.1145\/2663165.2663325-22","doi-asserted-by":"crossref","unstructured":"O. Sarood, E. Meneses, and L. V. Kale. A &#226;&#258;IJcool&#226;&#258;&#304; way of improving the reliability of hpc machines, &#226;&#258;&#304;. InProceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis, 2013.","DOI":"10.1145\/2503210.2503228"},{"key":"key-10.1145\/2663165.2663325-23","doi-asserted-by":"crossref","unstructured":"K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 19. IEEE Computer Society Press, 2012.","DOI":"10.1109\/SC.2012.46"},{"key":"key-10.1145\/2663165.2663325-24","unstructured":"A. Schiper, F. Cappello, T. Martsinkevich, A. Guermouche, and T. Ropars. Spbc: Leveraging the characteristics of mpi hpc applications for scalable checkpointing. InInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC\" 13), number EPFL-CONF-189836, 2013."},{"key":"key-10.1145\/2663165.2663325-25","doi-asserted-by":"crossref","unstructured":"C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid checkpointing for mpi jobs in hpc environments. InProceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS '10, pages 524--533, Washington, DC, USA, 2010. IEEE Computer Society.","DOI":"10.1109\/ICPADS.2010.48"},{"key":"key-10.1145\/2663165.2663325-26","unstructured":"H. Zhong and J. Nieh. Crak: Linux checkpoint\/restart as a kernel module. Technical report, CUCS-014-01, Department of Computer Science, Columbia University, 2001."}],"event":{"number":"15","sponsor":["Raytheon BBN Technologies","IFIP","Conseil R\u00e9gional d'Aquitaine","USENIX","ACM, Association for Computing Machinery","LaBRI","HP","Bordeaux, City of Bordeaux","GDR ASR, GDR Architecture, Syst\u00e8mes et R\u00e9seaux"],"acronym":"Middleware '14","name":"the 15th International Middleware Conference","start":{"date-parts":[[2014,12,8]]},"location":"Bordeaux, France","end":{"date-parts":[[2014,12,12]]}},"container-title":["Proceedings of the 15th International Middleware Conference on - Middleware '14"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2663165.2663325","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/dl.acm.org\/ft_gateway.cfm?id=2663325&amp;ftid=1515755&amp;dwn=1","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:13:46Z","timestamp":1750227226000},"score":1,"resource":{"primary":{"URL":"http:\/\/dl.acm.org\/citation.cfm?doid=2663165.2663325"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014]]},"references-count":26,"URL":"https:\/\/doi.org\/10.1145\/2663165.2663325","relation":{},"subject":[],"published":{"date-parts":[[2014]]}}}