{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T04:16:29Z","timestamp":1742876189988,"version":"3.40.2"},"reference-count":27,"publisher":"IGI Global","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,1,1]]},"abstract":"<p>In this paper, the authors present a new approach to algorithm based fault tolerance (ABFT) for High Performance computing system. The Algorithm Based Fault Tolerance approach transforms a system that does not tolerate a specific type of fault, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways, the parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs, can apply convolution codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This paper proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.<\/p>","DOI":"10.4018\/jghpc.2012010103","type":"journal-article","created":{"date-parts":[[2012,4,3]],"date-time":"2012-04-03T15:24:59Z","timestamp":1333466699000},"page":"37-51","source":"Crossref","is-referenced-by-count":9,"title":["Analysis and Evaluation of a New Algorithm Based Fault Tolerance for Computing Systems"],"prefix":"10.4018","volume":"4","author":[{"given":"Hodjat","family":"Hamidi","sequence":"first","affiliation":[{"name":"University of Isfahan, Iran"}]},{"given":"Abbas","family":"Vafaei","sequence":"additional","affiliation":[{"name":"University of Isfahan, Iran"}]},{"given":"Seyed Amir Hassan","family":"Monadjemi","sequence":"additional","affiliation":[{"name":"University of Isfahan, Iran"}]}],"member":"2432","reference":[{"key":"jghpc.2012010103-0","doi-asserted-by":"crossref","unstructured":"Acree, R. K., Ullah, N., Karia, A., Rahmeh, J. T., & Abraham, J. A. (1993). An object-oriented approach for implementing algorithm-based fault tolerance. In Proceedings of the 12th Annual International Phoenix Computers and Communications Conference (pp. 210-216).","DOI":"10.1109\/PCCC.1993.344462"},{"key":"jghpc.2012010103-1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2008.2004587"},{"key":"jghpc.2012010103-2","doi-asserted-by":"publisher","DOI":"10.1109\/12.57055"},{"key":"jghpc.2012010103-3","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4899-3276-1","author":"J.Baylis","year":"1998","journal-title":"Error-correcting codes: A mathematical introduction"},{"key":"jghpc.2012010103-4","doi-asserted-by":"publisher","DOI":"10.1504\/IJCCBS.2010.031709"},{"key":"jghpc.2012010103-5","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2008.58"},{"key":"jghpc.2012010103-6","doi-asserted-by":"crossref","unstructured":"Choi, J., Dongarra, J. J., & Walker, D. W. (1996). PB-BLAS: A set of parallel block basic linear algebra subprograms. In Proceedings of the Conference on Scalable High-Performance Computing (pp. 534-541).","DOI":"10.1002\/(SICI)1096-9128(199609)8:7<517::AID-CPE226>3.0.CO;2-W"},{"journal-title":"Error control coding fundamentals and applications","year":"2004","author":"D.Costello","key":"jghpc.2012010103-7"},{"key":"jghpc.2012010103-8","unstructured":"Dongarra, J. J., & Whaley, R. C. (1995). A user\u2019s guide to the BLACS v1.0 (Tech. Rep. No. CS-95-281). Knoxville, TN: University of Tennessee."},{"key":"jghpc.2012010103-9","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2008.04.011"},{"key":"jghpc.2012010103-10","doi-asserted-by":"crossref","unstructured":"Elnozahy, E. N., Johnson, D. B., & Zwaenepoel, W. (October 1992). The performance of consistent checkpointing. In Proceedings of the 11th Symposium on Reliable Distributed Systems (pp. 39-47).","DOI":"10.1109\/RELDIS.1992.235144"},{"key":"jghpc.2012010103-11","doi-asserted-by":"crossref","unstructured":"Hakkarinen, D., & Chen, Z. (2010, April 19-23). Algorithmic Cholesky factorization fault recovery. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, GA.","DOI":"10.1109\/IPDPS.2010.5470436"},{"key":"jghpc.2012010103-12","doi-asserted-by":"publisher","DOI":"10.3923\/jas.2009.3947.3956"},{"key":"jghpc.2012010103-13","unstructured":"Hamidi, H., Vafaei, A., & Monadjemi, A. H. (2010). A fault-tolerant approach for matrix functions in image processing. Paper presented at the 6th Iranian Machine Vision and Image Processing Conference."},{"key":"jghpc.2012010103-14","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1984.1676475"},{"key":"jghpc.2012010103-15","doi-asserted-by":"publisher","DOI":"10.1109\/PROC.1986.13535"},{"key":"jghpc.2012010103-16","doi-asserted-by":"publisher","DOI":"10.1109\/12.4606"},{"key":"jghpc.2012010103-17","doi-asserted-by":"publisher","DOI":"10.1109\/TSP.2009.2031727"},{"key":"jghpc.2012010103-18","doi-asserted-by":"publisher","DOI":"10.1142\/S0218126607003708"},{"key":"jghpc.2012010103-19","doi-asserted-by":"publisher","DOI":"10.1002\/0470035706"},{"key":"jghpc.2012010103-20","doi-asserted-by":"publisher","DOI":"10.1109\/12.54836"},{"key":"jghpc.2012010103-21","doi-asserted-by":"crossref","unstructured":"Rexford, J., & Jha, N. K. (1992). Algorithm-based fault tolerance for floating-point operations in massively parallel systems. In Proceedings of the International Symposium on Circuits and Systems (pp. 649-652).","DOI":"10.1109\/ISCAS.1992.230168"},{"key":"jghpc.2012010103-22","doi-asserted-by":"publisher","DOI":"10.1145\/1670679.1670680"},{"key":"jghpc.2012010103-23","doi-asserted-by":"crossref","unstructured":"Turmon, M., Granat, R., & Katz, D. (2000). Software-implemented fault detection for high-performance space applications. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (pp. 107-116).","DOI":"10.1109\/ICDSN.2000.857522"},{"key":"jghpc.2012010103-24","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2003.1197125"},{"key":"jghpc.2012010103-25","doi-asserted-by":"crossref","unstructured":"Veeravalli, V. S. (2009). Fault tolerance for arithmetic and logic unit. In Proceedings of the IEEE Southeast Conference (pp. 329-334).","DOI":"10.1109\/SECON.2009.5174100"},{"journal-title":"Principles of digital communication and coding","year":"1985","author":"A. J.Viterbi","key":"jghpc.2012010103-26"}],"container-title":["International Journal of Grid and High Performance Computing"],"original-title":[],"language":"ng","link":[{"URL":"https:\/\/www.igi-global.com\/viewtitle.aspx?TitleId=62996","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T02:26:15Z","timestamp":1742869575000},"score":1,"resource":{"primary":{"URL":"https:\/\/services.igi-global.com\/resolvedoi\/resolve.aspx?doi=10.4018\/jghpc.2012010103"}},"subtitle":[""],"short-title":[],"issued":{"date-parts":[[2012,1,1]]},"references-count":27,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2012,1]]}},"URL":"https:\/\/doi.org\/10.4018\/jghpc.2012010103","relation":{},"ISSN":["1938-0259","1938-0267"],"issn-type":[{"type":"print","value":"1938-0259"},{"type":"electronic","value":"1938-0267"}],"subject":[],"published":{"date-parts":[[2012,1,1]]}}}