{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,31]],"date-time":"2025-12-31T12:08:53Z","timestamp":1767182933129,"version":"build-2065373602"},"reference-count":36,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2021,5,31]],"date-time":"2021-05-31T00:00:00Z","timestamp":1622419200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100015515","name":"Malaysia Ministry of Education","doi-asserted-by":"publisher","award":["GPF097C-2020"],"award-info":[{"award-number":["GPF097C-2020"]}],"id":[{"id":"10.13039\/501100015515","id-type":"DOI","asserted-by":"publisher"}]},{"name":"GPF097C-2020","award":["GPF097C-2020"],"award-info":[{"award-number":["GPF097C-2020"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.<\/jats:p>","DOI":"10.3390\/s21113799","type":"journal-article","created":{"date-parts":[[2021,5,31]],"date-time":"2021-05-31T03:45:29Z","timestamp":1622432729000},"page":"3799","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1467-3292","authenticated-orcid":false,"given":"Muntadher","family":"Saadoon","sequence":"first","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9598-8813","authenticated-orcid":false,"given":"Siti Hafizah Ab","family":"Hamid","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8375-6441","authenticated-orcid":false,"given":"Hazrina","family":"Sofian","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5486-9882","authenticated-orcid":false,"given":"Hamza","family":"Altarturi","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"given":"Nur","family":"Nasuha","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8314-6464","authenticated-orcid":false,"given":"Zati Hakim","family":"Azizul","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"given":"Asmiza Abdul","family":"Sani","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9193-2430","authenticated-orcid":false,"given":"Adeleh","family":"Asemi","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia"}],"role":[{"role":"author","vocab":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,5,31]]},"reference":[{"key":"ref_1","unstructured":"Ean, J., and Ghemawat, S. (2008, January 6\u20138). MapReduce: Simplified data processing on large cluster. Proceedings of the 6th Symposium on Operating Systems Design & Implementation, Berkeley, CA, USA."},{"key":"ref_2","first-page":"185","article-title":"Improving fault diagnosis performance using hadoop mapreduce for efficient classification and analysis of large data sets","volume":"29","author":"Alkasem","year":"2018","journal-title":"J. Comput."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Azeez, N.A., Ayemobola, T.J., Misra, S., Maskeli\u016bnas, R., and Dama\u0161evi\u010dius, R. (2019). Network intrusion detection with a hashing based apriori algorithm using Hadoop MapReduce. Computers, 8.","DOI":"10.3390\/computers8040086"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Kumar Behera, R., Kumar Rath, S., Misra, S., Dama\u0161evi\u010dius, R., and Maskeli\u016bnas, R. (2019). Distributed centrality analysis of social network data using MapReduce. Algorithms, 12.","DOI":"10.3390\/a12080161"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"17322","DOI":"10.1109\/ACCESS.2017.2742698","article-title":"Fault and error tolerance in neural networks: A review","volume":"5","author":"Girau","year":"2017","journal-title":"IEEE Access"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"54","DOI":"10.1016\/j.jnca.2015.11.014","article-title":"Availability in the cloud: State of the art","volume":"60","author":"Nabi","year":"2016","journal-title":"J. Netw. Comput. Appl."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Liu, J., Shen, H., Chi, H., Narman, H.S., Yang, Y., Cheng, L., and Chung, W. (2020). A Low-Cost Multi-Failure Resilient Replication Scheme for High-Data Availability in Cloud Storage. IEEE\/ACM Trans. Netw.","DOI":"10.1109\/TNET.2020.3027814"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Asghar, H., and Nazir, B. (2021). Analysis and implementation of reactive fault tolerance techniques in Hadoop: A comparative study. J. Supercomput., 1\u201327.","DOI":"10.1007\/s11227-020-03491-9"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1016\/j.ins.2016.08.013","article-title":"Failure detector abstractions for MapReduce-based systems","volume":"379","author":"Memishi","year":"2017","journal-title":"Inf. Sci."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2310","DOI":"10.1002\/cpe.3044","article-title":"Towards self-caring MapReduce: A study of performance penalties under faults","volume":"27","author":"Kadirvel","year":"2015","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Faghri, F., Bazarbayev, S., Overholt, M., Farivar, R., Campbell, R.H., and Sanders, W.H. (2012, January 4). Failure scenario as a service (FSaaS) for Hadoop clusters. Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management, Montreal, QC, USA.","DOI":"10.1145\/2405186.2405191"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dinu, F., and Ng, T.E. (2012, January 18\u201322). Understanding the effects and implications of compute node related failures in hadoop. Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, Delft, The Netherlands.","DOI":"10.1145\/2287076.2287108"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Vavilapalli, V., Murthy, A., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., and Seth, S. (2013, January 1). Apache hadoop yarn: Yet another resource negotiator. Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, CA, USA.","DOI":"10.1145\/2523616.2523633"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Rahman, M.T., Gabriel, E., and Subhlok, J. (2017, January 5\u20138). Performance implications of failures on MapReduce applications. Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA.","DOI":"10.1109\/CLUSTER.2017.87"},{"key":"ref_15","first-page":"7","article-title":"Improving MapReduce performance in heterogeneous environments","volume":"8","author":"Zaharia","year":"2008","journal-title":"Osdi"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Chen, Q., Zhang, D., Guo, M., Deng, Q., and Guo, S. (July, January 29). Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, Bradford, UK.","DOI":"10.1109\/CIT.2010.458"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Gupta, C., Bansal, M., Chuang, T.C., Sinha, R., and Ben-Romdhane, S. (July, January 27). Astro: A predictive model for anomaly detection and feedback-based scheduling on Hadoop. Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Anchorage, AK, USA.","DOI":"10.1109\/BigData.2014.7004315"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Rosa, A., Chen, L.Y., and Binder, W. (2015, January 15\u201316). Catching failures of failures at big-data clusters: A two-level neural network approach. Proceedings of the 2015 IEEE 23rd International Symposium on Quality of Service (IWQoS), Portland, OR, USA.","DOI":"10.1109\/IWQoS.2015.7404739"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Soualhia, M., Khomh, F., and Tahar, S. (2015, January 14\u201316). ATLAS: An adaptive failure-aware scheduler for hadoop. Proceedings of the 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC), Nanjing, China.","DOI":"10.1109\/PCCC.2015.7410316"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"553","DOI":"10.1109\/TCC.2018.2805812","article-title":"A dynamic and failure-aware task scheduling framework for hadoop","volume":"8","author":"Soualhia","year":"2018","journal-title":"IEEE Trans. Cloud Comput."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1016\/j.future.2016.02.015","article-title":"Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling","volume":"74","author":"Yildiz","year":"2017","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_22","unstructured":"Kadirvel, S., Ho, J., and Fortes, J.A. (2013, January 26\u201328). Fault management in Map-Reduce through early detection of anomalous nodes. Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13), San Jose, CA, USA."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Quian\u00e9-Ruiz, J.A., Pinkel, C., Schad, J., and Dittrich, J. (2011, January 11\u201316). RAFTing MapReduce: Fast recovery on the RAFT. Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, Hannover, Germany.","DOI":"10.1109\/ICDE.2011.5767877"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"3572","DOI":"10.1007\/s11227-018-2716-8","article-title":"Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications","volume":"76","author":"Zhu","year":"2020","journal-title":"J. Supercomput."},{"key":"ref_25","unstructured":"Liu, J., Wang, P., Zhou, J., and Li, K. (2019). McTAR: A multi-trigger checkpointing tactic for fast task recovery in MapReduce. IEEE Trans. Serv. Comput."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010, January 3\u20137). The hadoop distributed file system. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Lake Tahoe, NV, USA.","DOI":"10.1109\/MSST.2010.5496972"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"832","DOI":"10.1007\/s10766-015-0395-0","article-title":"MapReduce parallel programming model: A state-of-the-art survey","volume":"44","author":"Li","year":"2016","journal-title":"Int. J. Parallel Program."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1145\/568522.568525","article-title":"A survey of rollback-recovery protocols in message-passing systems","volume":"34","author":"Elnozahy","year":"2002","journal-title":"ACM Comput. Surv."},{"key":"ref_29","unstructured":"Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., and Qin, X. (2010, January 19\u201323). Improving mapreduce performance through data placement in heterogeneous hadoop clusters. Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, USA."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1109\/TDSC.2004.2","article-title":"Basic concepts and taxonomy of dependable and secure computing","volume":"1","author":"Avizienis","year":"2004","journal-title":"IEEE Trans. Dependable Secur. Comput."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1109\/COMST.2008.4564478","article-title":"Fault tolerance for highly available internet services: Concepts, approaches, and issues","volume":"10","author":"Ayari","year":"2008","journal-title":"IEEE Commun. Surv. Tutor."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"63862","DOI":"10.1109\/ACCESS.2020.2984778","article-title":"A Novel Configuration Tuning Method Based on Feature Selection for Hadoop MapReduce","volume":"8","author":"Liu","year":"2020","journal-title":"IEEE Access"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1016\/j.jnca.2017.08.011","article-title":"Cloud storage reliability for big data applications: A state of the art survey","volume":"97","author":"Nachiappan","year":"2017","journal-title":"J. Netw. Comput. Appl."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhu, H., and Chen, H. (2011, January 12\u201315). Adaptive failure detection via heartbeat under Hadoop. Proceedings of the 2011 IEEE Asia-Pacific Services Computing Conference, Jeju, Korea.","DOI":"10.1109\/APSCC.2011.46"},{"key":"ref_35","unstructured":"Chen, Y., Ganapathi, A.S., Griffith, R., and Katz, R.H. (2010). A methodology for understanding mapreduce performance under diverse workloads. EECS Department, University of California, Berkeley, Tech. Rep. UCB\/EECS-2010-135, University of California."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Chen, Y., Alspaugh, S., and Katz, R. (2012). Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. arXiv.","DOI":"10.21236\/ADA561769"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/11\/3799\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:09:13Z","timestamp":1760162953000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/11\/3799"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,31]]},"references-count":36,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2021,6]]}},"alternative-id":["s21113799"],"URL":"https:\/\/doi.org\/10.3390\/s21113799","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,5,31]]}}}