{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T13:18:29Z","timestamp":1763644709059,"version":"3.45.0"},"reference-count":40,"publisher":"Wiley","issue":"25-26","license":[{"start":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T00:00:00Z","timestamp":1759795200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Concurrency and Computation"],"published-print":{"date-parts":[[2025,11,30]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>In today's distributed computing environments, the rapid generation of large\u2010scale data from diverse sources poses significant challenges in terms of storage, management, and processing, particularly for traditional relational databases. Hadoop has emerged as a widely adopted framework for handling such data through parallel processing across distributed clusters. Despite its advantages in scalability, flexibility, and fault tolerance, Hadoop suffers from inefficiencies related to high data access latency, redundant computations, and I\/O overhead, which degrade overall system performance. To mitigate these issues, researchers have proposed various caching mechanisms aimed at improving data access time, enhancing data locality, minimizing duplicate computations, and optimizing resource utilization. This paper provides a comprehensive survey and novel classification of existing caching strategies for Hadoop, categorizing them based on the specific Hadoop performance bottlenecks they address. A detailed comparative analysis is provided based on critical caching characteristics such as cached item type, cache management policies, replacement strategies, and access patterns. To assess the effectiveness of these caching mechanisms, their impact on key Hadoop performance metrics is evaluated. Also, statistical insights are presented, highlighting the percentage of reviewed studies addressing specific Hadoop performance challenges and the frequency of performance metrics used for evaluation. Finally, this survey identifies hybrid caching as a promising future trend and proposes a novel approach termed Hybrid Intelligent Cache (HIC) as an example. HIC combines the strengths of two previously developed methods from distinct categories. The first method is the Hybrid Support Vector Machine\u2013Least Recently Used (H\u2010SVM\u2010LRU) algorithm, which enhances the traditional LRU cache replacement strategy by employing a Support Vector Machine (SVM) to predict future data access patterns for intelligent eviction. The second is Cache Locality with Q\u2010Learning in MapReduce Scheduling (CLQLMRS), a reinforcement learning\u2013based scheduling technique that optimizes task allocation by maximizing both cache locality and data locality. Experimental results demonstrate that HIC yields an average 31.2% improvement in job execution time, marking a significant advancement in intelligent caching for Hadoop ecosystems.<\/jats:p>","DOI":"10.1002\/cpe.70337","type":"journal-article","created":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T06:46:12Z","timestamp":1759905972000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["A Systematic Overview of Caching Mechanisms to Improve Hadoop Performance"],"prefix":"10.1002","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-7857-2231","authenticated-orcid":false,"given":"Rana","family":"Ghazali","sequence":"first","affiliation":[{"name":"Department of Computer Engineering North Tehran Branch, Islamic Azad University  Tehran Iran"},{"name":"Department of Computing and Software McMaster University  Hamilton Ontario Canada"}]},{"given":"Douglas G.","family":"Down","sequence":"additional","affiliation":[{"name":"Department of Computing and Software McMaster University  Hamilton Ontario Canada"}]}],"member":"311","published-online":{"date-parts":[[2025,10,7]]},"reference":[{"key":"e_1_2_11_2_1","doi-asserted-by":"publisher","DOI":"10.32604\/cmc.2021.016462"},{"key":"e_1_2_11_3_1","first-page":"104","volume-title":"Proceedings\u20142022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS","author":"Uta A.","year":"2022"},{"key":"e_1_2_11_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.comcom.2022.07.008"},{"key":"e_1_2_11_5_1","unstructured":"J.ShahandH.Somani \u201cSurvey of Caching Mechanisms in Hadoop \u201d(2016) 562\u2013565."},{"key":"e_1_2_11_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/PERVASIVE.2015.7087016"},{"key":"e_1_2_11_7_1","first-page":"12","volume-title":"Proceedings\u2014IEEE\/ACM International Workshop on Grid Computing","author":"Zhang J.","year":"2012"},{"key":"e_1_2_11_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2013.83"},{"key":"e_1_2_11_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2014.11"},{"key":"e_1_2_11_10_1","unstructured":"G.Singh P.Chandra andR.Tahir \u201cA Dynamic Caching Mechanism for Hadoop Using Memcached \u201d(2012)."},{"key":"e_1_2_11_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920881"},{"key":"e_1_2_11_12_1","doi-asserted-by":"crossref","unstructured":"P.Bhatotia A.Wieder R.Rodrigues U. A.Acar andR.Pasquini \u201cIncoop: MapReduce for Incremental Computations. Proceedings of the 2nd ACM Symposium on Cloud Computing SOCC 2011 \u201d(2011) https:\/\/doi.org\/10.1145\/2038916.2038923.","DOI":"10.1145\/2038916.2038923"},{"key":"e_1_2_11_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TST.2014.6733207"},{"key":"e_1_2_11_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733037"},{"key":"e_1_2_11_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigMM.2016.10"},{"key":"e_1_2_11_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jnca.2024.104080"},{"key":"e_1_2_11_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/PDCAT.2016.060"},{"key":"e_1_2_11_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/384265.291048"},{"key":"e_1_2_11_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-017-0920-6"},{"key":"e_1_2_11_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987553"},{"key":"e_1_2_11_21_1","doi-asserted-by":"publisher","DOI":"10.1186\/s13677-022-00322-5"},{"key":"e_1_2_11_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIT.2014.18"},{"key":"e_1_2_11_23_1","doi-asserted-by":"crossref","unstructured":"S.Li M. A.Maddah\u2010Ali andA. S.Avestimehr \u201cCoded MapReduce \u201d(2015) 1\u201316.","DOI":"10.1109\/GLOCOM.2016.7841765"},{"key":"e_1_2_11_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2016.40"},{"key":"e_1_2_11_25_1","unstructured":"G.Ananthanarayanan A.Ghodsi A.Wang et al. \u201cPacman: Coordinated Memory Caching for Parallel Jobs \u201dUniversity of California Berkeley Facebook Microsoft Research KTH\/Sweden."},{"key":"e_1_2_11_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2015.04.087"},{"key":"e_1_2_11_27_1","unstructured":"M.ShrivastavaandH.Bischof \u201cHadoop\u2010Collaborative Caching in Real\u2010Time HDFS Thesis \u201d(2012) Rochester Institute of Technology Rochester New York."},{"key":"e_1_2_11_28_1","unstructured":"L.Code \u201cMaster's Thesis Cache Affinity\u2010Aware In\u2010Memory Caching Management for Hadoop \u201dJaewon Kwak."},{"key":"e_1_2_11_29_1","doi-asserted-by":"crossref","unstructured":"J.Kwak E.Hwang T.Yoo B.Nam andY.Choi \u201cIn\u2010Memory Caching Orchestration for Hadoop \u201d(2016) https:\/\/doi.org\/10.1109\/CCGrid.2016.73.","DOI":"10.1109\/CCGrid.2016.73"},{"key":"e_1_2_11_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987553"},{"key":"e_1_2_11_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW.2019.00-21"},{"key":"e_1_2_11_32_1","doi-asserted-by":"crossref","unstructured":"R.Ghazali S.Adabi A.Rezaee D. G.Down andA.Movaghar \u201cHadoop\u2010Oriented SVM\u2010LRU (H\u2010SVM\u2010LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance \u201d(2023) https:\/\/doi.org\/10.48550\/arXiv.2309.16471.","DOI":"10.21203\/rs.3.rs-4971579\/v1"},{"key":"e_1_2_11_33_1","first-page":"2405","volume-title":"Proceedings\u2014International Conference on Distributed Computing Systems","author":"Luo Y.","year":"2017"},{"key":"e_1_2_11_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICACCS.2019.8728373"},{"key":"e_1_2_11_35_1","unstructured":"J.Lee K. T.Kim andT.Youn\u2010Chen \u201cMapReduce Performance Scaling Using Data Prefetching \u201d(2022) 9 26\u201331."},{"key":"e_1_2_11_36_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.robot.2022.104228"},{"key":"e_1_2_11_37_1","first-page":"41","volume-title":"Proceedings\u20142nd IEEE International Conference on Cloud Computing Technology and Science","author":"Dong B.","year":"2010"},{"key":"e_1_2_11_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3175596"},{"key":"e_1_2_11_39_1","first-page":"1","article-title":"Smart Data Prefetching Using KNN to Improve Hadoop Performance","volume":"12","author":"Ghazali R.","year":"2025","journal-title":"EAI Endorsed Transactions on Scalable Information Systems"},{"key":"e_1_2_11_40_1","unstructured":"S.Huang J.Huang J.Dai T.Xie andB.Huang \u201cThe HiBench Benchmark Suite: Characterization of the MapReduce\u2010Based Data Analysis \u201d(2014) 47 https:\/\/doi.org\/10.1109\/ICDEW.2010.54527."},{"key":"e_1_2_11_41_1","unstructured":"\u201cHibench \u201dhttps:\/\/github.com\/Intel\u2010bigdata\/HiBench."}],"container-title":["Concurrency and Computation: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.70337","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T09:03:53Z","timestamp":1762938233000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.70337"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,7]]},"references-count":40,"journal-issue":{"issue":"25-26","published-print":{"date-parts":[[2025,11,30]]}},"alternative-id":["10.1002\/cpe.70337"],"URL":"https:\/\/doi.org\/10.1002\/cpe.70337","archive":["Portico"],"relation":{},"ISSN":["1532-0626","1532-0634"],"issn-type":[{"type":"print","value":"1532-0626"},{"type":"electronic","value":"1532-0634"}],"subject":[],"published":{"date-parts":[[2025,10,7]]},"assertion":[{"value":"2024-12-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70337"}}