{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T14:16:56Z","timestamp":1758637016471,"version":"3.40.3"},"publisher-location":"Cham","reference-count":20,"publisher":"Springer Nature Switzerland","isbn-type":[{"type":"print","value":"9783031697654"},{"type":"electronic","value":"9783031697661"}],"license":[{"start":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T00:00:00Z","timestamp":1704067200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,8,26]],"date-time":"2024-08-26T00:00:00Z","timestamp":1724630400000},"content-version":"vor","delay-in-days":238,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In the last years, applications related to Artificial Intelligence and big data, among others, have been involved. There is a need to improve I\/O operations to avoid bottlenecks in accessing a larger amount of data. For this purpose, the Expand Ad-Hoc parallel file system is being designed and developed.<\/jats:p><jats:p>Since these applications have very long execution times, fault tolerance mechanisms in the file system are necessary to allow them to continue running in the presence of failures.<\/jats:p><jats:p>This work introduces a fault-tolerant design based on data replication for the Expand Ad-Hoc parallel file system and an initial evaluation conducted on the HPC4AI Laboratory supercomputer in Torino.<\/jats:p><jats:p>The evaluation of Expand Ad-Hoc with fault-tolerant found that, despite data replication, its performance and scalability are generally better than those of other parallel file systems without fault-tolerant.<\/jats:p>","DOI":"10.1007\/978-3-031-69766-1_5","type":"book-chapter","created":{"date-parts":[[2024,8,25]],"date-time":"2024-08-25T19:02:05Z","timestamp":1724612525000},"page":"62-76","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Fault Tolerant in\u00a0the\u00a0Expand Ad-Hoc Parallel File System"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-3574-9189","authenticated-orcid":false,"given":"Dario","family":"Mu\u00f1oz-Mu\u00f1oz","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5067-1502","authenticated-orcid":false,"given":"Felix","family":"Garcia-Carballeira","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7561-3619","authenticated-orcid":false,"given":"Diego","family":"Camarmas-Alonso","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6185-653X","authenticated-orcid":false,"given":"Alejandro","family":"Calderon-Mateos","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1413-4793","authenticated-orcid":false,"given":"Jesus","family":"Carretero","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,8,26]]},"reference":[{"key":"5_CR1","unstructured":"BeeGFS: BeeGFS documentation 7.4.2 \u00bb architecture (2024). https:\/\/doc.beegfs.io\/7.4.2\/architecture\/overview.html#mirroring (Accessed 18 March 2024)"},{"key":"5_CR2","unstructured":"Braam, P.: The lustre storage architecture. CoRR arXiv: 1903.01955 (2019)"},{"issue":"1","key":"5_CR3","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1007\/s11390-020-9801-1","volume":"35","author":"A Brinkmann","year":"2020","unstructured":"Brinkmann, A., et al.: Ad hoc file systems for high-performance computing. J. Comput. Sci. Technol. 35(1), 4\u201326 (2020)","journal-title":"J. Comput. Sci. Technol."},{"key":"5_CR4","unstructured":"BSC: MareNostrum specification (2023). https:\/\/www.bsc.es\/marenostrum\/marenostrum\/technical-information, (Accessed 18 March 2024)"},{"key":"5_CR5","doi-asserted-by":"crossref","unstructured":"Devarajan, H., Zheng, H., Kougkas, A., Sun, X.H., Vishwanath, V.: Dlio: A data-centric benchmark for scientific deep learning applications. In: 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), vol. 1(81\u201391) (2021)","DOI":"10.1109\/CCGrid51090.2021.00018"},{"key":"5_CR6","doi-asserted-by":"publisher","unstructured":"Dong, S., Kryczka, A., Jin, Y., Stumm, M.: Rocksdb: evolution of development priorities in a key-value store serving large-scale applications. ACM Trans. Storage 17(4) (2021). https:\/\/doi.org\/10.1145\/3483840","DOI":"10.1145\/3483840"},{"key":"5_CR7","doi-asserted-by":"publisher","unstructured":"Garcia-Carballeira, F., Camarmas-Alonso, D., Caderon-Mateos, A., Carretero, J.: A new ad-hoc parallel file system for hpc environments based on the expand parallel file system. In: 2023 22nd International Symposium on Parallel and Distributed Computing (ISPDC), pp. 69\u201376 (2023). https:\/\/doi.org\/10.1109\/ISPDC59212.2023.00015","DOI":"10.1109\/ISPDC59212.2023.00015"},{"key":"5_CR8","doi-asserted-by":"publisher","unstructured":"Han, J., Kim, D., Eom, H.: Improving the performance of lustre file system in hpc environments. In: 2016 IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS*W), pp. 84\u201389 (2016). https:\/\/doi.org\/10.1109\/FAS-W.2016.29","DOI":"10.1109\/FAS-W.2016.29"},{"key":"5_CR9","unstructured":"Herold, F., Breuner, S., Heichler, J.: An introduction to beegfs. Tech. Rep, ThinkParQ, Kaiserslautern, Germany (2014)"},{"key":"5_CR10","unstructured":"HPC4AI: HPC4AI laboratory specification (2024). https:\/\/hpc4ai.unito.it\/. (Accessed March 18 2024)"},{"key":"5_CR11","unstructured":"Masih, Z.: On demand file systems with beegfs (2023)"},{"key":"5_CR12","doi-asserted-by":"crossref","unstructured":"Mu\u00f1oz-Mu\u00f1oz, D., Garcia-Carballeira, F., Camarmas-Alonso, D., Calderon-Mateos, A., Carretero, J.: Fault tolerant in the Expand Ad-Hoc parallel file system (2024)","DOI":"10.1007\/978-3-031-69766-1_5"},{"key":"5_CR13","doi-asserted-by":"publisher","unstructured":"Reed, D.A.: Beowulf clusters: from research curiosity to exascale. In: Proceedings of the 20 Years of Beowulf Workshop on Honor of Thomas Sterling\u2019s 65th Birthday, pp. 28-33. Beowulf 2014, Association for Computing Machinery, New York (2014). https:\/\/doi.org\/10.1145\/2737909.2737913","DOI":"10.1145\/2737909.2737913"},{"key":"5_CR14","unstructured":"Schmuck, F., Haskin, R.: GPFS: A Shared-Disk file system for large computing clusters. In: Conference on File and Storage Technologies (FAST 02). USENIX Association, Monterey, CA (Jan 2002). https:\/\/www.usenix.org\/conference\/fast-02\/gpfs-shared-disk-file-system-large-computing-clusters"},{"issue":"24","key":"5_CR15","doi-asserted-by":"publisher","first-page":"3056","DOI":"10.3390\/rs11243056","volume":"11","author":"R Sedona","year":"2019","unstructured":"Sedona, R., Cavallaro, G., Jitsev, J., Strube, A., Riedel, M., Benediktsson, J.A.: Remote sensing big data classification with high performance distributed deep learning. Remote Sensing 11(24), 3056 (2019)","journal-title":"Remote Sensing"},{"key":"5_CR16","unstructured":"SLURM: Slurm workload manager (2023). https:\/\/slurm.schedmd.com\/documentation.html, (Accessed 18 March 2024"},{"key":"5_CR17","doi-asserted-by":"crossref","unstructured":"Soumagne, J., et al.: Mercury: enabling remote procedure call for high-performance computing. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp.\u00a01\u20138. IEEE (2013)","DOI":"10.1109\/CLUSTER.2013.6702617"},{"key":"5_CR18","doi-asserted-by":"crossref","unstructured":"Vef, M.A., et al.: Gekkofs-a temporary distributed file system for hpc applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319\u2013324. IEEE (2018)","DOI":"10.1109\/CLUSTER.2018.00049"},{"issue":"1","key":"5_CR19","doi-asserted-by":"publisher","first-page":"72","DOI":"10.1007\/s11390-020-9797-6","volume":"35","author":"MA Vef","year":"2020","unstructured":"Vef, M.A., Moti, N., S\u00fc\u00df, T., Tacke, M., Tocci, T., Nou, R., Miranda, A., Cortes, T., Brinkmann, A.: Gekkofs: A temporary burst buffer file system for hpc applications. J. Comput. Sci. Technol. 35(1), 72\u201391 (2020)","journal-title":"J. Comput. Sci. Technol."},{"key":"5_CR20","doi-asserted-by":"crossref","unstructured":"Wang, T., Mohror, K., Moody, A., Sato, K., Yu, W.: An ephemeral burst-buffer file system for scientific applications. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysism pp. 807\u2013818. IEEE (2016)","DOI":"10.1109\/SC.2016.68"}],"container-title":["Lecture Notes in Computer Science","Euro-Par 2024: Parallel Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-69766-1_5","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,27]],"date-time":"2024-11-27T06:06:27Z","timestamp":1732687587000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-69766-1_5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"ISBN":["9783031697654","9783031697661"],"references-count":20,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-69766-1_5","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2024]]},"assertion":[{"value":"26 August 2024","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"The authors have no competing interests to declare that are relevant to the content of this article.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Disclosure of Interests"}},{"value":"Euro-Par","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"European Conference on Parallel Processing","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Madrid","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Spain","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2024","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"26 August 2024","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"30 August 2024","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"30","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"europar2024","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/2024.euro-par.org\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}