{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,18]],"date-time":"2025-09-18T04:12:51Z","timestamp":1758168771077,"version":"3.44.0"},"reference-count":28,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2025,8,30]],"date-time":"2025-08-30T00:00:00Z","timestamp":1756512000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,30]],"date-time":"2025-08-30T00:00:00Z","timestamp":1756512000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100011104","name":"Universitat Aut\u00f2noma de Barcelona","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100011104","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Cluster Comput"],"published-print":{"date-parts":[[2025,10]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Accessing large volumes of data presents a significant challenge when finding the best strategies to manage the data efficiently. Deep learning applications require the processing of massive amounts of data, which implies a considerable access Input\/Output (I\/O) load on computer systems. During training, interaction with the I\/O system intensifies as files are continuously accessed to read data sets. This persistent access could overload the file system, which, in turn, adversely impacts application performance and efficient storage system utilization. Several factors influence the I\/O of these applications, and one of the most relevant is the variety of file formats in which datasets can be stored. The choice of file format depends on the use case, as each format defines how information is stored. Some file formats have features that promote efficient access to datasets during the training phase, which can improve the performance of deep learning applications. Likewise, it is also important that the format adapts to the context, in this case, to an HPC system with a parallel file system. We will propose an image preprocessing method for cases where performance improves with parallel file access. This method will transform image data sets from their original JPEG format to the more efficient HDF5 format. Thus, our research will focus on the importance of understanding the mode of data access, spatial and temporal patterns, and the level of parallelism in file access to determine whether it is advisable to change the storage format.<\/jats:p>","DOI":"10.1007\/s10586-025-05283-3","type":"journal-article","created":{"date-parts":[[2025,8,30]],"date-time":"2025-08-30T11:13:29Z","timestamp":1756552409000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Deep learning data handling: exploring file formats and access strategies"],"prefix":"10.1007","volume":"28","author":[{"given":"Edixon","family":"Parraga","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Betzabeth","family":"Leon","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sandra","family":"Mendez","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dolores","family":"Rexachs","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daniel","family":"Franco","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Emilio","family":"Luque","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,8,30]]},"reference":[{"key":"5283_CR1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.Companion.2012.236","author":"B Behzad","year":"2012","unstructured":"Behzad, B., Huchette, J., Luu, H., Aydt, R., Koziol, Q., Prabhat, M., Byna, S., Chaarawi, M., Yao, Y.: SC companion: high performance computing. Netw. Storage Anal. (2012). https:\/\/doi.org\/10.1109\/SC.Companion.2012.236","journal-title":"Netw. Storage Anal."},{"issue":"1","key":"5283_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0281-5","volume":"7","author":"AZ Faroukhi","year":"2020","unstructured":"Faroukhi, A.Z., El Alaoui, I., Gahi, Y., Amine, A.: Big data monetization throughout big data value chain: a comprehensive review. J. Big Data 7(1), 1\u201322 (2020)","journal-title":"J. Big Data"},{"key":"5283_CR3","doi-asserted-by":"crossref","unstructured":"Chien, S.W., Markidis, S., Sishtla, C.P., Santos, L., Herman, P., Narasimhamurthy, S., Laure, E.: Characterizing deep-learning I\/O workloads in TensorFlow. In: 2018 IEEE\/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), pp. 54\u201363. IEEE\u00a0(2018)","DOI":"10.1109\/PDSW-DISCS.2018.00011"},{"key":"5283_CR4","doi-asserted-by":"crossref","unstructured":"Rajesh, N., Bateman, K., Bez, J.L., Byna, S., Kougkas, A., Sun, X.-H.: TunIO: An AI-powered Framework for Optimizing HPC I\/O. In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 494\u2013505. IEEE,\u00a0(2024)","DOI":"10.1109\/IPDPS57955.2024.00050"},{"key":"5283_CR5","doi-asserted-by":"publisher","unstructured":"Behzad, B., Luu, H.V.T., Huchette, J., Byna, S., Prabhat, Aydt, R., Koziol, Q., Snir, M.: Taming parallel I\/O complexity with auto-tuning. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. SC \u201913. Association for Computing Machinery, New York (2013). https:\/\/doi.org\/10.1145\/2503210.2503278","DOI":"10.1145\/2503210.2503278"},{"key":"5283_CR6","unstructured":"The HDF Group: HDF5 User\u2019s Guide. https:\/\/docs.hdfgroup.org\/hdf5\/v1_14\/index.html. Accessed 20 March 2024\u00a0"},{"issue":"2","key":"5283_CR7","doi-asserted-by":"publisher","first-page":"96","DOI":"10.1109\/MMUL.2017.38","volume":"24","author":"G Hudson","year":"2017","unstructured":"Hudson, G., L\u00e9ger, A., Niss, B., Sebesty\u00e9n, I.: JPEG at 25: still going strong. IEEE MultiMed 24(2), 96\u2013103 (2017). https:\/\/doi.org\/10.1109\/MMUL.2017.38","journal-title":"IEEE MultiMed"},{"key":"5283_CR8","doi-asserted-by":"publisher","unstructured":"Wallace, G.K. (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics.\u00a0https:\/\/doi.org\/10.1109\/30.125072","DOI":"10.1109\/30.125072"},{"key":"5283_CR9","volume-title":"Compressed Image File Formats JPEG, PNG, GIF, XBM, BMP","author":"J Miano","year":"2000","unstructured":"Miano, J.: Compressed Image File Formats JPEG, PNG, GIF, XBM, BMP, 2nd edn. Addison-Wesley, Massachusetts (2000)","edition":"2"},{"key":"5283_CR10","unstructured":"Chapman, Hall: High performance parallel I\/O, 1st edn. CRC Press is an imprint of Taylor & Francis Group an Informa business, Boca Raton (2015)"},{"key":"5283_CR11","doi-asserted-by":"publisher","first-page":"16","DOI":"10.3389\/fninf.2018.00016","volume":"12","author":"S-A Dragly","year":"2018","unstructured":"Dragly, S.-A., Hobbi Mobarhan, M., Lepper\u00f8d, M.E., Tenn\u00f8e, S., Fyhn, M., Hafting, T., Malthe-S\u00f8renssen, A.: Experimental directory structure (Exdir): an alternative to HDF5 without introducing a new file format. Front. Neuroinf. 12, 16 (2018)","journal-title":"Front. Neuroinf."},{"key":"5283_CR12","doi-asserted-by":"publisher","unstructured":"Zheng, H., Vishwanath, V., Koziol, Q., Tang, H., Ravi, J., Mainzer, J., Byna, S.: HDF5 Cache VOL: Efficient and scalable parallel I\/O through caching data on node-local storage. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 61\u201370 (2022). https:\/\/doi.org\/10.1109\/CCGrid54584.2022.00015","DOI":"10.1109\/CCGrid54584.2022.00015"},{"issue":"4","key":"5283_CR13","doi-asserted-by":"publisher","first-page":"891","DOI":"10.1109\/TPDS.2021.3090322","volume":"33","author":"H Tang","year":"2022","unstructured":"Tang, H., Koziol, Q., Ravi, J., Byna, S.: Transparent asynchronous parallel I\/O using background threads. IEEE Trans. Parallel Distrib. Syst. 33(4), 891\u2013902 (2022). https:\/\/doi.org\/10.1109\/TPDS.2021.3090322","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"5283_CR14","unstructured":"HPCwire: MLPerf Issues HPC 1.0 Benchmark Results Featuring Impressive Systems (Think Fugaku). (2021).\u00a0Accessed 26 March 2024\u00a0https:\/\/www.hpcwire.com\/2021\/11\/19\/mlperf-issues-hpc-1-0-benchmark-results-\/featuring-impressive-systems-think-fugaku\/"},{"key":"5283_CR15","doi-asserted-by":"crossref","unstructured":"Bae, M., Jeong, M., Yeo, S., Oh, S., Kwon, O.-K.: I\/O performance evaluation of large-scale deep learning on an hpc system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), pp. 436\u2013439 IEEE (2019)","DOI":"10.1109\/HPCS48598.2019.9188225"},{"key":"5283_CR16","unstructured":"Zhang, Z., Huang, L., Manor, U., Fang, L., Merlo, G., Michoski, C., Cazes, J., Gaffney, N.: FanStore: Enabling efficient and scalable I\/O for distributed deep learning. Preprint at https:\/\/arxiv.org\/abs\/quant-ph\/1809.1079 (2018)"},{"issue":"2","key":"5283_CR17","doi-asserted-by":"publisher","first-page":"34","DOI":"10.1145\/3331526","volume":"6","author":"S Pumma","year":"2019","unstructured":"Pumma, S., Si, M., Feng, W.-C., Balaji, P.: Scalable deep learning via i\/o analysis and optimization. ACM Trans. Parallel Comput. 6(2), 34 (2019). https:\/\/doi.org\/10.1145\/3331526","journal-title":"ACM Trans. Parallel Comput."},{"key":"5283_CR18","doi-asserted-by":"publisher","unstructured":"Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Efficient large-scale deep learning framework for heterogeneous multi-GPU cluster. In: 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS*W), pp. 176\u2013181 (2019). https:\/\/doi.org\/10.1109\/FAS-W.2019.00050","DOI":"10.1109\/FAS-W.2019.00050"},{"key":"5283_CR19","doi-asserted-by":"crossref","unstructured":"Byna, S., Chen, Y., Sun, X.-H., Thakur, R., Gropp, W.: Parallel I\/O prefetching using MPI file caching and I\/O signatures. In: SC\u201908: Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing, pp. 1\u201312 IEEE\u00a0(2008)","DOI":"10.1109\/SC.2008.5213604"},{"issue":"3","key":"5283_CR20","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211\u2013252 (2015). https:\/\/doi.org\/10.1007\/s11263-015-0816-y","journal-title":"Int. J. Comput. Vis. (IJCV)"},{"key":"5283_CR21","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1007\/978-3-031-70807-7_6","volume-title":"Cloud Comput. Big Data and Emerg. Top.","author":"E Parraga","year":"2025","unstructured":"Parraga, E., Leon, B., Mendez, S., Rexachs, D., Suppi, R., Luque, E.: An empirical method for processing I\/O traces to analyze the performance of DL applications. In: Naiouf, M., De Giusti, L., Chichizola, F., Libutti, L. (eds.) Cloud Comput. Big Data and Emerg. Top., pp. 74\u201390. Springer, Cham (2025)"},{"key":"5283_CR22","unstructured":"Vijayvargiya, G., Silakari, S., Pandey, R.: A Survey: Various techniques of image compression.https:\/\/arxiv.org\/abs\/quant-ph\/1311.6877  (2013)\u00a0"},{"key":"5283_CR23","doi-asserted-by":"publisher","unstructured":"Wang, Z., Lin, Z., Xu, L., Zhao, Y., Xin, J.: Batch images compression algorithm based on the common features. In: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp. 1\u20136 (2017). https:\/\/doi.org\/10.1109\/CISP-BMEI.2017.8301937","DOI":"10.1109\/CISP-BMEI.2017.8301937"},{"key":"5283_CR24","doi-asserted-by":"publisher","unstructured":"Carns, P., Harms, K., Allcock, W., Bacon, C., Lang, S., Latham, R., Ross, R.: Understanding and improving computational science storage access through continuous characterization. ACM Trans. Storage 7(3) (2011).\u00a0https:\/\/doi.org\/10.1145\/2027066.2027068","DOI":"10.1145\/2027066.2027068"},{"issue":"6","key":"5283_CR25","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1145\/3065386","volume":"60","author":"A Krizhevsky","year":"2017","unstructured":"Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84\u201390 (2017)","journal-title":"Commun. ACM"},{"key":"5283_CR26","unstructured":"Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow.\u00a0Preprint at\u00a0https:\/\/arxiv.org\/abs\/quant-ph\/1802.05799\u00a0(2018)"},{"key":"5283_CR27","unstructured":"Diederik, P.K.: Adam: A method for stochastic optimization (2014)"},{"key":"5283_CR28","unstructured":"Le, Y., Yang, X.S.: Tiny imagenet visual recognition challenge. (2015)"}],"container-title":["Cluster Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10586-025-05283-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10586-025-05283-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10586-025-05283-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T21:23:33Z","timestamp":1758144213000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10586-025-05283-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,30]]},"references-count":28,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["5283"],"URL":"https:\/\/doi.org\/10.1007\/s10586-025-05283-3","relation":{},"ISSN":["1386-7857","1573-7543"],"issn-type":[{"type":"print","value":"1386-7857"},{"type":"electronic","value":"1573-7543"}],"subject":[],"published":{"date-parts":[[2025,8,30]]},"assertion":[{"value":"13 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 March 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 April 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 August 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"613"}}