{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T08:25:57Z","timestamp":1777710357105,"version":"3.51.4"},"reference-count":22,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T00:00:00Z","timestamp":1772409600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"INdAM\u2014GNCS","award":["E53C23001670001"],"award-info":[{"award-number":["E53C23001670001"]}]},{"name":"INdAM\u2014GNCS","award":["E53C24001950001"],"award-info":[{"award-number":["E53C24001950001"]}]},{"DOI":"10.13039\/501100000780","name":"European Union","doi-asserted-by":"crossref","award":["B77G24000050001"],"award-info":[{"award-number":["B77G24000050001"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100000780","name":"European Union","doi-asserted-by":"crossref","award":["PE00000001"],"award-info":[{"award-number":["PE00000001"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100000780","name":"European Union","doi-asserted-by":"crossref","award":["E83C22004640001"],"award-info":[{"award-number":["E83C22004640001"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of \u201comics\u201d data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a novel approach for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. We implement three algorithms based on the MapReduce framework, distributing the index computation and not only the input dataset, differently than previous approaches from the literature. Experimental results performed on real datasets show that the proposed approach is promising.<\/jats:p>","DOI":"10.3390\/data11030048","type":"journal-article","created":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T16:06:59Z","timestamp":1772467619000},"page":"48","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark"],"prefix":"10.3390","volume":"11","author":[{"given":"Ylenia","family":"Galluzzo","sequence":"first","affiliation":[{"name":"Department of Mathematics and Computer Science, University of Palermo, 90123 Palermo, Italy"}]},{"given":"Raffaele","family":"Giancarlo","sequence":"additional","affiliation":[{"name":"Department of Mathematics and Computer Science, University of Palermo, 90123 Palermo, Italy"}]},{"given":"Mario","family":"Randazzo","sequence":"additional","affiliation":[{"name":"Department of Mathematics and Computer Science, University of Palermo, 90123 Palermo, Italy"}]},{"given":"Simona E.","family":"Rombo","sequence":"additional","affiliation":[{"name":"Department of Mathematics and Computer Science, University of Palermo, 90123 Palermo, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2026,3,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1016\/j.ins.2016.08.085","article-title":"Indexing Next-Generation Sequencing data","volume":"384","author":"Jalili","year":"2017","journal-title":"Inf. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1016\/S1570-8667(03)00065-0","article-title":"Replacing suffix trees with enhanced suffix arrays","volume":"2","author":"Abouelhoda","year":"2004","journal-title":"J. Discret. Algorithms"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"552","DOI":"10.1145\/1082036.1082039","article-title":"Indexing compressed text","volume":"52","author":"Ferragina","year":"2005","journal-title":"J. ACM"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Lee, W.P., Stromberg, M.P., Ward, A., Stewart, C., Garrison, E.P., and Marth, G.T. (2014). MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping. PLoS ONE, 9.","DOI":"10.1371\/journal.pone.0090581"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"712","DOI":"10.1016\/j.drudis.2017.01.014","article-title":"Next-generation sequencing: Big data meets high performance computing","volume":"22","author":"Schmidt","year":"2017","journal-title":"Drug Discov. Today"},{"key":"ref_6","unstructured":"Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., and Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of the Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), San Jose, CA, USENIX Association."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Nandimath, J., Banerjee, E., Patil, A., Kakade, P., Vaidya, S., and Chaturvedi, D. (2013). Big data analysis using Apache Hadoop. Proceedings of the 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI), IEEE.","DOI":"10.1109\/IRI.2013.6642536"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Menon, R.K., Bhat, G.P., and Schatz, M.C. (2011). Rapid Parallel Genome Indexing with MapReduce. MapReduce \u201911: Proceedings of the Second International Workshop on MapReduce and Its Applications, New York, NY, USA, Association for Computing Machinery.","DOI":"10.1145\/1996092.1996104"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"4003","DOI":"10.1093\/bioinformatics\/btv506","article-title":"BigBWA: Approaching the Burrows-Wheeler aligner to Big Data technologies","volume":"31","author":"Pichel","year":"2015","journal-title":"Bioinformatics"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1145\/1629175.1629198","article-title":"MapReduce: A flexible data processing tool","volume":"53","author":"Dean","year":"2010","journal-title":"Commun. ACM"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows-Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"btae717","DOI":"10.1093\/bioinformatics\/btae717","article-title":"BWT construction and search at the terabase scale","volume":"40","author":"Li","year":"2024","journal-title":"Bioinformatics"},{"key":"ref_13","first-page":"28","article-title":"A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform","volume":"2820","author":"Zumpano","year":"2020","journal-title":"Proceedings of the First International AAI4H\u2014Advances in Artificial Intelligence for Healthcare Workshop Co-Located with the 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, 4 September 2020"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Flick, P., and Aluru, S. (2015, January 15\u201320). Parallel distributed memory construction of suffix and longest common prefix arrays. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA.","DOI":"10.1145\/2807591.2807609"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"918","DOI":"10.1145\/1217856.1217858","article-title":"Linear work suffix array construction","volume":"53","author":"Sanders","year":"2006","journal-title":"J. ACM"},{"key":"ref_16","unstructured":"Manzini, G., and Navarro, G. (2026, February 13). The Pizza and Chili Corpus Home Page. Available online: https:\/\/pizzachili.dcc.uchile.cl\/texts.html."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"103809","DOI":"10.1016\/j.chemolab.2019.07.008","article-title":"On the internal correlations of protein sequences probed by non-alignment methods: Novel signatures for drug and antibody targets via the Burrows-Wheeler Transform","volume":"193","author":"Graham","year":"2019","journal-title":"Chemom. Intell. Lab. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Raff, E., Nicholas, C., and McLean, M. (2019). A New Burrows Wheeler Transform Markov Distance. arXiv.","DOI":"10.1609\/aaai.v34i04.5994"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1162","DOI":"10.14778\/3389133.3389135","article-title":"The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds","volume":"13","author":"Ferragina","year":"2020","journal-title":"Proc. VLDB Endow."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"109806","DOI":"10.1016\/j.engappai.2024.109806","article-title":"An efficient perceptual video compression scheme based on deep learning-assisted video saliency and just noticeable distortion","volume":"141","author":"Zhang","year":"2025","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Badkobeh, G., Bannai, H., and K\u00f6ppl, D. (2024). Bijective BWT based compression schemes. Proceedings of the International Symposium on String Processing and Information Retrieval, Springer.","DOI":"10.1007\/978-3-031-72200-4_2"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"38653","DOI":"10.1007\/s11042-025-20687-4","article-title":"A comprehensive survey of image compression methods: From prediction models to advanced techniques","volume":"84","author":"Devadason","year":"2025","journal-title":"Multimed. Tools Appl."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/3\/48\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T05:14:30Z","timestamp":1772601270000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/3\/48"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,2]]},"references-count":22,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["data11030048"],"URL":"https:\/\/doi.org\/10.3390\/data11030048","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,2]]}}}