{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T22:47:26Z","timestamp":1768258046463,"version":"3.49.0"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2018,1,12]],"date-time":"2018-01-12T00:00:00Z","timestamp":1515715200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Information theoretic and compositional\/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with \u2018Big Data\u2019 problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http:\/\/www.di-srv.unisa.it\/KCH.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty018","type":"journal-article","created":{"date-parts":[[2018,1,9]],"date-time":"2018-01-09T23:20:59Z","timestamp":1515540059000},"page":"1826-1833","source":"Crossref","is-referenced-by-count":20,"title":["Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms"],"prefix":"10.1093","volume":"34","author":[{"given":"Umberto","family":"Ferraro Petrillo","sequence":"first","affiliation":[{"name":"Dipartimento di Scienze Statistiche, Universit\u00e0 di Roma \u2013 La Sapienza, Rome, Italy"}]},{"given":"Gianluca","family":"Roscigno","sequence":"additional","affiliation":[{"name":"Dipartimento di Informatica, Universit\u00e0 di Salerno, Fisciano, SA, Italy"}]},{"given":"Giuseppe","family":"Cattaneo","sequence":"additional","affiliation":[{"name":"Dipartimento di Informatica, Universit\u00e0 di Salerno, Fisciano, SA, Italy"}]},{"given":"Raffaele","family":"Giancarlo","sequence":"additional","affiliation":[{"name":"Dipartimento di Matematica ed Informatica, Universit\u00e0 di Palermo, Palermo, Italy"}]}],"member":"286","published-online":{"date-parts":[[2018,1,12]]},"reference":[{"key":"2023012713544692300_bty018-B1","doi-asserted-by":"crossref","first-page":"e1022.","DOI":"10.1371\/journal.pone.0001022","article-title":"Nullomers: really a matter of natural selection?","volume":"2","author":"Acquisti","year":"2007","journal-title":"Plos One"},{"key":"2023012713544692300_bty018-B2","doi-asserted-by":"crossref","first-page":"2070","DOI":"10.1093\/bioinformatics\/btu152","article-title":"KAnalyze: a fast versatile pipelined k-mer toolkit","volume":"30","author":"Audano","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B3","doi-asserted-by":"crossref","first-page":"026004","DOI":"10.1088\/1478-3975\/13\/2\/026004","article-title":"The bulk and the tail of minimal absent words in genome sequences","volume":"13","author":"Aurell","year":"2016","journal-title":"Phys. Biol"},{"key":"2023012713544692300_bty018-B4","volume-title":"Principles of Concurrent and Distributed Programming","author":"Ben-Ari","year":"2006"},{"key":"2023012713544692300_bty018-B5","doi-asserted-by":"crossref","first-page":"e94.","DOI":"10.7717\/peerj-cs.94","article-title":"Multiple comparative metagenomics using multiset k-mer counting","volume":"2","author":"Benoit","year":"2016","journal-title":"PeerJ Comput. Sci"},{"key":"2023012713544692300_bty018-B6","doi-asserted-by":"crossref","first-page":"D36","DOI":"10.1093\/nar\/gks1195","article-title":"Genbank","volume":"41","author":"Benson","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023012713544692300_bty018-B7","author":"Bhatia","year":"2011"},{"key":"2023012713544692300_bty018-B8","doi-asserted-by":"crossref","first-page":"1492","DOI":"10.1093\/bioinformatics\/btt178","article-title":"Assembling the 20 Gb white spruce (picea glauca) genome from whole-genome shotgun sequencing data","volume":"29","author":"Birol","year":"2013","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B9","doi-asserted-by":"crossref","first-page":"1467","DOI":"10.1007\/s11227-016-1835-3","article-title":"An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop","volume":"73","author":"Cattaneo","year":"2017","journal-title":"J. Supercomput"},{"key":"2023012713544692300_bty018-B10","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1007\/978-3-319-57711-1_5","volume-title":"Advances in Artificial Life, Evolutionary Computation, and Systems Chemistry: 11th Italian Workshop, WIVACE 2016, Fisciano, Italy, October 4-6, 2016, Revised Selected Papers","author":"Cattaneo","year":"2017"},{"key":"2023012713544692300_bty018-B11","doi-asserted-by":"crossref","first-page":"R108.","DOI":"10.1186\/gb-2009-10-10-r108","article-title":"Genomic DNA k-mer spectra: models and modalities","volume":"10","author":"Chor","year":"2009","journal-title":"Genome Biol"},{"key":"2023012713544692300_bty018-B12","doi-asserted-by":"crossref","first-page":"987","DOI":"10.1038\/nbt.2023","article-title":"How to apply de Bruijn graphs to genome assembly","volume":"29","author":"Compeau","year":"2011","journal-title":"Nat. Biotechnol"},{"key":"2023012713544692300_bty018-B13","first-page":"137","article-title":"MapReduce: simplified data processing on large clusters","author":"Dean","year":"2004","journal-title":"6th Symposium on Operating System Design and Implementation (OSDI)"},{"key":"2023012713544692300_bty018-B14","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1145\/356571.356573","article-title":"Virtual memory","volume":"2","author":"Denning","year":"1970","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"2023012713544692300_bty018-B15","first-page":"1575","article-title":"FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications","volume":"33","author":"Ferraro Petrillo","year":"2017","journal-title":"Bioinformatics (Oxford, England)"},{"key":"2023012713544692300_bty018-B16","first-page":"100","author":"Ferraro Petrillo","year":"2017"},{"key":"2023012713544692300_bty018-B17","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1093\/bioinformatics\/btp117","article-title":"Textual data compression in computational biology: a synopsis","volume":"25","author":"Giancarlo","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B18","doi-asserted-by":"crossref","first-page":"390","DOI":"10.1093\/bib\/bbt088","article-title":"Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies","volume":"15","author":"Giancarlo","year":"2014","journal-title":"Brief. Bioinf"},{"key":"2023012713544692300_bty018-B19","doi-asserted-by":"crossref","first-page":"2939","DOI":"10.1093\/bioinformatics\/btv295","article-title":"Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning in vivo","volume":"31","author":"Giancarlo","year":"2015","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B20","first-page":"355","volume-title":"Pac Symp Biocomput","author":"Hampikian","year":"2007"},{"key":"2023012713544692300_bty018-B21","doi-asserted-by":"crossref","first-page":"W7","DOI":"10.1093\/nar\/gku398","article-title":"Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches","volume":"42","author":"Horwege","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023012713544692300_bty018-B22","volume-title":"Integrated Taxonomic Information System On-line Database","author":"ITIS Partnership","year":"2010"},{"key":"2023012713544692300_bty018-B23","first-page":"2759","article-title":"KMC 3: counting and manipulating k-mer statistics","volume":"33","author":"Kokot","year":"2017","journal-title":"Phys. Biol"},{"key":"2023012713544692300_bty018-B24","doi-asserted-by":"crossref","first-page":"971","DOI":"10.1093\/bioinformatics\/btw776","article-title":"Fast and accurate phylogeny reconstruction using filtered spaced-word matches","volume":"33","author":"Leimeister","year":"2017","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B25","first-page":"114","volume-title":"Alignment Free Dissimilarities for Nucleosome Classification","author":"Lo Bosco","year":"2016"},{"key":"2023012713544692300_bty018-B26","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B27","doi-asserted-by":"crossref","first-page":"3014","DOI":"10.1093\/bioinformatics\/btt528","article-title":"BioPig: a Hadoop-based analytic toolkit for large-scale sequence data","volume":"29","author":"Nordberg","year":"2013","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B28","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1038\/nbt.2515","article-title":"Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers","volume":"31","author":"Nordstrom","year":"2013","journal-title":"Nat. Biotechnol"},{"key":"2023012713544692300_bty018-B29","doi-asserted-by":"crossref","first-page":"579","DOI":"10.1038\/nature12211","article-title":"The norway spruce genome sequence and conifer genome evolution","volume":"497","author":"Nystedt","year":"2013","journal-title":"Nature"},{"key":"2023012713544692300_bty018-B30","doi-asserted-by":"crossref","first-page":"408.","DOI":"10.1186\/1471-2105-12-408","article-title":"A motif-independent metric for DNA sequence specificity","volume":"12","author":"Pinello","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012713544692300_bty018-B31","doi-asserted-by":"crossref","first-page":"1756.","DOI":"10.1186\/s13104-016-1972-z","article-title":"Absent words and the (dis)similarity analysis of dna sequences: an experimental study","volume":"9","author":"Rahman","year":"2016","journal-title":"BMC Res. Notes"},{"key":"2023012713544692300_bty018-B32","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","article-title":"DSK: k-mer counting with very low memory usage","volume":"29","author":"Rizk","year":"2013","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B33","author":"Shvachko","year":"2010"},{"key":"2023012713544692300_bty018-B34","doi-asserted-by":"crossref","first-page":"26.","DOI":"10.1186\/s13742-015-0058-5","article-title":"A quantitative assessment of the hadoop framework for analyzing massively parallel dna sequencing data","volume":"4","author":"Siretskiy","year":"2015","journal-title":"GigaScience"},{"key":"2023012713544692300_bty018-B35","doi-asserted-by":"crossref","first-page":"835","DOI":"10.1093\/bioinformatics\/btv679","article-title":"The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes","volume":"32","author":"Utro","year":"2016","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B36","doi-asserted-by":"crossref","first-page":"e0164540.","DOI":"10.1371\/journal.pone.0164540","article-title":"Nullomers and high order nullomers in genomic sequences","volume":"11","author":"Vergni","year":"2016","journal-title":"Plos One"},{"key":"2023012713544692300_bty018-B37","volume-title":"Hadoop: The Definitive Guide","author":"White","year":"2015","edition":"4th edn."},{"key":"2023012713544692300_bty018-B38","first-page":"95","article-title":"Spark: cluster computing with working sets","volume":"10","author":"Zaharia","year":"2010","journal-title":"HotCloud"},{"key":"2023012713544692300_bty018-B39","doi-asserted-by":"crossref","first-page":"1090","DOI":"10.1093\/bioinformatics\/btw750","article-title":"Metaspark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes","volume":"33","author":"Zhou","year":"2017","journal-title":"Bioinformatics"},{"key":"2023012713544692300_bty018-B40","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1534\/genetics.113.159715","article-title":"Sequencing and assembly of the 22-Gb loblolly pine genome","volume":"196","author":"Zimin","year":"2014","journal-title":"Genetics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/11\/1826\/48937852\/bioinformatics_34_11_1826.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/11\/1826\/48937852\/bioinformatics_34_11_1826.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,30]],"date-time":"2023-08-30T10:41:58Z","timestamp":1693392118000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/11\/1826\/4802227"}},"subtitle":[],"editor":[{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2018,1,12]]},"references-count":40,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2018,6,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty018","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,6,1]]},"published":{"date-parts":[[2018,1,12]]}}}