{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,5,7]],"date-time":"2025-05-07T04:17:04Z","timestamp":1746591424710,"version":"3.40.5"},"reference-count":27,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,6]],"date-time":"2025-05-06T00:00:00Z","timestamp":1746489600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,6]],"date-time":"2025-05-06T00:00:00Z","timestamp":1746489600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003032","name":"Association Nationale de la Recherche et de la Technologie","doi-asserted-by":"publisher","award":["CIFRE 2019\/1231"],"award-info":[{"award-number":["CIFRE 2019\/1231"]}],"id":[{"id":"10.13039\/501100003032","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100022110","name":"bioM\u00e9rieux","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100022110","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001665","name":"Agence Nationale de la Recherche","doi-asserted-by":"publisher","award":["ANR-16-CE02-0005-01","ANR-16-CE02-0005-01"],"award-info":[{"award-number":["ANR-16-CE02-0005-01","ANR-16-CE02-0005-01"]}],"id":[{"id":"10.13039\/501100001665","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Methods<\/jats:title>\n            <jats:p>Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and\u00a0representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>MPS-Sampling was applied to a dataset of 48\u00a0ribosomal protein families from 178,203\u00a0bacterial genomes to generate representative genome sets of various size, corresponding to a sampling of 32.17% down to 0.3% of the complete dataset. An in-depth analysis shows that the selected genomes are both taxonomically and phylogenetically representative of the complete dataset, demonstrating the relevance of the approach.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>MPS-Sampling provides an efficient, fast and scalable way to sample large collections of genomes in an acceptable computational time. MPS-Sampling does not rely on taxonomic information and does not require the inference of phylogenetic trees, thus avoiding the biases inherent in these approaches. As such, MPS-Sampling meets the needs of a growing number of users.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/s12859-025-06095-3","type":"journal-article","created":{"date-parts":[[2025,5,6]],"date-time":"2025-05-06T12:48:19Z","timestamp":1746535699000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Multi-proteins similarity-based sampling to select representative genomes from large databases"],"prefix":"10.1186","volume":"26","author":[{"given":"R\u00e9mi-Vinh","family":"Coudert","sequence":"first","affiliation":[]},{"given":"Jean-Philippe","family":"Charrier","sequence":"additional","affiliation":[]},{"given":"Fr\u00e9d\u00e9ric","family":"Jauffrit","sequence":"additional","affiliation":[]},{"given":"Jean-Pierre","family":"Flandrois","sequence":"additional","affiliation":[]},{"given":"C\u00e9line","family":"Brochier-Armanet","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,6]]},"reference":[{"key":"6095_CR1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pbio.1002195","volume":"13","author":"ZD Stephens","year":"2015","unstructured":"Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: Astronomical or genomical? PLoS Biol. 2015;13:1\u201311.","journal-title":"PLoS Biol"},{"key":"6095_CR2","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1007\/s10142-015-0433-4","volume":"15","author":"M Land","year":"2015","unstructured":"Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH, et al. Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics. 2015;15:141\u201361.","journal-title":"Funct Integr Genomics"},{"key":"6095_CR3","doi-asserted-by":"publisher","first-page":"2396","DOI":"10.1093\/molbev\/msab034","volume":"38","author":"PS Garcia","year":"2021","unstructured":"Garcia PS, Duchemin W, Flandrois JP, Gribaldo S, Grangeasse C, Brochier-Armanet C. A comprehensive evolutionary scenario of cell division and associated processes in the firmicutes. Mol Biol Evol. 2021;38:2396\u2013412.","journal-title":"Mol Biol Evol"},{"key":"6095_CR4","doi-asserted-by":"publisher","first-page":"1533","DOI":"10.1038\/s41564-017-0012-7","volume":"2","author":"DH Parks","year":"2017","unstructured":"Parks DH, Rinke C, Chuvochina M, Chaumeil PA, Woodcroft BJ, Evans PN, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol. 2017;2:1533\u201342.","journal-title":"Nat Microbiol"},{"key":"6095_CR5","doi-asserted-by":"publisher","first-page":"461","DOI":"10.1099\/ijsem.0.002516","volume":"68","author":"J Chun","year":"2018","unstructured":"Chun J, Oren A, Ventosa A, Christensen H, Arahal DR, da Costa MS, et al. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int J Syst Evol Microbiol. 2018;68:461\u20136.","journal-title":"Int J Syst Evol Microbiol"},{"key":"6095_CR6","doi-asserted-by":"publisher","first-page":"1079","DOI":"10.1038\/s41587-020-0501-8","volume":"38","author":"DH Parks","year":"2020","unstructured":"Parks DH, Chuvochina M, Chaumeil PA, Rinke C, Mussig AJ, Hugenholtz P. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol. 2020;38:1079\u201386.","journal-title":"Nat Biotechnol"},{"key":"6095_CR7","doi-asserted-by":"publisher","first-page":"1125","DOI":"10.1016\/S1286-4579(02)01637-4","volume":"4","author":"R Lan","year":"2002","unstructured":"Lan R, Reeves PR. Escherichia coli in disguise: Molecular origins of Shigella. Microbes Infect. 2002;4:1125\u201332.","journal-title":"Microbes Infect"},{"key":"6095_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12859-018-2164-8","volume":"19","author":"F Menardo","year":"2018","unstructured":"Menardo F, Loiseau C, Brites D, Coscolla M, Gygli SM, Rutaihwa LK, et al. Treemmer: A tool to reduce large phylogenetic datasets with minimal loss of diversity. BMC Bioinformatics. 2018;19:1\u20138.","journal-title":"BMC Bioinformatics"},{"key":"6095_CR9","doi-asserted-by":"publisher","first-page":"1580","DOI":"10.1093\/molbev\/msz053","volume":"36","author":"AX Han","year":"2019","unstructured":"Han AX, Parker E, Scholer F, Maurer-Stroh S, Russell CA. Phylogenetic clustering by linear integer programming (PhyCLiP). Mol Biol Evol. 2019;36:1580\u201395.","journal-title":"Mol Biol Evol"},{"key":"6095_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0221068","volume":"14","author":"M Balaban","year":"2019","unstructured":"Balaban M, Moshiri N, Mai U, Jia X, Mirarab S. TreeCluster: Clustering biological sequences using phylogenetic trees. PLoS ONE. 2019;14:1\u201320.","journal-title":"PLoS ONE"},{"key":"6095_CR11","doi-asserted-by":"publisher","first-page":"663","DOI":"10.1093\/bioinformatics\/btab723","volume":"38","author":"L Pipes","year":"2022","unstructured":"Pipes L, Nielsen R. AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees. Bioinformatics. 2022;38:663\u201370.","journal-title":"Bioinformatics"},{"key":"6095_CR12","first-page":"1","volume":"2017","author":"H Philippe","year":"2017","unstructured":"Philippe H, De Vienne DM, Ranwez V, Roure B, Baurain D, Delsuc F. Pitfalls in supermatrix phylogenomics. Eur J Taxon. 2017;2017:1\u201325.","journal-title":"Eur J Taxon"},{"key":"6095_CR13","doi-asserted-by":"publisher","first-page":"316","DOI":"10.1099\/ijs.0.054171-0","volume":"64","author":"J Chun","year":"2014","unstructured":"Chun J, Rainey FA. Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea. Int J Syst Evol Microbiol. 2014;64:316\u201324.","journal-title":"Int J Syst Evol Microbiol"},{"key":"6095_CR14","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1099\/ijs.0.64483-0","volume":"57","author":"J Goris","year":"2007","unstructured":"Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol. 2007;57:81\u201391.","journal-title":"Int J Syst Evol Microbiol"},{"key":"6095_CR15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13059-016-0997-x","volume":"17","author":"BD Ondov","year":"2016","unstructured":"Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:1\u201314.","journal-title":"Genome Biol"},{"key":"6095_CR16","doi-asserted-by":"publisher","first-page":"2210","DOI":"10.1128\/JB.01688-14","volume":"196","author":"QL Qin","year":"2014","unstructured":"Qin QL, Bin XB, Zhang XY, Chen XL, Zhou BC, Zhou J, et al. A proposed genus boundary for the prokaryotes based on genomic insights. J Bacteriol. 2014;196:2210\u20135.","journal-title":"J Bacteriol"},{"issue":"1","key":"6095_CR17","doi-asserted-by":"publisher","first-page":"2542","DOI":"10.1038\/s41467-018-04964-5","volume":"9","author":"M Steinegger","year":"2018","unstructured":"Steinegger M, S\u00f6ding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):2542.","journal-title":"Nat Commun"},{"key":"6095_CR18","doi-asserted-by":"crossref","unstructured":"Slav\u00edk P. A tight analysis of the greedy algorithm for set cover. Proc Annu ACM Symp Theory Comput. 1996;Part F1294:435\u201341.","DOI":"10.1145\/237814.237991"},{"key":"6095_CR19","doi-asserted-by":"crossref","unstructured":"Dice LR. Measures of the Amount of Ecologic Association Between Species Author ( s ): Lee R . Dice Published by\u202f: Ecological Society of America. Ecology. 1945;26:297\u2013302.","DOI":"10.2307\/1932409"},{"key":"6095_CR20","doi-asserted-by":"publisher","first-page":"2170","DOI":"10.1093\/molbev\/msw088","volume":"33","author":"F Jauffrit","year":"2016","unstructured":"Jauffrit F, Penel S, Delmotte S, Rey C, De Vienne DM, Gouy M, et al. RiboDB database: a comprehensive resource for prokaryotic systematics. Mol Biol Evol. 2016;33:2170\u20132.","journal-title":"Mol Biol Evol"},{"key":"6095_CR21","doi-asserted-by":"publisher","first-page":"D785","DOI":"10.1093\/nar\/gkab776","volume":"50","author":"DH Parks","year":"2022","unstructured":"Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil PA, Hugenholtz P. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022;50:D785\u201394.","journal-title":"Nucleic Acids Res"},{"key":"6095_CR22","doi-asserted-by":"publisher","first-page":"W293","DOI":"10.1093\/nar\/gkab301","volume":"49","author":"I Letunic","year":"2021","unstructured":"Letunic I, Bork P. Interactive tree of life (iTOL) v5: An online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293\u20136.","journal-title":"Nucleic Acids Res"},{"key":"6095_CR23","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1038\/s41579-020-00458-8","volume":"19","author":"WH Lewis","year":"2020","unstructured":"Lewis WH, Tahon G, Geesink P, Sousa DZ, Ettema TJG. Innovations to culturing the uncultured microbial majority. Nat Rev Microbiol. 2020;19:225\u201340.","journal-title":"Nat Rev Microbiol"},{"key":"6095_CR24","unstructured":"S\u00f8rensen T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. 1948."},{"key":"6095_CR25","doi-asserted-by":"publisher","first-page":"270","DOI":"10.1177\/1094428112470848","volume":"16","author":"H Aguinis","year":"2013","unstructured":"Aguinis H, Gottfredson RK, Joo H. Best-Practice recommendations for defining, identifying, and handling outliers. Organ Res Methods. 2013;16:270\u2013301.","journal-title":"Organ Res Methods"},{"key":"6095_CR26","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0036972","volume":"7","author":"N Yutin","year":"2012","unstructured":"Yutin N, Puigb\u00f2 P, Koonin EV, Wolf YI. Phylogenomics of prokaryotic ribosomal proteins. PLoS ONE. 2012;7: e36972.","journal-title":"PLoS ONE"},{"key":"6095_CR27","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1016\/j.ympev.2014.02.013","volume":"75","author":"HG Ramulu","year":"2014","unstructured":"Ramulu HG, Groussin M, Talla E, Planel R, Daubin V, Brochier-Armanet C. Ribosomal proteins: toward a next generation standard for prokaryotic systematics? Mol Phylogenet Evol. 2014;75:103\u201317.","journal-title":"Mol Phylogenet Evol"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-025-06095-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-025-06095-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-025-06095-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,6]],"date-time":"2025-05-06T12:48:23Z","timestamp":1746535703000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-025-06095-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,6]]},"references-count":27,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["6095"],"URL":"https:\/\/doi.org\/10.1186\/s12859-025-06095-3","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,6]]},"assertion":[{"value":"25 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"121"}}