{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T17:48:00Z","timestamp":1770918480918,"version":"3.50.1"},"reference-count":34,"publisher":"Oxford University Press (OUP)","issue":"8","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,4,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Proteins can be naturally classified into families of homologous sequences that derive from a common ancestor. The comparison of homologous sequences and the analysis of their phylogenetic relationships provide useful information regarding the function and evolution of genes. One important difficulty of clustering methods is to distinguish highly divergent homologous sequences from sequences that only share partial homology due to evolution by protein domain rearrangements. Existing clustering methods require parameters that have to be set a priori. Given the variability in the evolution pattern among proteins, these parameters cannot be optimal for all gene families.<\/jats:p>\n               <jats:p>Results: We propose a strategy that aims at clustering sequences homologous over their entire length, and that takes into account the pattern of substitution specific to each gene family. Sequences are first all compared with each other and clustered into pre-families, based on pairwise similarity criteria, with permissive parameters to optimize sensitivity. Pre-families are then divided into homogeneous clusters, based on the topology of the similarity network. Finally, clusters are progressively merged into families, for which we compute multiple alignments, and we use a model selection technique to find the optimal tradeoff between the number of families and multiple alignment likelihood. To evaluate this method, called HiFiX, we analyzed simulated sequences and manually curated datasets. These tests showed that HiFiX is the only method robust to both sequence divergence and domain rearrangements. HiFiX is fast enough to be used on very large datasets.<\/jats:p>\n               <jats:p>Availability and implementation: The Python software HiFiX is freely available at http:\/\/lbbe.univ-lyon1.fr\/hifix<\/jats:p>\n               <jats:p>Contact: \u00a0vincent.miele@univ-lyon1.fr<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts098","type":"journal-article","created":{"date-parts":[[2012,2,26]],"date-time":"2012-02-26T05:52:45Z","timestamp":1330235565000},"page":"1078-1085","source":"Crossref","is-referenced-by-count":28,"title":["High-quality sequence clustering guided by network topology and multiple alignment likelihood"],"prefix":"10.1093","volume":"28","author":[{"given":"Vincent","family":"Miele","sequence":"first","affiliation":[]},{"given":"Simon","family":"Penel","sequence":"additional","affiliation":[]},{"given":"Vincent","family":"Daubin","sequence":"additional","affiliation":[]},{"given":"Franck","family":"Picard","sequence":"additional","affiliation":[]},{"given":"Daniel","family":"Kahn","sequence":"additional","affiliation":[]},{"given":"Laurent","family":"Duret","sequence":"additional","affiliation":[]}],"member":"286","published-online":{"date-parts":[[2012,2,25]]},"reference":[{"key":"2023012711534558300_B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B2","doi-asserted-by":"crossref","first-page":"e1001131","DOI":"10.1371\/journal.pcbi.1001131","article-title":"Detecting network communities: an application to phylogenetic analysis","volume":"7","author":"Andrade","year":"2011","journal-title":"PLoS Comput. Biol."},{"key":"2023012711534558300_B3","doi-asserted-by":"crossref","first-page":"326","DOI":"10.1093\/bioinformatics\/btq655","article-title":"Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution","volume":"27","author":"Apeltsin","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012711534558300_B4","doi-asserted-by":"crossref","first-page":"e4345","DOI":"10.1371\/journal.pone.0004345","article-title":"Using sequence similarity networks for visualization of relationships across diverse protein superfamilies","volume":"4","author":"Atkinson","year":"2009","journal-title":"PLoS ONE"},{"key":"2023012711534558300_B5","doi-asserted-by":"crossref","first-page":"719","DOI":"10.1109\/34.865189","article-title":"Assessing a mixture model for clustering with the integrated completed likelihood","volume":"22","author":"Biernacki","year":"2000","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"2023012711534558300_B6","doi-asserted-by":"crossref","first-page":"P10008+","DOI":"10.1088\/1742-5468\/2008\/10\/P10008","article-title":"Fast unfolding of communities in large networks","volume":"2008","author":"Blondel","year":"2008","journal-title":"J. Stat. Mech.-Theory E."},{"key":"2023012711534558300_B7","doi-asserted-by":"crossref","first-page":"R8","DOI":"10.1186\/gb-2006-7-1-r8","article-title":"A gold standard set of mechanistically diverse enzyme superfamilies","volume":"7","author":"Brown","year":"2006","journal-title":"Genome Biol."},{"key":"2023012711534558300_B8","doi-asserted-by":"crossref","first-page":"D212","DOI":"10.1093\/nar\/gki034","article-title":"The ProDom database of protein domain families: more emphasis on 3D","volume":"33","author":"Bru","year":"2005","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B9","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511790492","volume-title":"Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.","author":"Durbin","year":"1998"},{"key":"2023012711534558300_B10","first-page":"205","article-title":"A new generation of homology search tools based on probabilistic inference","volume":"23","author":"Eddy","year":"2009","journal-title":"Genome Inform."},{"key":"2023012711534558300_B11","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1093\/nar\/30.7.1575","article-title":"An efficient algorithm for large-scale detection of protein families","volume":"30","author":"Enright","year":"2002","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B12","doi-asserted-by":"crossref","first-page":"D211","DOI":"10.1093\/nar\/gkp985","article-title":"The Pfam protein families database","volume":"38","author":"Finn","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B13","doi-asserted-by":"crossref","first-page":"1879","DOI":"10.1093\/molbev\/msp098","article-title":"INDELible: a flexible simulator of biological sequence evolution","volume":"26","author":"Fletcher","year":"2009","journal-title":"Mol. Biol. Evol."},{"key":"2023012711534558300_B14","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1186\/1471-2105-11-86","article-title":"Enrichment of homologs in insignificant BLAST hits by co-complex network alignment","volume":"11","author":"Fokkens","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023012711534558300_B15","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1016\/j.physrep.2009.11.002","article-title":"Community detection in graphs","volume":"486","author":"Fortunato","year":"2010","journal-title":"Phys. Rep."},{"key":"2023012711534558300_B16","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1016\/j.mib.2010.01.005","article-title":"Diversity of structure and function of response regulator output domains","volume":"13","author":"Galperin","year":"2010","journal-title":"Curr. Opin. Microbiol."},{"key":"2023012711534558300_B17","doi-asserted-by":"crossref","first-page":"7821","DOI":"10.1073\/pnas.122653799","article-title":"Community structure in social and biological networks","volume":"99","author":"Girvan","year":"2002","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012711534558300_B18","doi-asserted-by":"crossref","first-page":"2177","DOI":"10.1093\/nar\/gkp1219","article-title":"Homologous over-extension: a challenge for iterative similarity searches","volume":"38","author":"Gonzalez","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B19","doi-asserted-by":"crossref","first-page":"1590","DOI":"10.1109\/TASL.2008.2002085","article-title":"Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization","volume":"16","author":"Han","year":"2008","journal-title":"IEEE T Audio Speech"},{"key":"2023012711534558300_B20","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1007\/978-1-59745-251-9_3","article-title":"Multiple alignment of DNA sequences with MAFFT","volume":"537","author":"Katoh","year":"2009","journal-title":"Methods Mol. Biol."},{"key":"2023012711534558300_B21","doi-asserted-by":"crossref","first-page":"e173","DOI":"10.1371\/journal.pcbi.0020173","article-title":"Protein homology network families reveal step-wise diversification of Type III and Type IV secretion systems","volume":"2","author":"Medini","year":"2006","journal-title":"PLoS Comput. Biol."},{"key":"2023012711534558300_B22","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1186\/1471-2105-12-116","article-title":"Ultra-fast sequence clustering from similarity networks with SiLiX","volume":"12","author":"Miele","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012711534558300_B23","doi-asserted-by":"crossref","first-page":"1077","DOI":"10.1198\/016214501753208735","article-title":"Estimation and prediction for stochastic blockstructures","volume":"96","author":"Nowicki","year":"2001","journal-title":"J. Am. Stat. Assoc."},{"key":"2023012711534558300_B24","doi-asserted-by":"crossref","first-page":"1571","DOI":"10.1093\/nar\/gkj515","article-title":"Spectral clustering of protein sequences","volume":"34","author":"Paccanaro","year":"2006","journal-title":"Nucleic Acids Res."},{"issue":"Suppl. 6","key":"2023012711534558300_B25","doi-asserted-by":"crossref","first-page":"S3","DOI":"10.1186\/1471-2105-10-S6-S3","article-title":"Databases of homologous gene families for comparative genomics","volume":"10","author":"Penel","year":"2009","journal-title":"BMC Bioinformatics"},{"issue":"Suppl. 6","key":"2023012711534558300_B26","doi-asserted-by":"crossref","first-page":"S17","DOI":"10.1186\/1471-2105-10-S6-S17","article-title":"Deciphering the connectivity structure of biological networks using MixNet","volume":"10","author":"Picard","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2023012711534558300_B27","doi-asserted-by":"crossref","first-page":"7188","DOI":"10.1093\/nar\/gkm864","article-title":"SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB","volume":"35","author":"Pruesse","year":"2007","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B28","doi-asserted-by":"crossref","first-page":"D735","DOI":"10.1093\/nar\/gkm1005","article-title":"TreeFam: 2008 update","volume":"36","author":"Ruan","year":"2008","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B29","doi-asserted-by":"crossref","first-page":"2498","DOI":"10.1101\/gr.1239303","article-title":"Cytoscape: a software environment for integrated models of biomolecular interaction networks","volume":"13","author":"Shannon","year":"2003","journal-title":"Genome Res."},{"key":"2023012711534558300_B30","doi-asserted-by":"crossref","first-page":"e1000063","DOI":"10.1371\/journal.pcbi.1000063","article-title":"Sequence similarity network reveals common ancestry of multidomain proteins","volume":"4","author":"Song","year":"2008","journal-title":"PLoS Comput. Biol."},{"key":"2023012711534558300_B31","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1093\/nar\/29.1.22","article-title":"The COG database: new developments in phylogenetic classification of proteins from complete genomes","volume":"29","author":"Tatusov","year":"2001","journal-title":"Nucleic Acids Res."},{"key":"2023012711534558300_B32","doi-asserted-by":"crossref","first-page":"327","DOI":"10.1101\/gr.073585.107","article-title":"EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates","volume":"19","author":"Vilella","year":"2009","journal-title":"Genome Res."},{"key":"2023012711534558300_B33","doi-asserted-by":"crossref","first-page":"419","DOI":"10.1038\/nmeth0610-419","article-title":"Partitioning biological data with transitivity clustering","volume":"7","author":"Wittkop","year":"2010","journal-title":"Nat. Methods"},{"key":"2023012711534558300_B34","doi-asserted-by":"crossref","first-page":"627","DOI":"10.1089\/cmb.2009.0028","article-title":"Phylogeny inference based on spectral graph clustering","volume":"18","author":"Zhang","year":"2011","journal-title":"J. Comput. Biol."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/8\/1078\/48930549\/bioinformatics_28_8_1078.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/8\/1078\/48930549\/bioinformatics_28_8_1078.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T12:24:51Z","timestamp":1674822291000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/8\/1078\/195985"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,2,25]]},"references-count":34,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2012,4,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts098","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2012,4]]},"published":{"date-parts":[[2012,2,25]]}}}