{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,19]],"date-time":"2025-03-19T12:21:02Z","timestamp":1742386862132},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":3015,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/uk\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2008,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The classification of proteins into homologous groups (families) allows their structure and function to be analysed and compared in an evolutionary context. The modular nature of eukaryotic proteins presents a considerable challenge to the delineation of families, as different local regions within a single protein may share common ancestry with distinct, even mutually exclusive, sets of homologs, thereby creating an intricate web of homologous relationships if full-length sequences are taken as the unit of evolution. We attempt to disentangle this web by developing a fully automated pipeline to delineate protein subsequences that represent sensible units for homology inference, and clustering them into putatively homologous families using the Markov clustering algorithm.<\/jats:p>\n               <jats:p>Results: Using six eukaryotic proteomes as input, we clustered 162 349 protein sequences into 19 697\u201377 415 subsequence families depending on granularity of clustering. We validated these Markov clusters of homologous subsequences (MACHOS) against the manually curated Pfam domain families, using a quality measure to assess overlap. Our subsequence families correspond well to known domain families and achieve higher quality scores than do groups generated by fully automated domain family classification methods. We illustrate our approach by analysis of a group of proteins that contains the glutamyl\/glutaminyl-tRNA synthetase domain, and conclude that our method can produce high-coverage decomposition of protein sequence space into precise homologous families in a way that takes the modularity of eukaryotic proteins into account. This approach allows for a fine-scale examination of evolutionary histories of proteins encoded in eukaryotic genomes.<\/jats:p>\n               <jats:p>Contact: \u00a0m.ragan@imb.uq.edu.au<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online. MACHOS for the six proteomes are available as FASTA-formatted files: http:\/\/research1t.imb.uq.edu.au\/ragan\/machos<\/jats:p>","DOI":"10.1093\/bioinformatics\/btn144","type":"journal-article","created":{"date-parts":[[2008,6,27]],"date-time":"2008-06-27T07:43:13Z","timestamp":1214552593000},"page":"i77-i85","source":"Crossref","is-referenced-by-count":13,"title":["MACHOS: Markov clusters of homologous subsequences"],"prefix":"10.1093","volume":"24","author":[{"given":"Simon","family":"Wong","sequence":"first","affiliation":[{"name":"ARC Centre of Excellence in Bioinformatics and Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mark A.","family":"Ragan","sequence":"additional","affiliation":[{"name":"ARC Centre of Excellence in Bioinformatics and Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2008,7,1]]},"reference":[{"key":"2023020210394769000_B1","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol"},{"key":"2023020210394769000_B2","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1006\/jmbi.2001.4776","article-title":"Domain combinations in archaeal, eubacterial and eukaryotic proteomes","volume":"310","author":"Apic","year":"2001","journal-title":"J. Mol. Biol"},{"key":"2023020210394769000_B3","doi-asserted-by":"crossref","first-page":"D138","DOI":"10.1093\/nar\/gkh121","article-title":"The Pfam protein families database","volume":"32","author":"Bateman","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B4","doi-asserted-by":"crossref","first-page":"D21","DOI":"10.1093\/nar\/gkl986","article-title":"GenBank","volume":"35","author":"Benson","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B5","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1093\/nar\/28.1.235","article-title":"The Protein Data Bank","volume":"28","author":"Berman","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B6","doi-asserted-by":"crossref","first-page":"D556","DOI":"10.1093\/nar\/gkj133","article-title":"Ensembl 2006","volume":"34","author":"Birney","year":"2006","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B7","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1016\/0014-5793(91)80937-X","article-title":"Shuffled domains in extracellular proteins","volume":"286","author":"Bork","year":"1991","journal-title":"FEBS Lett"},{"key":"2023020210394769000_B8","doi-asserted-by":"crossref","first-page":"D212","DOI":"10.1093\/nar\/gki034","article-title":"The ProDom database of protein domain families: more emphasis on 3D","volume":"33","author":"Bru","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B9","doi-asserted-by":"crossref","first-page":"1559","DOI":"10.1126\/science.1112014","article-title":"The transcriptional landscape of the mammalian genome","volume":"309","author":"Carninci","year":"2005","journal-title":"Science"},{"key":"2023020210394769000_B10","doi-asserted-by":"crossref","first-page":"1377","DOI":"10.1126\/science.2255907","article-title":"How big is the universe of exons?","volume":"250","author":"Dorit","year":"1990","journal-title":"Science"},{"key":"2023020210394769000_B11","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1093\/nar\/30.7.1575","article-title":"An efficient algorithm for large-scale detection of protein families","volume":"30","author":"Enright","year":"2002","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B12","doi-asserted-by":"crossref","first-page":"D247","DOI":"10.1093\/nar\/gkj149","article-title":"Pfam: clans, web tools and services","volume":"34","author":"Finn","year":"2006","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B13","doi-asserted-by":"crossref","first-page":"99","DOI":"10.2307\/2412448","article-title":"Distinguishing homologous from analogous proteins","volume":"19","author":"Fitch","year":"1970","journal-title":"Syst. Zool"},{"key":"2023020210394769000_B14","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1093\/bioinformatics\/14.2.174","article-title":"Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities","volume":"14","author":"Gracy","year":"1998","journal-title":"Bioinformatics"},{"key":"2023020210394769000_B15","volume-title":"Homology. The hierarchical basis of comparative biology","author":"Hall","year":"1994"},{"key":"2023020210394769000_B16","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1186\/1471-2105-5-45","article-title":"A hybrid clustering approach to recognition of protein families in 114 microbial genomes","volume":"5","author":"Harlow","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023020210394769000_B17","doi-asserted-by":"crossref","first-page":"749","DOI":"10.1016\/S0022-2836(03)00269-9","article-title":"Exhaustive enumeration of protein domain families","volume":"328","author":"Heger","year":"2003","journal-title":"J. Mol. Biol"},{"key":"2023020210394769000_B18","doi-asserted-by":"crossref","first-page":"595","DOI":"10.1126\/science.273.5275.595","article-title":"Mapping the protein universe","volume":"273","author":"Holm","year":"1996","journal-title":"Science"},{"key":"2023020210394769000_B19","first-page":"373","article-title":"A space-efficient algorithm for local similarities","volume":"6","author":"Huang","year":"1990","journal-title":"Comput. Appl. Biosci"},{"key":"2023020210394769000_B20","doi-asserted-by":"crossref","first-page":"e363","DOI":"10.1371\/journal.pbio.0020363","article-title":"Human microRNA targets","volume":"2","author":"John","year":"2004","journal-title":"PLoS Biol"},{"key":"2023020210394769000_B21","doi-asserted-by":"crossref","first-page":"233","DOI":"10.1002\/pro.5560070202","article-title":"Domain assignment for protein structures using a consensus approach: characterization and analysis","volume":"7","author":"Jones","year":"1998","journal-title":"Protein Sci"},{"key":"2023020210394769000_B22","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1146\/annurev.genet.39.073003.114725","article-title":"Orthologs, paralogs, and evolutionary genomics","volume":"39","author":"Koonin","year":"2005","journal-title":"Annu. Rev. Genet"},{"key":"2023020210394769000_B23","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1186\/1471-2105-6-15","article-title":"Large scale hierarchical clustering of protein sequences","volume":"6","author":"Krause","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023020210394769000_B24","doi-asserted-by":"crossref","first-page":"334","DOI":"10.1016\/S0959-440X(00)00211-6","article-title":"Clustering and analysis of protein families","volume":"11","author":"Kriventseva","year":"2001","journal-title":"Curr. Opin. Struct. Biol"},{"key":"2023020210394769000_B25","doi-asserted-by":"crossref","first-page":"2618","DOI":"10.1093\/bioinformatics\/bti386","article-title":"The properties of protein family space depend on experimental design","volume":"21","author":"Kunin","year":"2005","journal-title":"Bioinformatics"},{"key":"2023020210394769000_B26","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1080\/00222937008696201","article-title":"On the use of the term homology in modern zoology","volume":"6","author":"Lankester","year":"1870","journal-title":"Ann. Mag. Nat. Hist"},{"key":"2023020210394769000_B27","doi-asserted-by":"crossref","first-page":"960","DOI":"10.1145\/185675.306789","article-title":"On the hardness of approximating minimization problems","volume":"41","author":"Lund","year":"1994","journal-title":"J. ACM"},{"key":"2023020210394769000_B28","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1126\/science.163.3863.127.a","article-title":"Homology: a definition","volume":"163","author":"Margoliash","year":"1969","journal-title":"Science"},{"key":"2023020210394769000_B29","first-page":"379","volume-title":"Lectures on the comparative anatomy and physiology of the invertebrate animals, delivered at the Royal College of Surgeons, I 1843","author":"Owen","year":"1843"},{"key":"2023020210394769000_B30","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1016\/0888-7543(91)90071-L","article-title":"Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms","volume":"11","author":"Pearson","year":"1991","journal-title":"Genomics"},{"key":"2023020210394769000_B31","doi-asserted-by":"crossref","first-page":"3824","DOI":"10.1093\/bioinformatics\/bti627","article-title":"Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap","volume":"21","author":"Price","year":"2005","journal-title":"Bioinformatics"},{"key":"2023020210394769000_B32","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1016\/S0065-3233(08)60520-3","article-title":"The anatomy and taxonomy of protein structure","volume":"34","author":"Richardson","year":"1981","journal-title":"Adv. Protein Chem"},{"key":"2023020210394769000_B33","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1093\/bib\/3.3.246","article-title":"ProDom: automated clustering of homologous domains","volume":"3","author":"Servant","year":"2002","journal-title":"Brief. Bioinform"},{"key":"2023020210394769000_B34","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J. Mol. Biol"},{"key":"2023020210394769000_B35","doi-asserted-by":"crossref","first-page":"D193","DOI":"10.1093\/nar\/gkl929","article-title":"The Universal Protein Resource (UniProt)","volume":"35","author":"The Uniprot Consortium","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B36","volume-title":"Graph Clustering by Flow Simulation","author":"van Dongen","year":"2000"},{"key":"2023020210394769000_B37","doi-asserted-by":"crossref","first-page":"360","DOI":"10.1002\/(SICI)1097-0134(19991115)37:3<360::AID-PROT5>3.0.CO;2-Z","article-title":"ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space","volume":"37","author":"Yona","year":"1999","journal-title":"Proteins"},{"key":"2023020210394769000_B38","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1093\/nar\/28.1.49","article-title":"ProtoMap: automatic classification of protein sequences and hierarchy of protein families","volume":"28","author":"Yona","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2023020210394769000_B39","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1016\/B978-1-4832-2734-4.50017-6","article-title":"Evolutionary divergence and convergence in proteins. In","volume-title":"Evolving Genes and Proteins","author":"Zuckerkandl","year":"1965"},{"key":"2023020210394769000_B40","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1016\/0022-5193(65)90083-4","article-title":"Molecules as documents of evolutionary history","volume":"8","author":"Zuckerkandl","year":"1965","journal-title":"J. Theor. Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/24\/13\/i77\/49052476\/bioinformatics_24_13_i77.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/24\/13\/i77\/49052476\/bioinformatics_24_13_i77.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,2]],"date-time":"2023-02-02T12:23:01Z","timestamp":1675340581000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/24\/13\/i77\/227117"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,7,1]]},"references-count":40,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2008,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btn144","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2008,7,1]]},"published":{"date-parts":[[2008,7,1]]}}}