{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,26]],"date-time":"2025-10-26T14:14:20Z","timestamp":1761488060350,"version":"3.34.0"},"reference-count":25,"publisher":"Oxford University Press (OUP)","issue":"16","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2008,8,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Automatic clustering of protein sequences is an important problem in computational biology. The recent explosion in genome sequences has given biological researchers a vast number of novel protein sequences. However, the majority of these sequences have no experimental evidence for their molecular function in the cell, and the responsibility for correctly annotating these sequences falls upon the bioinformatics community. Ideally, we would like to be able to group sequences of similar or identical molecular function in an automatic fashion, without relying on experimental evidence.<\/jats:p><jats:p>Results: In this article I present a novel probabilistic framework that models subfamilies within a known protein family. Given a multiple sequence alignment, the model uses Dirichlet mixture densities to estimate amino acid preferences within subfamily clusters, and places a Dirichlet process prior on the overall set of clusters. Based on results from several datasets, the model breaks data accurately into functional subgroups.<\/jats:p><jats:p>Availability: The algorithm is implemented as c++ software available at bpg-research.berkeley.edu\/~duncanb\/dpcluster\/<\/jats:p><jats:p>Contact: \u00a0duncan_brown@merck.com<\/jats:p><jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btn244","type":"journal-article","created":{"date-parts":[[2008,5,30]],"date-time":"2008-05-30T00:33:45Z","timestamp":1212107625000},"page":"1765-1771","source":"Crossref","is-referenced-by-count":9,"title":["Efficient functional clustering of protein sequences using the Dirichlet process"],"prefix":"10.1093","volume":"24","author":[{"given":"Duncan P.","family":"Brown","sequence":"first","affiliation":[{"name":"1 Department of Bioengineering, UC Berkeley and 2Merck & Co., Inc., 1700 Owens St, San Francisco, CA 94158, USA"},{"name":"1 Department of Bioengineering, UC Berkeley and 2Merck & Co., Inc., 1700 Owens St, San Francisco, CA 94158, USA"}]}],"member":"286","published-online":{"date-parts":[[2008,5,29]]},"reference":[{"key":"2023020210501504700_B1","doi-asserted-by":"crossref","first-page":"908","DOI":"10.1093\/bioinformatics\/18.7.908","article-title":"Clustering of proximal sequence space for the identification of protein families","volume":"18","author":"Abascal","year":"2002","journal-title":"Bioinformatics"},{"key":"2023020210501504700_B2","doi-asserted-by":"crossref","first-page":"353","DOI":"10.1214\/aos\/1176342372","article-title":"Ferguson distributions via Polya Urn schemes","volume":"1","author":"Blackwell","year":"1973","journal-title":"Ann. Stat"},{"key":"2023020210501504700_B3","doi-asserted-by":"crossref","first-page":"e160","DOI":"10.1371\/journal.pcbi.0030160","article-title":"Automated protein subfamily identification and classification","volume":"3","author":"Brown","year":"2007","journal-title":"PLoS Comput. Biol"},{"key":"2023020210501504700_B4","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1186\/gb-2004-5-9-343","article-title":"Structural genomics and structural biology: compare and contrast","volume":"5","author":"Chandonia","year":"2004","journal-title":"Genome Biol"},{"key":"2023020210501504700_B5","article-title":"An improved merge-split sampler for conjugate dirichlet process mixture models","volume-title":"Technical Report 1086.","author":"Dahl","year":"2003"},{"key":"2023020210501504700_B6","first-page":"399","article-title":"Clustering protein sequence and structure space with infinite Gaussian mixture models","author":"Dubey","year":"2004","journal-title":"Pac. Symp. Biocomput"},{"key":"2023020210501504700_B7","doi-asserted-by":"crossref","first-page":"1792","DOI":"10.1093\/nar\/gkh340","article-title":"MUSCLE: multiple sequence alignment with high accuracy and high throughput","volume":"32","author":"Edgar","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2023020210501504700_B8","doi-asserted-by":"crossref","first-page":"451","DOI":"10.1093\/bioinformatics\/16.5.451","article-title":"GeneRAGE: a robust algorithm for sequence clustering and domain detection","volume":"16","author":"Enright","year":"2000","journal-title":"Bioinformatics"},{"key":"2023020210501504700_B9","doi-asserted-by":"crossref","first-page":"346","DOI":"10.1093\/nar\/29.1.346","article-title":"Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems","volume":"29","author":"Horn","year":"2001","journal-title":"Nucleic Acids Res"},{"key":"2023020210501504700_B10","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1093\/nar\/gkg103","article-title":"GPCRDB information system for G protein-coupled receptors","volume":"31","author":"Horn","year":"2003","journal-title":"Nucleic Acids Res"},{"key":"2023020210501504700_B11","article-title":"A split-merge markov chain Monte Carlo procedure for the dirichlet mixture model","volume-title":"Technical Report 2003.","author":"Jain","year":"2000"},{"key":"2023020210501504700_B12","doi-asserted-by":"crossref","first-page":"134","DOI":"10.1002\/(SICI)1097-0134(1997)1+<134::AID-PROT18>3.0.CO;2-P","article-title":"Predicting protein structure using hidden Markov models","author":"Karplus","year":"1997","journal-title":"Proteins"},{"key":"2023020210501504700_B13","doi-asserted-by":"crossref","first-page":"430","DOI":"10.1093\/bioinformatics\/14.5.430","article-title":"A set-theoretic approach to database searching and clustering","volume":"14","author":"Krause","year":"1998","journal-title":"Bioinformatics"},{"key":"2023020210501504700_B14","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2023020210501504700_B15","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1186\/1471-2105-6-83","article-title":"Bayesian coestimation of phylogeny and sequence alignment","volume":"6","author":"Lunter","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023020210501504700_B16","first-page":"173","article-title":"Comparing clusterings by the variation of information","volume-title":"Learning Theory And Kernel Machines. Vol. 2777 of Lecture Notes In Artificial Intelligence.","author":"Meila","year":"2003"},{"key":"2023020210501504700_B17","doi-asserted-by":"crossref","first-page":"660","DOI":"10.1007\/s002390010253","article-title":"Assessing variability by joint sampling of alignments and mutation rates","volume":"53","author":"Metzler","year":"2001","journal-title":"J. Mol. Evol"},{"key":"2023020210501504700_B18","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1080\/10618600.2000.10474879","article-title":"Markov chain sampling methods for Dirichlet process mixture","volume":"9","author":"Neal","year":"2000","journal-title":"J. Comput. Graph. Stat"},{"volume-title":"Enzyme nomenclature 1992: recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes.","year":"1992","author":"Webb","key":"2023020210501504700_B19"},{"key":"2023020210501504700_B20","doi-asserted-by":"crossref","first-page":"2545","DOI":"10.1021\/bi052101l","article-title":"Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database","volume":"45","author":"Pegg","year":"2006","journal-title":"Biochemistry"},{"key":"2023020210501504700_B21","first-page":"639","article-title":"A constructive definition of dirichlet priors","volume":"4","author":"Sethuraman","year":"1994","journal-title":"Stat. Sin"},{"key":"2023020210501504700_B22","first-page":"165","article-title":"Phylogenetic inference in protein superfamilies: analysis of SH2 domains","volume":"6","author":"Sj\u00f6lander","year":"1998","journal-title":"Proc. Int. Conf. Intell. Syst. Mol. Biol"},{"key":"2023020210501504700_B23","first-page":"327","article-title":"Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology","volume":"12","author":"Sj\u00f6lander","year":"1996","journal-title":"Comput. Appl. Biosci"},{"key":"2023020210501504700_B24","doi-asserted-by":"crossref","first-page":"1435","DOI":"10.1093\/oxfordjournals.molbev.a003929","article-title":"Secator: a program for inferring protein subfamilies from phylogenetic trees","volume":"18","author":"Wicker","year":"2001","journal-title":"Mol. Biol. Evol"},{"key":"2023020210501504700_B25","first-page":"212","article-title":"A map of the protein space\u2013an automatic hierarchical classification of all protein sequences","volume":"6","author":"Yona","year":"1998","journal-title":"Proc. Int. Conf. Intel. Syst. Mol. Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/24\/16\/1765\/49052898\/bioinformatics_24_16_1765.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/24\/16\/1765\/49052898\/bioinformatics_24_16_1765.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,30]],"date-time":"2025-01-30T13:35:28Z","timestamp":1738244128000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/24\/16\/1765\/199047"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,5,29]]},"references-count":25,"journal-issue":{"issue":"16","published-print":{"date-parts":[[2008,8,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btn244","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"type":"electronic","value":"1367-4811"},{"type":"print","value":"1367-4803"}],"subject":[],"published-other":{"date-parts":[[2008,8,15]]},"published":{"date-parts":[[2008,5,29]]}}}