{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T18:10:17Z","timestamp":1706811017880},"reference-count":26,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family\/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusion<\/jats:title>\n                <jats:p>Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/1471-2105-6-242","type":"journal-article","created":{"date-parts":[[2005,10,4]],"date-time":"2005-10-04T06:14:18Z","timestamp":1128406458000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks"],"prefix":"10.1186","volume":"6","author":[{"given":"Qicheng","family":"Ma","sequence":"first","affiliation":[]},{"given":"Gung-Wei","family":"Chirn","sequence":"additional","affiliation":[]},{"given":"Richard","family":"Cai","sequence":"additional","affiliation":[]},{"given":"Joseph D","family":"Szustakowski","sequence":"additional","affiliation":[]},{"given":"NR","family":"Nirmala","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2005,10,3]]},"reference":[{"issue":"5338","key":"567_CR1","doi-asserted-by":"publisher","first-page":"631","DOI":"10.1126\/science.278.5338.631","volume":"278","author":"RL Tatusov","year":"1997","unstructured":"Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631\u2013637.","journal-title":"Science"},{"issue":"3\u20134","key":"567_CR2","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1016\/S0097-8485(99)00011-X","volume":"23","author":"J Gouzy","year":"1999","unstructured":"Gouzy J, Corpet F, Kahn D: Whole genome protein domain analysis using a new method for domain clustering. Comput Chem 1999, 23(3\u20134):333\u2013340.","journal-title":"Comput Chem"},{"issue":"3","key":"567_CR3","doi-asserted-by":"publisher","first-page":"272","DOI":"10.1093\/bioinformatics\/17.3.272","volume":"17","author":"A Heger","year":"2001","unstructured":"Heger A, Holm L: Picasso: generating a covering set of protein family profiles. Bioinformatics 2001, 17(3):272\u2013279.","journal-title":"Bioinformatics"},{"issue":"17","key":"567_CR4","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"SF Altschul","year":"1997","unstructured":"Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389\u20133402.","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"567_CR5","doi-asserted-by":"publisher","first-page":"430","DOI":"10.1093\/bioinformatics\/14.5.430","volume":"14","author":"A Krause","year":"1998","unstructured":"Krause A, Vingron MA: set-theoretic approach to database searching and clustering. Bioinformatics 1998, 14(5):430\u2013438.","journal-title":"Bioinformatics"},{"issue":"3","key":"567_CR6","doi-asserted-by":"publisher","first-page":"839","DOI":"10.1006\/jmbi.2001.5387","volume":"316","author":"RA George","year":"2002","unstructured":"George RA, Heringa J: SnapDRAGON: a method to delineate protein structural domains from sequence data. J Mol Biol 2002, 316(3):839\u2013851.","journal-title":"J Mol Biol"},{"key":"567_CR7","first-page":"224","volume-title":"Proceedings of the seventh annual international conference on Computational molecular biology","author":"N Nagarajan","year":"2003","unstructured":"Nagarajan N, Yona G: A multi-expert system for the automatic detection of protein domains from sequence information. In Proceedings of the seventh annual international conference on Computational molecular biology. Berlin, Germany; 2003:224\u2013234."},{"issue":"1","key":"567_CR8","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1093\/nar\/29.1.33","volume":"29","author":"EV Kriventseva","year":"2001","unstructured":"Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler R: CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 2001, 29(1):33\u201336.","journal-title":"Nucleic Acids Res"},{"issue":"10","key":"567_CR9","doi-asserted-by":"publisher","first-page":"935","DOI":"10.1093\/bioinformatics\/17.10.935","volume":"17","author":"E Bolten","year":"2001","unstructured":"Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R: Clustering protein sequences\u2013structure prediction by transitivehomology. Bioinformatics 2001, 17(10):935\u2013941.","journal-title":"Bioinformatics"},{"key":"567_CR10","first-page":"S182","volume-title":"Bioinformatics","author":"P Pipenbacher","year":"2002","unstructured":"Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R: ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 2002, (Suppl 2):S182\u2013191."},{"key":"567_CR11","first-page":"S14","volume-title":"Bioinformatics","author":"O Sasson","year":"2002","unstructured":"Sasson O, Linial N, Linial M: The metric space of proteins-comparative study of clustering algorithms. Bioinformatics 2002, (Suppl 1):S14\u201321."},{"issue":"3","key":"567_CR12","doi-asserted-by":"publisher","first-page":"360","DOI":"10.1002\/(SICI)1097-0134(19991115)37:3<360::AID-PROT5>3.0.CO;2-Z","volume":"37","author":"G Yona","year":"1999","unstructured":"Yona G, Linial N, Linial M: ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins 1999, 37(3):360\u2013378.","journal-title":"Proteins"},{"issue":"7","key":"567_CR13","doi-asserted-by":"publisher","first-page":"1575","DOI":"10.1093\/nar\/30.7.1575","volume":"30","author":"AJ Enright","year":"2002","unstructured":"Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30(7):1575\u201384.","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"567_CR14","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1093\/bioinformatics\/16.2.117","volume":"16","author":"SA Teichmann","year":"2000","unstructured":"Teichmann SA, Chothia C, Church GM, Park J: Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics 2000, 16(2):117\u2013124.","journal-title":"Bioinformatics"},{"issue":"1","key":"567_CR15","doi-asserted-by":"publisher","first-page":"349","DOI":"10.1006\/jmbi.1997.1288","volume":"273","author":"J Park","year":"1997","unstructured":"Park J, Teichmann SA, Hubbard T, Chothia C: Intermediate sequences increase the detection of homology between sequences. J Mol Biol 1997, 273(1):349\u2013354.","journal-title":"J Mol Biol"},{"issue":"8","key":"567_CR16","doi-asserted-by":"publisher","first-page":"707","DOI":"10.1093\/bioinformatics\/14.8.707","volume":"14","author":"M Gerstein","year":"1998","unstructured":"Gerstein M: Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence. Bioinformatics 1998, 14(8):707\u2013714.","journal-title":"Bioinformatics"},{"issue":"4","key":"567_CR17","doi-asserted-by":"publisher","first-page":"1201","DOI":"10.1006\/jmbi.1998.2221","volume":"284","author":"J Park","year":"1998","unstructured":"Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201\u20131210.","journal-title":"J Mol Biol"},{"issue":"1","key":"567_CR18","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1093\/nar\/gkg095","volume":"31","author":"B Boeckmann","year":"2003","unstructured":"Boeckmann B, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31(1):365\u2013370.","journal-title":"Nucleic Acids Res"},{"issue":"19","key":"567_CR19","doi-asserted-by":"publisher","first-page":"847","DOI":"10.1093\/bioinformatics\/17.9.847","volume":"17","author":"EM Zdobnov","year":"2001","unstructured":"Zdobnov EM, Apweiler R: InterProScan \u2013 an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17(19):847\u2013848.","journal-title":"Bioinformatics"},{"key":"567_CR20","doi-asserted-by":"publisher","first-page":"78s","DOI":"10.1183\/09031936.02.00400202","volume":"36","author":"ST Cole","year":"2002","unstructured":"Cole ST: Comparative mycobacterial genomics as a tool for drug target and antigen discovery. Eur Respir J Suppl 2002, 36: 78s-86s.","journal-title":"Eur Respir J Suppl"},{"issue":"8","key":"567_CR21","doi-asserted-by":"publisher","first-page":"4285","DOI":"10.1073\/pnas.96.8.4285","volume":"96","author":"M Pellegrini","year":"1999","unstructured":"Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285\u20134288.","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"8","key":"567_CR22","doi-asserted-by":"publisher","first-page":"700","DOI":"10.1093\/bioinformatics\/17.8.700","volume":"17","author":"J Pei","year":"2001","unstructured":"Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 2001, 17(8):700\u2013712.","journal-title":"Bioinformatics"},{"issue":"2","key":"567_CR23","first-page":"426","volume":"40","author":"TJ Wang","year":"2001","unstructured":"Wang TJ, Ma Q, Shasha D, Wu C: New techniques for extracting features from protein sequences. IBM Systems Journal, Special Issue on Deep Computing for the Life Sciences 2001, 40(2):426\u2013441.","journal-title":"IBM Systems Journal, Special Issue on Deep Computing for the Life Sciences"},{"key":"567_CR24","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780198538493.001.0001","volume-title":"Neural Networks for Pattern Recognition","author":"CM Bishop","year":"1995","unstructured":"Bishop CM: Neural Networks for Pattern Recognition. Oxford University Press, New York, New York; 1995."},{"key":"567_CR25","volume-title":"Mastering MATLAB 5: A comprehensive tutorial and reference","author":"DC Hanselman","year":"1998","unstructured":"Hanselman DC: Mastering MATLAB 5: A comprehensive tutorial and reference. Prentice Hall, Upper Saddle River, New Jersey; 1998."},{"key":"567_CR26","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-21606-5","volume-title":"The Elements of Statistical Learning","author":"T Hastie","year":"2001","unstructured":"Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer, New York; 2001."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-6-242.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T17:56:22Z","timestamp":1706810182000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-6-242"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,10,3]]},"references-count":26,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2005,12]]}},"alternative-id":["567"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-6-242","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2005,10,3]]},"assertion":[{"value":"12 April 2005","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 October 2005","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 October 2005","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"242"}}