{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T21:26:58Z","timestamp":1775078818480,"version":"3.50.1"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2012,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multiple sequence alignment from which a profile HMM is built. In spite of their high specificity, HMMs may lack sensitivity when searching for domains in divergent organisms. This is particularly the case for species with a biased amino-acid composition, such as<jats:italic>P. falciparum<\/jats:italic>, the main causal agent of human malaria. In this context, fitting HMMs to the specificities of the target proteome can help identify additional domains.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Using<jats:italic>P. falciparum<\/jats:italic>as an example, we compare approaches that have been proposed for this problem, and present two alternative methods. Because previous attempts strongly rely on known domain occurrences in the target species or its close relatives, they mainly improve the detection of domains which belong to already identified families. Our methods learn global correction rules that adjust amino-acid distributions associated with the match states of HMMs. These rules are applied to all match states of the whole HMM library, thus enabling the detection of domains from previously absent families. Additionally, we propose a procedure to estimate the proportion of false positives among the newly discovered domains. Starting with the Pfam standard library, we build several new libraries with the different HMM-fitting approaches. These libraries are first used to detect new domain occurrences with low E-values. Second, by applying the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed, the libraries are further used to identify likely occurrences among potential domains with higher E-values.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>We show that the new approaches allow identification of several domain families previously absent in the<jats:italic>P. falciparum<\/jats:italic>proteome and the Apicomplexa phylum, and identify many domains that are not detected by previous approaches. In terms of the number of new discovered domains, the new approaches outperform the previous ones when no close species are available or when they are used to identify likely occurrences among potential domains with high E-values. All predictions on<jats:italic>P. falciparum<\/jats:italic>have been integrated into a dedicated website which pools all known\/new annotations of protein domains and functions for this organism. A software implementing the two proposed approaches is available at the same address:<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"http:\/\/www.lirmm.fr\/~terrapon\/HMMfit\/\" ext-link-type=\"uri\">http:\/\/www.lirmm.fr\/~terrapon\/HMMfit\/<\/jats:ext-link><\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-13-67","type":"journal-article","created":{"date-parts":[[2012,5,1]],"date-time":"2012-05-01T10:14:07Z","timestamp":1335867247000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum"],"prefix":"10.1186","volume":"13","author":[{"given":"Nicolas","family":"Terrapon","sequence":"first","affiliation":[]},{"given":"Olivier","family":"Gascuel","sequence":"additional","affiliation":[]},{"given":"\u00c9ric","family":"Mar\u00e9chal","sequence":"additional","affiliation":[]},{"given":"Laurent","family":"Br\u00e9h\u00e9lin","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2012,5,1]]},"reference":[{"key":"5309_CR1","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1016\/S0065-3233(08)60520-3","volume":"34","author":"J Richardson","year":"1981","unstructured":"Richardson J: The anatomy and taxonomy of protein structure. Adv Protein Chem 1981, 34: 167.","journal-title":"Adv Protein Chem"},{"issue":"4","key":"5309_CR2","first-page":"536","volume":"247","author":"A Murzin","year":"1995","unstructured":"Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a Structural Classification of Proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536.","journal-title":"J Mol Biol"},{"issue":"Database issue","key":"5309_CR3","doi-asserted-by":"publisher","first-page":"D211","DOI":"10.1093\/nar\/gkn785","volume":"37","author":"S Hunter","year":"2009","unstructured":"Hunter S: InterPro: the integrative protein signature database. Nucleic Acid Res 2009, 37(Database issue):D211.","journal-title":"Nucleic Acid Res"},{"issue":"Database issue","key":"5309_CR4","doi-asserted-by":"publisher","first-page":"D211","DOI":"10.1093\/nar\/gkp985","volume":"38","author":"R Finn","year":"2010","unstructured":"Finn R, Mistry J, Tate J, Coggill P, Heger A, Pollington J, Gavin O, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer E, Eddy S, Bateman A: The Pfam protein families database. Nucleic Acids Res 2010, 38(Database issue):D211.","journal-title":"Nucleic Acids Res"},{"key":"5309_CR5","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511790492","volume-title":"Biological sequence analysis: Probabilistic models of proteins and nucleic acids","author":"R Durbin","year":"1998","unstructured":"Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press; 1998."},{"issue":"Database issue","key":"5309_CR6","doi-asserted-by":"publisher","first-page":"D169","DOI":"10.1093\/nar\/gkn664","volume":"37","author":"The UniProt Consortium","year":"2009","unstructured":"The UniProt Consortium: The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 2009, 37(Database issue):D169.","journal-title":"Nucleic Acids Res"},{"key":"5309_CR7","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1038\/75556","volume":"25","author":"The Gene Ontology Consortium","year":"2000","unstructured":"The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25: 25. 10.1038\/75556","journal-title":"Nat Genet"},{"key":"5309_CR8","volume-title":"Nat Genet","author":"World Health Organization","year":"2010","unstructured":"World Health Organization: World Malaria Report. Nat Genet 2010."},{"issue":"2","key":"5309_CR9","doi-asserted-by":"publisher","first-page":"218","DOI":"10.1101\/gr.GR-1522R","volume":"11","author":"E Pizzi","year":"2001","unstructured":"Pizzi E, Frontali C: Low-complexity regions in Plasmodium falciparum proteins. Genome Res 2001, 11(2):218. 10.1101\/gr.GR-1522R","journal-title":"Genome Res"},{"issue":"2","key":"5309_CR10","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1016\/j.gene.2004.04.029","volume":"336","author":"O Bastien","year":"2004","unstructured":"Bastien O, Lespinats S, Roy S, M\u00e9tayer K, Fertil B, Codani J, Mar\u00e9chal E: Analysis of the compositional biases in Plasmodium falciparum genome and proteome using Arabidopsis thaliana as a reference. Gene 2004, 336(2):163. 10.1016\/j.gene.2004.04.029","journal-title":"Gene"},{"key":"5309_CR11","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1186\/1471-2105-5-56","volume":"5","author":"L Coin","year":"2004","unstructured":"Coin L, Bateman A, Durbin R: Enhanced protein domain discovery using taxonomy. BMC Bioinformatics 2004, 5: 56. 10.1186\/1471-2105-5-56","journal-title":"BMC Bioinformatics"},{"key":"5309_CR12","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1186\/1471-2164-8-97","volume":"8","author":"I Alam","year":"2007","unstructured":"Alam I, Hubbard S, Oliver S, Rattray M: A kingdom-specific protein domain HMM library for improved annotation of fungal genomes. BMC Genomics 2007, 8: 97. 10.1186\/1471-2164-8-97","journal-title":"BMC Genomics"},{"issue":"23","key":"5309_CR13","doi-asserted-by":"publisher","first-page":"3077","DOI":"10.1093\/bioinformatics\/btp560","volume":"25","author":"N Terrapon","year":"2009","unstructured":"Terrapon N, Gascuel O, Mar\u00e9chal E, Br\u00e9h\u00e9lin L: Detection of new protein domains using co-occurrence: application to Plasmodium falciparum. Bioinformatics 2009, 25(23):3077. 10.1093\/bioinformatics\/btp560","journal-title":"Bioinformatics"},{"key":"5309_CR14","volume-title":"HMMER User\u2019s Guide Version 2.3.2","author":"S Eddy","year":"2003","unstructured":"Eddy S: HMMER User\u2019s Guide Version 2.3.2. 2003."},{"issue":"3","key":"5309_CR15","doi-asserted-by":"publisher","first-page":"275","DOI":"10.1093\/bioinformatics\/8.3.275","volume":"8","author":"D Jones","year":"1992","unstructured":"Jones D, Taylor W, Thornton J: The rapid generation of mutation data matrices from protein sequences. Bioinformatics 1992, 8(3):275. 10.1093\/bioinformatics\/8.3.275","journal-title":"Bioinformatics"},{"issue":"5","key":"5309_CR16","doi-asserted-by":"publisher","first-page":"691","DOI":"10.1093\/oxfordjournals.molbev.a003851","volume":"18","author":"S Whelan","year":"2001","unstructured":"Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 2001, 18(5):691. 10.1093\/oxfordjournals.molbev.a003851","journal-title":"Mol Biol Evol"},{"key":"5309_CR17","volume-title":"Inferring Phylogenies","author":"J Felsenstein","year":"2003","unstructured":"Felsenstein J: Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts; 2003."},{"issue":"7","key":"5309_CR18","doi-asserted-by":"publisher","first-page":"1307","DOI":"10.1093\/molbev\/msn067","volume":"25","author":"S Le","year":"2008","unstructured":"Le S, Gascuel O: An improved general amino acid replacement matrix. Mol Biol Evol 2008, 25(7):1307. 10.1093\/molbev\/msn067","journal-title":"Mol Biol Evol"},{"key":"5309_CR19","unstructured":"Lloyd S: Least squares quantization in PCM. Technical Report 1957."},{"issue":"2","key":"5309_CR20","doi-asserted-by":"publisher","first-page":"311","DOI":"10.1006\/jmbi.2001.4776","volume":"310","author":"G Apic","year":"2001","unstructured":"Apic G, Gough J, Teichmann S: Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 2001, 310(2):311. 10.1006\/jmbi.2001.4776","journal-title":"J Mol Biol"},{"key":"5309_CR21","doi-asserted-by":"crossref","first-page":"36","DOI":"10.1080\/00031305.1983.10483087","volume":"37","author":"B Efron","year":"1983","unstructured":"Efron B, Gong G: A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation. Am Statistician 1983, 37: 36.","journal-title":"Am Statistician"},{"issue":"2","key":"5309_CR22","doi-asserted-by":"publisher","first-page":"149","DOI":"10.1016\/0097-8485(93)85006-X","volume":"17","author":"J Wootton","year":"1993","unstructured":"Wootton J, Federhen S: Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 1993, 17(2):149. 10.1016\/0097-8485(93)85006-X","journal-title":"Comput Chem"},{"issue":"4","key":"5309_CR23","doi-asserted-by":"publisher","first-page":"698","DOI":"10.1016\/j.meegid.2010.09.008","volume":"11","author":"A Ghouila","year":"2010","unstructured":"Ghouila A, Terrapon N, Gascuel O, Guerfali FZ, Laouini D, Mar\u00e9chal E, Br\u00e9h\u00e9lin L: EuPathDomains: the Divergent Domain Database for Eukaryotic Pathogens. Infection Genetic and Evolution 2010, 11(4):698.","journal-title":"Infection Genetic and Evolution"},{"issue":"15","key":"5309_CR24","doi-asserted-by":"publisher","first-page":"1681","DOI":"10.1093\/bioinformatics\/btn312","volume":"24","author":"K Forslund","year":"2008","unstructured":"Forslund K, Sonnhammer E: Predicting protein function from domain content. Bioinformatics 2008, 24(15):1681. 10.1093\/bioinformatics\/btn312","journal-title":"Bioinformatics"},{"issue":"2","key":"5309_CR25","doi-asserted-by":"publisher","first-page":"228","DOI":"10.1101\/gr.101063.109","volume":"20","author":"N Ponts","year":"2010","unstructured":"Ponts N, Harris E, Prudhomme J, Wick I, Eckhardt-Ludka C, Hicks G, Hardiman G, Lonardi S, Le Roch K: Nucleosome landscape and control of transcription in the human malaria parasite. Genome Res 2010, 20(2):228. 10.1101\/gr.101063.109","journal-title":"Genome Res"},{"issue":"2","key":"5309_CR26","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1016\/j.pt.2003.11.001","volume":"20","author":"G McConkey","year":"2004","unstructured":"McConkey G, Pinney J, Westhead D, Plueckhahn K, Fitzpatrick T, Macheroux P, Kappes B: Annotating the Plasmodium genome and the enigma of the shikimate pathway. TRENDS Parasitology 2004, 20(2):60. 10.1016\/j.pt.2003.11.001","journal-title":"TRENDS Parasitology"},{"issue":"Database issue","key":"5309_CR27","doi-asserted-by":"publisher","first-page":"D233","DOI":"10.1093\/nar\/gkn663","volume":"37","author":"B Cantarel","year":"2009","unstructured":"Cantarel B, Coutinho P, Rancurel C, Bernard T, Lombard V, Henrissat B: The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res 2009, 37(Database issue):D233.","journal-title":"Nucleic Acids Res"},{"issue":"8","key":"5309_CR28","doi-asserted-by":"publisher","first-page":"1285","DOI":"10.1007\/s00018-011-0646-1","volume":"68","author":"S Sato","year":"2011","unstructured":"Sato S: The apicomplexan plastid and its evolution. Cell Mol Life Sci 2011, 68(8):1285. 10.1007\/s00018-011-0646-1","journal-title":"Cell Mol Life Sci"},{"issue":"13","key":"5309_CR29","doi-asserted-by":"publisher","first-page":"1602","DOI":"10.1093\/bioinformatics\/btp265","volume":"25","author":"A Kumar","year":"2009","unstructured":"Kumar A, Cowen L: Augmented training of hidden Markov models to recognize remote homologs via simulated evolution. Bioinformatics 2009, 25(13):1602. 10.1093\/bioinformatics\/btp265","journal-title":"Bioinformatics"},{"issue":"3","key":"5309_CR30","doi-asserted-by":"publisher","first-page":"361","DOI":"10.1089\/cmb.1996.3.361","volume":"3","author":"H Mamitsuka","year":"1996","unstructured":"Mamitsuka H: A learning method of hidden Markov models for sequence discrimination. J Comput Biol 1996, 3(3):361. 10.1089\/cmb.1996.3.361","journal-title":"J Comput Biol"},{"key":"5309_CR31","first-page":"322","volume-title":"Pacific Symposium on Biocomputing, Volume 10","author":"D Brown","year":"2005","unstructured":"Brown D, Krishnamurthy N, Dale J, Christopher W, Sj\u00f6lander K: Subfamily hmms in functional genomics. Pacific Symposium on Biocomputing, Volume 10 2005, 322\u2013333."},{"key":"5309_CR32","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1186\/1471-2105-8-104","volume":"8","author":"P Srivastava","year":"2007","unstructured":"Srivastava P, Desai D, Nandi S, Lynn A: HMM-ModE \u2013 Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics 2007, 8: 104. 10.1186\/1471-2105-8-104","journal-title":"BMC Bioinformatics"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-13-67.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,24]],"date-time":"2024-04-24T02:13:14Z","timestamp":1713924794000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-13-67"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,5,1]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2012,12]]}},"alternative-id":["5309"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-13-67","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,5,1]]},"assertion":[{"value":"24 October 2011","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 May 2012","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 May 2012","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"67"}}