{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T08:48:20Z","timestamp":1762505300326},"reference-count":19,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2007,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-8-339","type":"journal-article","created":{"date-parts":[[2007,9,14]],"date-time":"2007-09-14T18:13:36Z","timestamp":1189793616000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["A Bayesian nonparametric method for prediction in EST analysis"],"prefix":"10.1186","volume":"8","author":[{"given":"Antonio","family":"Lijoi","sequence":"first","affiliation":[]},{"given":"Rams\u00e9s H","family":"Mena","sequence":"additional","affiliation":[]},{"given":"Igor","family":"Pr\u00fcnster","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2007,9,14]]},"reference":[{"key":"1711_CR1","doi-asserted-by":"publisher","first-page":"1651","DOI":"10.1126\/science.2047873","volume":"252","author":"M Adams","year":"1991","unstructured":"Adams M, Kelley J, Gocayne J, Mark D, Polymeropoulos M, Xiao H, Merril C, Wu A, Olde B, Moreno R, Kerlavage A, McCombe W, Venter J: Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome Project. Science. 1991, 252: 1651-1656. 10.1126\/science.2047873.","journal-title":"Science"},{"key":"1711_CR2","doi-asserted-by":"publisher","first-page":"69","DOI":"10.1101\/gr.5145806","volume":"17","author":"S Emrich","year":"2007","unstructured":"Emrich S, Barbazuk W, Li L, Schnable P: Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res. 2007, 17: 69-73. 10.1101\/gr.5145806.","journal-title":"Genome Res"},{"key":"1711_CR3","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1093\/biomet\/40.3-4.237","volume":"40","author":"IJ Good","year":"1953","unstructured":"Good IJ: The population frequencies of species and the estimation of population parameters. Biometrika. 1953, 40: 237-264.","journal-title":"Biometrika"},{"key":"1711_CR4","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1093\/biomet\/43.1-2.45","volume":"43","author":"IJ Good","year":"1956","unstructured":"Good IJ, Toulmin GH: The number of new species, and the increase in population coverage, when a sample is increased. Biometrika. 1956, 43: 45-63.","journal-title":"Biometrika"},{"key":"1711_CR5","doi-asserted-by":"publisher","first-page":"1108","DOI":"10.1198\/016214504000001709","volume":"99","author":"CX Mao","year":"2004","unstructured":"Mao CX: Prediction of the conditional probability of discovering a new class. J Amer Statist Assoc. 2004, 99: 1108-1118. 10.1198\/016214504000001709.","journal-title":"J Amer Statist Assoc"},{"key":"1711_CR6","doi-asserted-by":"publisher","first-page":"2279","DOI":"10.1093\/bioinformatics\/bth239","volume":"20","author":"E Susko","year":"2004","unstructured":"Susko E, Roger AJ: Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys. Bioinformatics. 2004, 20: 2279-2287. 10.1093\/bioinformatics\/bth239.","journal-title":"Bioinformatics"},{"key":"1711_CR7","doi-asserted-by":"publisher","first-page":"300","DOI":"10.1186\/1471-2105-6-300","volume":"6","author":"JPZ Wang","year":"2005","unstructured":"Wang JPZ, Lindsay BG, Cui L, Wall PK, Marion J, Zhang J, dePamphilis CW: Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries. BMC Bioinformatics. 2005, 6: 300-10.1186\/1471-2105-6-300.","journal-title":"BMC Bioinformatics"},{"key":"1711_CR8","volume-title":"Statistica Sinica","author":"CX Mao","year":"2007","unstructured":"Mao CX: Estimating species accumulation curves and diversity indices. Statistica Sinica. 2007,"},{"key":"1711_CR9","doi-asserted-by":"publisher","first-page":"668","DOI":"10.1080\/01621459.1979.10481668","volume":"74","author":"BM Hill","year":"1979","unstructured":"Hill BM: Posterior moments of the number of species in a finite population and the posterior probability of finding a new species. J Amer Statist Assoc. 1979, 74: 668-673. 10.2307\/2286989.","journal-title":"J Amer Statist Assoc"},{"key":"1711_CR10","volume-title":"Biometrika","author":"A Lijoi","year":"2007","unstructured":"Lijoi A, Mena R, Pr\u00fcnster I: Bayesian nonparametric estimation of the probability of discovering new species. Biometrika. 2007,"},{"key":"1711_CR11","volume-title":"Combinatorial Stochastic Processes. Lecture Notes in Mathematics 1875","author":"J Pitman","year":"2006","unstructured":"Pitman J: Combinatorial Stochastic Processes. Lecture Notes in Mathematics 1875. 2006, Berlin: Springer, 10.1093\/biomet\/asm061."},{"key":"1711_CR12","doi-asserted-by":"publisher","first-page":"2973","DOI":"10.1093\/bioinformatics\/bth342","volume":"20","author":"JPZ Wang","year":"2004","unstructured":"Wang JPZ, Lindsay BG, Cui L, Wall PK, Miller WC, dePamphilis CW: EST clustering error evaluation and correction. Bioinformatics. 2004, 20: 2973-2984. 10.1093\/bioinformatics\/bth342.","journal-title":"Bioinformatics"},{"key":"1711_CR13","doi-asserted-by":"publisher","DOI":"10.1002\/9780470316870","volume-title":"Bayesian theory","author":"JM Bernardo","year":"1994","unstructured":"Bernardo JM, Smith AFM: Bayesian theory. 1994, Chichester: Wiley"},{"key":"1711_CR14","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1007\/BF01213386","volume":"102","author":"J Pitman","year":"1995","unstructured":"Pitman J: Exchangeable and partially exchangeable random partitions. Probab Theory Related Fields. 1995, 102: 145-158. 10.1007\/BF01213386.","journal-title":"Probab Theory Related Fields"},{"key":"1711_CR15","doi-asserted-by":"publisher","first-page":"87","DOI":"10.1016\/0040-5809(72)90035-4","volume":"3","author":"WJ Ewens","year":"1972","unstructured":"Ewens WJ: The sampling theory of selectively neutral alleles. Theor Popul Biol. 1972, 3: 87-112. 10.1016\/0040-5809(72)90035-4.","journal-title":"Theor Popul Biol"},{"key":"1711_CR16","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1111\/j.1751-5823.2001.tb00458.x","volume":"69","author":"M Gyllenberg","year":"2001","unstructured":"Gyllenberg M, Koski T: Probabilistic models for bacterial taxonomy. Int Statist Review. 2001, 69: 249-276. 10.1111\/j.1751-5823.2001.tb00458.x.","journal-title":"Int Statist Review"},{"key":"1711_CR17","doi-asserted-by":"publisher","first-page":"1988","DOI":"10.1093\/bioinformatics\/btl284","volume":"22","author":"S Zhaohui","year":"2006","unstructured":"Zhaohui S: Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics. 2006, 22: 1988-1997. 10.1093\/bioinformatics\/btl284.","journal-title":"Bioinformatics"},{"key":"1711_CR18","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1198\/016214501750332758","volume":"96","author":"H Ishwaran","year":"2001","unstructured":"Ishwaran H, James LF: Gibbs sampling methods for stick-breaking priors. J Amer Statist Assoc. 2001, 96: 161-173. 10.1198\/016214501750332758.","journal-title":"J Amer Statist Assoc"},{"key":"1711_CR19","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics","author":"YW Teh","year":"2006","unstructured":"Teh YW: A hierarchical Bayesian language model based on Pitman-Yor processes. Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2006, 44:"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-8-339.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T01:44:23Z","timestamp":1630460663000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-8-339"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,9,14]]},"references-count":19,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2007,12]]}},"alternative-id":["1711"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-8-339","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2007,9,14]]},"assertion":[{"value":"27 February 2007","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 September 2007","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 September 2007","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"339"}}