{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,3,14]],"date-time":"2024-03-14T17:40:25Z","timestamp":1710438025617},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"S6","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2009,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value \u2264 0.05).<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences\/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences \u2013 this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.<\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-10-s6-s2","type":"journal-article","created":{"date-parts":[[2009,6,16]],"date-time":"2009-06-16T18:15:51Z","timestamp":1245176151000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements"],"prefix":"10.1186","volume":"10","author":[{"given":"Teresa M","family":"Creanza","sequence":"first","affiliation":[]},{"given":"David S","family":"Horner","sequence":"additional","affiliation":[]},{"given":"Annarita","family":"D'Addabbo","sequence":"additional","affiliation":[]},{"given":"Rosalia","family":"Maglietta","sequence":"additional","affiliation":[]},{"given":"Flavio","family":"Mignone","sequence":"additional","affiliation":[]},{"given":"Nicola","family":"Ancona","sequence":"additional","affiliation":[]},{"given":"Graziano","family":"Pesole","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2009,6,16]]},"reference":[{"key":"3297_CR1","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1038\/nature06340","volume":"450","author":"A Stark","year":"2007","unstructured":"Stark A, et al.: Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 2007, 450: 219\u2013232. 10.1038\/nature06340","journal-title":"Nature"},{"key":"3297_CR2","doi-asserted-by":"publisher","first-page":"520","DOI":"10.1038\/nature01262","volume":"420","author":"MGS Consortium","year":"2002","unstructured":"Consortium MGS: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420: 520\u2013562. 10.1038\/nature01262","journal-title":"Nature"},{"key":"3297_CR3","doi-asserted-by":"publisher","first-page":"493","DOI":"10.1038\/nature02426","volume":"428","author":"RGSP Consortium","year":"2004","unstructured":"Consortium RGSP: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 2004, 428: 493\u2013521. 10.1038\/nature02426","journal-title":"Nature"},{"key":"3297_CR4","doi-asserted-by":"publisher","first-page":"517","DOI":"10.1101\/gr.1984404","volume":"14","author":"S Yang","year":"2004","unstructured":"Yang S, Smit AF, Schwartz S, Chiaromonte F, Roskin KM, Haussler D, Miller W, Hardison RC: Patterns of insertions and their covariation with substitutions in the rat, mouse, and human genomes. Genome Research 2004, 14: 517\u2013527. 10.1101\/gr.1984404","journal-title":"Genome Research"},{"key":"3297_CR5","doi-asserted-by":"publisher","first-page":"528","DOI":"10.1101\/gr.1970304","volume":"14","author":"MI Jensen-Seaman","year":"2004","unstructured":"Jensen-Seaman MI, Furey TS, Payseur BA, Lu Y, Roskin KM, Chen CF, Thomas MA, Haussler D, Jacob HJ: Comparative Recombination rates in the rat, mouse and human genomes. Genome Research 2004, 14: 528\u2013538. 10.1101\/gr.1970304","journal-title":"Genome Research"},{"key":"3297_CR6","doi-asserted-by":"publisher","first-page":"319","DOI":"10.1089\/1066527041410319","volume":"11","author":"M Kellis","year":"2004","unstructured":"Kellis M, Patterson N, Birren B, Berger B, Lander ES: Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 2004, 11: 319\u2013355. 10.1089\/1066527041410319","journal-title":"J Comput Biol"},{"key":"3297_CR7","first-page":"183","volume":"13","author":"H Noguchi","year":"2002","unstructured":"Noguchi H, Yada T, Sakaki Y: A novel index which precisely derives protein coding regions from cross-species genome alignments. Genome Informatics 2002, 13: 183\u2013191.","journal-title":"Genome Informatics"},{"key":"3297_CR8","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1186\/1471-2105-2-8","volume":"2","author":"E Rivas","year":"2001","unstructured":"Rivas E, Eddy S: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8. 10.1186\/1471-2105-2-8","journal-title":"BMC Bioinformatics"},{"issue":"15","key":"3297_CR9","doi-asserted-by":"publisher","first-page":"4639","DOI":"10.1093\/nar\/gkg483","volume":"31","author":"F Mignone","year":"2003","unstructured":"Mignone F, Grillo G, Liuni S, Pesole G: Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. Nucleic Acids Res 2003, 31(15):4639\u20134645. 10.1093\/nar\/gkg483","journal-title":"Nucleic Acids Res"},{"key":"3297_CR10","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1016\/0378-1119(84)90116-1","volume":"30","author":"ML Bibb","year":"1984","unstructured":"Bibb ML, Findlay PR, Johnson MW: The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. GENE 1984, 30: 157\u2013166. 10.1016\/0378-1119(84)90116-1","journal-title":"GENE"},{"key":"3297_CR11","volume-title":"Eurekah Bioscience Collection","author":"SV Buldyrev","year":"2005","unstructured":"Buldyrev SV: Power Law Correlations in DNA Sequences. Eurekah Bioscience Collection 2005."},{"issue":"17","key":"3297_CR12","doi-asserted-by":"publisher","first-page":"5303","DOI":"10.1093\/nar\/10.17.5303","volume":"10","author":"JW Fickett","year":"1982","unstructured":"Fickett JW: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 1982, 10(17):5303\u201318. 10.1093\/nar\/10.17.5303","journal-title":"Nucleic Acids Research"},{"key":"3297_CR13","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1109\/79.939833","volume":"18","author":"D Anastassiou","year":"2001","unstructured":"Anastassiou D: Genomic Signal Processing. IEEE Signal Processing Magazine 2001, 18: 8\u201320. 10.1109\/79.939833","journal-title":"IEEE Signal Processing Magazine"},{"key":"3297_CR14","doi-asserted-by":"publisher","first-page":"3805","DOI":"10.1103\/PhysRevLett.68.3805","volume":"68","author":"R Voss","year":"1992","unstructured":"Voss R: Evolution of long-range fractal correlations and 1\/f noise in DNA base sequences. Phys Rev Lett 1992, 68: 3805\u20133808. 10.1103\/PhysRevLett.68.3805","journal-title":"Phys Rev Lett"},{"key":"3297_CR15","doi-asserted-by":"publisher","first-page":"6441","DOI":"10.1093\/nar\/20.24.6441","volume":"20","author":"JW Fickett","year":"1992","unstructured":"Fickett JW, Tung CS: Assessment of protein coding measures. Nucleic Acids Research 1992, 20: 6441\u20136450. 10.1093\/nar\/20.24.6441","journal-title":"Nucleic Acids Research"},{"key":"3297_CR16","doi-asserted-by":"publisher","first-page":"673","DOI":"10.1093\/bioinformatics\/btg467","volume":"20","author":"F Gao","year":"2004","unstructured":"Gao F, Zhang CT: Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 2004, 20: 673\u2013681. 10.1093\/bioinformatics\/btg467","journal-title":"Bioinformatics"},{"key":"3297_CR17","doi-asserted-by":"publisher","first-page":"198","DOI":"10.1101\/gr.200901","volume":"12","author":"A Nekrutenko","year":"2002","unstructured":"Nekrutenko A, Makova K, Li WH: The KA\/KS ratio test for assessing the protein-coding capacity of genomic regions: An emprirical and simulation study. Genome Research 2002, 12: 198\u2013202. 10.1101\/gr.200901","journal-title":"Genome Research"},{"key":"3297_CR18","doi-asserted-by":"publisher","first-page":"W624","DOI":"10.1093\/nar\/gkh486","volume":"32","author":"T Castrignan\u00f2","year":"2004","unstructured":"Castrignan\u00f2 T, Canali A, Grillo G, Liuni S, Mignone F, Pesole G: CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison. Nucleic Acids Research 2004, 32: W624-W627. 10.1093\/nar\/gkh486","journal-title":"Nucleic Acids Research"},{"key":"3297_CR19","doi-asserted-by":"publisher","first-page":"512","DOI":"10.1093\/oxfordjournals.molbev.a026133","volume":"16","author":"JH Badger","year":"1999","unstructured":"Badger JH, Olsen GJ: CRITICA: Coding region identification tool invoking comparative analysis. Mol Biol Evol 1999, 16: 512\u2013524.","journal-title":"Mol Biol Evol"},{"issue":"4","key":"3297_CR20","doi-asserted-by":"publisher","first-page":"e29","DOI":"10.1371\/journal.pgen.0020029","volume":"2","author":"J Liu","year":"2006","unstructured":"Liu J, Gough J, Rost B: Distinguishing protein-coding from non-coding RNAs through support vector machines. PLoS Genet 2006, 2(4):e29. 10.1371\/journal.pgen.0020029","journal-title":"PLoS Genet"},{"key":"3297_CR21","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2440-0","volume-title":"The Nature of Statistical Learning Theory","author":"V Vapnik","year":"1995","unstructured":"Vapnik V: The Nature of Statistical Learning Theory. New York: Springer Verlag; 1995."},{"key":"3297_CR22","volume-title":"Nonparametric statistical methods","author":"M Hollander","year":"1999","unstructured":"Hollander M, Wolfe DA: Nonparametric statistical methods. 2nd revised edition. New York: Wiley Series in Probability and Statistics; 1999.","edition":"2nd revised"},{"key":"3297_CR23","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1089\/106652703321825928","volume":"10","author":"S Mukherjee","year":"2003","unstructured":"Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP: Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol 2003, 10: 119\u2013142. 10.1089\/106652703321825928","journal-title":"J Comput Biol"},{"key":"3297_CR24","doi-asserted-by":"publisher","first-page":"488","DOI":"10.1016\/S0140-6736(05)17866-0","volume":"365","author":"S Michiels","year":"2005","unstructured":"Michiels S, Koscielny S, Hill C: Predictor of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365: 488\u2013492. 10.1016\/S0140-6736(05)17866-0","journal-title":"Lancet"},{"key":"3297_CR25","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2346-5","volume-title":"Permutation tests: a practical guide to resampling methods for testing hypotheses","author":"P Good","year":"1994","unstructured":"Good P: Permutation tests: a practical guide to resampling methods for testing hypotheses. New York: Springer-Verlag; 1994."},{"key":"3297_CR26","volume-title":"An introduction to multivariate statistical analysis","author":"TW Anderson","year":"1958","unstructured":"Anderson TW: An introduction to multivariate statistical analysis. New York: John Wiley; 1958."},{"issue":"4","key":"3297_CR27","doi-asserted-by":"publisher","first-page":"656","DOI":"10.1101\/gr.229202. Article published online before March 2002","volume":"12","author":"W Kent","year":"2002","unstructured":"Kent W: BLAT-the BLAST-like alignment tool. Genome Res 2002, 12(4):656\u201364.","journal-title":"Genome Res"},{"issue":"17","key":"3297_CR28","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"S Altschul","year":"1997","unstructured":"Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389\u20133402. 10.1093\/nar\/25.17.3389","journal-title":"Nucleic Acids Research"},{"issue":"5","key":"3297_CR29","first-page":"418","volume":"3","author":"M Nei","year":"1986","unstructured":"Nei M, Gojobory T: Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions. Mol Biol Evol 1986, 3(5):418\u2013426.","journal-title":"Mol Biol Evol"},{"key":"3297_CR30","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780195135848.001.0001","volume-title":"Molecular Evolution and Phylogenetics","author":"M Nei","year":"2000","unstructured":"Nei M, S K: Synonymous and nonsynonymous nucleotide substitutions. Molecular Evolution and Phylogenetics 2000."},{"key":"3297_CR31","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1016\/B978-1-4832-3211-9.50009-7","volume-title":"Mammalian protein metabolism III","author":"TH Jukes","year":"1969","unstructured":"Jukes TH, Cantor CR: Evolution of protein molecules. In Mammalian protein metabolism III. Edited by: Munro HN. New York: Academic Press; 1969:21\u2013132."},{"key":"3297_CR32","doi-asserted-by":"publisher","first-page":"10915","DOI":"10.1073\/pnas.89.22.10915","volume":"89","author":"S Henikoff","year":"1992","unstructured":"Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915\u201310919. 10.1073\/pnas.89.22.10915","journal-title":"Proc Natl Acad Sci USA"},{"key":"3297_CR33","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511802843","volume-title":"Bootstrap methods and Their Application","author":"AC Davison","year":"1997","unstructured":"Davison AC, Hinkley DV: Bootstrap methods and Their Application. Cambridge University Press; 1997."},{"key":"3297_CR34","volume-title":"Statistical Methods in Bioinformatics","author":"WJ Ewens","year":"2004","unstructured":"Ewens WJ, Grant GR: Statistical Methods in Bioinformatics. Second Revised edition. New York: Springer-Verlag; 2004.","edition":"Second Revised"},{"issue":"6","key":"3297_CR35","doi-asserted-by":"publisher","first-page":"493","DOI":"10.1007\/BF02102651","volume":"32","author":"B Aissani","year":"1991","unstructured":"Aissani B, et al.: The compositional properties of human genes. J Mol Evol 1991, 32(6):493\u2013503. 10.1007\/BF02102651","journal-title":"J Mol Evol"},{"key":"3297_CR36","doi-asserted-by":"crossref","unstructured":"Lin MF, Deoras AN, Rasmussen MD, Kellis M: Performance and Scalability of Discriminative Metrics for Comparative Gene Identification in 12 Drosophila Genomes. Plos computational biology 2008., 4(4):","DOI":"10.1371\/journal.pcbi.1000067"},{"key":"3297_CR37","doi-asserted-by":"publisher","first-page":"367","DOI":"10.1007\/978-1-59745-514-5_23","volume":"395","author":"A Ganley","year":"2007","unstructured":"Ganley A, Kobayashi T: Phylogenetic footprinting to find functional DNA elements. Methods Mol Biol 2007, 395: 367\u201380.","journal-title":"Methods Mol Biol"},{"issue":"8","key":"3297_CR38","doi-asserted-by":"publisher","first-page":"1034","DOI":"10.1101\/gr.3715005","volume":"15","author":"A Siepel","year":"2005","unstructured":"Siepel A, et al.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8):1034\u201350. 10.1101\/gr.3715005","journal-title":"Genome Res"},{"issue":"4","key":"3297_CR39","doi-asserted-by":"publisher","first-page":"497","DOI":"10.1093\/bioinformatics\/bti754","volume":"22","author":"T Castrignan\u00f2","year":"2006","unstructured":"Castrignan\u00f2 T, Meo PDD, Grillo G, Liuni S, Mignone F, Talamo I, Pesole G: GenoMiner: a tool for genome-wide search of coding and non-coding conserved sequence tags. Bioinformatics 2006, 22(4):497\u2013499. 10.1093\/bioinformatics\/bti754","journal-title":"Bioinformatics"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-10-S6-S2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,14]],"date-time":"2024-03-14T17:10:43Z","timestamp":1710436243000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-10-S6-S2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,6]]},"references-count":39,"journal-issue":{"issue":"S6","published-print":{"date-parts":[[2009,6]]}},"alternative-id":["3297"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-10-s6-s2","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,6]]},"assertion":[{"value":"16 June 2009","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S2"}}