{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T23:45:10Z","timestamp":1773272710701,"version":"3.50.1"},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2007,2,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential.<\/jats:p><jats:p>Results: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes.<\/jats:p><jats:p>Supplementary data: \u00a0<\/jats:p><jats:p>Contact: \u00a0yvan.saeys@psb.ugent.be<\/jats:p>","DOI":"10.1093\/bioinformatics\/btl639","type":"journal-article","created":{"date-parts":[[2007,1,5]],"date-time":"2007-01-05T01:49:58Z","timestamp":1167961798000},"page":"414-420","source":"Crossref","is-referenced-by-count":33,"title":["In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists"],"prefix":"10.1093","volume":"23","author":[{"given":"Yvan","family":"Saeys","sequence":"first","affiliation":[]},{"given":"Pierre","family":"Rouz\u00e9","sequence":"additional","affiliation":[{"name":"Laboratoire Associ\u00e9 de l'INRA (France) Ghent University 1 \u00a0 1 \u00a0 \u00a0 Technologiepark 927, B-9052 Ghent, Belgium"}]},{"given":"Yves","family":"Van de Peer","sequence":"additional","affiliation":[]}],"member":"286","published-online":{"date-parts":[[2007,1,4]]},"reference":[{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"2242","DOI":"10.1126\/science.1103388","article-title":"Global identification of human transcribed sequences with genome tiling arrays","volume":"306","author":"Bertone","year":"2006","journal-title":"Science"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1016\/0097-8485(93)85004-V","article-title":"Genmark: parallel gene recognition for both DNA strands","volume":"17","author":"Borodovsky","year":"1993","journal-title":"Comput. Chem."},{"key":"2023041109265291300_","first-page":"144","article-title":"A training algorithm for optimal margin classifiers","author":"Boser","year":"1992"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1016\/j.sbi.2004.05.007","article-title":"Recent advances in gene structure prediction","volume":"14","author":"Brent","year":"2004","journal-title":"Curr. Opin. Struct. Biol."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"6441","DOI":"10.1093\/nar\/20.24.6441","article-title":"Assessment of protein coding measures","volume":"20","author":"Fickett","year":"1992","journal-title":"Nucleic Acids Res."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1093\/bioinformatics\/btg467","article-title":"Comparison of various algorithms for recognizing short coding sequences of human genes","volume":"20","author":"Gao","year":"2004","journal-title":"Bioinformatics"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"931","DOI":"10.1038\/nature03001","article-title":"Finishing the euchromatic sequence of the human genome","volume":"431","author":"International Human Genome Sequencing Consortium","year":"2004","journal-title":"Nature"},{"key":"2023041109265291300_","first-page":"169","article-title":"Making large-scale support vector machine learning practical","author":"Joachims","year":"1998","journal-title":"Advances in Kernel Methods: Support Vector Machines"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1016\/S0004-3702(97)00043-X","article-title":"Wrappers for feature subset selection","volume":"97","author":"Kohavi","year":"1997","journal-title":"Artif. Intell."},{"key":"2023041109265291300_","first-page":"284","article-title":"Toward optimal feature selection","author":"Koller","year":"1996"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"1930","DOI":"10.1101\/gr.1261703","article-title":"Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions","volume":"13","author":"Kotlar","year":"2003","journal-title":"Genome Res."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"3601","DOI":"10.1093\/nar\/gkg527","article-title":"GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders","volume":"31","author":"Majoros","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"2878","DOI":"10.1093\/bioinformatics\/bth315","article-title":"TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders","volume":"20","author":"Majoros","year":"2004","journal-title":"Bioinformatics"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"4103","DOI":"10.1093\/nar\/gkf543","article-title":"Current methods of gene prediction, their strengths and weaknesses","volume":"30","author":"Math\u00e9","year":"2002","journal-title":"Nucleic Acids Res."},{"key":"2023041109265291300_","article-title":"On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes","volume-title":"Advances in Neural Information Processing Systems 14.","author":"Ng","year":"2002"},{"key":"2023041109265291300_","first-page":"43","article-title":"Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions","author":"Provost","year":"1997"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1109\/34.75512","article-title":"Small sample size effects in statistical pattern recognition: recommendataions for practitioners","volume":"13","author":"Raudys","year":"1991","journal-title":"IEEE Trans. PAMI"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1186\/1471-2105-5-64","article-title":"Feature selection for splice site prediction: a new method using EDA-based feature ranking","volume":"21","author":"Saeys","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"544","DOI":"10.1093\/nar\/26.2.544","article-title":"Microbial gene identification using interpolated Markov models","volume":"26","author":"Salzberg","year":"1998","journal-title":"Nucleic Acids Res."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1006\/geno.1999.5854","article-title":"Interpolated Markov models for eukaryotic gene finding","volume":"59","author":"Salzberg","year":"1999","journal-title":"Genomics"},{"key":"2023041109265291300_","first-page":"111","article-title":"EuG\u00e8ne: an Eukaryotic Gene Finder that combines several sources of evidence","volume-title":"Proceedings of the Lect. Notes Comput. Sc. 2006","author":"Schiex","year":"2001"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1016\/S0022-5193(86)80060-1","article-title":"A measure of DNA periodicity","volume":"118","author":"Silverman","year":"1986","journal-title":"J. Theor. Biol."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"62","DOI":"10.1186\/1471-2105-7-62","article-title":"Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources","volume":"7","author":"Stanke","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"4453","DOI":"10.1073\/pnas.0408203102","article-title":"Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays","volume":"102","author":"Stolc","year":"2005","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023041109265291300_","first-page":"263","article-title":"Prediction of probable genes by Fourier analysis of genomic sequences","volume":"13","author":"Tiwari","year":"1997","journal-title":"Comput. Appl. Biosci."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"3805","DOI":"10.1103\/PhysRevLett.68.3805","article-title":"Evolution of long-range fractal correlations and 1\/f noise in DNA base sequences","volume":"68","author":"Voss","year":"1992","journal-title":"Phys. Rev. Lett."},{"key":"2023041109265291300_","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1002\/bip.10054","article-title":"Recognizing shorter coding regions of human genes based on the statistics of stop codons","volume":"63","author":"Wang","year":"2002","journal-title":"Biopolymers"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/4\/414\/49829724\/bioinformatics_23_4_414.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/4\/414\/49829724\/bioinformatics_23_4_414.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,10]],"date-time":"2023-05-10T08:18:04Z","timestamp":1683706684000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/23\/4\/414\/182361"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,1,4]]},"references-count":27,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2007,2,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btl639","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2007,2,15]]},"published":{"date-parts":[[2007,1,4]]}}}