{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T02:29:01Z","timestamp":1768444141495,"version":"3.49.0"},"reference-count":53,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2022,9,12]],"date-time":"2022-09-12T00:00:00Z","timestamp":1662940800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62003165"],"award-info":[{"award-number":["62003165"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61973155"],"award-info":[{"award-number":["61973155"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"publisher","award":["2019M661817"],"award-info":[{"award-number":["2019M661817"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["NP2018109"],"award-info":[{"award-number":["NP2018109"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000925","name":"National Health and Medical Research Council","doi-asserted-by":"publisher","award":["APP1127948"],"award-info":[{"award-number":["APP1127948"]}],"id":[{"id":"10.13039\/501100000925","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000925","name":"National Health and Medical Research Council","doi-asserted-by":"publisher","award":["APP1144652"],"award-info":[{"award-number":["APP1144652"]}],"id":[{"id":"10.13039\/501100000925","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01 AI111965"],"award-info":[{"award-number":["R01 AI111965"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"CJ Martin Early Career Research Fellowship","award":["1143366"],"award-info":[{"award-number":["1143366"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,11,19]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Short open reading frames (sORFs) refer to the small nucleic fragments no longer than 303\u00a0nt in length that probably encode small peptides. To date, translatable sORFs have been found in both untranslated regions of messenger ribonucleic acids (RNAs; mRNAs) and long non-coding RNAs (lncRNAs), playing vital roles in a myriad of biological processes. As not all sORFs are translated or essentially translatable, it is important to develop a highly accurate computational tool for characterizing the coding potential of sORFs, thereby facilitating discovery of novel functional peptides. In light of this, we designed a series of ensemble models by integrating Efficient-CapsNet and LightGBM, collectively termed csORF-finder, to differentiate the coding sORFs (csORFs) from non-coding sORFs in Homo sapiens, Mus musculus and Drosophila melanogaster, respectively. To improve the performance of csORF-finder, we introduced a novel feature encoding scheme named trinucleotide deviation from expected mean (TDE) and computed all types of in-frame sequence-based features, such as i-framed-3mer, i-framed-CKSNAP and i-framed-TDE. Benchmarking results showed that these features could significantly boost the performance compared to the original 3-mer, CKSNAP and TDE features. Our performance comparisons showed that csORF-finder achieved a superior performance than the state-of-the-art methods for csORF prediction on multi-species and non-ATG initiation independent test datasets. Furthermore, we applied csORF-finder to screen the lncRNA datasets for identifying potential csORFs. The resulting data serve as an important computational repository for further experimental validation. We hope that csORF-finder can be exploited as a powerful platform for high-throughput identification of csORFs and functional characterization of these csORFs encoded peptides.<\/jats:p>","DOI":"10.1093\/bib\/bbac392","type":"journal-article","created":{"date-parts":[[2022,9,12]],"date-time":"2022-09-12T11:55:33Z","timestamp":1662983733000},"source":"Crossref","is-referenced-by-count":24,"title":["csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames"],"prefix":"10.1093","volume":"23","author":[{"given":"Meng","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics , Nanjing 211106, China"}]},{"given":"Jian","family":"Zhao","sequence":"additional","affiliation":[{"name":"Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics , Nanjing 211106, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1847-754X","authenticated-orcid":false,"given":"Chen","family":"Li","sequence":"additional","affiliation":[{"name":"Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University , Melbourne, VIC 3800, Australia"}]},{"given":"Fang","family":"Ge","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Nanjing University of Science and Technology , 200 Xiaolingwei, Nanjing 210094, China"}]},{"given":"Jing","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Biomedical Engineering and Informatics, Nanjing Medical University , Nanjing 211166, China"}]},{"given":"Bin","family":"Jiang","sequence":"additional","affiliation":[{"name":"College of Automation Engineering, Nanjing University of Aeronautics and Astronautics , Nanjing 211106, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8031-9086","authenticated-orcid":false,"given":"Jiangning","family":"Song","sequence":"additional","affiliation":[{"name":"Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University , Melbourne, VIC 3800, Australia"},{"name":"Monash Data Futures Institute, Monash University , Melbourne, VIC 3800, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7445-4302","authenticated-orcid":false,"given":"Xiaofeng","family":"Song","sequence":"additional","affiliation":[{"name":"Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics , Nanjing 211106, China"}]}],"member":"286","published-online":{"date-parts":[[2022,9,12]]},"reference":[{"key":"2022112111113586500_ref1","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1016\/j.tig.2017.12.009","article-title":"The definition of open reading frame revisited","volume":"34","author":"Sieber","year":"2018","journal-title":"Trends Genet"},{"key":"2022112111113586500_ref2","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1038\/nrg3520","article-title":"Emerging evidence for functional peptides encoded by short open reading frames","volume":"15","author":"Andrews","year":"2014","journal-title":"Nat Rev Genet"},{"key":"2022112111113586500_ref3","doi-asserted-by":"crossref","first-page":"575","DOI":"10.1038\/nrm.2017.58","article-title":"Classification and function of small open reading frames","volume":"18","author":"Couso","year":"2017","journal-title":"Nat Rev Mol Cell Biol"},{"key":"2022112111113586500_ref4","doi-asserted-by":"crossref","first-page":"e03523","DOI":"10.7554\/eLife.03523","article-title":"Long non-coding RNAs as a source of new peptides","volume":"3","author":"Ruiz-Orera","year":"2014","journal-title":"Elife"},{"key":"2022112111113586500_ref5","doi-asserted-by":"crossref","first-page":"651","DOI":"10.1038\/nrm4069","article-title":"Ribosome profiling reveals the what, when, where and how of protein synthesis","volume":"16","author":"Brar","year":"2015","journal-title":"Nat Rev Mol Cell Biol"},{"key":"2022112111113586500_ref6","doi-asserted-by":"crossref","first-page":"458","DOI":"10.1038\/s41589-019-0425-0","article-title":"Accurate annotation of human protein-coding small open reading frames","volume":"16","author":"Martinez","year":"2020","journal-title":"Nat Chem Biol"},{"key":"2022112111113586500_ref7","first-page":"636","article-title":"SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci","volume":"19","author":"Hao","year":"2018","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref8","doi-asserted-by":"crossref","first-page":"602","DOI":"10.1016\/j.gpb.2021.09.002","article-title":"SmProt: a reliable repository with comprehensive annotation of small proteins identified from ribosome profiling","volume":"19","author":"Li","year":"2021","journal-title":"Genom Proteom Bioinform"},{"key":"2022112111113586500_ref9","doi-asserted-by":"crossref","first-page":"D497","DOI":"10.1093\/nar\/gkx1130","article-title":"An update on sORFs.org: a repository of small ORFs identified by ribosome profiling","volume":"46","author":"Olexiouk","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref10","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1038\/nchembio.2249","article-title":"A human microprotein that interacts with the mRNA decapping complex","volume":"13","author":"D'Lima","year":"2017","journal-title":"Nat Chem Biol"},{"key":"2022112111113586500_ref11","doi-asserted-by":"crossref","first-page":"1853","DOI":"10.1093\/bib\/bby055","article-title":"The small peptide world in long noncoding RNAs","volume":"20","author":"Choi","year":"2019","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref12","doi-asserted-by":"crossref","first-page":"1248636","DOI":"10.1126\/science.1248636","article-title":"Toddler: an embryonic signal that promotes cell movement via apelin receptors","volume":"343","author":"Pauli","year":"2014","journal-title":"Science"},{"key":"2022112111113586500_ref13","doi-asserted-by":"crossref","DOI":"10.3389\/fphys.2017.00230","article-title":"Lnc RNA-Six1 encodes a micropeptide to activate Six1 in Cis and is involved in cell proliferation and muscle growth","volume":"8","author":"Cai","year":"2017","journal-title":"Front Physiol"},{"key":"2022112111113586500_ref14","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1016\/j.omtn.2021.06.027","article-title":"A putative long noncoding RNA-encoded micropeptide maintains cellular homeostasis in pancreatic beta cells","volume":"26","author":"Li","year":"2021","journal-title":"Mol Ther Nucleic acids"},{"key":"2022112111113586500_ref15","doi-asserted-by":"crossref","DOI":"10.7554\/eLife.53734","article-title":"A small protein encoded by a putative lncRNA regulates apoptosis and tumorigenicity in human colorectal cancer cells","volume":"9","author":"Li","year":"2020","journal-title":"Elife"},{"key":"2022112111113586500_ref16","doi-asserted-by":"crossref","DOI":"10.1038\/s41587-021-01021-3","article-title":"Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer","volume":"40","author":"Ouspenskaia","year":"2022","journal-title":"Nat Biotechnol"},{"key":"2022112111113586500_ref17","doi-asserted-by":"crossref","first-page":"108815","DOI":"10.1016\/j.celrep.2021.108815","article-title":"Most non-canonical proteins uniquely populate the proteome or immunopeptidome","volume":"34","author":"Cuevas","year":"2021","journal-title":"Cell Rep"},{"key":"2022112111113586500_ref18","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1186\/s13059-015-0742-x","article-title":"Extensive identification and analysis of conserved small ORFs in animals","volume":"16","author":"Mackowiak","year":"2015","journal-title":"Genome Biol"},{"key":"2022112111113586500_ref19","doi-asserted-by":"crossref","first-page":"399","DOI":"10.1093\/bioinformatics\/btp688","article-title":"sORF finder: a program package to identify small open reading frames with high coding potential","volume":"26","author":"Hanada","year":"2010","journal-title":"Bioinformatics"},{"key":"2022112111113586500_ref20","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-15-36","article-title":"uPEPperoni: an online tool for upstream open reading frame location and analysis of transcript conservation","volume":"15","author":"Skarshewski","year":"2014","journal-title":"BMC Bioinform"},{"key":"2022112111113586500_ref21","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1186\/s12859-019-3033-9","article-title":"MiPepid: microPeptide identification tool using machine learning","volume":"20","author":"Zhu","year":"2019","journal-title":"BMC Bioinform"},{"key":"2022112111113586500_ref22","doi-asserted-by":"crossref","first-page":"272","DOI":"10.52586\/4943","article-title":"Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy","volume":"26","author":"Yu","year":"2021","journal-title":"Front Biosci-Landmark"},{"key":"2022112111113586500_ref23","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pgen.0020029","article-title":"Distinguishing protein-coding from non-coding RNAs through support vector machines","volume":"2","author":"Liu","year":"2006","journal-title":"PLoS Genet"},{"key":"2022112111113586500_ref24","doi-asserted-by":"crossref","first-page":"W345","DOI":"10.1093\/nar\/gkm391","article-title":"CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine","volume":"35","author":"Kong","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref25","doi-asserted-by":"crossref","first-page":"e166","DOI":"10.1093\/nar\/gkt646","article-title":"Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts","volume":"41","author":"Sun","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref26","doi-asserted-by":"crossref","first-page":"W12","DOI":"10.1093\/nar\/gkx428","article-title":"CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features","volume":"45","author":"Kang","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref27","doi-asserted-by":"crossref","first-page":"W516","DOI":"10.1093\/nar\/gkz400","article-title":"CNIT: a fast and accurate web tool for identifying protein-coding and long non-coding transcripts based on intrinsic sequence composition","volume":"47","author":"Guo","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref28","doi-asserted-by":"crossref","first-page":"e43","DOI":"10.1093\/nar\/gkz087","article-title":"CPPred: coding potential prediction based on the global description of RNA sequence","volume":"47","author":"Tong","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref29","doi-asserted-by":"crossref","first-page":"2073","DOI":"10.1093\/bib\/bbaa039","article-title":"DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction","volume":"22","author":"Zhang","year":"2021","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref30","doi-asserted-by":"crossref","first-page":"e74","DOI":"10.1093\/nar\/gkt006","article-title":"CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model","volume":"41","author":"Wang","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref31","doi-asserted-by":"crossref","first-page":"lqz024","DOI":"10.1093\/nargab\/lqz024","article-title":"RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences","volume":"2","author":"Camargo","year":"2020","journal-title":"NAR Genom Bioinform"},{"key":"2022112111113586500_ref32","doi-asserted-by":"crossref","first-page":"14634","DOI":"10.1038\/s41598-021-93977-0","article-title":"Efficient-CapsNet: capsule network with self-attention routing","volume":"11","author":"Mazzia","year":"2021","journal-title":"Sci Rep"},{"key":"2022112111113586500_ref33","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems","author":"Ke","year":"2017"},{"key":"2022112111113586500_ref34","doi-asserted-by":"crossref","first-page":"e9550","DOI":"10.1371\/journal.pone.0009550","article-title":"A rapid method for characterization of protein relatedness using feature vectors","volume":"5","author":"Carr","year":"2010","journal-title":"Plos One"},{"key":"2022112111113586500_ref35","doi-asserted-by":"crossref","DOI":"10.1093\/database\/baw093","article-title":"The Ensembl gene annotation system","volume":"2016","author":"Aken","year":"2016","journal-title":"Database"},{"key":"2022112111113586500_ref36","doi-asserted-by":"crossref","first-page":"D203","DOI":"10.1093\/nar\/gkv1252","article-title":"NONCODE 2016: an informative and valuable data source of long non-coding RNAs","volume":"44","author":"Zhao","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref37","doi-asserted-by":"crossref","first-page":"D756","DOI":"10.1093\/nar\/gkt1114","article-title":"RefSeq: an update on mammalian reference sequences","volume":"42","author":"Pruitt","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2022112111113586500_ref38","doi-asserted-by":"crossref","DOI":"10.1155\/2020\/8858489","article-title":"Succinylation site prediction based on protein sequences using the IFS-LightGBM (BO) model","volume":"2020","author":"Zhang","year":"2020","journal-title":"Comput Math Methods Med"},{"key":"2022112111113586500_ref39","first-page":"2960","article-title":"Practical Bayesian optimization of machine learning algorithms","volume":"25","author":"Jasper","year":"2012","journal-title":"Adv Neural Inf Process Syst (NIPS)"},{"key":"2022112111113586500_ref40","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2022112111113586500_ref41","doi-asserted-by":"crossref","first-page":"3645","DOI":"10.1093\/bioinformatics\/btx469","article-title":"ggseqlogo: a versatile R package for drawing sequence logos","volume":"33","author":"Wagih","year":"2017","journal-title":"Bioinformatics"},{"key":"2022112111113586500_ref42","doi-asserted-by":"crossref","first-page":"890","DOI":"10.1038\/s41559-018-0506-6","article-title":"Translation of neutrally evolving peptides provides a basis for de novo gene evolution","volume":"2","author":"Ruiz-Orera","year":"2018","journal-title":"Nat Ecol Evol"},{"key":"2022112111113586500_ref43","first-page":"388","volume-title":"Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence","author":"Liu","year":"1995"},{"key":"2022112111113586500_ref44","article-title":"Research on intrusion detection method based on Pearson correlation coefficient feature selection algorithm","volume":"1757","author":"Pengtian","year":"2021","journal-title":"J Phys Conf Ser"},{"key":"2022112111113586500_ref45","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1186\/s12859-016-0990-0","article-title":"McTwo: a two-step feature selection algorithm based on maximal information coefficient","volume":"17","author":"Ge","year":"2016","journal-title":"BMC Bioinform"},{"key":"2022112111113586500_ref46","doi-asserted-by":"crossref","first-page":"2957","DOI":"10.1093\/bioinformatics\/btz016","article-title":"MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters","volume":"35","author":"Zhang","year":"2019","journal-title":"Bioinformatics"},{"key":"2022112111113586500_ref47","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbab376","article-title":"STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction","volume":"23","author":"Basith","year":"2022","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref48","article-title":"Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction","volume":"23","author":"Zhang","year":"2022","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref49","doi-asserted-by":"crossref","first-page":"2126","DOI":"10.1093\/bib\/bbaa049","article-title":"Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework","volume":"22","author":"Li","year":"2021","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref50","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1016\/j.ipm.2009.03.002","article-title":"A systematic analysis of performance measures for classification tasks","volume":"45","author":"Sokolova","year":"2009","journal-title":"Inf Process Manag"},{"key":"2022112111113586500_ref51","doi-asserted-by":"crossref","first-page":"40","DOI":"10.4161\/rna.3.1.2789","article-title":"Discrimination of non-protein-coding transcripts from protein-coding mRNA","volume":"3","author":"Frith","year":"2006","journal-title":"RNA Biol"},{"key":"2022112111113586500_ref52","article-title":"Positive-unlabeled learning in bioinformatics and computational biology: a brief review","volume":"23","author":"Li","year":"2022","journal-title":"Brief Bioinform"},{"key":"2022112111113586500_ref53","article-title":"Positive-unlabelled learning of glycosylation sites in the human proteome","volume":"20","author":"Li","year":"2019","journal-title":"BMC Bioinform"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/23\/6\/bbac392\/47144337\/bbac392.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/23\/6\/bbac392\/47144337\/bbac392.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,21]],"date-time":"2022-11-21T11:16:59Z","timestamp":1669029419000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbac392\/6696144"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,12]]},"references-count":53,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2022,11,19]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbac392","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,11]]},"published":{"date-parts":[[2022,9,12]]},"article-number":"bbac392"}}