{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T22:19:19Z","timestamp":1780525159806,"version":"3.54.1"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":1201,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/3.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2013,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat\/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments\/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that \u223c3.5% of transcripts reported by TopHat\/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, \u223c10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.<\/jats:p>\n               <jats:p>Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat\/Cufflinks or MapSplice\/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat\/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat\/Cufflinks, GeneScissors finds that &amp;gt;16.3% of them are false positives.<\/jats:p>\n               <jats:p>Availability: The software can be downloaded at http:\/\/csbio.unc.edu\/genescissors\/<\/jats:p>\n               <jats:p>Contact: \u00a0weiwang@cs.ucla.edu<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btt216","type":"journal-article","created":{"date-parts":[[2013,6,27]],"date-time":"2013-06-27T05:33:26Z","timestamp":1372311206000},"page":"i291-i299","source":"Crossref","is-referenced-by-count":12,"title":["GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment"],"prefix":"10.1093","volume":"29","author":[{"given":"Zhaojun","family":"Zhang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shunping","family":"Huang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jack","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiang","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fernando","family":"Pardo Manuel de Villena","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Leonard","family":"McMillan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2013,6,19]]},"reference":[{"key":"2023062614273991900_btt216-B1","doi-asserted-by":"crossref","first-page":"R106","DOI":"10.1186\/gb-2010-11-10-r106","article-title":"Differential expression analysis for sequence count data","volume":"11","author":"Anders","year":"2010","journal-title":"Genome Biol."},{"key":"2023062614273991900_btt216-B2","doi-asserted-by":"crossref","first-page":"4570","DOI":"10.1093\/nar\/gkq211","article-title":"Detection of splice junctions from paired-end RNA-seq data by SpliceMap","volume":"38","author":"Au","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023062614273991900_btt216-B3","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1146\/annurev.genet.37.040103.103949","article-title":"Pseudogenes: are they \u201cjunk\u201d or functional DNA? Ann","volume":"37","author":"Balakirev","year":"2003","journal-title":"Rev. Genet."},{"key":"2023062614273991900_btt216-B4","doi-asserted-by":"crossref","first-page":"1691","DOI":"10.1093\/bioinformatics\/btr174","article-title":"BamTools: a C++ API and toolkit for analyzing and managing BAM files","volume":"27","author":"Barnett","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062614273991900_btt216-B5","doi-asserted-by":"crossref","first-page":"S9","DOI":"10.1186\/1471-2105-13-S6-S9","article-title":"A context-based approach to identify the most likely mapping for RNA-seq experiments","volume":"13","author":"Bonfert","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023062614273991900_btt216-B6","doi-asserted-by":"crossref","first-page":"D84","DOI":"10.1093\/nar\/gkr991","article-title":"Ensembl 2012","volume":"40","author":"Flicek","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"2023062614273991900_btt216-B7","doi-asserted-by":"crossref","first-page":"644","DOI":"10.1038\/nbt.1883","article-title":"Full-length transcriptome assembly from RNA-seq data without a reference genome","volume":"29","author":"Grabherr","year":"2011","journal-title":"Nat. Biotechnol."},{"key":"2023062614273991900_btt216-B8","doi-asserted-by":"crossref","first-page":"643","DOI":"10.1126\/science.1190830","article-title":"High-resolution analysis of parent-of-origin allelic expression in the mouse brain","volume":"329","author":"Gregg","year":"2010","journal-title":"Science"},{"key":"2023062614273991900_btt216-B9","doi-asserted-by":"crossref","first-page":"503","DOI":"10.1038\/nbt.1633","article-title":"Ab initio reconstruction of cell type\u2013specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs","volume":"28","author":"Guttman","year":"2010","journal-title":"Nat. Biotechnol."},{"key":"2023062614273991900_btt216-B10","doi-asserted-by":"crossref","first-page":"1033","DOI":"10.1093\/nar\/gkg169","article-title":"Identification of pseudogenes in the Drosophila melanogaster genome","volume":"31","author":"Harrison","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023062614273991900_btt216-B11","doi-asserted-by":"crossref","first-page":"1793","DOI":"10.1007\/s00018-007-7084-0","article-title":"Useful \u2018junk\u2019: Alu RNAs in the human transcriptome","volume":"64","author":"H\u00e4sler","year":"2007","journal-title":"Cell. Mol. Life Sci."},{"key":"2023062614273991900_btt216-B12","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1038\/nature01535","article-title":"An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene","volume":"423","author":"Hirotsune","year":"2003","journal-title":"Nature"},{"key":"2023062614273991900_btt216-B13","doi-asserted-by":"crossref","first-page":"e206","DOI":"10.1371\/journal.pbio.0020206","article-title":"Gene duplication: the genomic trade in spare parts","volume":"2","author":"Hurles","year":"2004","journal-title":"PLoS Biol."},{"key":"2023062614273991900_btt216-B14","doi-asserted-by":"crossref","first-page":"4775","DOI":"10.1073\/pnas.85.13.4775","article-title":"A fundamental division in the Alu family of repeated sequences","volume":"85","author":"Jurka","year":"1988","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023062614273991900_btt216-B15","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1038\/nature10413","article-title":"Mouse genomic variation and its effect on phenotypes and gene regulation","volume":"477","author":"Keane","year":"2011","journal-title":"Nature"},{"key":"2023062614273991900_btt216-B16","first-page":"D59","article-title":"HOPPSIGEN: a database of human and mouse processed pseudogenes","volume":"33","author":"Khelifi","year":"2005","journal-title":"Nucleic Acids Res."},{"key":"2023062614273991900_btt216-B17","doi-asserted-by":"crossref","first-page":"1302","DOI":"10.1126\/science.1209658","article-title":"Comment on \u2018Widespread RNA and DNA sequence differences in the human transcriptome\u2019","volume":"335","author":"Kleinman","year":"2012","journal-title":"Science"},{"key":"2023062614273991900_btt216-B18","doi-asserted-by":"crossref","first-page":"1181","DOI":"10.2140\/pjm.1960.10.1181","article-title":"An approximation theorem for the poisson binomial distribution","volume":"10","author":"Le Cam","year":"1960","journal-title":"Pacific J. Math."},{"key":"2023062614273991900_btt216-B19","doi-asserted-by":"crossref","first-page":"493","DOI":"10.1093\/bioinformatics\/btp692","article-title":"RNA-Seq gene expression estimation with read mapping uncertainty","volume":"26","author":"Li","year":"2010","journal-title":"Bioinformatics"},{"key":"2023062614273991900_btt216-B20","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1126\/science.1207018","article-title":"Widespread RNA and DNA sequence differences in the human transcriptome","volume":"333","author":"Li","year":"2011","journal-title":"Science"},{"key":"2023062614273991900_btt216-B21","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1038\/nrg2934","article-title":"RNA sequencing: advances, challenges and opportunities","volume":"12","author":"Ozsolak","year":"2011","journal-title":"Nat. Rev. Genet."},{"key":"2023062614273991900_btt216-B22","doi-asserted-by":"crossref","first-page":"909","DOI":"10.1038\/nmeth.1517","article-title":"De novo assembly and analysis of RNA-seq data","volume":"7","author":"Robertson","year":"2010","journal-title":"Nat. Methods"},{"key":"2023062614273991900_btt216-B23","doi-asserted-by":"crossref","first-page":"1105","DOI":"10.1093\/bioinformatics\/btp120","article-title":"TopHat: discovering splice junctions with RNA-seq","volume":"25","author":"Trapnell","year":"2009","journal-title":"Bioinformatics"},{"key":"2023062614273991900_btt216-B24","doi-asserted-by":"crossref","first-page":"516","DOI":"10.1038\/nbt.1621","article-title":"Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation","volume":"28","author":"Trapnell","year":"2010","journal-title":"Nat. Biotechnol."},{"key":"2023062614273991900_btt216-B25","doi-asserted-by":"crossref","first-page":"562","DOI":"10.1038\/nprot.2012.016","article-title":"Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks","volume":"7","author":"Trapnell","year":"2012","journal-title":"Nat. Protoc."},{"key":"2023062614273991900_btt216-B26","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1146\/annurev.ge.19.120185.001345","article-title":"Processed pseudogenes: characteristics and evolution","volume":"19","author":"Vanin","year":"1985","journal-title":"Ann. Rev. Genet."},{"key":"2023062614273991900_btt216-B27","doi-asserted-by":"crossref","first-page":"e178","DOI":"10.1093\/nar\/gkq622","article-title":"MapSplice: accurate mapping of RNA-seq reads for splice junction discovery","volume":"38","author":"Wang","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023062614273991900_btt216-B28","doi-asserted-by":"crossref","first-page":"2541","DOI":"10.1101\/gr.1429003","article-title":"Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome","volume":"13","author":"Zhang","year":"2003","journal-title":"Genome Res."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/29\/13\/i291\/50702746\/bioinformatics_29_13_i291.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/29\/13\/i291\/50702746\/bioinformatics_29_13_i291.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,26]],"date-time":"2023-06-26T15:27:28Z","timestamp":1687793248000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/29\/13\/i291\/190663"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,6,19]]},"references-count":28,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2013,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btt216","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2013,7]]},"published":{"date-parts":[[2013,6,19]]}}}