{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T09:03:03Z","timestamp":1778058183719,"version":"3.51.4"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"7","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2006,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Generation of alternative transcripts from the same gene is an important biological event due to their contribution in creating functional diversity in eukaryotes. In this work, we choose the task of extracting information around this complex topic using a two-step procedure involving machine learning and information extraction.<\/jats:p>\n               <jats:p>Results: In the first step, we trained a classifier that inductively learns to identify sentences about physiological transcript diversity from the MEDLINE abstracts. Using a large hand-built corpus, we compared the sentence classification performance of various text categorization methods. Support vector machines (SVMs) followed by the maximum entropy classifier outperformed other methods for the sentence classification task. The SVM with the radial basis function kernel and optimized parameters achieved F\u03b2-measure of 91% during the 4-fold cross validation and of 74% when applied to all sentences in more than 12 million abstracts of MEDLINE. In the second step, we identified eight frequently present semantic categories in the sentences and performed a limited amount of semantic role labeling. The role labeling step also achieved very high F\u03b2-measure for all eight categories.<\/jats:p>\n               <jats:p>Availability: The results of our two-step procedure are summarized in the LSAT database of alternative transcripts. LSAT is available at<\/jats:p>\n               <jats:p>Contact: \u00a0shah@embl.de<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online<\/jats:p>","DOI":"10.1093\/bioinformatics\/btk044","type":"journal-article","created":{"date-parts":[[2006,1,13]],"date-time":"2006-01-13T01:54:20Z","timestamp":1137117260000},"page":"857-865","source":"Crossref","is-referenced-by-count":14,"title":["LSAT: learning about alternative transcripts in MEDLINE"],"prefix":"10.1093","volume":"22","author":[{"given":"Parantu K.","family":"Shah","sequence":"first","affiliation":[{"name":"European Molecular Biology Laboratory 1 \u00a0 1 \u00a0 \u00a0 Heidelberg, Germany"},{"name":"Max Delbr\u00fcck Centre for Molecular Medicine 2 \u00a0 2 \u00a0 \u00a0 Berlin-Buch, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peer","family":"Bork","sequence":"additional","affiliation":[{"name":"European Molecular Biology Laboratory 1 \u00a0 1 \u00a0 \u00a0 Heidelberg, Germany"},{"name":"Max Delbr\u00fcck Centre for Molecular Medicine 2 \u00a0 2 \u00a0 \u00a0 Berlin-Buch, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2006,1,12]]},"reference":[{"key":"2023012409113586000_b1","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1016\/S0092-8674(00)00128-8","article-title":"Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology","volume":"103","author":"Black","year":"2000","journal-title":"Cell"},{"key":"2023012409113586000_b2","first-page":"123","article-title":"The potential use of SUISEKI as a protein interaction discovery tool","volume":"12","author":"Blaschke","year":"2001","journal-title":"Genome Inform. Ser. Workshop Genome Inform."},{"key":"2023012409113586000_b3","doi-asserted-by":"crossref","first-page":"365","DOI":"10.1093\/nar\/gkg095","article-title":"The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003","volume":"31","author":"Boeckmann","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023012409113586000_b4","doi-asserted-by":"crossref","first-page":"1031","DOI":"10.1002\/bies.10371","article-title":"Alternative splicing and evolution","volume":"25","author":"Boue","year":"2003","journal-title":"Bioessays"},{"key":"2023012409113586000_b5","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1177\/001316446002000104","article-title":"Coefficient of agreement for nominal scales","volume":"20","author":"Cohen","year":"1960","journal-title":"Educ. Psychol. Meas."},{"key":"2023012409113586000_b6","first-page":"77","article-title":"Constructing biological knowledgebases by extracting information from text sources","author":"Craven","year":"1999"},{"key":"2023012409113586000_b7","doi-asserted-by":"crossref","first-page":"604","DOI":"10.1093\/bioinformatics\/btg452","article-title":"Extracting human protein interactions from MEDLINE using a full-sentence parser","volume":"20","author":"Daraselia","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012409113586000_b8","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1186\/1471-2105-4-11","article-title":"PreBIND and Textomy\u2014mining the biomedical literature for protein\u2013protein interactions using a support vector machine","volume":"4","author":"Donaldson","year":"2003","journal-title":"BMC Bioinformatics"},{"key":"2023012409113586000_b9","first-page":"148","article-title":"Inductive learning algorithms and representations for text categorization","author":"Dumais","year":"1998"},{"key":"2023012409113586000_b10","doi-asserted-by":"crossref","first-page":"2547","DOI":"10.1093\/nar\/25.13.2547","article-title":"Alternative poly(A) site selection in complex transcription units: means to an end?","volume":"25","author":"Edwalds-Gilbert","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2023012409113586000_b11","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1471-2105-6-S1-S1","article-title":"Overview of BioCreAtIvE: critical assessment of information extraction for biology","volume":"6","author":"Hirschman","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023012409113586000_b12","article-title":"Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms","author":"Joachims","year":"2001"},{"key":"2023012409113586000_b13","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1186\/gb-2005-6-7-224","article-title":"Text-mining and information-retrieval services for molecular biology","volume":"6","author":"Krallinger","year":"2005","journal-title":"Genome Biol."},{"key":"2023012409113586000_b14","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1186\/gb-2004-5-7-231","article-title":"Analysis of alternative splicing with microarrays: successes and challenges","volume":"5","author":"Lee","year":"2004","journal-title":"Genome Biol."},{"key":"2023012409113586000_b15","doi-asserted-by":"crossref","first-page":"I241","DOI":"10.1093\/bioinformatics\/bth904","article-title":"Protein names precisely peeled off free text","volume":"20","author":"Mika","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012409113586000_b16","volume-title":"Machine Learning","author":"Mitchell","year":"1997"},{"key":"2023012409113586000_b17","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1016\/S0168-9525(02)02665-3","article-title":"Statistical issues with microarrays: processing and analysis","volume":"18","author":"Nadon","year":"2002","journal-title":"Trends Genet."},{"key":"2023012409113586000_b18","first-page":"61","article-title":"Using maximum entropy for text classification","author":"Nigam","year":"1999"},{"key":"2023012409113586000_b19","doi-asserted-by":"crossref","first-page":"1699","DOI":"10.1093\/bioinformatics\/btg207","article-title":"MedScan, a natural language processing engine for MEDLINE abstracts","volume":"19","author":"Novichkova","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012409113586000_b20","article-title":"Shallow semantic parsing using support vector machines","author":"Pradhan","year":"2004"},{"key":"2023012409113586000_b21","first-page":"1273","article-title":"Representing sentence structure in hidden Markov models for information extraction","author":"Ray","year":"2001"},{"key":"2023012409113586000_b22","volume-title":"Modern Information Retrieval","author":"Ribeiro-Neto","year":"1999"},{"key":"2023012409113586000_b23","first-page":"44","article-title":"Probabilistic part-of-speech tagging using decision trees","author":"Schmid","year":"1994"},{"key":"2023012409113586000_b24","doi-asserted-by":"crossref","first-page":"e10","DOI":"10.1371\/journal.pcbi.0010010","article-title":"Extraction of transcript diversity from scientific literature","volume":"1","author":"Shah","year":"2005","journal-title":"PLoS Computat. Biol."},{"key":"2023012409113586000_b25","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1089\/106652703322756104","article-title":"Mining the biomedical literature in the genomic era: an overview","volume":"10","author":"Shatkay","year":"2003","journal-title":"J. Comput. Biol."},{"key":"2023012409113586000_b26","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1016\/S0306-4573(01)00045-0","article-title":"The use of bigrams to enhance text categorization","volume":"30","author":"Tan","year":"2002","journal-title":"J. Inform. Process. Manage."},{"key":"2023012409113586000_b27","doi-asserted-by":"crossref","first-page":"D64","DOI":"10.1093\/nar\/gkh030","article-title":"ASD: the alternative splicing database","volume":"32","author":"Thanaraj","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2023012409113586000_b28","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1186\/1471-2105-5-155","article-title":"PASBio: predicate-argument structures for event extraction in molecular biology","volume":"5","author":"Wattarujeekrit","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023012409113586000_b29","first-page":"408","article-title":"Event extraction from biomedical papers using a full parser","author":"Yakushiji","year":"2001","journal-title":"Pac. Symp. Biocomput."},{"key":"2023012409113586000_b30","doi-asserted-by":"crossref","first-page":"i331","DOI":"10.1093\/bioinformatics\/btg1046","article-title":"Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup","volume":"19","author":"Yeh","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012409113586000_b31","first-page":"42","article-title":"A re-examination of text categorization methods","author":"Yiming Yang","year":"1999"},{"key":"2023012409113586000_b32","doi-asserted-by":"crossref","first-page":"1290","DOI":"10.1101\/gr.1017303","article-title":"Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome","volume":"13","author":"Zavolan","year":"2003","journal-title":"Genome Res."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/22\/7\/857\/48840464\/bioinformatics_22_7_857.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/22\/7\/857\/48840464\/bioinformatics_22_7_857.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,24]],"date-time":"2023-01-24T09:47:46Z","timestamp":1674553666000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/22\/7\/857\/201978"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,1,12]]},"references-count":32,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2006,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btk044","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2006,4,1]]},"published":{"date-parts":[[2006,1,12]]}}}