{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,26]],"date-time":"2025-10-26T22:46:12Z","timestamp":1761518772466},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusion<\/jats:title>\n                <jats:p>This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/1471-2105-6-88","type":"journal-article","created":{"date-parts":[[2005,4,8]],"date-time":"2005-04-08T06:57:49Z","timestamp":1112943469000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Building a protein name dictionary from full text: a machine learning term extraction approach"],"prefix":"10.1186","volume":"6","author":[{"given":"Lei","family":"Shi","sequence":"first","affiliation":[]},{"given":"Fabien","family":"Campagne","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2005,4,7]]},"reference":[{"key":"413_CR1","first-page":"773","volume":"2004","author":"W Hersh","year":"2004","unstructured":"Hersh W, Bhupatiraju RT, Corley S: Enhancing Access to the Bibliome: The TREC Genomics Track. Medinfo 2004, 2004: 773\u2013777.","journal-title":"Medinfo"},{"key":"413_CR2","doi-asserted-by":"publisher","first-page":"i331","DOI":"10.1093\/bioinformatics\/btg1046","volume":"19 Suppl 1","author":"AS Yeh","year":"2003","unstructured":"Yeh AS, Hirschman L, Morgan AA: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 2003, 19 Suppl 1: i331\u20139. 10.1093\/bioinformatics\/btg1046","journal-title":"Bioinformatics"},{"key":"413_CR3","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1038\/88213","volume":"28","author":"TK Jenssen","year":"2001","unstructured":"Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001, 28: 21\u201328. 10.1038\/88213","journal-title":"Nat Genet"},{"key":"413_CR4","doi-asserted-by":"publisher","first-page":"5241","DOI":"10.1073\/pnas.0307740100","volume":"101 Suppl 1","author":"DM Wilkinson","year":"2004","unstructured":"Wilkinson DM, Huberman BA: A method for finding communities of related genes. Proc Natl Acad Sci U S A 2004, 101 Suppl 1: 5241\u20135248. 10.1073\/pnas.0307740100","journal-title":"Proc Natl Acad Sci U S A"},{"key":"413_CR5","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1093\/bioinformatics\/btg449","volume":"20","author":"F Horn","year":"2004","unstructured":"Horn F, Lau AL, Cohen FE: Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 2004, 20: 557\u2013568. 10.1093\/bioinformatics\/btg449","journal-title":"Bioinformatics"},{"key":"413_CR6","doi-asserted-by":"publisher","first-page":"1555","DOI":"10.1210\/me.2002-0424","volume":"17","author":"S Albert","year":"2003","unstructured":"Albert S, Gaudan S, Knigge H, Raetsch A, Delgado A, Huhse B, Kirsch H, Albers M, Rebholz-Schuhmann D, Koegl M: Computer-assisted generation of a protein-interaction database for nuclear receptors. Mol Endocrinol 2003, 17: 1555\u20131567. 10.1210\/me.2002-0424","journal-title":"Mol Endocrinol"},{"key":"413_CR7","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1016\/j.jbi.2003.10.001","volume":"37","author":"A Rzhetsky","year":"2004","unstructured":"Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboue PA, Weng W, Wilbur WJ, Hatzivassiloglou V, Friedman C: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 2004, 37: 43\u201353. 10.1016\/j.jbi.2003.10.001","journal-title":"J Biomed Inform"},{"key":"413_CR8","doi-asserted-by":"publisher","first-page":"E309","DOI":"10.1371\/journal.pbio.0020309","volume":"2","author":"HM Muller","year":"2004","unstructured":"Muller HM, Kenny EE, Sternberg PW: Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biol 2004, 2: E309.","journal-title":"PLoS Biol"},{"key":"413_CR9","first-page":"1","volume-title":"Research Memo CS-97-02","author":"A Cunningham","year":"1997","unstructured":"Cunningham A: Information Extraction a User Guide. In Research Memo CS-97\u201302. Sheffield, University of Sheffield; 1997:1\u201320."},{"key":"413_CR10","doi-asserted-by":"publisher","first-page":"247","DOI":"10.1016\/S1532-0464(03)00014-5","volume":"35","author":"L Hirschman","year":"2002","unstructured":"Hirschman L, Morgan AA, Yeh AS: Rutabaga by any other name: extracting biological names. J Biomed Inform 2002, 35: 247\u2013259. 10.1016\/S1532-0464(03)00014-5","journal-title":"J Biomed Inform"},{"key":"413_CR11","doi-asserted-by":"publisher","first-page":"1178","DOI":"10.1093\/bioinformatics\/bth060","volume":"20","author":"G Zhou","year":"2004","unstructured":"Zhou G, Zhang J, Su J, Shen D, Tan C: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 2004, 20: 1178\u20131190. 10.1093\/bioinformatics\/bth060","journal-title":"Bioinformatics"},{"key":"413_CR12","doi-asserted-by":"publisher","first-page":"I241","DOI":"10.1093\/bioinformatics\/bth904","volume":"20 Suppl 1","author":"S Mika","year":"2004","unstructured":"Mika S, Rost B: Protein names precisely peeled off free text. Bioinformatics 2004, 20 Suppl 1: I241-I247. 10.1093\/bioinformatics\/bth904","journal-title":"Bioinformatics"},{"key":"413_CR13","doi-asserted-by":"publisher","first-page":"245","DOI":"10.1016\/S0378-1119(00)00431-5","volume":"259","author":"M Krauthammer","year":"2000","unstructured":"Krauthammer M, Rzhetsky A, Morozov P, Friedman C: Using BLAST for identifying gene and protein names in journal articles. Gene 2000, 259: 245\u2013252. 10.1016\/S0378-1119(00)00431-5","journal-title":"Gene"},{"key":"413_CR14","doi-asserted-by":"publisher","first-page":"216","DOI":"10.1093\/bioinformatics\/btg393","volume":"20","author":"JT Chang","year":"2004","unstructured":"Chang JT, Schutze H, Altman RB: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 2004, 20: 216\u2013225. 10.1093\/bioinformatics\/btg393","journal-title":"Bioinformatics"},{"key":"413_CR15","doi-asserted-by":"publisher","first-page":"2597","DOI":"10.1093\/bioinformatics\/bth291","volume":"20","author":"MJ Schuemie","year":"2004","unstructured":"Schuemie MJ, Weeber M, Schijvenaars BJ, van Mulligen EM, van der Eijk CC, Jelier R, Mons B, Kors JA: Distribution of information in biomedical abstracts and full-text publications. Bioinformatics 2004, 20: 2597\u20132604. 10.1093\/bioinformatics\/bth291","journal-title":"Bioinformatics"},{"key":"413_CR16","doi-asserted-by":"publisher","first-page":"3206","DOI":"10.1093\/bioinformatics\/bth386","volume":"20","author":"DP Corney","year":"2004","unstructured":"Corney DP, Buxton BF, Langdon WB, Jones DT: BioRAT: extracting biological information from full-length papers. Bioinformatics 2004, 20: 3206\u20133213. 10.1093\/bioinformatics\/bth386","journal-title":"Bioinformatics"},{"key":"413_CR17","doi-asserted-by":"publisher","DOI":"10.3115\/1567594.1567618","volume-title":"Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets.: ; Geneva, Switzerland.","author":"B Settles","year":"2004","unstructured":"Settles B: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets.: ; Geneva, Switzerland. ; 2004."},{"key":"413_CR18","volume-title":"Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data","author":"J Lafferty","year":"2001","unstructured":"Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. 2001."},{"key":"413_CR19","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1186\/1471-2105-6-5","volume":"6","author":"M Srdanovic","year":"2005","unstructured":"Srdanovic M, Schenk U, Schwieger M, Campagne F: Critical evaluation of the JDO API for the persistence and portability requirements of complex biological databases. BMC Bioinformatics 2005, 6: 5. 10.1186\/1471-2105-6-5","journal-title":"BMC Bioinformatics"},{"key":"413_CR20","volume-title":"MG4J: Managing Gigabytes for Java","author":"P Boldi","year":"2004","unstructured":"Boldi P, Vigna S: MG4J: Managing Gigabytes for Java.2004. [http:\/\/mg4j.dsi.unimi.it]"},{"key":"413_CR21","volume-title":"SVMLight","author":"T Joachims","year":"2004","unstructured":"Joachims T: SVMLight.2004. [http:\/\/svmlight.joachims.org\/]"},{"key":"413_CR22","doi-asserted-by":"publisher","first-page":"S97","DOI":"10.1093\/bioinformatics\/17.suppl_1.S97","volume":"17 Suppl 1","author":"V Hatzivassiloglou","year":"2001","unstructured":"Hatzivassiloglou V, Duboue PA, Rzhetsky A: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 2001, 17 Suppl 1: S97\u2013106.","journal-title":"Bioinformatics"},{"key":"413_CR23","first-page":"D289","volume":"33 Database Iss","author":"JD Wren","year":"2005","unstructured":"Wren JD, Chang JT, Pustejovsky J, Adar E, Garner HR, Altman RB: Biomedical term mapping databases. Nucleic Acids Res 2005, 33 Database Issue: D289\u201393.","journal-title":"Nucleic Acids Res"},{"key":"413_CR24","doi-asserted-by":"publisher","first-page":"611","DOI":"10.1142\/S0219720004000399","volume":"1","author":"L Tanabe","year":"2004","unstructured":"Tanabe L, Wilbur WJ: Generation of a large gene\/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 2004, 1: 611\u2013626. 10.1142\/S0219720004000399","journal-title":"J Bioinform Comput Biol"},{"key":"413_CR25","volume-title":"Explorations in the Document Vector Model of Information Retrieval","author":"JJ Paijmans","year":"1999","unstructured":"Paijmans JJ: Explorations in the Document Vector Model of Information Retrieval [http:\/\/pi0959.kub.nl\/Paai\/Onderw\/V-I\/Content\/evaluation.html]. , Katholieke Universiteit Brabant; 1999."},{"key":"413_CR26","volume-title":"Advances in Kernel Methods - Support Vector Learning","author":"T Joachims","year":"1999","unstructured":"Joachims T: Making large-Scale SVM Learning Practical. In Advances in Kernel Methods - Support Vector Learning. Edited by: Sch\u00f6lkopf B, Burges C and Smola A. Cambridge, MIT-Press; 1999."},{"key":"413_CR27","first-page":"422","volume-title":"Advances in Large Margin Classifiers","author":"JC Platt","year":"1999","unstructured":"Platt JC: Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers. Edited by: Smola AJ, Bartlett P, Sch\u00f6lkopf B and Schuurmans D. , MIT Press; 1999:422."},{"key":"413_CR28","first-page":"205","volume-title":"Kluwer international series in engineering and computer science","author":"T Joachims","year":"2001","unstructured":"Joachims T: Learning To Classify Text Using Support Vector Machines. In Kluwer international series in engineering and computer science. Dordrecht, Kluwer Academic Publishers; 2001:205."},{"key":"413_CR29","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1016\/S1386-5056(02)00052-7","volume":"67","author":"K Franzen","year":"2002","unstructured":"Franzen K, Eriksson G, Olsson F, Asker L, Liden P, Coster J: Protein names and how to find them. Int J Med Inform 2002, 67: 49\u201361. 10.1016\/S1386-5056(02)00052-7","journal-title":"Int J Med Inform"},{"key":"413_CR30","doi-asserted-by":"publisher","first-page":"D35","DOI":"10.1093\/nar\/gkh073","volume":"32 Database iss","author":"DL Wheeler","year":"2004","unstructured":"Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 2004, 32 Database issue: D35\u201340. 10.1093\/nar\/gkh073","journal-title":"Nucleic Acids Res"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-6-88.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T17:53:03Z","timestamp":1706809983000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-6-88"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,4,7]]},"references-count":30,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2005,12]]}},"alternative-id":["413"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-6-88","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2005,4,7]]},"assertion":[{"value":"1 February 2005","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 April 2005","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 April 2005","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"88"}}