{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T16:33:44Z","timestamp":1761237224953},"reference-count":21,"publisher":"Springer Science and Business Media LLC","issue":"S1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2005,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.<\/jats:p>","DOI":"10.1186\/1471-2105-6-s1-s9","type":"journal-article","created":{"date-parts":[[2005,5,25]],"date-time":"2005-05-25T06:13:24Z","timestamp":1117001604000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Systematic feature evaluation for gene name recognition"],"prefix":"10.1186","volume":"6","author":[{"given":"J\u00f6rg","family":"Hakenberg","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Steffen","family":"Bickel","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Conrad","family":"Plake","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ulf","family":"Brefeld","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hagen","family":"Zahn","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lukas","family":"Faulstich","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ulf","family":"Leser","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tobias","family":"Scheffer","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2005,5,24]]},"reference":[{"key":"644_CR1","unstructured":"BioCreAtIvE Challenge Cup2003. [http:\/\/www.pdg.cnb.uam.es\/BioLINK\/BioCreative.eval.html]"},{"issue":"Suppl 1","key":"644_CR2","doi-asserted-by":"publisher","first-page":"S2","DOI":"10.1186\/1471-2105-6-S1-S2","volume":"6","author":"A Yeh","year":"2005","unstructured":"Yeh A, Morgan A, Colosimo M, Hirschman L: BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1):S2. 10.1186\/1471-2105-6-S1-S2","journal-title":"BMC Bioinformatics"},{"key":"644_CR3","first-page":"1","volume-title":"Proc EFMI Workshop on Natural Language Processing in Biomedical Applications, Nicosia, Cyprus","author":"B de Bruijn","year":"2002","unstructured":"de Bruijn B, Martin J: Literature mining in molecular biology. Proc EFMI Workshop on Natural Language Processing in Biomedical Applications, Nicosia, Cyprus 2002, 1\u20135."},{"issue":"6","key":"644_CR4","doi-asserted-by":"publisher","first-page":"821","DOI":"10.1089\/106652703322756104","volume":"10","author":"H Shatkay","year":"2003","unstructured":"Shatkay H, Feldman R: Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology 2003, 10(6):821\u2013856. 10.1089\/106652703322756104","journal-title":"Journal of Computational Biology"},{"key":"644_CR5","volume-title":"BioCreAtIvE Workshop, Granada, Spain","author":"G Zhou","year":"2004","unstructured":"Zhou G, Shen D, Zhang J, Su J, Soon TH, Tan CL: Recognition of Protein\/Gene Names from Text using an Ensemble of Classifiers and Effective Abbreviation Detection. BioCreAtIvE Workshop, Granada, Spain 2004."},{"issue":"7","key":"644_CR6","doi-asserted-by":"publisher","first-page":"1178","DOI":"10.1093\/bioinformatics\/bth060","volume":"20","author":"G Zhou","year":"2004","unstructured":"Zhou G, Zhang J, Su J, Shen D, Tan CL: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 2004, 20(7):1178\u20131190. 10.1093\/bioinformatics\/bth060","journal-title":"Bioinformatics"},{"key":"644_CR7","volume-title":"BioCreAtIvE Workshop, Granada, Spain","author":"S Kinoshita","year":"2004","unstructured":"Kinoshita S, Ogren P, Cohen KB, Hunter L: Entity identification in the molecular biology domain with a stochastic POS tagger: the BioCreative task. BioCreAtIvE Workshop, Granada, Spain 2004."},{"key":"644_CR8","volume-title":"BioCreAtIvE Workshop, Granada, Spain","author":"R McDonald","year":"2004","unstructured":"McDonald R, Pereira F: Identifying Gene and Protein Mentions in Text Using Conditional Random Fields. BioCreAtIvE Workshop, Granada, Spain 2004."},{"key":"644_CR9","volume-title":"BioCreAtIvE Workshop, Granada, Spain","author":"Y Song","year":"2004","unstructured":"Song Y, Yi E, Kim E, Lee GG: POSBIOTM-NER: A Machine Learning Approach. BioCreAtIvE Workshop, Granada, Spain 2004."},{"key":"644_CR10","volume-title":"BioCreAtIvE Workshop, Granada, Spain","author":"T Mitsumori","year":"2004","unstructured":"Mitsumori T, Fation S, Murata M, Doi K, Doi H: Gene\/protein name recognition using Support Vector Machine after dictionary matching. BioCreAtIvE Workshop, Granada, Spain 2004."},{"issue":"1\u20133","key":"644_CR11","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1023\/A:1012487302797","volume":"46","author":"I Guyon","year":"2002","unstructured":"Guyon I, Weston J, Barnhill S, Vapnik VN: Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 2002, 46(1\u20133):389\u2013422. 10.1023\/A:1012487302797","journal-title":"Machine Learning"},{"key":"644_CR12","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1093\/nar\/30.1.169","volume":"30","author":"H Wain","year":"2002","unstructured":"Wain H, Lush M, Ducluzeau F, Povey S: Genew: The Human Nomenclature Database. Nuc Acids Res 2002, 30: 169. 10.1093\/nar\/30.1.169","journal-title":"Nuc Acids Res"},{"issue":"2","key":"644_CR13","doi-asserted-by":"publisher","first-page":"216","DOI":"10.1093\/bioinformatics\/btg393","volume":"20","author":"JT Chang","year":"2004","unstructured":"Chang JT, Sch\u00fctze H, Altman RB: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 2004, 20(2):216\u2013225. 10.1093\/bioinformatics\/btg393","journal-title":"Bioinformatics"},{"key":"644_CR14","volume-title":"Proceedings of the Computational Systems Bioinformatics Conference (CSB)","author":"K Seki","year":"2003","unstructured":"Seki K, Mostafa J: A Probabilistic Model for Identifying Protein Names and their Name Boundaries. Proceedings of the Computational Systems Bioinformatics Conference (CSB) 2003."},{"key":"644_CR15","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2440-0","volume-title":"The Nature of Statistical Learning Theory","author":"VN Vapnik","year":"1995","unstructured":"Vapnik VN: The Nature of Statistical Learning Theory. New York: Springer-Verlag; 1995."},{"key":"644_CR16","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511801389","volume-title":"An Introduction to Support Vector Machines and other kernel-based learning methods","author":"N Cristianini","year":"2000","unstructured":"Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge: Cambridge University Press; 2000."},{"key":"644_CR17","volume-title":"Proceedings of ECML-98, 10th European Conference on Machine Learning, Springer","author":"T Joachims","year":"1998","unstructured":"Joachims T: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of ECML-98, 10th European Conference on Machine Learning, Springer 1998."},{"key":"644_CR18","volume-title":"BioCreAtIvE Workshop, Granada, Spain","author":"S Bickel","year":"2004","unstructured":"Bickel S, Brefeld U, Faulstich L, Hakenberg J, Leser U, Plake C, Scheffer T: A Support Vector Classifier for Gene Name Recognition. BioCreAtIvE Workshop, Granada, Spain 2004."},{"key":"644_CR19","volume-title":"Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, Trento, Italy","author":"E Brill","year":"1992","unstructured":"Brill E: A simple rule-based part of speech tagger. Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, Trento, Italy 1992."},{"key":"644_CR20","first-page":"313","volume":"19","author":"MP Marcus","year":"1993","unstructured":"Marcus MP, Santorini B, Marcinkiewicz MA: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 1993, 19: 313\u2013330.","journal-title":"Computational Linguistics"},{"key":"644_CR21","first-page":"E-007","volume-title":"Report to the U.S. Office of Education on Cooperative Research Project","author":"WN Francis","year":"1964","unstructured":"Francis WN: A standard sample of present-day English for use with digital computers. Report to the U.S. Office of Education on Cooperative Research Project 1964, E-007."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-6-S1-S9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T10:10:19Z","timestamp":1630491019000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-6-S1-S9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,5]]},"references-count":21,"journal-issue":{"issue":"S1","published-print":{"date-parts":[[2005,5]]}},"alternative-id":["644"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-6-s1-s9","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2005,5]]},"assertion":[{"value":"24 May 2005","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S9"}}