{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T11:04:43Z","timestamp":1775646283764,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"S5","license":[{"start":{"date-parts":[[2006,12,1]],"date-time":"2006-12-01T00:00:00Z","timestamp":1164931200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/2.0"},{"start":{"date-parts":[[2006,12,18]],"date-time":"2006-12-18T00:00:00Z","timestamp":1166400000000},"content-version":"vor","delay-in-days":17,"URL":"https:\/\/creativecommons.org\/licenses\/by\/2.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2006,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., <jats:italic>B-protein<\/jats:italic>, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., \"2\" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-7-s5-s11","type":"journal-article","created":{"date-parts":[[2006,12,19]],"date-time":"2006-12-19T07:17:42Z","timestamp":1166512662000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":79,"title":["NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition"],"prefix":"10.1186","volume":"7","author":[{"given":"Richard Tzong-Han","family":"Tsai","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Cheng-Lung","family":"Sung","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hong-Jie","family":"Dai","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hsieh-Chuan","family":"Hung","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ting-Yi","family":"Sung","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wen-Lian","family":"Hsu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2006,12,18]]},"reference":[{"key":"1361_CR1","doi-asserted-by":"publisher","first-page":"409","DOI":"10.1016\/j.compbiolchem.2004.09.010","volume":"28","author":"ZZ Hu","year":"2004","unstructured":"Hu ZZ, Mani I, Hermoso V, Liu H, Wu CH: iProLINK: an integrated protein resource for literature mining. Comput Biol Chem 2004, 28: 409\u2013416. 10.1016\/j.compbiolchem.2004.09.010","journal-title":"Comput Biol Chem"},{"key":"1361_CR2","volume-title":"Artificial Intelligence and Systems Biology. Springer","author":"KB Cohen","year":"2005","unstructured":"Cohen KB, Hunter L: Natural Language Processing and Systems Biology. In Artificial Intelligence and Systems Biology. Springer. Edited by: Dubitzky W, Azuaje F. ; 2005."},{"key":"1361_CR3","volume-title":"Message Understanding Conference","author":"N Chinchor","year":"1998","unstructured":"Chinchor N: Message Understanding Conference Proceedings. Message Understanding Conference 1998."},{"issue":"6","key":"1361_CR4","doi-asserted-by":"publisher","first-page":"821","DOI":"10.1089\/106652703322756104","volume":"10","author":"H Shatkay","year":"2003","unstructured":"Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: an overview. Journal of Computational Biology 2003, 10(6):821\u2013855. 10.1089\/106652703322756104","journal-title":"Journal of Computational Biology"},{"key":"1361_CR5","volume-title":"the 40th Annual Meeting of the Association for Computational Linguistics (ACL)","author":"S Pakhomov","year":"2002","unstructured":"Pakhomov S: Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical text. the 40th Annual Meeting of the Association for Computational Linguistics (ACL) 2002."},{"key":"1361_CR6","volume-title":"Pacific Symposium on Biocomputing '03","author":"D Hanisch","year":"2003","unstructured":"Hanisch D, Fluck J, Mevissen H, Zimmer R: Playing biology's name game: identifying protein names in scientific text. Pacific Symposium on Biocomputing '03 2003."},{"key":"1361_CR7","volume-title":"Pacific Symopium on Biocomputing '98","author":"K Fukuda","year":"1998","unstructured":"Fukuda K, Tsunoda T, Tamura A, Takagi T: Toward information extraction: identifying protein names from biological papers. Pacific Symopium on Biocomputing '98 1998."},{"key":"1361_CR8","volume-title":"ACL-02 Workshop on Natural Language Processing in Biomedical Applications","author":"J Kazama","year":"2002","unstructured":"Kazama J, Makino T, Ohta Y, Tsujii J: Tuning support vector machines for biomedical named entity recognition. ACL-02 Workshop on Natural Language Processing in Biomedical Applications 2002."},{"key":"1361_CR9","volume-title":"ACL-03 Workshop on Natural Language Processing in Biomedicine","author":"K-J Lee","year":"2003","unstructured":"Lee K-J, Hwang Y-S, Rim H-C: Two phase biomedical NE Recognition based on SVMs. ACL-03 Workshop on Natural Language Processing in Biomedicine 2003."},{"key":"1361_CR10","volume-title":"COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA)","author":"B Settles","year":"2004","unstructured":"Settles B: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA) 2004."},{"key":"1361_CR11","volume-title":"COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA)","author":"S Zhao","year":"2004","unstructured":"Zhao S: Named Entity Recognition in Biomedical Texts using an HMM Model. COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA) 2004."},{"key":"1361_CR12","volume-title":"JNLPBA-04","author":"G Zhou","year":"2004","unstructured":"Zhou G, Su J: Exploring Deep Knowledge Resources in Biomedical Name Recognition. JNLPBA-04 2004."},{"key":"1361_CR13","volume-title":"COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA)","author":"J Finkel","year":"2004","unstructured":"Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G: Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web. COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA) 2004."},{"key":"1361_CR14","doi-asserted-by":"publisher","first-page":"9","DOI":"10.3115\/1610230.1610233","volume-title":"ACL-05 Workshop on Feature Engineering","author":"SF Adafre","year":"2005","unstructured":"Adafre SF, Rijke Md: Feature Engineering and Post-Processing for Temporal Expression Recognition Using Conditional Random Fields. In ACL-05 Workshop on Feature Engineering. Ann Arbor; 2005:9\u201316."},{"key":"1361_CR15","volume-title":"UAI-03","author":"A McCallum","year":"2003","unstructured":"McCallum A: Efficiently Inducing Features of Conditional Random Fields. UAI-03 2003."},{"issue":"Suppl 1","key":"1361_CR16","doi-asserted-by":"publisher","first-page":"S6","DOI":"10.1186\/1471-2105-6-S1-S6","volume":"6","author":"R McDonald","year":"2005","unstructured":"McDonald R, Pereira F: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 2005, 6(Suppl 1):S6. 10.1186\/1471-2105-6-S1-S6","journal-title":"BMC Bioinformatics"},{"key":"1361_CR17","volume-title":"EACL-99","author":"EFTK Sang","year":"1999","unstructured":"Sang EFTK, Veenstra J: Representing text chunks. EACL-99 1999."},{"key":"1361_CR18","doi-asserted-by":"publisher","first-page":"260","DOI":"10.1109\/TIT.1967.1054010","volume":"13","author":"AJ Viterbi","year":"1967","unstructured":"Viterbi AJ: Error bounds for convolutional codes and an asymptotically optimum decording algorithm. IEEE Transaction on Information Theory 1967, 13: 260\u2013269. 10.1109\/TIT.1967.1054010","journal-title":"IEEE Transaction on Information Theory"},{"key":"1361_CR19","volume-title":"ICML' 00","author":"A McCallum","year":"2000","unstructured":"McCallum A, Freitag D, Pereira F: Maximum entropy Markov models for information extraction and segmentation. ICML' 00 2000."},{"key":"1361_CR20","volume-title":"Universit'e de Paris XI","author":"L Bottou","year":"1991","unstructured":"Bottou L: Une approche th'eorique del'apprentissage connexionniste: Applications 'a la reconnaissance de la parole. Universit'e de Paris XI 1991."},{"key":"1361_CR21","volume-title":"ICML' 01","author":"J Lafferty","year":"2001","unstructured":"Lafferty J, McCallum A, Pereira F: Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML' 01 2001."},{"key":"1361_CR22","volume-title":"HLT\/NAACL-03","author":"F Sha","year":"2003","unstructured":"Sha F, Pereira F: Shallow Parsing with Conditional Random Fields. HLT\/NAACL-03 2003."},{"key":"1361_CR23","volume-title":"CIS, U of Pennsylvania","author":"H Wallach","year":"2004","unstructured":"Wallach H: Conditional Random Fields: An Introduction. CIS, U of Pennsylvania 2004."},{"key":"1361_CR24","volume-title":"ACM SIGIR' 03","author":"D Pinto","year":"2003","unstructured":"Pinto D, McCallum A, Wei X, Croft WB: Table extraction using conditional random fields. ACM SIGIR' 03 2003."},{"key":"1361_CR25","volume-title":"HLT & NAACL' 03","author":"F Sha","year":"2003","unstructured":"Sha F, Pereira F: Shallow parsing with conditional random fields. HLT & NAACL' 03 2003."},{"key":"1361_CR26","volume-title":"the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications","author":"J-D Kim","year":"2004","unstructured":"Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N: Introduction to the Bio-Entity Task at JNLPBA. the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications 2004."},{"key":"1361_CR27","doi-asserted-by":"publisher","first-page":"1178","DOI":"10.1093\/bioinformatics\/bth060","volume":"20","author":"G Zhou","year":"2004","unstructured":"Zhou G, Zhang J, Su J, Shen D, Tan C: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 2004, 20: 1178\u20131190. 10.1093\/bioinformatics\/bth060","journal-title":"Bioinformatics"},{"key":"1361_CR28","volume-title":"CoNLL-03","author":"EFTK Sang","year":"2003","unstructured":"Sang EFTK, Meulder FD: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL-03 2003."},{"key":"1361_CR29","doi-asserted-by":"crossref","unstructured":"Tsai RT-H, Wu S-H, Chou W-C, Lin Y-C, He D, Hsiang J, Sung T-Y, Hsu W-L: Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics 2006., 7(92):","DOI":"10.1186\/1471-2105-7-92"},{"key":"1361_CR30","volume-title":"Using biological resources to bootstrap text mining","author":"L Hirschman","year":"2003","unstructured":"Hirschman L: Using biological resources to bootstrap text mining. 2003."},{"key":"1361_CR31","volume-title":"Third Meeting of the Special Interest Group on Text Mining","author":"G Demetrious","year":"2003","unstructured":"Demetrious G, Gaizauskas R: Corpus resources for development and evaluation of a biological text mining system. Third Meeting of the Special Interest Group on Text Mining 2003."},{"key":"1361_CR32","first-page":"43","volume-title":"Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04)","author":"J Gim\u00e9nez","year":"2004","unstructured":"Gim\u00e9nez J, M\u00e1rquez L: SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). Lisbon, Portugal; 2004:43\u201346."},{"key":"1361_CR33","first-page":"168","volume-title":"CoNLL-03","author":"R Florian","year":"2003","unstructured":"Florian R, Ittycheriah A, Jing H, Zhang T: Named Entity Recognition through Classifier Combination. In CoNLL-03. Edmonton, Canada; 2003:168\u2013171."},{"key":"1361_CR34","volume-title":"HLT\/NAACL-03","author":"K Toutanova","year":"2003","unstructured":"Toutanova K, Klein D, Manning C, Singer Y: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. HLT\/NAACL-03 2003."},{"key":"1361_CR35","volume-title":"HLT\/EMNLP-05","author":"Y Tsuruoka","year":"2005","unstructured":"Tsuruoka Y, Tsujii Ji: Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. HLT\/EMNLP-05 2005."},{"key":"1361_CR36","first-page":"440","volume-title":"IDA-05","author":"L Talavera","year":"2005","unstructured":"Talavera L: An evaluation of filter and wrapper methods for feature selection in categorical clustering. In IDA-05. Madrid, Spain: Springer Verlag; 2005:440\u2013451."},{"key":"1361_CR37","volume-title":"ICME-04","author":"Y Liu","year":"2004","unstructured":"Liu Y, Kender JR: Video Feature Selection Using Fast-converging Sort-Merge Tree. In ICME-04. Taipei, Taiwan; 2004."},{"key":"1361_CR38","volume-title":"ICML-03","author":"L Yu","year":"2003","unstructured":"Yu L, Liu H: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. ICML-03 2003."},{"key":"1361_CR39","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","volume":"147","author":"TF Smith","year":"1981","unstructured":"Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology 1981, 147: 195\u2013197. 10.1016\/0022-2836(81)90087-5","journal-title":"Journal of Molecular Biology"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-7-S5-S11.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/1471-2105-7-S5-S11\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-7-S5-S11.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T02:32:09Z","timestamp":1630463529000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-7-S5-S11"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,12]]},"references-count":39,"journal-issue":{"issue":"S5","published-print":{"date-parts":[[2006,12]]}},"alternative-id":["1361"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-7-s5-s11","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2006,12]]},"assertion":[{"value":"18 December 2006","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S11"}}