{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T11:49:36Z","timestamp":1773661776758,"version":"3.50.1"},"reference-count":29,"publisher":"Oxford University Press (OUP)","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2005,1,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. \u2018Recognition\u2019 can be accomplished using pattern matching and machine learning. But for \u2018identification\u2019 these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it.<\/jats:p><jats:p>Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intra-species ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45\u2009000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional \u2018gene\u2019 instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications.<\/jats:p><jats:p>Contact: \u00a0lifeng.chen@dbmi.columbia.edu<\/jats:p>","DOI":"10.1093\/bioinformatics\/bth496","type":"journal-article","created":{"date-parts":[[2004,8,28]],"date-time":"2004-08-28T01:15:02Z","timestamp":1093655702000},"page":"248-256","source":"Crossref","is-referenced-by-count":93,"title":["Gene name ambiguity of eukaryotic nomenclatures"],"prefix":"10.1093","volume":"21","author":[{"given":"Lifeng","family":"Chen","sequence":"first","affiliation":[]},{"given":"Hongfang","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Carol","family":"Friedman","sequence":"additional","affiliation":[]}],"member":"286","published-online":{"date-parts":[[2004,8,27]]},"reference":[{"key":"2023013107193130400_B1","unstructured":"Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., Eppig, J.T. 2003MGD: the Mouse Genome Database. Nucleic Acids Res.31193\u2013195"},{"key":"2023013107193130400_B2","unstructured":"Cherry, J.M., Adler, C., Ball, C., Chervitz, S.A., Dwight, S.S., Hester, E.T., Jia, Y., Juvik, G., Roe, T., Schroeder, M., Weng, S., Botstein, D. 1998SGD: Saccharomyces Genome Database. Nucleic Acids Res.2673\u201379"},{"key":"2023013107193130400_B3","doi-asserted-by":"crossref","unstructured":"Christensen, L., Haug, P., Fiszman, M. 2002MPLUS: a probabilistic medical language understanding system. Proceedings of ACL Workshop in Natural Language Processing , Philadelphia, PA , pp. 29\u201336 July 2002","DOI":"10.3115\/1118149.1118154"},{"key":"2023013107193130400_B4","unstructured":"Dolf, G. 1999DogMap: an international collaboration toward a low-resolution canine genetic marker map. J. Hered.903\u20136"},{"key":"2023013107193130400_B5","unstructured":"Friedman, C., Alderson, P.O., Austin, J., Cimino, J.J., Johnson, S.B. 1994A general natural language text processor for clinical radiology. JAMIA1161\u2013174"},{"key":"2023013107193130400_B6","doi-asserted-by":"crossref","unstructured":"Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A. 2001GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics17Suppl. 1,S74\u2013S82","DOI":"10.1093\/bioinformatics\/17.suppl_1.S74"},{"key":"2023013107193130400_B7","unstructured":"Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T. 1998Information extraction: identifying protein names from biological papers. Proceedings of the Pacific Symposium on Biocomputing'98 (PSB'98) , Hawaii , pp. 707\u2013718 January 1998"},{"key":"2023013107193130400_B8","doi-asserted-by":"crossref","unstructured":"Hanisch, D., Fluck, J., Mevissen, H.T., Zimmer, R. 2003Playing biology's name game: identifying protein names in scientific text. Pac. Symp. Biocomput. , Kavai, HI , pp. 403\u2013414","DOI":"10.1142\/9789812776303_0038"},{"key":"2023013107193130400_B9","doi-asserted-by":"crossref","unstructured":"Harris, T.W., Chen, N., Cunningham, F., Tello-Ruiz, M., Antoshechkin, I., Bastiani, C., Bieri, T., Blasiar, D., Bradnam, K., Chan, J., et al. 2004WormBase: a multi-species resource for nematode biology and genomics. Nucleic Acids Res.32D411\u2013D417","DOI":"10.1093\/nar\/gkh066"},{"key":"2023013107193130400_B10","unstructured":"Hirschman, L., Morgan, A.A., Yeh, A.S. 2002Rutabaga by any other name: extracting biological names. J. Biomed. Inform.35247\u2013259"},{"key":"2023013107193130400_B11","unstructured":"Hirschman, L., Park, J.C., Tsujii, J., Wu, C.H. 2002Accomplishments and challenges in literature data mining for biology. Bioinformatics181553\u20131561"},{"key":"2023013107193130400_B12","unstructured":"Hu, J., Mungall, C., Law, A., Papworth, R., Nelson, J.P., Brown, A., Simpson, I., Leckie, S., Burt, D.W., Hillyard, A.L., Archibald, A.L. 2001The ARKdb: genome databases for farmed and other animals. Nucleic Acids Res.29106\u2013110"},{"key":"2023013107193130400_B13","unstructured":"Jenssen, T. and Vinterbo, S.A. 2000A set-covering approach to specific search for literature about human genes. Proceedings of the AMIA Symposium , Los Angeles, CA , pp. 384\u2013388 October 2000"},{"key":"2023013107193130400_B14","doi-asserted-by":"crossref","unstructured":"Jenssen, T.-K., Laegreid, A., Komorowski, J., Hovig, E. 2001A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet.2821\u201328","DOI":"10.1038\/ng0501-21"},{"key":"2023013107193130400_B15","unstructured":"Lindberg, D., Humphreys, B., McCray, A.T. 1993The Unified Medical Language System. Meth. Inform. Med.32281\u2013291"},{"key":"2023013107193130400_B16","unstructured":"Liu, H., Lussier, Y., Friedman, C. 2001A study of abbreviations in the UMLS. Proceedings of the AMIA Symposium , Philadelphia, PA Hanley&Belfus, pp. 393\u2013397"},{"key":"2023013107193130400_B17","unstructured":"Liu, H. and Wu, C. 2004A study of text categorization for model organism databases. Proceedings of NAACLIHLT 2004 , Boston, MA , pp. 25\u201332"},{"key":"2023013107193130400_B18","doi-asserted-by":"crossref","unstructured":"Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K. 2003A biological named entity recognizer. Pac. Symp. Biocomput. , Kavai, HI , pp. 427\u2013438","DOI":"10.1142\/9789812776303_0040"},{"key":"2023013107193130400_B19","unstructured":"Proux, D., Rechenmann, F., Julliard, L., Pillet, V., Jacq, B. 1998Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. Genome Inform. Ser Workshop Genome Inform.972\u201380"},{"key":"2023013107193130400_B20","doi-asserted-by":"crossref","unstructured":"Pruitt, K.D. and Maglott, D.R. 2001RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res.29137\u2013140","DOI":"10.1093\/nar\/29.1.137"},{"key":"2023013107193130400_B21","unstructured":"Sager, N., Lyman, M., Nhan, N.T., TIck, L.J. 1995Medical language processing: applications to patient data representation and automatic encoding. Meth. Inform. Med.34140\u2013146"},{"key":"2023013107193130400_B22","doi-asserted-by":"crossref","unstructured":"Shen, D., Zhang, J., Zhou, G., Su, J., an, C. 2003Effective adaptation of a hidden Markov model-based named entity recognizer for biomedical domain. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine , Japan Sapparo, pp. 49\u201356","DOI":"10.3115\/1118958.1118965"},{"key":"2023013107193130400_B23","doi-asserted-by":"crossref","unstructured":"Sprague, J., Doerry, E., Douglas, S., Westerfield, M. 2001The Zebrafish Information Network (ZFIN): a resource for genetic, genomic and developmental research. Nucleic Acids Res.2987\u201390","DOI":"10.1093\/nar\/29.1.87"},{"key":"2023013107193130400_B24","doi-asserted-by":"crossref","unstructured":"Steen, R.G., Kwitek-Black, A.E., Glenn, C., Gullings-Handley, J., Van Etten, W., Atkinson, O.S., Appel, D., Twigger, S., Muir, M., Mull, T., et al. 1999A high-density integrated genetic linkage and radiation hybrid map of the laboratory rat. Genome Res.9793","DOI":"10.1101\/gr.9.6.AP1"},{"key":"2023013107193130400_B25","unstructured":"The FlyBase Consortium. 2003The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res.31172\u2013175"},{"key":"2023013107193130400_B26","unstructured":"Tuason, O., Chen, L., Liu, H., Blake, J., Friedman, C. 2004Acquisition of lexical knowledge using biological nomenclatures. Pac. Symp. Biocomput.238\u2013249"},{"key":"2023013107193130400_B27","unstructured":"Wain, H.M., Lush, M., Ducluzeau, F., Povey, S. 2002Genew: the human gene nomenclature database. Nucleic Acids Res.30169\u2013171"},{"key":"2023013107193130400_B28","doi-asserted-by":"crossref","unstructured":"Wain, H.M., Bruford, E.A., Lovering, R.C., Lush, M.J., Wright, M.W., Povey, S. 2002Guidelines for Human Gene Nomenclature. Genomics79464\u2013470","DOI":"10.1006\/geno.2002.6748"},{"key":"2023013107193130400_B29","doi-asserted-by":"crossref","unstructured":"Yamamoto, K., Kudo, T., Konagaya, A., Matsumoto, Y. 2003Protein Name Tagging for Biomedical Anonotation in Text. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine , Japan Sapparo, pp. 65\u201372","DOI":"10.3115\/1118958.1118967"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/21\/2\/248\/48961890\/bioinformatics_21_2_248.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/21\/2\/248\/48961890\/bioinformatics_21_2_248.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T12:13:58Z","timestamp":1734524038000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/21\/2\/248\/187296"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2004,8,27]]},"references-count":29,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2005,1,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bth496","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2005,1,15]]},"published":{"date-parts":[[2004,8,27]]}}}