{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T10:53:37Z","timestamp":1756896817727,"version":"3.41.2"},"reference-count":27,"publisher":"Oxford University Press (OUP)","funder":[{"name":"Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,10,13]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or an identifier by a text-mining algorithm. Existing methods typically rely on heuristic rules based on gene and species co-occurrence in the article, but their accuracy is suboptimal. We therefore developed a high-performance method, using a novel deep learning-based framework, to identify whether there is a relation between a gene and a species. Instead of the traditional binary classification framework in which all possible pairs of genes and species in the same article are evaluated, we treat the problem as a sequence labeling task such that only a fraction of the pairs needs to be considered. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method for the species assignment task (from 65.8\u201381.3% in accuracy). The source code and data for species assignment are freely available.<\/jats:p><jats:p>Database URL https:\/\/github.com\/ncbi\/SpeciesAssignment<\/jats:p>","DOI":"10.1093\/database\/baac090","type":"journal-article","created":{"date-parts":[[2022,10,13]],"date-time":"2022-10-13T13:32:34Z","timestamp":1665667954000},"source":"Crossref","is-referenced-by-count":4,"title":["Assigning species information to corresponding genes by a sequence labeling framework"],"prefix":"10.1093","volume":"2022","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5141-0259","authenticated-orcid":false,"given":"Ling","family":"Luo","sequence":"first","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5094-7321","authenticated-orcid":false,"given":"Chih-Hsuan","family":"Wei","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2025-318X","authenticated-orcid":false,"given":"Po-Ting","family":"Lai","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6036-1516","authenticated-orcid":false,"given":"Qingyu","family":"Chen","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5651-1860","authenticated-orcid":false,"given":"Rezarta","family":"Islamaj","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9998-916X","authenticated-orcid":false,"given":"Zhiyong","family":"Lu","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,10,13]]},"reference":[{"key":"2022101312371500400_R1","doi-asserted-by":"crossref","first-page":"3454","DOI":"10.1093\/bioinformatics\/btx439","article-title":"On expert curation and scalability: UniProtKB\/Swiss-Prot as a case study","volume":"33","author":"Poux","year":"2017","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R2","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bas049","article-title":"BioCreative-2012 virtual issue","volume":"2012","author":"Wu","year":"2012","journal-title":"Database"},{"key":"2022101312371500400_R3","doi-asserted-by":"crossref","first-page":"D1534","DOI":"10.1093\/nar\/gkaa952","article-title":"LitCovid: an open database of COVID-19 literature","volume":"49","author":"Chen","year":"2021","journal-title":"Nucleic Acids Res."},{"key":"2022101312371500400_R4","doi-asserted-by":"crossref","first-page":"W530","DOI":"10.1093\/nar\/gky355","article-title":"LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC","volume":"46","author":"Allot","year":"2018","journal-title":"Nucleic Acids Res."},{"key":"2022101312371500400_R5","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1006390","article-title":"Scaling up data curation using deep learning: an application to literature triage in genomic variation resources","volume":"14","author":"Lee","year":"2018","journal-title":"PLoS Comput. Biol."},{"key":"2022101312371500400_R6","doi-asserted-by":"crossref","first-page":"W587","DOI":"10.1093\/nar\/gkz389","article-title":"PubTator central: automated concept annotation for biomedical full text articles","volume":"47","author":"Wei","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"2022101312371500400_R7","doi-asserted-by":"crossref","DOI":"10.1155\/2015\/918710","article-title":"GNormPlus: an integrative approach for tagging genes, gene families, and protein domains","volume":"2015","author":"Wei","year":"2015","journal-title":"Biomed. Res. Int."},{"key":"2022101312371500400_R8","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-12-S8-S2","article-title":"The gene normalization task in BioCreative III","volume":"12","author":"Lu","year":"2011","journal-title":"BMC Bioinform."},{"key":"2022101312371500400_R9","doi-asserted-by":"crossref","first-page":"2769","DOI":"10.1093\/bioinformatics\/btr455","article-title":"The GNAT library for local and remote gene mention normalization","volume":"27","author":"Hakenberg","year":"2011","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-11-85","article-title":"LINNAEUS: a species name identification system for biomedical literature","volume":"11","author":"Gerner","year":"2010","journal-title":"BMC Bioinform."},{"key":"2022101312371500400_R11","doi-asserted-by":"crossref","first-page":"2721","DOI":"10.1093\/bioinformatics\/btr452","article-title":"OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents","volume":"27","author":"Naderi","year":"2011","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R12","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0065390","article-title":"The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text","volume":"8","author":"Pafilis","year":"2013","journal-title":"PLoS One"},{"key":"2022101312371500400_R13","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1109\/TCBB.2010.48","article-title":"Exploring species-based strategies for gene normalization","volume":"7","author":"Verspoor","year":"2010","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform. Biol. Insights"},{"key":"2022101312371500400_R14","doi-asserted-by":"crossref","first-page":"1032","DOI":"10.1093\/bioinformatics\/btr042","article-title":"GeneTUKit: a software for document-level gene normalization","volume":"27","author":"Huang","year":"2011","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R15","article-title":"SR4GN: a species recognition software tool for gene normalization","volume":"7","author":"Wei","year":"2012","journal-title":"PLoS One"},{"key":"2022101312371500400_R16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3458754","article-title":"Domain-specific language model pretraining for biomedical natural language processing","volume":"3","author":"Gu","year":"2021","journal-title":"ACM Trans. Comput. Healthcare"},{"key":"2022101312371500400_R17","first-page":"272","article-title":"Team bioformer at BioCreative VII LitCovid Track: multic-label topic classification for COVID-19 literature with a compact BERT model","author":"Fang","year":"2021"},{"key":"2022101312371500400_R18","doi-asserted-by":"crossref","first-page":"2792","DOI":"10.1093\/bioinformatics\/btab042","article-title":"HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition","volume":"37","author":"Weber","year":"2021","journal-title":"Bioinformatics"},{"article-title":"Systema naturae; sive, Regna tria naturae: systematice proposita per classes, ordines, genera & species","year":"1735","author":"Linnaeus","key":"2022101312371500400_R19"},{"key":"2022101312371500400_R20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-020-3457-2","article-title":"Exploiting sequence labeling framework to extract document-level relations from biomedical texts","volume":"21","author":"Li","year":"2020","journal-title":"BMC Bioinform."},{"key":"2022101312371500400_R21","doi-asserted-by":"crossref","DOI":"10.1016\/j.jbi.2020.103384","article-title":"A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature","volume":"103","author":"Luo","year":"2020","journal-title":"J. Biomed. Inform."},{"key":"2022101312371500400_R22","first-page":"26","article-title":"Extracting drug-protein interaction using an ensemble of biomedical pre-trained language models through sequence labeling and text classification techniques","author":"Luo","year":"2021"},{"key":"2022101312371500400_R23","doi-asserted-by":"crossref","first-page":"4087","DOI":"10.1093\/bioinformatics\/bty449","article-title":"Transfer learning for biomedical named entity recognition with neural networks","volume":"34","author":"Giorgi","year":"2018","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R24","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R25","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1093\/bioinformatics\/btz528","article-title":"HUNER: improving biomedical NER with pretraining","volume":"36","author":"Weber","year":"2020","journal-title":"Bioinformatics"},{"key":"2022101312371500400_R26","doi-asserted-by":"crossref","DOI":"10.1016\/j.jbi.2021.103779","article-title":"NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition","volume":"118","author":"Islamaj","year":"2021","journal-title":"J. Biomed. Inform."},{"key":"2022101312371500400_R27","first-page":"1","article-title":"Adam: a method for stochastic optimization","author":"Kingma","year":"2015"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baac090\/46497302\/baac090.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baac090\/46497302\/baac090.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,7]],"date-time":"2023-03-07T22:46:55Z","timestamp":1678229215000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baac090\/6760187"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,1]]},"references-count":27,"URL":"https:\/\/doi.org\/10.1093\/database\/baac090","relation":{},"ISSN":["1758-0463"],"issn-type":[{"type":"electronic","value":"1758-0463"}],"subject":[],"published-other":{"date-parts":[[2022,1,1]]},"published":{"date-parts":[[2022,1,1]]},"article-number":"baac090"}}