{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T23:46:31Z","timestamp":1773272791550,"version":"3.50.1"},"reference-count":24,"publisher":"Oxford University Press (OUP)","issue":"20","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":3339,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/uk\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2007,10,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed.<\/jats:p><jats:p>Results: We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene\/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks.<\/jats:p><jats:p>Availability: A dictionary look-up system using the similarity measures described in this article is available at http:\/\/text0.mib.man.ac.uk\/software\/mldic\/<\/jats:p><jats:p>Contact: \u00a0yoshimasa.tsuruoka@manchester.ac.uk<\/jats:p>","DOI":"10.1093\/bioinformatics\/btm393","type":"journal-article","created":{"date-parts":[[2007,8,14]],"date-time":"2007-08-14T00:12:36Z","timestamp":1187050356000},"page":"2768-2774","source":"Crossref","is-referenced-by-count":62,"title":["Learning string similarity measures for gene\/protein name dictionary look-up using logistic regression"],"prefix":"10.1093","volume":"23","author":[{"given":"Yoshimasa","family":"Tsuruoka","sequence":"first","affiliation":[{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"}]},{"given":"John","family":"McNaught","sequence":"additional","affiliation":[{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"},{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"}]},{"given":"Jun'i;chi","family":"Tsujii","sequence":"additional","affiliation":[{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"},{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"},{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"}]},{"given":"Sophia","family":"Ananiadou","sequence":"additional","affiliation":[{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"},{"name":"1 School of Computer Science, The University of Manchester, Manchester, 2National Centre for Text Mining (NaCTeM), Manchester, UK and 3Department of Computer Science, The University of Tokyo, Japan"}]}],"member":"286","published-online":{"date-parts":[[2007,8,12]]},"reference":[{"key":"2023041106211547200_","first-page":"39","article-title":"Adaptive duplicate detection using learnable string similarity measures","author":"Bilenko","year":"2003"},{"key":"2023041106211547200_","first-page":"58","article-title":"Adaptive product normalization: Using online learning for record linkage in comparison shopping","author":"Bilenko","year":"2005"},{"key":"2023041106211547200_","first-page":"14","article-title":"Contrast and variability in gene names","author":"Cohen","year":"2002"},{"key":"2023041106211547200_","first-page":"475","article-title":"Learning to match and cluster large high-dimensinoal data sets for data integration","author":"Cohen","year":"2002"},{"issue":"7","key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1186\/1471-2105-7-440","article-title":"Agraph-search framework for associating gene identifies with documents","author":"Cohen","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"S13","DOI":"10.1186\/1471-2105-6-S1-S13","article-title":"Automatically annotating documents with normalized gene lists","volume":"6","author":"Crim","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023041106211547200_","first-page":"41","article-title":"Human gene name normalization using text matching with automatically extracted synonym dictionaries","author":"Fang","year":"2006"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"S14","DOI":"10.1186\/1471-2105-6-S1-S14","article-title":"ProMiner: rule-based protein and gene entity recognition","volume":"6","author":"Hanisch","journal-title":"BMC Bioinformatics"},{"key":"2023041106211547200_","first-page":"21","article-title":"Implementing the iHOP concept for navigation of biomedical literature","author":"Hoffmann","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"1231","DOI":"10.1101\/gr.835903","article-title":"Kinase pathway database: An integrated protein-kinase and NLP-based protein\u2013interaction resource","volume":"13","author":"Koike","year":"2003","journal-title":"Genome Res"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/S0378-1119(00)00431-5","article-title":"Using BLAST for identifying gene and protein names in journal articles","volume":"259","author":"Krauthammer","year":"2000","journal-title":"Gene"},{"key":"2023041106211547200_","first-page":"8","article-title":"Binary codes capable of correcting spurious insertions and deletions of ones","volume":"1","author":"Levenshtein","year":"1965","journal-title":"Prob. Inf. Transm"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1093\/bioinformatics\/bti749","article-title":"BioThesaurus: a web-based thesaurus of protein and gene names","volume":"22","author":"Liu","year":"2006","journal-title":"Bioinformatics"},{"key":"2023041106211547200_","first-page":"388","article-title":"A conditional random field for discriminatively-trained finite-state string edit distance","author":"McCallum","year":"2005"},{"key":"2023041106211547200_","first-page":"1017","article-title":"Semantic retrieval for the accurate identification of relational concepts in massive textbases","author":"Miyao","year":"2006"},{"key":"2023041106211547200_","first-page":"17","article-title":"Overview of BioCreative II gene normalization","author":"Morgan","year":"2007"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"396","DOI":"10.1016\/j.jbi.2004.08.010","article-title":"Gene name identification and normalization using a model organism database","volume":"37","author":"Morgan","year":"2004","journal-title":"J. Biomed. Inform"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"522","DOI":"10.1109\/34.682181","article-title":"Learning string-edit distance.IEEE","volume":"20","author":"Ristad","year":"1998","journal-title":"Trans. Pattern Anal. Mach. Intell"},{"issue":"27","key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1016\/S1476-9271(02)00096-8","article-title":"Hidden Markov models and optimized sequence alighments","author":"Smith","year":"2003","journal-title":"Comput. Biol. Chem"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"2748","DOI":"10.1093\/bioinformatics\/bti338","article-title":"MaSTerClass: a case-based reasoning system for the classification of biomedical terms","volume":"21","author":"Spasic","year":"2005","journal-title":"Bioinformatics"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1016\/j.jbi.2004.08.003","article-title":"Improving the performance of dictionary-based approaches in protein name recognition","volume":"37","author":"Tsuruoka","year":"2004","journal-title":"J. Biomed. Informa"},{"key":"2023041106211547200_","first-page":"9","article-title":"Adaptive string similarity metrics for biomedical reference resolution","author":"Wellner","year":"2005"},{"key":"2023041106211547200_","article-title":"The state of record linkage and current research problems","volume-title":"Technical report","author":"Winkler","year":"1999"},{"key":"2023041106211547200_","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1016\/j.compbiolchem.2003.12.003","article-title":"Identification of related gene\/protein names based on an hmm of name variations","volume":"28","author":"Yeganova","year":"2004","journal-title":"Comput. Biol. Chem"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/20\/2768\/49816871\/bioinformatics_23_20_2768.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/20\/2768\/49816871\/bioinformatics_23_20_2768.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,17]],"date-time":"2024-02-17T10:57:44Z","timestamp":1708167464000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/23\/20\/2768\/229308"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,8,12]]},"references-count":24,"journal-issue":{"issue":"20","published-print":{"date-parts":[[2007,10,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btm393","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2007,10,15]]},"published":{"date-parts":[[2007,8,12]]}}}