{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,11]],"date-time":"2026-01-11T11:54:05Z","timestamp":1768132445697,"version":"3.49.0"},"reference-count":34,"publisher":"Oxford University Press (OUP)","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.<\/jats:p>","DOI":"10.1093\/database\/baz064","type":"journal-article","created":{"date-parts":[[2019,4,24]],"date-time":"2019-04-24T19:11:42Z","timestamp":1556133102000},"source":"Crossref","is-referenced-by-count":16,"title":["PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database"],"prefix":"10.1093","volume":"2019","author":[{"given":"Rezarta","family":"Islamaj","sequence":"first","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"W John","family":"Wilbur","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Natalie","family":"Xie","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Noreen R","family":"Gonzales","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Narmada","family":"Thanki","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Roxanne","family":"Yamashita","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Chanjuan","family":"Zheng","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Aron","family":"Marchler-Bauer","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]},{"given":"Zhiyong","family":"Lu","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA"}]}],"member":"286","published-online":{"date-parts":[[2019,7,2]]},"reference":[{"key":"2019070300224996400_ref1","doi-asserted-by":"crossref","first-page":"i41","DOI":"10.1093\/bioinformatics\/btm229","article-title":"Manual curation is not sufficient for annotation of genomic databases","volume":"23","author":"Baumgartner","year":"2007","journal-title":"Bioinformatics"},{"key":"2019070300224996400_ref2","doi-asserted-by":"crossref","first-page":"S16","DOI":"10.1038\/527S16a","article-title":"Perspective: sustaining the big-data ecosystem","volume":"527","author":"Bourne","year":"2015","journal-title":"Nature"},{"key":"2019070300224996400_ref3","doi-asserted-by":"crossref","first-page":"3454","DOI":"10.1093\/bioinformatics\/btx439","article-title":"On expert curation and scalability: UniProtKB\/Swiss-Prot as a case study","volume":"33","author":"Poux","year":"2017","journal-title":"Bioinformatics"},{"key":"2019070300224996400_ref4","doi-asserted-by":"crossref","first-page":"bas020","DOI":"10.1093\/database\/bas020","article-title":"Text mining for the biocuration workflow","volume":"2012","author":"Hirschman","year":"2012","journal-title":"Database (Oxford)"},{"key":"2019070300224996400_ref5","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/gb-2008-9-s2-s1","article-title":"Evaluation of text-mining systems for biology: overview of the second BioCreative community challenge","volume":"9","author":"Krallinger","year":"2008","journal-title":"Genome Biol."},{"key":"2019070300224996400_ref6","doi-asserted-by":"crossref","first-page":"bas056","DOI":"10.1093\/database\/bas056","article-title":"An overview of the BioCreative 2012 Workshop Track III: interactive text mining task","volume":"2013","author":"Arighi","year":"2013","journal-title":"Database (Oxford)"},{"key":"2019070300224996400_ref7","doi-asserted-by":"crossref","first-page":"D200","DOI":"10.1093\/nar\/gkw1129","article-title":"CDD\/SPARCLE: functional classification of proteins via subfamily domain architectures","volume":"45","author":"Marchler-Bauer","year":"2017","journal-title":"Nucleic Acids Res."},{"key":"2019070300224996400_ref8","doi-asserted-by":"crossref","first-page":"D279","DOI":"10.1093\/nar\/gkv1344","article-title":"The Pfam protein families database: towards a more sustainable future","volume":"44","author":"Finn","year":"2016","journal-title":"Nucleic Acids Res."},{"key":"2019070300224996400_ref9","doi-asserted-by":"crossref","first-page":"D257","DOI":"10.1093\/nar\/gku949","article-title":"SMART: recent updates, new developments and status in 2015","volume":"43","author":"Letunic","year":"2015","journal-title":"Nucleic Acids Res."},{"key":"2019070300224996400_ref10","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1093\/nar\/29.1.22","article-title":"The COG database: new developments in phylogenetic classification of proteins from complete genomes","volume":"29","author":"Tatusov","year":"2001","journal-title":"Nucleic Acids Res."},{"key":"2019070300224996400_ref11","doi-asserted-by":"crossref","first-page":"D387","DOI":"10.1093\/nar\/gks1234","article-title":"TIGRFAMs and genome properties in 2013","volume":"41","author":"Haft","year":"2013","journal-title":"Nucleic Acids Res."},{"key":"2019070300224996400_ref12","doi-asserted-by":"crossref","first-page":"D216","DOI":"10.1093\/nar\/gkn734","article-title":"The National Center for Biotechnology Information's Protein Clusters Database","volume":"37","author":"Klimke","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"2019070300224996400_ref13","doi-asserted-by":"crossref","DOI":"10.1075\/li.30.1.03nad","article-title":"A survey of named entity recognition and classification","author":"Nadeau","year":"2007","journal-title":"Lingvisticae Investigationes"},{"key":"2019070300224996400_ref14","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1758-2946-7-S1-S1","article-title":"CHEMDNER: the drugs and chemical names extraction challenge","volume":"7","author":"Krallinger","year":"2015","journal-title":"J. Cheminform."},{"key":"2019070300224996400_ref15","first-page":"2145","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics","author":"Yadav","year":"2018"},{"key":"2019070300224996400_ref16","volume-title":"Pacific Symposium on Biocomputing","author":"Chun","year":"2006"},{"key":"2019070300224996400_ref17","volume-title":"Natural Language Processing and Text Mining","author":"Bunescu","year":"2007"},{"key":"2019070300224996400_ref18","author":"Yang","year":"2016","journal-title":"Proceedings of Workshop on Biomedical Language Processing"},{"key":"2019070300224996400_ref19","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P17-1171","article-title":"Reading Wikipedia to answer open-domain questions","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics","author":"Chen","year":"2017"},{"key":"2019070300224996400_ref20","author":"Allahyari","year":"2017","journal-title":"Text summarization techniques: a brief survey"},{"key":"2019070300224996400_ref21","doi-asserted-by":"crossref","first-page":"218","DOI":"10.1145\/1111449.1111496","volume-title":"Proceedings of the 11th International Conference on Intelligent User Interfaces","author":"Badi","year":"2006"},{"key":"2019070300224996400_ref22","doi-asserted-by":"crossref","first-page":"W518","DOI":"10.1093\/nar\/gkt441","article-title":"PubTator: a web-based text mining tool for assisting biocuration","volume":"41","author":"Wei","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2019070300224996400_ref23","first-page":"13","article-title":"A survey of text similarity approaches","volume":"68","author":"Gomaa","year":"2013","journal-title":"Int. J. Comput. Appl."},{"key":"2019070300224996400_ref24","first-page":"16","author":"Metzler","year":"2007"},{"key":"2019070300224996400_ref25","first-page":"385","volume-title":"Proceedings of the First Joint Conference on Lexical and Computational Semantics\u2014Volume 1: Proceedings of the main conference and the shared task and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation","author":"Agirre","year":"2012"},{"key":"2019070300224996400_ref26","doi-asserted-by":"crossref","first-page":"i49","DOI":"10.1093\/bioinformatics\/btx238","article-title":"BIOSSES: a semantic sentence similarity estimation system for the biomedical domain","volume":"33","author":"So\u011fanc\u0131o\u011flu","year":"2017","journal-title":"Bioinformatics"},{"key":"2019070300224996400_ref27","volume-title":"ACMBCB\u201918: 9th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 2018","author":"Chen","year":"2018"},{"key":"2019070300224996400_ref28","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1613\/jair.2985","article-title":"A survey of paraphrasing and textual entailment methods","volume":"38","author":"Androutsopoulos","year":"2010","journal-title":"J. Artif. Intell. Res."},{"key":"2019070300224996400_ref29","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511809071","volume-title":"Introduction to Information Retrieval","author":"Manning","year":"2008"},{"key":"2019070300224996400_ref30","first-page":"1411","volume-title":"Short text similarity with word embeddings. CIKM","author":"Kenter","year":"2015"},{"key":"2019070300224996400_ref31","first-page":"1275","author":"Song","year":"2015"},{"key":"2019070300224996400_ref32","volume-title":"Evaluating web-based question answering systems","author":"Radev","year":"2002"},{"key":"2019070300224996400_ref33","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1007\/s10791-008-9069-5","article-title":"The ineffectiveness of within\u2014document term frequency in text classification","volume":"12","author":"Wilbur","year":"2009","journal-title":"Inf. Retr."},{"key":"2019070300224996400_ref34","volume-title":"Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013)","author":"Wang","year":"2013"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baz064\/28895934\/baz064.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,9,16]],"date-time":"2022-09-16T20:10:56Z","timestamp":1663359056000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baz064\/5527151"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,1,1]]},"references-count":34,"URL":"https:\/\/doi.org\/10.1093\/database\/baz064","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2019]]},"published":{"date-parts":[[2019,1,1]]},"article-number":"baz064"}}