{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T06:26:33Z","timestamp":1768458393633,"version":"3.49.0"},"reference-count":17,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2021,9,29]],"date-time":"2021-09-29T00:00:00Z","timestamp":1632873600000},"content-version":"vor","delay-in-days":28,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100009633","name":"Eunice Kennedy Shriver National Institute of Child Health and Human Development","doi-asserted-by":"publisher","award":["P41 HD064556"],"award-info":[{"award-number":["P41 HD064556"]}],"id":[{"id":"10.13039\/100009633","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100009633","name":"Eunice Kennedy Shriver National Institute of Child Health and Human Development","doi-asserted-by":"publisher","award":["P41 HD095831"],"award-info":[{"award-number":["P41 HD095831"]}],"id":[{"id":"10.13039\/100009633","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,9,29]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>A keyword-based search of comprehensive databases such as PubMed may return irrelevant papers, especially if the keywords are used in multiple fields of study. In such cases, domain experts (curators) need to verify the results and remove the irrelevant articles. Automating this filtering process will save time, but it has to be done well enough to ensure few relevant papers are rejected and few irrelevant papers are accepted. A good solution would be fast, work with the limited amount of data freely available (full paper body may be missing), handle ambiguous keywords and be as domain-neutral as possible. In this paper, we evaluate a number of classification algorithms for identifying a domain-specific set of papers about echinoderm species and show that the resulting tool satisfies most of the abovementioned requirements. Echinoderms consist of a number of very different organisms, including brittle stars, sea stars (starfish), sea urchins and sea cucumbers. While their taxonomic identifiers are specific, the common names are used in many other contexts, creating ambiguity and making a keyword search prone to error. We try classifiers using Linear, Na\u00efve Bayes, Nearest Neighbor, Tree, SVM, Bagging, AdaBoost and Neural Network learning models and compare their performance. We show how effective the resulting classifiers are in filtering irrelevant articles returned from PubMed. The methodology used is more dependent on the good selection of training data and is a practical solution that can be applied to other fields of study facing similar challenges.<\/jats:p>\n               <jats:p>Database URL: The code and date reported in this paper are freely available at http:\/\/xenbaseturbofrog.org\/pub\/Text-Topic-Classifier\/<\/jats:p>","DOI":"10.1093\/database\/baab062","type":"journal-article","created":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T03:12:02Z","timestamp":1631848322000},"source":"Crossref","is-referenced-by-count":9,"title":["Classifying domain-specific text documents containing ambiguous keywords"],"prefix":"10.1093","volume":"2021","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5733-9409","authenticated-orcid":false,"given":"Kamran","family":"Karimi","sequence":"first","affiliation":[{"name":"Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada"}]},{"given":"Sergei","family":"Agalakov","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada"}]},{"given":"Cheryl A","family":"Telmer","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Thomas R","family":"Beatman","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2340-5356","authenticated-orcid":false,"given":"Troy J","family":"Pells","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada"}]},{"given":"Bradley Im","family":"Arshinoff","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada"}]},{"given":"Carolyn J","family":"Ku","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1791-2837","authenticated-orcid":false,"given":"Saoirse","family":"Foley","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Veronica F","family":"Hinman","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Charles A","family":"Ettensohn","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Peter D","family":"Vize","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada"}]}],"member":"286","published-online":{"date-parts":[[2021,9,29]]},"reference":[{"key":"2022031012251320100_R1","doi-asserted-by":"crossref","DOI":"10.1186\/s12859-019-2607-x","article-title":"BioReader: a text mining tool for performing classification of biomedical literature","volume":"19","author":"Simon","year":"2019","journal-title":"BMC Bioinform."},{"key":"2022031012251320100_R2","doi-asserted-by":"publisher","DOI":"10.5772\/intechopen.75924","article-title":"Application of biomedical text mining, artificial intelligence - emerging trends and applications","author":"Gong","year":"2018","journal-title":"IntechOpen"},{"key":"2022031012251320100_R3","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1016\/j.ymeth.2015.01.015","article-title":"Application of text mining in the biomedical domain","volume":"74","author":"Fleuren","year":"2015","journal-title":"Methods"},{"key":"2022031012251320100_R4","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bas020","article-title":"Text mining for the biocuration workflow","volume":"2012","author":"Hirschman","year":"2012","journal-title":"Database"},{"key":"2022031012251320100_R5","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bas043","article-title":"Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II","volume":"2012","author":"Lu","year":"2012","journal-title":"Database"},{"key":"2022031012251320100_R6","doi-asserted-by":"crossref","DOI":"10.1126\/science.abc7839","article-title":"Scientists are drowning in COVID-19 papers. Can new tools keep them afloat?","author":"Brainard","year":"2020","journal-title":"Science"},{"key":"2022031012251320100_R7","doi-asserted-by":"crossref","first-page":"781","DOI":"10.1093\/bib\/bbaa296","article-title":"Text mining approaches for dealing with the rapidly expanding literature on COVID-19","volume":"22","author":"Wang","year":"2021","journal-title":"Brief. Bioinf."},{"key":"2022031012251320100_R8","doi-asserted-by":"crossref","DOI":"10.1201\/b17320","volume-title":"Data Classification: Algorithms and Applications","author":"Aggarwal","year":"2014"},{"key":"2022031012251320100_R9","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bax017","article-title":"Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)","volume":"2017","author":"Jiang","year":"2017","journal-title":"Database"},{"key":"2022031012251320100_R10","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bas040","article-title":"Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR","volume":"2012","author":"Van Auken","year":"2012","journal-title":"Database"},{"key":"2022031012251320100_R11","volume-title":"Starfish, Urchins, and Other Echinoderms","author":"Gilpin","year":"2006"},{"key":"2022031012251320100_R12","doi-asserted-by":"crossref","first-page":"349","DOI":"10.1007\/978-1-4939-7737-6_12","article-title":"EchinoBase: tools for echinoderm genome analyses","volume":"1757","author":"Cary","year":"2018","journal-title":"Methods Mol. Biol."},{"key":"2022031012251320100_R13","volume-title":"Entrez Programming Utilities Help [Internet]","author":"Sayers","year":"2010"},{"key":"2022031012251320100_R14","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1005962","article-title":"A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts","volume":"14","author":"Westergaard","year":"2018","journal-title":"PLoS Comput. Biol."},{"key":"2022031012251320100_R15","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"JMLR"},{"key":"2022031012251320100_R16","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkab326","article-title":"LitSuggest: a web-based system for literature recommendation and curation using machine learning","author":"Allot","year":"2021","journal-title":"Nucleic Acids Res."},{"key":"2022031012251320100_R17","doi-asserted-by":"crossref","first-page":"D861","DOI":"10.1093\/nar\/gkx936","article-title":"Xenbase: a genomic, epigenomic and transcriptomic model organism database","volume":"46","author":"Karimi","year":"2018","journal-title":"Nucleic Acids Res."}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baab062\/42803547\/baab062.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baab062\/42803547\/baab062.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,3,10]],"date-time":"2022-03-10T12:26:57Z","timestamp":1646915217000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baab062\/6377760"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,1]]},"references-count":17,"URL":"https:\/\/doi.org\/10.1093\/database\/baab062","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,9,1]]},"published":{"date-parts":[[2021,9,1]]},"article-number":"baab062"}}