{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T11:48:27Z","timestamp":1753876107974,"version":"3.41.2"},"reference-count":31,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2020,4,15]],"date-time":"2020-04-15T00:00:00Z","timestamp":1586908800000},"content-version":"vor","delay-in-days":105,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","award":["R01LM012527","R01LM011945"],"award-info":[{"award-number":["R01LM012527","R01LM011945"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Institute for Child Health and Development","award":["P41 HD062499"],"award-info":[{"award-number":["P41 HD062499"]}]},{"DOI":"10.13039\/100000051","name":"National Human Genome Research Institute","doi-asserted-by":"publisher","award":["U41HG000330"],"award-info":[{"award-number":["U41HG000330"]}],"id":[{"id":"10.13039\/100000051","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title\/>\n                  <jats:p>Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title\/>\n                  <jats:p>We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60\u2009000 documents (5469 labeled as relevant; 52\u2009866 as irrelevant), gathered throughout 2012\u20132016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier\u2019s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title\/>\n                  <jats:p>Database URL:<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/database\/baaa024","type":"journal-article","created":{"date-parts":[[2020,3,18]],"date-time":"2020-03-18T20:09:50Z","timestamp":1584562190000},"source":"Crossref","is-referenced-by-count":11,"title":["Integrating image caption information into biomedical document classification in support of biocuration"],"prefix":"10.1093","volume":"2020","author":[{"given":"Xiangying","family":"Jiang","sequence":"first","affiliation":[{"name":"The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA"}]},{"given":"Pengyuan","family":"Li","sequence":"first","affiliation":[{"name":"The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA"}]},{"given":"James","family":"Kadin","sequence":"first","affiliation":[{"name":"The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA"}]},{"given":"Judith A","family":"Blake","sequence":"first","affiliation":[{"name":"The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA"}]},{"given":"Martin","family":"Ringwald","sequence":"first","affiliation":[{"name":"The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA"}]},{"given":"Hagit","family":"Shatkay","sequence":"first","affiliation":[{"name":"The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,4,15]]},"reference":[{"issue":"12","key":"2020041511352948900_ref1","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0115892","article-title":"Machine learning for biomedical literature triage","volume":"9","author":"Almeida","year":"2014","journal-title":"PLoS One"},{"key":"2020041511352948900_ref2","first-page":"1027","volume-title":"Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms","author":"Arthur","year":"2007"},{"issue":"13","key":"2020041511352948900_ref3","doi-asserted-by":"crossref","first-page":"i41","DOI":"10.1093\/bioinformatics\/btm229","article-title":"Manual curation is not sufficient for annotation of genomic databases","volume":"23","author":"Baumgartner","year":"2007","journal-title":"Bioinformatics"},{"key":"2020041511352948900_ref4","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2020041511352948900_ref5","doi-asserted-by":"crossref","DOI":"10.1093\/database\/baz034","article-title":"Building deep learning models for evidence classification from the open access biomedical literature","author":"Burns","year":"2019"},{"key":"2020041511352948900_ref6","first-page":"161","article-title":"An effective general purpose approach for automated biomedical document classification","author":"Cohen","year":"2006","journal-title":"Proceedings of Annual Symposium of the American Medical Informatics Association (AMIA)"},{"issue":"3","key":"2020041511352948900_ref7","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1007\/BF00994018","article-title":"Support-vector networks","volume":"20","author":"Cortes","year":"1995","journal-title":"Mach. Learn."},{"key":"2020041511352948900_ref8","first-page":"bap019","article-title":"Integrating text mining into the MGI biocuration workflow","author":"Dowell","year":"2009","journal-title":"Database"},{"key":"2020041511352948900_ref9","first-page":"bay076","article-title":"Hierarchical bi-directional attention-based RNNs for supporting document classification on protein\u2013protein interactions affected by genetic mutations","author":"Fergadis","year":"2018","journal-title":"Database"},{"key":"2020041511352948900_ref10","first-page":"278","volume-title":"Proceedings of the Third International Conference on Document Analysis and Recognition","author":"Ho","year":"1995"},{"key":"2020041511352948900_ref11","first-page":"bay091","article-title":"Assisting document triage for human kinome curation via machine learning","author":"Hsu","year":"2018","journal-title":"Database"},{"key":"2020041511352948900_ref12","first-page":"baz045","article-title":"An effective biomedical document classification scheme in support of biocuration: addressing class imbalance","author":"Jiang","year":"2019","journal-title":"Database"},{"key":"2020041511352948900_ref13","first-page":"bax017","article-title":"Effective biomedical document classification for identifying publications relevant to the mouse gene expression database (GXD)","author":"Jiang","year":"2017","journal-title":"Database"},{"issue":"3","key":"2020041511352948900_ref14","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1109\/TCBB.2010.49","article-title":"Empirical investigations into full-text protein interaction article categorization task (ACT) in the BioCreative II. 5 Challenge","volume":"7","author":"Lan","year":"2010","journal-title":"IEEE\/ACM T. Comput. Biol. Bioinf."},{"issue":"8","key":"2020041511352948900_ref15","doi-asserted-by":"crossref","first-page":"e1006390","DOI":"10.1371\/journal.pcbi.1006390","article-title":"Scaling up data curation using deep learning: an application to literature triage in genomic variation resources","volume":"14","author":"Lee","year":"2018","journal-title":"PLoS Comput. Biol."},{"issue":"21","key":"2020041511352948900_ref16","doi-asserted-by":"crossref","first-page":"4381","DOI":"10.1093\/bioinformatics\/btz228","article-title":"Figure and caption extraction from biomedical documents","volume":"35","author":"Li","year":"2019","journal-title":"Bioinformatics"},{"issue":"1","key":"2020041511352948900_ref17","doi-asserted-by":"crossref","first-page":"46","DOI":"10.1186\/1471-2105-10-46","article-title":"Is searching full text more effective than searching abstracts?","volume":"10","author":"Lin","year":"2009","journal-title":"BMC Bioinf."},{"key":"2020041511352948900_ref18","first-page":"496","article-title":"Introduction to Information Retrieval as indicated in the manuscript","volume-title":"Introduction to Information Retrieval","author":"Manning","year":"2008"},{"author":"Mouse Genome Informatics","key":"2020041511352948900_ref19"},{"issue":"1","key":"2020041511352948900_ref20","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1186\/s12859-018-2103-8","article-title":"Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature","volume":"19","author":"M\u00fcller","year":"2018","journal-title":"BMC Bioinf."},{"issue":"15","key":"2020041511352948900_ref21","doi-asserted-by":"crossref","first-page":"1915","DOI":"10.1093\/bioinformatics\/btt317","article-title":"BeCAS: biomedical concept recognition services and visualization","volume":"29","author":"Nunes","year":"2013","journal-title":"Bioinformatics"},{"key":"2020041511352948900_ref22","doi-asserted-by":"crossref","first-page":"1532","DOI":"10.3115\/v1\/D14-1162","article-title":"Glove: global vectors for word representation","author":"Pennington","year":"2014","journal-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing"},{"key":"2020041511352948900_ref23","first-page":"2227","article-title":"Deep contextualized word representations","volume-title":"Proceedings of the 2018 Conference of the North American Chapter Association for Computational Linguistics: Human Language Technologies","author":"Peters","year":"2018"},{"author":"PMC Author Manuscript Collection","key":"2020041511352948900_ref24"},{"issue":"2","key":"2020041511352948900_ref25","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1145\/772862.772874","article-title":"Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1)","volume":"4","author":"Regev","year":"2002","journal-title":"ACM SIGKDD Explor. Newslett."},{"issue":"14","key":"2020041511352948900_ref26","doi-asserted-by":"crossref","first-page":"e446","DOI":"10.1093\/bioinformatics\/btl235","article-title":"Integrating image data into biomedical text categorization","volume":"22","author":"Shatkay","year":"2006","journal-title":"Bioinformatics"},{"key":"2020041511352948900_ref27","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1145\/2382936.2382949","volume-title":"Proceedings of the ACM Conference on Bioinformatics, Comput. Biol. Biomed","author":"Shatkay","year":"2012"},{"issue":"13","key":"2020041511352948900_ref28","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1186\/s12859-019-2607-x","article-title":"BioReader: a text mining tool for performing classification of biomedical literature","volume":"19","author":"Simon","year":"2019","journal-title":"BMC Bioinf."},{"key":"2020041511352948900_ref29","first-page":"235","article-title":"Probability & Statistics for Engineers & Scientists","author":"Walpole","year":"2012","journal-title":"Prentice Hall"},{"issue":"W1","key":"2020041511352948900_ref30","doi-asserted-by":"crossref","first-page":"W518","DOI":"10.1093\/nar\/gkt441","article-title":"PubTator: a web-based text mining tool for assisting biocuration","volume":"41","author":"Wei","year":"2013","journal-title":"Nucleic Acids Res."},{"author":"WormBase","key":"2020041511352948900_ref31"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa024\/33047414\/baaa024.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa024\/33047414\/baaa024.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,4,15]],"date-time":"2020-04-15T17:12:05Z","timestamp":1586970725000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaa024\/5819650"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,1,1]]},"references-count":31,"URL":"https:\/\/doi.org\/10.1093\/database\/baaa024","relation":{},"ISSN":["1758-0463"],"issn-type":[{"type":"electronic","value":"1758-0463"}],"subject":[],"published-other":{"date-parts":[[2020]]},"published":{"date-parts":[[2020,1,1]]},"article-number":"baaa024"}}