{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T05:12:48Z","timestamp":1768281168166,"version":"3.49.0"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2007,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>We here report our experience in automating this process using Na\u00efve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-8-269","type":"journal-article","created":{"date-parts":[[2007,8,2]],"date-time":"2007-08-02T18:14:04Z","timestamp":1186078444000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":36,"title":["Automating document classification for the Immune Epitope Database"],"prefix":"10.1186","volume":"8","author":[{"given":"Peng","family":"Wang","sequence":"first","affiliation":[]},{"given":"Alexander A","family":"Morgan","sequence":"additional","affiliation":[]},{"given":"Qing","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Alessandro","family":"Sette","sequence":"additional","affiliation":[]},{"given":"Bjoern","family":"Peters","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2007,7,26]]},"reference":[{"issue":"Database issue","key":"1641_CR1","doi-asserted-by":"publisher","first-page":"D115","DOI":"10.1093\/nar\/gkh131","volume":"32","author":"R Apweiler","year":"2004","unstructured":"Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic acids research 2004, 32(Database issue):D115\u20139. 10.1093\/nar\/gkh131","journal-title":"Nucleic acids research"},{"key":"1641_CR2","unstructured":"GeneRIF[http:\/\/www.ncbi.nlm.nih.gov\/projects\/GeneRIF\/GeneRIFhelp.html]"},{"issue":"Database issue","key":"1641_CR3","doi-asserted-by":"publisher","first-page":"D471","DOI":"10.1093\/nar\/gki113","volume":"33","author":"JT Eppig","year":"2005","unstructured":"Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Anagnostopoulos A, Baldarelli RM, Baya M, Beal JS, Bello SM, Boddy WJ, Bradt DW, Burkart DL, Butler NE, Campbell J, Cassell MA, Corbani LE, Cousins SL, Dahmen DJ, Dene H, Diehl AD, Drabkin HJ, Frazer KS, Frost P, Glass LH, Goldsmith CW, Grant PL, Lennon-Pierce M, Lewis J, Lu I, Maltais LJ, McAndrews-Hill M, McClellan L, Miers DB, Miller LA, Ni L, Ormsby JE, Qi D, Reddy TB, Reed DJ, Richards-Smith B, Shaw DR, Sinclair R, Smith CL, Szauter P, Walker MB, Walton DO, Washburn LL, Witham IT, Zhu Y: The Mouse Genome Database (MGD): from genes to mice--a community resource for mouse biology. Nucleic acids research 2005, 33(Database issue):D471\u20135. 10.1093\/nar\/gki113","journal-title":"Nucleic acids research"},{"issue":"Database issue","key":"1641_CR4","doi-asserted-by":"publisher","first-page":"D277","DOI":"10.1093\/nar\/gkh063","volume":"32","author":"M Kanehisa","year":"2004","unstructured":"Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic acids research 2004, 32(Database issue):D277\u201380. 10.1093\/nar\/gkh063","journal-title":"Nucleic acids research"},{"issue":"1","key":"1641_CR5","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1093\/nar\/30.1.303","volume":"30","author":"I Xenarios","year":"2002","unstructured":"Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research 2002, 30(1):303\u2013305. 10.1093\/nar\/30.1.303","journal-title":"Nucleic acids research"},{"issue":"1","key":"1641_CR6","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1093\/nar\/gkg056","volume":"31","author":"GD Bader","year":"2003","unstructured":"Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic acids research 2003, 31(1):248\u2013250. 10.1093\/nar\/gkg056","journal-title":"Nucleic acids research"},{"issue":"3","key":"1641_CR7","doi-asserted-by":"publisher","first-page":"e91","DOI":"10.1371\/journal.pbio.0030091","volume":"3","author":"B Peters","year":"2005","unstructured":"Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, Nemazee D, Ponomarenko JV, Sathiamurthy M, Schoenberger S, Stewart S, Surko P, Way S, Wilson S, Sette A: The immune epitope database and analysis resource: from vision to blueprint. PLoS biology 2005, 3(3):e91. 10.1371\/journal.pbio.0030091","journal-title":"PLoS biology"},{"issue":"5","key":"1641_CR8","doi-asserted-by":"publisher","first-page":"326","DOI":"10.1007\/s00251-005-0803-5","volume":"57","author":"B Peters","year":"2005","unstructured":"Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, Nemazee D, Ponomarenko JV, Sathiamurthy M, Schoenberger SP, Stewart S, Surko P, Way S, Wilson S, Sette A: The design and implementation of the immune epitope database and analysis resource. Immunogenetics 2005, 57(5):326\u2013336. 10.1007\/s00251-005-0803-5","journal-title":"Immunogenetics"},{"key":"1641_CR9","doi-asserted-by":"publisher","first-page":"341","DOI":"10.1186\/1471-2105-7-341","volume":"7","author":"R Vita","year":"2006","unstructured":"Vita R, Vaughan K, Zarebski L, Salimi N, Fleri W, Grey H, Sathiamurthy M, Mokili J, Bui HH, Bourne PE, Ponomarenko J, de Castro R Jr., Chan RK, Sidney J, Wilson SS, Stewart S, Way S, Peters B, Sette A: Curation of complex, context-dependent immunological data. BMC bioinformatics 2006, 7: 341. 10.1186\/1471-2105-7-341","journal-title":"BMC bioinformatics"},{"key":"1641_CR10","volume-title":"Foundations of Statistical Natural Language Processing","author":"C Manning","year":"1999","unstructured":"Manning C, Sch\u00fctze H: Foundations of Statistical Natural Language Processing. 1999."},{"issue":"4","key":"1641_CR11","doi-asserted-by":"publisher","first-page":"344","DOI":"10.1093\/bib\/6.4.344","volume":"6","author":"W Hersh","year":"2005","unstructured":"Hersh W: Evaluation of biomedical text-mining systems: lessons learned from information retrieval. Briefings in bioinformatics 2005, 6(4):344\u2013356. 10.1093\/bib\/6.4.344","journal-title":"Briefings in bioinformatics"},{"key":"1641_CR12","volume-title":"TREC 2006 Genomics Track Overview: Gaithersburg, MD.","author":"W Hersh","year":"2006","unstructured":"Hersh W, Cohen AM, Roberts P, Rekapalli HK: TREC 2006 Genomics Track Overview: Gaithersburg, MD. ; 2006."},{"key":"1641_CR13","first-page":"321","volume-title":"In Proceeding of the Sixth IEEE CAIA,","author":"P Hayes","year":"1990","unstructured":"Hayes P, Andersen P, Nirenburg I, Schmandt L: TCS: A Shell for ContentBased Text Categorization. In Proceeding of the Sixth IEEE CAIA, 1990, 321--325."},{"issue":"1","key":"1641_CR14","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/505282.505283","volume":"34","author":"F Sebastiani","year":"2002","unstructured":"Sebastiani F: Machine learning in automated text categorization. ACM Computing Surveys 2002, 34(1):1--47. 10.1145\/505282.505283","journal-title":"ACM Computing Surveys"},{"key":"1641_CR15","volume-title":"In AAAI-98 Workshop on Learning for Text Categorization, 1998","author":"AN McCallum","year":"1998","unstructured":"McCallum AN, Nigam K: A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998 1998."},{"issue":"3","key":"1641_CR16","doi-asserted-by":"publisher","first-page":"233","DOI":"10.1145\/183422.183423","volume":"12","author":"C Apte","year":"1994","unstructured":"Apte C, Damerau F, Weiss SM: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems 1994, 12(3):233\u2013251. 10.1145\/183422.183423","journal-title":"ACM Transactions on Information Systems"},{"key":"1641_CR17","first-page":"137","volume-title":"Text categorization with support vector machines: learning with many relevant features","author":"T Joachims","year":"1998","unstructured":"Joachims T: Text categorization with support vector machines: learning with many relevant features. 1998, 137--142."},{"key":"1641_CR18","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1186\/1471-2105-4-11","volume":"4","author":"I Donaldson","year":"2003","unstructured":"Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K, Pawson T, Hogue CW: PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC bioinformatics 2003, 4: 11. 10.1186\/1471-2105-4-11","journal-title":"BMC bioinformatics"},{"key":"1641_CR19","doi-asserted-by":"publisher","first-page":"i91","DOI":"10.1093\/bioinformatics\/btg1011","volume":"19 Suppl 1","author":"PB Dobrokhotov","year":"2003","unstructured":"Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics (Oxford, England) 2003, 19 Suppl 1: i91\u20134. 10.1093\/bioinformatics\/btg1011","journal-title":"Bioinformatics (Oxford, England)"},{"issue":"2","key":"1641_CR20","first-page":"32","volume":"16","author":"O Miotto","year":"2005","unstructured":"Miotto O, Tan TW, Brusic V: Supporting the curation of biological databases with reusable text mining. Genome informatics 2005, 16(2):32\u201344.","journal-title":"Genome informatics"},{"key":"1641_CR21","doi-asserted-by":"publisher","first-page":"370","DOI":"10.1186\/1471-2105-7-370","volume":"7","author":"D Chen","year":"2006","unstructured":"Chen D, Muller HM, Sternberg PW: Automatic document classification of biological literature. BMC bioinformatics 2006, 7: 370. 10.1186\/1471-2105-7-370","journal-title":"BMC bioinformatics"},{"key":"1641_CR22","first-page":"792","volume-title":"Proceedings of {AAAI}-98, 15th Conference of the American Association for Artificial Intelligence","author":"K Nigam","year":"1998","unstructured":"Nigam K, McCallum AK, Thrun S, Mitchell TM: Learning to classify text from labeled and unlabeled documents. Proceedings of {AAAI}-98, 15th Conference of the American Association for Artificial Intelligence 1998, 792--799."},{"key":"1641_CR23","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1002\/asi.4630270302","volume":"27","author":"SE Robertson","year":"1976","unstructured":"Robertson SE, Sprck-Jones K: Relevance weighting of search terms. Journal of the American Society for Information Science 1976, 27: 129\u2013146. 10.1002\/asi.4630270302","journal-title":"Journal of the American Society for Information Science"},{"key":"1641_CR24","volume-title":"AAAI'98 Workshop on Learning for Text Categorization","author":"M Sahami","year":"1998","unstructured":"Sahami M, Dumais S, Heckerman D, Horvitz E: A Bayesian Approach to Filtering Junk E-Mail. AAAI'98 Workshop on Learning for Text Categorization 1998."},{"key":"1641_CR25","volume-title":"Data Mining: Practical machine learning tools and techniques","author":"IH Witten","year":"2005","unstructured":"Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd Edition edition. San Francisco , Morgan Kaufmann; 2005.","edition":"2nd Edition"},{"key":"1641_CR26","doi-asserted-by":"publisher","first-page":"130","DOI":"10.1108\/eb046814","volume":"14","author":"M Porter","year":"1980","unstructured":"Porter M: An algorithm for suffix stripping. Program (Automated Library and Information Systems) 1980, 14: 130\u2013137.","journal-title":"Program (Automated Library and Information Systems)"},{"issue":"17","key":"1641_CR27","doi-asserted-by":"publisher","first-page":"2136","DOI":"10.1093\/bioinformatics\/btl350","volume":"22","author":"B Han","year":"2006","unstructured":"Han B, Obradovic Z, Hu ZZ, Wu CH, Vucetic S: Substring selection for biomedical document classification. Bioinformatics (Oxford, England) 2006, 22(17):2136\u20132142. 10.1093\/bioinformatics\/btl350","journal-title":"Bioinformatics (Oxford, England)"},{"key":"1641_CR28","first-page":"191","volume-title":"Viewing morphology as an inference process: Pittsburgh.","author":"R Krovetz","year":"1993","unstructured":"Krovetz R: Viewing morphology as an inference process: Pittsburgh. ; 1993:191\u2013203."},{"key":"1641_CR29","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1186\/1471-2105-6-75","volume":"6","author":"BP Suomela","year":"2005","unstructured":"Suomela BP, Andrade MA: Ranking the whole MEDLINE database according to a large training set using text indexing. BMC bioinformatics 2005, 6: 75. 10.1186\/1471-2105-6-75","journal-title":"BMC bioinformatics"},{"key":"1641_CR30","unstructured":"BioCreAtIvE[http:\/\/biocreative.sourceforge.net\/]"},{"key":"1641_CR31","volume-title":"McGraw-Hill Series in Computer Science","author":"TM Mitchell","year":"1997","unstructured":"Mitchell TM: Machine Learning. In McGraw-Hill Series in Computer Science. Edited by: Liu CL. New York , MIT press and The McGraw-Hill Companies, Inc; 1997."},{"key":"1641_CR32","first-page":"1137","volume-title":"A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection","author":"R Kohavi","year":"1995","unstructured":"Kohavi R: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. 1995, 1137\u20131145."},{"issue":"1","key":"1641_CR33","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1148\/radiology.143.1.7063747","volume":"143","author":"JA Hanley","year":"1982","unstructured":"Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143(1):29\u201336.","journal-title":"Radiology"},{"key":"1641_CR34","unstructured":"R[http:\/\/www.r-project.org\/]"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-8-269.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T10:15:01Z","timestamp":1630491301000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-8-269"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,7,26]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2007,12]]}},"alternative-id":["1641"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-8-269","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2007,7,26]]},"assertion":[{"value":"15 March 2007","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 July 2007","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 July 2007","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"269"}}