{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,25]],"date-time":"2026-01-25T05:30:15Z","timestamp":1769319015659,"version":"3.49.0"},"reference-count":20,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2014,8,12]],"date-time":"2014-08-12T00:00:00Z","timestamp":1407801600000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2014,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are `ChEMBL-like\u2019 (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>The method has been implemented as a data protocol\/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"ftp:\/\/ftp.ebi.ac.uk\/pub\/databases\/chembl\/text-mining\" ext-link-type=\"uri\">ftp:\/\/ftp.ebi.ac.uk\/pub\/databases\/chembl\/text-mining<\/jats:ext-link>. These can be readily modified to include additional keyword constraints to further focus searches.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Abstract<\/jats:title>\n          <\/jats:sec>","DOI":"10.1186\/s13321-014-0040-8","type":"journal-article","created":{"date-parts":[[2014,8,11]],"date-time":"2014-08-11T03:34:31Z","timestamp":1407728071000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["A document classifier for medicinal chemistry publications trained on the ChEMBL corpus"],"prefix":"10.1186","volume":"6","author":[{"given":"George","family":"Papadatos","sequence":"first","affiliation":[]},{"given":"Gerard JP","family":"van Westen","sequence":"additional","affiliation":[]},{"given":"Samuel","family":"Croset","sequence":"additional","affiliation":[]},{"given":"Rita","family":"Santos","sequence":"additional","affiliation":[]},{"given":"Simone","family":"Trubian","sequence":"additional","affiliation":[]},{"given":"John P","family":"Overington","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2014,8,12]]},"reference":[{"key":"40_CR1","doi-asserted-by":"publisher","first-page":"D1083","DOI":"10.1093\/nar\/gkt1031","volume":"42","author":"AP Bento","year":"2014","unstructured":"Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Kr\u00fcger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP: The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014, 42: D1083-D1090. 10.1093\/nar\/gkt1031.","journal-title":"Nucleic Acids Res"},{"key":"40_CR2","doi-asserted-by":"publisher","first-page":"e65","DOI":"10.1371\/journal.pbio.0030065","volume":"3","author":"D Rebholz-Schuhmann","year":"2005","unstructured":"Rebholz-Schuhmann D, Kirsch H, Couto F: Facts from text\u2013is text mining ready to deliver?. PLoS Biol. 2005, 3: e65-10.1371\/journal.pbio.0030065.","journal-title":"PLoS Biol"},{"key":"40_CR3","first-page":"bar059","volume":"2012","author":"S Burge","year":"2012","unstructured":"Burge S, Attwood TK, Bateman A, Berardini TZ, Cherry M, O\u2019Donovan C, Xenarios L, Gaudet P: Biocurators and biocuration: surveying the 21st century challenges. Database (Oxford). 2012, 2012: bar059-","journal-title":"Database (Oxford)"},{"key":"40_CR4","unstructured":"Europe PubMed Central. [], [http:\/\/europepmc.org\/]"},{"key":"40_CR5","unstructured":"PubMed\/MEDLINE. [], [http:\/\/www.pubmed.org]"},{"key":"40_CR6","doi-asserted-by":"publisher","first-page":"296","DOI":"10.1093\/bioinformatics\/btm557","volume":"24","author":"D Rebholz-Schuhmann","year":"2008","unstructured":"Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through web services: calling Whatizit. Bioinformatics. 2008, 24: 296-298. 10.1093\/bioinformatics\/btm557.","journal-title":"Bioinformatics"},{"key":"40_CR7","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1186\/1758-2946-3-41","volume":"3","author":"DM Jessop","year":"2011","unstructured":"Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P: OSCAR4: a flexible architecture for chemical text-mining. J Cheminform. 2011, 3: 41-10.1186\/1758-2946-3-41.","journal-title":"J Cheminform"},{"key":"40_CR8","doi-asserted-by":"publisher","first-page":"1633","DOI":"10.1093\/bioinformatics\/bts183","volume":"28","author":"T Rockt\u00e4schel","year":"2012","unstructured":"Rockt\u00e4schel T, Weidlich M, Leser U: ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012, 28: 1633-1640. 10.1093\/bioinformatics\/bts183.","journal-title":"Bioinformatics"},{"key":"40_CR9","volume-title":"Proceedings of the fourth BioCreative challenge evaluation workshop","author":"CN Arighi","year":"2013","unstructured":"Arighi CN, Cohen KB, Hirschman L, Lu Z, Tudor CO, Wiegers T, Wilbur WJ, Wu CH: Proceedings of the fourth BioCreative challenge evaluation workshop. 2013, Maryland, USA, Bethesda"},{"key":"40_CR10","doi-asserted-by":"publisher","first-page":"e58201","DOI":"10.1371\/journal.pone.0058201","volume":"8","author":"AP Davis","year":"2013","unstructured":"Davis AP, Wiegers TC, Johnson RJ, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, Murphy CG, Mattingly CJ: Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One. 2013, 8: e58201-10.1371\/journal.pone.0058201.","journal-title":"PLoS One"},{"key":"40_CR11","doi-asserted-by":"publisher","first-page":"bas050","DOI":"10.1093\/database\/bas050","volume":"2012","author":"D Vishnyakova","year":"2012","unstructured":"Vishnyakova D, Pasche E, Ruch P: Using binary classification to prioritize and curate articles for the comparative toxicogenomics database. Database (Oxford). 2012, 2012: bas050-10.1093\/database\/bas050.","journal-title":"Database (Oxford)"},{"key":"40_CR12","volume-title":"Machine learning","author":"TM Mitchell","year":"1997","unstructured":"Mitchell TM: Machine learning. 1997, McGraw-Hill, Inc., New York, NY, USA"},{"key":"40_CR13","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1023\/A:1007413511361","volume":"29","author":"P Domingos","year":"1997","unstructured":"Domingos P, Pazzani M: On the optimality of the simple bayesian classifier under zero\u2013one loss. Mach Learn. 1997, 29: 103-130. 10.1023\/A:1007413511361.","journal-title":"Mach Learn"},{"key":"40_CR14","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023\/A:1010933404324.","journal-title":"Mach Learn"},{"key":"40_CR15","unstructured":"Pipeline pilot. 2012"},{"key":"40_CR16","volume-title":"KNIME: the konstanz information miner","author":"MR Berthold","year":"2007","unstructured":"Berthold MR, Cebron N, Dill F, Gabriel TR, K\u00f6tter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B: KNIME: the konstanz information miner. 2007, Springer, In Stud. Classif. Data Anal. Knowl. Organ"},{"key":"40_CR17","doi-asserted-by":"publisher","first-page":"D198","DOI":"10.1093\/nar\/gkl999","volume":"35","author":"T Liu","year":"2007","unstructured":"Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK: BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35: D198-D201. 10.1093\/nar\/gkl999.","journal-title":"Nucleic Acids Res"},{"key":"40_CR18","doi-asserted-by":"publisher","first-page":"e1003559","DOI":"10.1371\/journal.pcbi.1003559","volume":"10","author":"GJP Van Westen","year":"2014","unstructured":"Van Westen GJP, Gaulton A, Overington JP: Chemical, target, and bioactive properties of allosteric modulation. PLoS Comput Biol. 2014, 10: e1003559-10.1371\/journal.pcbi.1003559.","journal-title":"PLoS Comput Biol"},{"key":"40_CR19","doi-asserted-by":"publisher","first-page":"98","DOI":"10.3163\/1536-5050.100.2.007","volume":"100","author":"HL Brown","year":"2012","unstructured":"Brown HL: Pay-per-view in interlibrary loan: a case study. J Med Libr Assoc. 2012, 100: 98-103. 10.3163\/1536-5050.100.2.007.","journal-title":"J Med Libr Assoc"},{"key":"40_CR20","unstructured":"Malaria-data resource. [], [https:\/\/www.ebi.ac.uk\/chembl\/malaria\/]"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s13321-014-0040-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-014-0040-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-014-0040-8","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-014-0040-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,2]],"date-time":"2021-09-02T05:11:24Z","timestamp":1630559484000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-014-0040-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,8,12]]},"references-count":20,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,12]]}},"alternative-id":["40"],"URL":"https:\/\/doi.org\/10.1186\/s13321-014-0040-8","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,8,12]]},"assertion":[{"value":"7 April 2014","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 July 2014","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 August 2014","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"40"}}