{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T06:09:23Z","timestamp":1778220563930,"version":"3.51.4"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,9,1]],"date-time":"2020-09-01T00:00:00Z","timestamp":1598918400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,9,1]],"date-time":"2020-09-01T00:00:00Z","timestamp":1598918400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100004440","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["WT086151\/Z\/08\/Z"],"award-info":[{"award-number":["WT086151\/Z\/08\/Z"]}],"id":[{"id":"10.13039\/100004440","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004440","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["WT104104\/Z\/14\/Z"],"award-info":[{"award-number":["WT104104\/Z\/14\/Z"]}],"id":[{"id":"10.13039\/100004440","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100013060","name":"European Molecular Biology Laboratory","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100013060","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>\n                      A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a\n                      <jats:italic>Checker<\/jats:italic>\n                      to test the validity of chemical structures and flag any serious errors; a\n                      <jats:italic>Standardizer<\/jats:italic>\n                      which formats compounds according to defined rules and conventions and a\n                      <jats:italic>GetParent<\/jats:italic>\n                      component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures.\n                    <\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusion<\/jats:title>\n                    <jats:p>All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s13321-020-00456-1","type":"journal-article","created":{"date-parts":[[2020,8,31]],"date-time":"2020-08-31T21:03:32Z","timestamp":1598907812000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":586,"title":["An open source chemical structure curation pipeline using RDKit"],"prefix":"10.1186","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1424-480X","authenticated-orcid":false,"given":"A. Patr\u00edcia","family":"Bento","sequence":"first","affiliation":[]},{"given":"Anne","family":"Hersey","sequence":"additional","affiliation":[]},{"given":"Eloy","family":"F\u00e9lix","sequence":"additional","affiliation":[]},{"given":"Greg","family":"Landrum","sequence":"additional","affiliation":[]},{"given":"Anna","family":"Gaulton","sequence":"additional","affiliation":[]},{"given":"Francis","family":"Atkinson","sequence":"additional","affiliation":[]},{"given":"Louisa J.","family":"Bellis","sequence":"additional","affiliation":[]},{"given":"Marleen","family":"De Veij","sequence":"additional","affiliation":[]},{"given":"Andrew R.","family":"Leach","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,9,1]]},"reference":[{"issue":"D1","key":"456_CR1","doi-asserted-by":"publisher","first-page":"D930","DOI":"10.1093\/nar\/gky1075","volume":"47","author":"D Mendez","year":"2019","unstructured":"Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47(D1):D930\u2013D940","journal-title":"Nucleic Acids Res"},{"issue":"D1","key":"456_CR2","doi-asserted-by":"publisher","first-page":"D1045","DOI":"10.1093\/nar\/gkv1072","volume":"44","author":"MK Gilson","year":"2016","unstructured":"Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J (2016) BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res 44(D1):D1045\u2013D1053","journal-title":"Nucleic Acids Res"},{"issue":"D1","key":"456_CR3","doi-asserted-by":"publisher","first-page":"D1102","DOI":"10.1093\/nar\/gky1033","volume":"47","author":"S Kim","year":"2019","unstructured":"Kim S, Chen J, Cheng T, Gindulyte A, He J, He S et al (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47(D1):D1102","journal-title":"Nucleic Acids Res"},{"key":"456_CR4","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1021\/ci00007a012","volume":"32","author":"A Dalby","year":"1992","unstructured":"Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, Laufer J (1992) Description of several chemical structure formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci 32:244\u2013255","journal-title":"J Chem Inf Comput Sci"},{"issue":"1","key":"456_CR5","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) SMILES, a chemical langaugeand information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31\u201336","journal-title":"J Chem Inf Comput Sci"},{"key":"456_CR6","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1186\/1758-2946-3-33","volume":"3","author":"NM O\u2019Boyle","year":"2011","unstructured":"O\u2019Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminform. 3:33","journal-title":"J Cheminform."},{"issue":"11","key":"456_CR7","doi-asserted-by":"publisher","first-page":"3016","DOI":"10.1016\/j.bmc.2018.05.011","volume":"26","author":"P Brear","year":"2018","unstructured":"Brear P, North A, Iegre J, Hadje Georgiou K, Lubin A, Carro L et al (2018) Novel non-ATP competitive small molecules targeting the CK2 alpha\/beta interface. Bioorg Med Chem 26(11):3016\u20133020","journal-title":"Bioorg Med Chem"},{"issue":"6","key":"456_CR8","doi-asserted-by":"publisher","first-page":"2422","DOI":"10.1021\/acs.jmedchem.7b01664","volume":"61","author":"DE Knutson","year":"2018","unstructured":"Knutson DE, Kodali R, Divovic B, Treven M, Stephen MR, Zahn NM et al (2018) Design and synthesis of novel deuterated ligands functionally selective for the gamma-aminobutyric acid type A receptor (GABAAR) alpha6 subtype with improved metabolic stability and enhanced bioavailability. J Med Chem 61(6):2422\u20132446","journal-title":"J Med Chem"},{"issue":"15","key":"456_CR9","doi-asserted-by":"publisher","first-page":"6830","DOI":"10.1021\/acs.jmedchem.8b00718","volume":"61","author":"DR Weiss","year":"2018","unstructured":"Weiss DR, Karpiak J, Huang XP, Sassano MF, Lyu J, Roth BL et al (2018) Selectivity challenges in docking screens for GPCR targets and antitargets. J Med Chem 61(15):6830\u20136845","journal-title":"J Med Chem"},{"issue":"1","key":"456_CR10","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1186\/1758-2946-5-7","volume":"5","author":"S Heller","year":"2013","unstructured":"Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I (2013) InChI\u2014the worldwide chemical structure identifier standard. J Cheminform. 5(1):7","journal-title":"J Cheminform."},{"key":"456_CR11","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1186\/s13321-015-0068-4","volume":"7","author":"SR Heller","year":"2015","unstructured":"Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D (2015) InChI, the IUPAC International Chemical Identifier. J Cheminform. 7:23","journal-title":"J Cheminform."},{"key":"456_CR12","unstructured":"InChI Trust Downloads. https:\/\/www.inchi-trust.org\/downloads\/. Accessed 07 Aug 2020"},{"issue":"1","key":"456_CR13","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s13321-018-0293-8","volume":"10","author":"VD Hahnke","year":"2018","unstructured":"Hahnke VD, Kim S, Bolton EE (2018) PubChem chemical structure standardization. J Cheminform. 10(1):36","journal-title":"J Cheminform."},{"issue":"6\u20137","key":"456_CR14","doi-asserted-by":"publisher","first-page":"521","DOI":"10.1007\/s10822-010-9346-4","volume":"24","author":"M Sitzmann","year":"2010","unstructured":"Sitzmann M, Ihlenfeldt WD, Nicklaus MC (2010) Tautomerism in large databases. J Comput Aided Mol Des 24(6\u20137):521\u2013551","journal-title":"J Comput Aided Mol Des"},{"key":"456_CR15","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1186\/s13321-015-0072-8","volume":"7","author":"K Karapetyan","year":"2015","unstructured":"Karapetyan K, Batchelor C, Sharpe D, Tkachenko V, Williams AJ (2015) The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets. J Cheminform. 7:30","journal-title":"J Cheminform."},{"key":"456_CR16","unstructured":"ChemSpider | Search and share chemistry. http:\/\/www.chemspider.com\/. Accessed 07 Aug 2020"},{"issue":"21\u201322","key":"456_CR17","doi-asserted-by":"publisher","first-page":"1188","DOI":"10.1016\/j.drudis.2012.05.016","volume":"17","author":"AJ Williams","year":"2012","unstructured":"Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL et al (2012) Open PHACTS: semantic interoperability for drug discovery. Drug Discov Today. 17(21\u201322):1188\u20131198","journal-title":"Drug Discov Today."},{"key":"456_CR18","unstructured":"Open PHACTS ops-crs package. https:\/\/github.com\/openphacts\/ops-crs\/tree\/master\/CVSP. Accessed 07 Aug 2020"},{"key":"456_CR19","unstructured":"ChemSpider Blog. http:\/\/cvsp.chemspider.com\/. Accessed 07 Aug 2020"},{"issue":"1","key":"456_CR20","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1186\/s13321-017-0247-6","volume":"9","author":"AJ Williams","year":"2017","unstructured":"Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC et al (2017) The CompTox Chemistry Dashboard: a community data resource for environmental chemistry. J Cheminform. 9(1):61","journal-title":"J Cheminform."},{"key":"456_CR21","unstructured":"ChemIDplus Advanced. https:\/\/chem.nlm.nih.gov\/chemidplus\/. Accessed 07 Aug 2020"},{"issue":"100096","key":"456_CR22","first-page":"1","volume":"12","author":"CM Grulke","year":"2019","unstructured":"Grulke CM, Williams AJ, Thillanadarajah I, Richard AM (2019) EPA\u2019s DSSTox database: history of development of a curated chemistry resource supporting computational toxicology research. Comput Toxicol. 12(100096):1\u201315","journal-title":"Comput Toxicol."},{"key":"456_CR23","unstructured":"FDA | FDA\u2019s Global Substance Registration System. https:\/\/www.fda.gov\/industry\/fda-resources-data-standards\/fdas-global-substance-registration-system. Accessed 07 Aug 2020"},{"key":"456_CR24","unstructured":"Chemical Structure Representation Toolkit | ChemAxon. https:\/\/chemaxon.com\/products\/chemical-structure-representation-toolkit. Accessed 07 Aug 2020"},{"key":"456_CR25","unstructured":"BioVia Chemical Representation Guide. http:\/\/help.accelrysonline.com\/insight\/2017\/content\/pdf_files\/bioviachemicalrepresentation2017.pdf. Accessed 07 Aug 2020"},{"key":"456_CR26","unstructured":"MolVS: Molecule Validation and Standardization. https:\/\/molvs.readthedocs.io\/en\/latest\/. Accessed 07 Aug 2020"},{"key":"456_CR27","unstructured":"RDKit: Open-Source Cheminformatics Software. https:\/\/www.rdkit.org. Accessed 07 Aug 2020"},{"key":"456_CR28","unstructured":"ChEMBL chembl_structure_pipeline package. https:\/\/github.com\/chembl\/ChEMBL_Structure_Pipeline\/releases\/tag\/1.0.0. Accessed 07 Aug 2020"},{"key":"456_CR29","unstructured":"ChEMBL standardiser package.https:\/\/github.com\/chembl\/standardiser. Accessed 07 Aug 2020"},{"issue":"12","key":"456_CR30","doi-asserted-by":"publisher","first-page":"811","DOI":"10.1038\/nrd.2017.177","volume":"16","author":"F Sanz","year":"2017","unstructured":"Sanz F, Pognan F, Steger-Hartmann T, Diaz C, Cases M et al (2017) Legacy data sharing to improve drug safety assessment: the eTOX project. Nat Rev Drug Discov. 16(12):811\u2013812","journal-title":"Nat Rev Drug Discov."},{"key":"456_CR31","unstructured":"FDA | Food and Drug Administration Substance Registration System Standard Operation Procedure Substance Definition Manual. https:\/\/www.fda.gov\/downloads\/ForIndustry\/DataStandards\/SubstanceRegistrationSystem-UniqueIngredientIdentifierUNII\/ucm127743.pdf. Accessed 07 Aug 2020"},{"issue":"10","key":"456_CR32","doi-asserted-by":"publisher","first-page":"1897","DOI":"10.1351\/pac200678101897","volume":"38","author":"J Brecher","year":"2006","unstructured":"Brecher J (2006) Graphical Representation of Stereochemical configuration (IUPAC recommendations 2006). Pure Appl Chem 38(10):1897\u20131970","journal-title":"Pure Appl Chem"},{"key":"456_CR33","unstructured":"American Medical Association (AMA) list of pharmacological salts. https:\/\/www.ama-assn.org\/system\/files\/2019-04\/radicals-and-anions-list.pdf. Accessed 07 Aug 2020"},{"key":"456_CR34","unstructured":"FDA | Approved Drug Products with Therapeutic Equivalence Evaluations (Orange Book). https:\/\/www.fda.gov\/drugs\/drug-approvals-and-databases\/approved-drug-products-therapeutic-equivalence-evaluations-orange-book. Accessed 07 Aug 2020"},{"key":"456_CR35","unstructured":"Anaconda Cloud chembl_structure_pipeline package. https:\/\/anaconda.org\/chembl\/chembl_structure_pipeline. Accessed 07 Aug 2020"},{"key":"456_CR36","unstructured":"ChEMBL Beaker. https:\/\/www.ebi.ac.uk\/chembl\/api\/utils\/docs. Accessed 07 Aug 2020"},{"issue":"9","key":"456_CR37","doi-asserted-by":"publisher","first-page":"885","DOI":"10.1007\/s10822-015-9860-5","volume":"29","author":"G Papadatos","year":"2015","unstructured":"Papadatos G, Gaulton A, Hersey A, Overington JP (2015) Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des 29(9):885\u2013896","journal-title":"J Comput Aided Mol Des"},{"key":"456_CR38","unstructured":"ChEMBL: Downloads. ftp:\/\/ftp.ebi.ac.uk\/pub\/databases\/chembl\/ChEMBLdb\/latest\/. Accessed 07 Aug 2020"},{"key":"456_CR39","author":"Power User Gateway (PUG)","year":"2020","unstructured":"Power User Gateway (PUG): PubChem Standardization Tasks. https:\/\/pubchemdocs.ncbi.nlm.nih.gov\/power-user-gateway$_3-3. Accessed 07 Aug 2020"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-020-00456-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-020-00456-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-020-00456-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,8,31]],"date-time":"2021-08-31T20:42:16Z","timestamp":1630442536000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-020-00456-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,1]]},"references-count":39,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["456"],"URL":"https:\/\/doi.org\/10.1186\/s13321-020-00456-1","relation":{"references":[{"id-type":"uri","id":"https:\/\/pubchemdocs.ncbi.nlm.nih.gov\/power-user-gateway$_3-3","asserted-by":"subject"}],"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-34715\/v2","asserted-by":"object"},{"id-type":"doi","id":"10.21203\/rs.3.rs-34715\/v1","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,1]]},"assertion":[{"value":"11 June 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 August 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 September 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"51"}}