{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T01:55:25Z","timestamp":1774922125428,"version":"3.50.1"},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,5,28]],"date-time":"2022-05-28T00:00:00Z","timestamp":1653696000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,5,28]],"date-time":"2022-05-28T00:00:00Z","timestamp":1653696000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000289","name":"Cancer Research UK","doi-asserted-by":"publisher","award":["C35696\/A23187"],"award-info":[{"award-number":["C35696\/A23187"]}],"id":[{"id":"10.13039\/501100000289","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000289","name":"Cancer Research UK","doi-asserted-by":"publisher","award":["C35696\/A23187"],"award-info":[{"award-number":["C35696\/A23187"]}],"id":[{"id":"10.13039\/501100000289","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000289","name":"Cancer Research UK","doi-asserted-by":"publisher","award":["C35696\/A23187"],"award-info":[{"award-number":["C35696\/A23187"]}],"id":[{"id":"10.13039\/501100000289","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000289","name":"Cancer Research UK","doi-asserted-by":"publisher","award":["C309\/A11566"],"award-info":[{"award-number":["C309\/A11566"]}],"id":[{"id":"10.13039\/501100000289","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["212969\/Z\/18\/Z"],"award-info":[{"award-number":["212969\/Z\/18\/Z"]}],"id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["204735\/Z\/16\/Z"],"award-info":[{"award-number":["204735\/Z\/16\/Z"]}],"id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100011264","name":"FP7 People: Marie-Curie Actions","doi-asserted-by":"publisher","award":["FP7\/2007\u20132013"],"award-info":[{"award-number":["FP7\/2007\u20132013"]}],"id":[{"id":"10.13039\/100011264","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds\u2019 hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL\u2019s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem\u2019s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>We use canSARchem to standardize all the compounds uploaded in canSAR (&gt;\u20093\u00a0million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/gitlab.icr.ac.uk\/cansar-public\/compound-registration-pipeline\">https:\/\/gitlab.icr.ac.uk\/cansar-public\/compound-registration-pipeline<\/jats:ext-link>.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s13321-022-00606-7","type":"journal-article","created":{"date-parts":[[2022,5,28]],"date-time":"2022-05-28T16:02:43Z","timestamp":1653753763000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["canSAR chemistry registration and standardization pipeline"],"prefix":"10.1186","volume":"14","author":[{"given":"Daniela","family":"Dolciami","sequence":"first","affiliation":[]},{"given":"Eloy","family":"Villasclaras-Fernandez","sequence":"additional","affiliation":[]},{"given":"Christos","family":"Kannas","sequence":"additional","affiliation":[]},{"given":"Mirco","family":"Meniconi","sequence":"additional","affiliation":[]},{"given":"Bissan","family":"Al-Lazikani","sequence":"additional","affiliation":[]},{"given":"Albert A.","family":"Antolin","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,5,28]]},"reference":[{"key":"606_CR1","doi-asserted-by":"publisher","first-page":"D1074","DOI":"10.1093\/nar\/gkaa1059","volume":"49","author":"C Mitsopoulos","year":"2021","unstructured":"Mitsopoulos C, Di Micco P, Fernandez EV et al (2021) CanSAR: Update to the cancer translational research and drug discovery knowledgebase. Nucleic Acids Res 49:D1074\u2013D1082. https:\/\/doi.org\/10.1093\/nar\/gkaa1059","journal-title":"Nucleic Acids Res"},{"key":"606_CR2","doi-asserted-by":"publisher","DOI":"10.26434\/CHEMRXIV.12286877.V1","author":"C Mitsopoulos","year":"2020","unstructured":"Mitsopoulos C, Antolin AA, Fernandez EV et al (2020) Coronavirus canSAR\u2014a data-driven, AI-enabled. Drug Discov Resour Res Commun. https:\/\/doi.org\/10.26434\/CHEMRXIV.12286877.V1","journal-title":"Drug Discov Resour Res Commun"},{"issue":"8","key":"606_CR3","doi-asserted-by":"publisher","first-page":"536","DOI":"10.1038\/nchembio.1867","volume":"11","author":"CH Arrowsmith","year":"2015","unstructured":"Arrowsmith CH, Audia JE, Austin C et al (2015) The promise and peril of chemical probes. Nat Chem Biol 11(8):536\u2013541. https:\/\/doi.org\/10.1038\/nchembio.1867","journal-title":"Nat Chem Biol"},{"key":"606_CR4","doi-asserted-by":"publisher","first-page":"D930","DOI":"10.1093\/nar\/gky1075","volume":"47","author":"D Mendez","year":"2019","unstructured":"Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930\u2013D940. https:\/\/doi.org\/10.1093\/nar\/gky1075","journal-title":"Nucleic Acids Res"},{"key":"606_CR5","doi-asserted-by":"publisher","first-page":"D1045","DOI":"10.1093\/nar\/gkv1072","volume":"44","author":"MK Gilson","year":"2016","unstructured":"Gilson MK, Liu T, Baitaluk M et al (2016) BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res 44:D1045\u2013D1053. https:\/\/doi.org\/10.1093\/nar\/gkv1072","journal-title":"Nucleic Acids Res"},{"key":"606_CR6","doi-asserted-by":"publisher","first-page":"D344","DOI":"10.1093\/NAR\/GKZ853","volume":"48","author":"M Varadi","year":"2020","unstructured":"Consortium Pdb-K, Varadi M, Berrisford J et al (2020) PDBe-KB: a community-driven resource for structural and functional annotations. Nucleic Acids Res 48:D344\u2013D353. https:\/\/doi.org\/10.1093\/NAR\/GKZ853","journal-title":"Nucleic Acids Res"},{"key":"606_CR7","doi-asserted-by":"publisher","first-page":"194","DOI":"10.1016\/j.chembiol.2017.11.004","volume":"25","author":"AA Antolin","year":"2018","unstructured":"Antolin AA, Tym JE, Komianou A et al (2018) Objective, quantitative, data-driven assessment of chemical probes. Cell Chem Biol 25:194-205.e5. https:\/\/doi.org\/10.1016\/j.chembiol.2017.11.004","journal-title":"Cell Chem Biol"},{"key":"606_CR8","doi-asserted-by":"publisher","first-page":"51","DOI":"10.1186\/s13321-020-00456-1","volume":"12","author":"AP Bento","year":"2020","unstructured":"Bento AP, Hersey A, F\u00e9lix E et al (2020) An open source chemical structure curation pipeline using RDKit. J Cheminformatics 12:51. https:\/\/doi.org\/10.1186\/s13321-020-00456-1","journal-title":"J Cheminformatics"},{"key":"606_CR9","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s13321-018-0293-8","volume":"10","author":"VD H\u00e4hnke","year":"2018","unstructured":"H\u00e4hnke VD, Kim S, Bolton EE (2018) PubChem chemical structure standardization. J Cheminformatics 10:36. https:\/\/doi.org\/10.1186\/s13321-018-0293-8","journal-title":"J Cheminformatics"},{"key":"606_CR10","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78246-9_38","author":"MR Berthold","year":"2008","unstructured":"Berthold MR, Cebron N, Dill F et al (2008) KNIME: the Konstanz information miner. Stud Classif Data Anal Knowl Organ. https:\/\/doi.org\/10.1007\/978-3-540-78246-9_38","journal-title":"Stud Classif Data Anal Knowl Organ"},{"key":"606_CR11","doi-asserted-by":"publisher","first-page":"17","DOI":"10.1016\/j.ddtec.2015.01.005","volume":"14","author":"A Hersey","year":"2015","unstructured":"Hersey A, Chambers J, Bellis L et al (2015) Chemical databases: curation or integration by user-defined equivalence? Drug Discov Today Technol 14:17\u201324. https:\/\/doi.org\/10.1016\/j.ddtec.2015.01.005","journal-title":"Drug Discov Today Technol"},{"key":"606_CR12","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1186\/1758-2946-4-35","volume":"4","author":"SA Akhondi","year":"2012","unstructured":"Akhondi SA, Kors JA, Muresan S (2012) Consistency of systematic chemical identifiers within and between small-molecule databases. J Cheminformatics 4:35. https:\/\/doi.org\/10.1186\/1758-2946-4-35","journal-title":"J Cheminformatics"},{"key":"606_CR13","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1002\/wcms.36","volume":"1","author":"WA Warr","year":"2011","unstructured":"Warr WA (2011) Representation of chemical structures. Wiley Interdiscip Rev Comput Mol Sci 1:557\u2013579. https:\/\/doi.org\/10.1002\/wcms.36","journal-title":"Wiley Interdiscip Rev Comput Mol Sci"},{"key":"606_CR14","doi-asserted-by":"publisher","first-page":"747","DOI":"10.1016\/j.drudis.2011.07.007","volume":"16","author":"AJ Williams","year":"2011","unstructured":"Williams AJ, Ekins S (2011) A quality alert and call for improved curation of public chemistry databases. Drug Discov Today 16:747\u2013750. https:\/\/doi.org\/10.1016\/j.drudis.2011.07.007","journal-title":"Drug Discov Today"},{"key":"606_CR15","doi-asserted-by":"publisher","first-page":"685","DOI":"10.1016\/j.drudis.2012.02.013","volume":"17","author":"AJ Williams","year":"2012","unstructured":"Williams AJ, Ekins S, Tkachenko V (2012) Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 17:685\u2013701. https:\/\/doi.org\/10.1016\/j.drudis.2012.02.013","journal-title":"Drug Discov Today"},{"key":"606_CR16","doi-asserted-by":"publisher","first-page":"521","DOI":"10.1007\/s10822-010-9346-4","volume":"24","author":"M Sitzmann","year":"2010","unstructured":"Sitzmann M, Ihlenfeldt WD, Nicklaus MC (2010) Tautomerism in large databases. J Comput Aided Mol Des 24:521\u2013551. https:\/\/doi.org\/10.1007\/s10822-010-9346-4","journal-title":"J Comput Aided Mol Des"},{"key":"606_CR17","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) SMILES, a chemical language and information system: 1: introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31\u201336. https:\/\/doi.org\/10.1021\/ci00057a005","journal-title":"J Chem Inf Comput Sci"},{"key":"606_CR18","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1021\/ci00062a008","volume":"29","author":"D Weininger","year":"1989","unstructured":"Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97\u2013101. https:\/\/doi.org\/10.1021\/ci00062a008","journal-title":"J Chem Inf Comput Sci"},{"key":"606_CR19","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1186\/1758-2946-5-7","volume":"5","author":"S Heller","year":"2013","unstructured":"Heller S, McNaught A, Stein S et al (2013) InChI\u2014the worldwide chemical structure identifier standard. J Cheminformatics 5:7. https:\/\/doi.org\/10.1186\/1758-2946-5-7","journal-title":"J Cheminformatics"},{"key":"606_CR20","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1186\/s13321-015-0068-4","volume":"7","author":"SR Heller","year":"2015","unstructured":"Heller SR, McNaught A, Pletnev I et al (2015) InChI, the IUPAC international chemical identifier. J Cheminformatics 7:23. https:\/\/doi.org\/10.1186\/s13321-015-0068-4","journal-title":"J Cheminformatics"},{"key":"606_CR21","unstructured":"Technical FAQ\u2014InChI Trust. https:\/\/www.inchi-trust.org\/technical-faq-2\/#2.6. Accessed 20 May 2021"},{"key":"606_CR22","unstructured":"KNIME Analytics Platform|KNIME. https:\/\/www.knime.com\/knime-analytics-platform. Accessed 28 Apr 2021"},{"key":"606_CR23","unstructured":"RDKit. http:\/\/www.rdkit.org\/. Accessed 15 Dec 2021"},{"key":"606_CR24","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1080\/10629360701843540","volume":"19","author":"M Sitzmann","year":"2008","unstructured":"Sitzmann M, Filippov IV, Nicklaus MC (2008) Internet resources integrating many small-molecule databases. SAR QSAR Environ Res 19:1\u20139. https:\/\/doi.org\/10.1080\/10629360701843540","journal-title":"SAR QSAR Environ Res"},{"key":"606_CR25","unstructured":"MolVS: Molecule Validation and Standardization\u2014MolVS 0.1.1 documentation. https:\/\/molvs.readthedocs.io\/en\/latest\/. Accessed 20 Mar 2022"},{"key":"606_CR26","doi-asserted-by":"publisher","first-page":"1253","DOI":"10.1021\/acs.jcim.9b01080","volume":"60","author":"DK Dhaked","year":"2020","unstructured":"Dhaked DK, Ihlenfeldt WD, Patel H et al (2020) Toward a comprehensive treatment of tautomerism in chemoinformatics including in InChI V2. J Chem Inf Model 60:1253\u20131275. https:\/\/doi.org\/10.1021\/acs.jcim.9b01080","journal-title":"J Chem Inf Model"},{"key":"606_CR27","doi-asserted-by":"publisher","first-page":"475","DOI":"10.1007\/s10822-010-9359-z","volume":"24","author":"AR Katritzky","year":"2010","unstructured":"Katritzky AR, Dennis Hall C, El-Gendy BEDM, Draghici B (2010) Tautomerism in drug discovery. J Comput Aided Mol Des 24:475\u2013484. https:\/\/doi.org\/10.1007\/s10822-010-9359-z","journal-title":"J Comput Aided Mol Des"},{"key":"606_CR28","doi-asserted-by":"publisher","first-page":"2149","DOI":"10.1021\/acs.jcim.6b00338","volume":"56","author":"L Guasch","year":"2016","unstructured":"Guasch L, Yapamudiyansel W, Peach ML et al (2016) Experimental and chemoinformatics study of tautomerism in a database of commercially available screening samples. J Chem Inf Model 56:2149\u20132161. https:\/\/doi.org\/10.1021\/acs.jcim.6b00338","journal-title":"J Chem Inf Model"},{"key":"606_CR29","unstructured":"MolVS: molecule validation and standardization\u2014MolVS 0.1.1 documentation. https:\/\/molvs.readthedocs.io\/en\/latest\/. Accessed 28 Apr 2021"},{"key":"606_CR30","unstructured":"rdkit.Chem.MolStandardize.rdMolStandardize module\u2014The RDKit 2021.03.1 documentation. https:\/\/www.rdkit.org\/docs\/source\/rdkit.Chem.MolStandardize.rdMolStandardize.html. Accessed 30 Jul 2021"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-022-00606-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-022-00606-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-022-00606-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,5,28]],"date-time":"2022-05-28T16:02:47Z","timestamp":1653753767000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-022-00606-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,28]]},"references-count":30,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["606"],"URL":"https:\/\/doi.org\/10.1186\/s13321-022-00606-7","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,28]]},"assertion":[{"value":"4 February 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 April 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 May 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"DD, EV-F, CK, MM, AAA and BA-L are\/were employees of The Institute of Cancer Research (ICR), which has a commercial interest in a range of drug targets. The ICR operates a Rewards to Inventors scheme whereby employees of the ICR may receive financial benefits following the commercial licensing of a project. BA-L is an employee of MD Anderson Cancer Center which also operates a Reward to Inventors Scheme. BA-L declares commercial interest in Exscientia and Astra Zeneca. BA-L is\/was a consultant\/ scientific advisory board member for GSK, Open Targets, Astex Pharmaceuticals, Astellas Pharma and is an ex-employee of Inpharmatica Ltd. AAA is\/was a consultant of Darwin Health. CK is an employee of Astra Zeneca. DD, EV-F, CK, AAA and BA-L have been instrumental in the creation\/development of canSAR and\/or Probe Miner. BA-L was instrumental in the creation of ChEMBL and is a Director of the non-profit Chemical Probes Portal.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"28"}}