{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T01:47:37Z","timestamp":1772502457452,"version":"3.50.1"},"reference-count":44,"publisher":"IOP Publishing","issue":"3","license":[{"start":{"date-parts":[[2023,8,8]],"date-time":"2023-08-08T00:00:00Z","timestamp":1691452800000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,8,8]],"date-time":"2023-08-08T00:00:00Z","timestamp":1691452800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/iopscience.iop.org\/info\/page\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/501100001711","name":"Schweizerischer Nationalfonds zur F\u00f6rderung der Wissenschaftlichen Forschung","doi-asserted-by":"crossref","award":["NCCR-Catalysis 180544"],"award-info":[{"award-number":["NCCR-Catalysis 180544"]}],"id":[{"id":"10.13039\/501100001711","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["iopscience.iop.org"],"crossmark-restriction":false},"short-container-title":["Mach. Learn.: Sci. Technol."],"published-print":{"date-parts":[[2023,9,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    With the growing amount of chemical data stored digitally, it has become crucial to represent chemical compounds accurately and consistently. Harmonized representations facilitate the extraction of insightful information from datasets, and are advantageous for machine learning applications. To achieve consistent representations throughout datasets, one relies on molecule standardization, which is typically accomplished using rule-based algorithms that modify descriptions of functional groups. Here, we present the first deep-learning model for molecular standardization. We enable custom standardization schemes based solely on data, which, as additional benefit, support standardization options that are difficult to encode into rules. Our model achieves over\n                    <jats:inline-formula>\n                      <jats:tex-math>\n                        \n                      <\/jats:tex-math>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" overflow=\"scroll\">\n                        <mml:mn>98<\/mml:mn>\n                        <mml:mi mathvariant=\"normal\">%<\/mml:mi>\n                      <\/mml:math>\n                      <jats:inline-graphic xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"mlstace878ieqn1.gif\" xlink:type=\"simple\"\/>\n                    <\/jats:inline-formula>\n                    accuracy in learning two popular rule-based standardization protocols. We then follow a transfer learning approach to standardize metal-organic compounds (for which there is currently no automated standardization practice), based on a human-curated dataset of 1512 compounds. This model predicts the expected standardized molecular format with a test accuracy of 80.7%. As standardization can be considered, more broadly, a transformation from undesired to desired representations of compounds, the same data-driven architecture can be applied to other tasks. For instance, we demonstrate the application to compound canonicalization and to the determination of major tautomers in solution, based on computed and experimental data.\n                  <\/jats:p>","DOI":"10.1088\/2632-2153\/ace878","type":"journal-article","created":{"date-parts":[[2023,7,18]],"date-time":"2023-07-18T18:42:20Z","timestamp":1689705740000},"page":"035014","update-policy":"https:\/\/doi.org\/10.1088\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Standardizing chemical compounds with language models"],"prefix":"10.1088","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2192-5091","authenticated-orcid":false,"given":"Miruna T","family":"Cretu","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5218-8653","authenticated-orcid":false,"given":"Alessandra","family":"Toniato","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0403-4067","authenticated-orcid":false,"given":"Amol","family":"Thakkar","sequence":"additional","affiliation":[]},{"given":"Amin A","family":"Debabeche","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8717-0456","authenticated-orcid":false,"given":"Teodoro","family":"Laino","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7554-0288","authenticated-orcid":true,"given":"Alain C","family":"Vaucher","sequence":"additional","affiliation":[]}],"member":"266","published-online":{"date-parts":[[2023,8,8]]},"reference":[{"key":"mlstace878bib1","doi-asserted-by":"publisher","first-page":"5966","DOI":"10.1002\/chem.201605499","article-title":"Neural-symbolic machine learning for retrosynthesis and reaction prediction","volume":"23","author":"Segler","year":"2017","journal-title":"Eur. J. Chem."},{"key":"mlstace878bib2","doi-asserted-by":"publisher","first-page":"370","DOI":"10.1039\/C8SC04228D","article-title":"A graph-convolutional neural network model for the prediction of chemical reactivity","volume":"10","author":"Coley","year":"2019","journal-title":"Chem. Sci."},{"key":"mlstace878bib3","doi-asserted-by":"publisher","first-page":"1572","DOI":"10.1021\/acscentsci.9b00576","article-title":"Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction","volume":"5","author":"Schwaller","year":"2019","journal-title":"ACS Cent. Sci."},{"key":"mlstace878bib4","doi-asserted-by":"publisher","first-page":"604","DOI":"10.1038\/nature25978","article-title":"Planning chemical syntheses with deep neural networks and symbolic AI","volume":"555","author":"Segler","year":"2018","journal-title":"Nature"},{"key":"mlstace878bib5","doi-asserted-by":"publisher","first-page":"3316","DOI":"10.1039\/C9SC05704H","article-title":"Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy","volume":"11","author":"Schwaller","year":"2020","journal-title":"Chem. Sci."},{"key":"mlstace878bib6","doi-asserted-by":"publisher","first-page":"120","DOI":"10.1021\/acscentsci.7b00512","article-title":"Generating focused molecule libraries for drug discovery with recurrent neural networks","volume":"4","author":"Segler","year":"2018","journal-title":"ACS Cent. Sci."},{"key":"mlstace878bib7","article-title":"GT4SD: generative toolkit for scientific discovery","author":"Manica","year":"2022"},{"key":"mlstace878bib8","doi-asserted-by":"publisher","DOI":"10.1088\/2632-2153\/abc81d","article-title":"Prediction of chemical reaction yields using deep learning","volume":"2","author":"Schwaller","year":"2021","journal-title":"Mach. Learn.: Sci. Technol."},{"key":"mlstace878bib9","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1016\/j.ddtec.2020.05.001","article-title":"Molecular property prediction: recent trends in the era of artificial intelligence","volume":"32-33","author":"Shen","year":"2019","journal-title":"Drug Discovery Today Technol."},{"key":"mlstace878bib10","article-title":"PubChem","author":""},{"key":"mlstace878bib11","doi-asserted-by":"publisher","first-page":"D1373","DOI":"10.1093\/nar\/gkac956","article-title":"PubChem 2023 update","volume":"51","author":"Kim","year":"2023","journal-title":"Nucleic Acids Res."},{"key":"mlstace878bib12","article-title":"ChEMBL","author":""},{"key":"mlstace878bib13","article-title":"ChEBI","author":""},{"key":"mlstace878bib14","doi-asserted-by":"publisher","first-page":"1189","DOI":"10.1021\/ci100176x","article-title":"Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research","volume":"50","author":"Fourches","year":"2010","journal-title":"J. Chem. Inf. Model."},{"key":"mlstace878bib15","article-title":"A guide to molecular standardization","author":"Apodaca"},{"key":"mlstace878bib16","doi-asserted-by":"publisher","first-page":"1368","DOI":"10.1039\/D2RE00008C","article-title":"The effect of chemical representation on active machine learning towards closed-loop optimization","volume":"7","author":"Pomberger","year":"2022","journal-title":"React. Chem. Eng."},{"key":"mlstace878bib17","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","article-title":"SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules","volume":"28","author":"Weininger","year":"1988","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"mlstace878bib18","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1021\/ci00062a008","article-title":"SMILES 2. Algorithm for generation of unique SMILES notation","volume":"29","author":"Weininger","year":"1989","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"mlstace878bib19","doi-asserted-by":"publisher","first-page":"4166","DOI":"10.1126\/sciadv.abe4166","article-title":"Extraction of organic chemistry grammar from unsupervised learning of chemical reactions","volume":"7","author":"Schwaller","year":"2021","journal-title":"Sci. Adv."},{"key":"mlstace878bib20","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1021\/ci00007a012","article-title":"Description of several chemical structure file formats used by computer programs developed at molecular design limited","volume":"32","author":"Dalby","year":"1992","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"mlstace878bib21","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1186\/s13321-020-00460-5","article-title":"Molecular representations in AI-driven drug discovery: a review and practical guide","volume":"12","author":"David","year":"2020","journal-title":"J. Cheminform."},{"key":"mlstace878bib22","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s13321-018-0293-8","article-title":"Pubchem chemical structure standardization","volume":"10","author":"Volker","year":"2018","journal-title":"J. Cheminform."},{"key":"mlstace878bib23","doi-asserted-by":"publisher","first-page":"51","DOI":"10.1186\/s13321-020-00456-1","article-title":"An open source chemical structure curation pipeline using RDKit","volume":"12","author":"Patr\u00edcia Bento","year":"2020","journal-title":"J. Cheminform."},{"key":"mlstace878bib24","doi-asserted-by":"publisher","DOI":"10.1002\/minf.202100119","article-title":"Reaction data curation I: chemical structures and transformations standardization","volume":"40","author":"Gimadiev","year":"2021","journal-title":"Mol. Inf."},{"key":"mlstace878bib25","doi-asserted-by":"publisher","first-page":"1742","DOI":"10.1021\/acs.jcim.8b00165","article-title":"Redesigning the materials and catalysts database construction process using ontologies","volume":"58","author":"Takahashi","year":"2018","journal-title":"J. Chem. Inf. Model."},{"key":"mlstace878bib26","doi-asserted-by":"publisher","first-page":"7482","DOI":"10.1021\/acs.jpclett.9b02976","article-title":"Visualizing scientists\u2019 cognitive representation of materials data through the application of ontology","volume":"10","author":"Takahashi","year":"2019","journal-title":"J. Phys. Chem."},{"key":"mlstace878bib27","doi-asserted-by":"publisher","first-page":"836","DOI":"10.1002\/cctc.202001132","article-title":"Open data in catalysis: from today\u2019s big picture to the future of small data","volume":"13","author":"Mendes","year":"2021","journal-title":"ChemCatChem"},{"key":"mlstace878bib28","doi-asserted-by":"publisher","first-page":"3223","DOI":"10.1002\/cctc.202001974","article-title":"A unified research data infrastructure for catalysis research \u2013 challenges and concepts","volume":"13","author":"Wulf","year":"2021","journal-title":"ChemCatChem"},{"key":"mlstace878bib29","doi-asserted-by":"publisher","first-page":"521","DOI":"10.1007\/s10822-010-9346-4","article-title":"Tautomerism in large databases","volume":"24","author":"Sitzmann","year":"2010","journal-title":"J. Comput.-Aided Mol. Des."},{"key":"mlstace878bib30","doi-asserted-by":"publisher","first-page":"2342","DOI":"10.1021\/ci060109b","article-title":"The impact of tautomer forms on pharmacophore-based virtual screening","volume":"46","author":"Oellien","year":"2006","journal-title":"J. Chem. Inf. Model."},{"key":"mlstace878bib31","doi-asserted-by":"publisher","first-page":"2742","DOI":"10.1021\/ci900364w","article-title":"The effect of ligand-based tautomer and protomer prediction on structure-based virtual screening","volume":"49","author":"Kalliokoski","year":"2009","journal-title":"J. Chem. Inf. Model."},{"key":"mlstace878bib32","doi-asserted-by":"publisher","first-page":"867","DOI":"10.1021\/ci200528d","article-title":"Recognizing pitfalls in virtual screening: a critical review","volume":"52","author":"Scior","year":"2012","journal-title":"J. Chem. Inf. Model."},{"key":"mlstace878bib33","article-title":"Attention is all you need","volume":"vol 30","author":"Scior","year":"2017"},{"key":"mlstace878bib34","article-title":"Pistachio","author":""},{"key":"mlstace878bib35","author":"Landrum"},{"key":"mlstace878bib36","doi-asserted-by":"publisher","first-page":"1085","DOI":"10.1021\/acs.jcim.0c00035","article-title":"Tautobase: an open tautomer database","volume":"60","author":"Wahl","year":"2020","journal-title":"J. Chem. Inf. Model."},{"key":"mlstace878bib37","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1186\/s13321-015-0069-3","article-title":"Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?","volume":"7","author":"Bajusz","year":"2015","journal-title":"J. Cheminform."},{"key":"mlstace878bib38","doi-asserted-by":"publisher","first-page":"1695","DOI":"10.1038\/s41467-021-21895-w","article-title":"Quantitative interpretation explains machine learning models for chemical reaction prediction and uncovers bias","volume":"12","author":"Kov\u00e1cs","year":"2021","journal-title":"Nat. Commun."},{"key":"mlstace878bib39","doi-asserted-by":"publisher","first-page":"80","DOI":"10.3389\/fenvs.2015.00080","article-title":"Deeptox: toxicity prediction using deep learning","volume":"3","author":"Mayr","year":"2016","journal-title":"Front. Environ. Sci."},{"key":"mlstace878bib40","article-title":"PubChem Standardization Service","author":""},{"key":"mlstace878bib41","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1186\/1758-2946-4-8","article-title":"Structure-based classification and ontology in chemistry","volume":"4","author":"Hastings","year":"2012","journal-title":"J. Cheminform."},{"key":"mlstace878bib42","article-title":"Quacpac C++Toolkit, version 1.9.0","author":""},{"key":"mlstace878bib43","doi-asserted-by":"publisher","first-page":"385","DOI":"10.1002\/anie.196603851","article-title":"Specification of molecular chirality","volume":"5","author":"Cahn","year":"1966","journal-title":"Angew. Chem., Int. Ed."},{"key":"mlstace878bib44","article-title":"OEChem C++Toolkit, version 1.9.0","author":""}],"container-title":["Machine Learning: Science and Technology"],"original-title":[],"link":[{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878","content-type":"text\/html","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"similarity-checking"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,8]],"date-time":"2023-08-08T08:05:30Z","timestamp":1691481930000},"score":1,"resource":{"primary":{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ace878"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,8]]},"references-count":44,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2023,8,8]]},"published-print":{"date-parts":[[2023,9,1]]}},"URL":"https:\/\/doi.org\/10.1088\/2632-2153\/ace878","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2022-14ztf-v2","asserted-by":"object"}]},"ISSN":["2632-2153"],"issn-type":[{"value":"2632-2153","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,8]]},"assertion":[{"value":"Standardizing chemical compounds with language models","name":"article_title","label":"Article Title"},{"value":"Machine Learning: Science and Technology","name":"journal_title","label":"Journal Title"},{"value":"paper","name":"article_type","label":"Article Type"},{"value":"\u00a9 2023 The Author(s). Published by IOP Publishing Ltd","name":"copyright_information","label":"Copyright Information"},{"value":"2023-03-20","name":"date_received","label":"Date Received","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2023-07-18","name":"date_accepted","label":"Date Accepted","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2023-08-08","name":"date_epub","label":"Online publication date","group":{"name":"publication_dates","label":"Publication dates"}}]}}