{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:30:06Z","timestamp":1772166606513,"version":"3.50.1"},"reference-count":58,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,5,29]],"date-time":"2023-05-29T00:00:00Z","timestamp":1685318400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,5,29]],"date-time":"2023-05-29T00:00:00Z","timestamp":1685318400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003654","name":"Korea Environmental Industry and Technology Institute","doi-asserted-by":"publisher","award":["KEITI:2020002960002"],"award-info":[{"award-number":["KEITI:2020002960002"]}],"id":[{"id":"10.13039\/501100003654","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003654","name":"Korea Environmental Industry and Technology Institute","doi-asserted-by":"publisher","award":["KEITI:2020002960002"],"award-info":[{"award-number":["KEITI:2020002960002"]}],"id":[{"id":"10.13039\/501100003654","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002701","name":"Ministry of Education","doi-asserted-by":"publisher","award":["5120200513755"],"award-info":[{"award-number":["5120200513755"]}],"id":[{"id":"10.13039\/501100002701","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2020M3A9G7103933"],"award-info":[{"award-number":["NRF-2020M3A9G7103933"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2020M3A9G7103933"],"award-info":[{"award-number":["NRF-2020M3A9G7103933"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.<\/jats:p>","DOI":"10.1186\/s13321-023-00725-9","type":"journal-article","created":{"date-parts":[[2023,5,29]],"date-time":"2023-05-29T12:02:03Z","timestamp":1685361723000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":41,"title":["Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization"],"prefix":"10.1186","volume":"15","author":[{"given":"Umit V.","family":"Ucak","sequence":"first","affiliation":[]},{"given":"Islambek","family":"Ashyrmamatov","sequence":"additional","affiliation":[]},{"given":"Juyong","family":"Lee","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,5,29]]},"reference":[{"key":"725_CR1","doi-asserted-by":"publisher","unstructured":"Domingo M, Garc\u0131a-Mart\u0131nez M, Helle A, et al (2018) How Much Does Tokenization Affect Neural Machine Translation? Arxiv. https:\/\/doi.org\/10.48550\/arxiv.1812.08621","DOI":"10.48550\/arxiv.1812.08621"},{"issue":"1","key":"725_CR2","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comp Sci 28(1):31\u201336. https:\/\/doi.org\/10.1021\/ci00057a005","journal-title":"J Chem Inf Comp Sci"},{"issue":"1","key":"725_CR3","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1021\/ar00109a003","volume":"18","author":"RFW Bader","year":"1985","unstructured":"Bader RFW (1985) Atoms in molecules. Acc Chem Res 18(1):9\u201315. https:\/\/doi.org\/10.1021\/ar00109a003","journal-title":"Acc Chem Res"},{"issue":"31","key":"725_CR4","doi-asserted-by":"publisher","first-page":"8108","DOI":"10.1002\/anie.201403708","volume":"53","author":"A Cadeddu","year":"2014","unstructured":"Cadeddu A, Wylie EK, Jurczak J, Wampler-Doty M, Grzybowski BA (2014) Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew Chem Int Ed 53(31):8108\u20138112. https:\/\/doi.org\/10.1002\/anie.201403708","journal-title":"Angew Chem Int Ed"},{"key":"725_CR5","first-page":"164","volume":"30","author":"S Lesniewski","year":"1927","unstructured":"Lesniewski S (1927) O podstawach matematyki (on the foundations of mathematics). Przeglad filozoficzny 30:164\u2013206","journal-title":"Przeglad filozoficzny"},{"issue":"3","key":"725_CR6","doi-asserted-by":"publisher","first-page":"259","DOI":"10.1016\/S0169-023X(96)00017-1","volume":"20","author":"AC Varzi","year":"1996","unstructured":"Varzi AC (1996) Parts, wholes, and part-whole relations: the prospects of mereotopology. Data Knowl Eng 20(3):259\u2013286. https:\/\/doi.org\/10.1016\/S0169-023X(96)00017-1","journal-title":"Data Knowl Eng"},{"key":"725_CR7","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1905.09139","author":"G Borb\u00e9ly","year":"2019","unstructured":"Borb\u00e9ly G, Kornai A (2019) Sentence Length. arXiv Preprint. https:\/\/doi.org\/10.48550\/arXiv.1905.09139","journal-title":"arXiv Preprint"},{"key":"725_CR8","doi-asserted-by":"publisher","unstructured":"Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno\u00a0Yepes A, Koehn P, Logacheva V, Monz C, Negri M, N\u00e9v\u00e9ol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: proceedings of the first conference on machine translation: volume 2, shared task papers, pp. 131\u2013198. Association for Computational Linguistics, Berlin, Germany . https:\/\/doi.org\/10.18653\/v1\/W16-2301","DOI":"10.18653\/v1\/W16-2301"},{"key":"725_CR9","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1904.09751","author":"A Holtzman","year":"2019","unstructured":"Holtzman A, Buys J, Du L, Forbes M, Choi Y (2019) The curious case of neural text degeneration. arXiv Preprint. https:\/\/doi.org\/10.48550\/arXiv.1904.09751","journal-title":"arXiv Preprint"},{"key":"725_CR10","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1908.04319","author":"S Welleck","year":"2019","unstructured":"Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J (2019) Neural text generation with unlikelihood training. arXiv Preprint. https:\/\/doi.org\/10.48550\/arXiv.1908.04319","journal-title":"arXiv Preprint"},{"issue":"1","key":"725_CR11","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-019-0393-0","volume":"11","author":"J Ar\u00fas-Pous","year":"2019","unstructured":"Ar\u00fas-Pous J, Johansson SV, Prykhodko O, Bjerrum EJ, Tyrchan C, Reymond JL, Chen H, Engkvist O (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminform 11(1):1\u201313. https:\/\/doi.org\/10.1186\/s13321-019-0393-0","journal-title":"J Cheminform"},{"issue":"9","key":"725_CR12","doi-asserted-by":"publisher","first-page":"1523","DOI":"10.1021\/acscentsci.9b00476","volume":"5","author":"T-S Lin","year":"2019","unstructured":"Lin T-S, Coley CW, Mochigase H, Beech HK, Wang W, Wang Z, Woods E, Craig SL, Johnson JA, Kalow JA, Jensen KF, Olsen BD (2019) Bigsmiles: a structurally-based line notation for describing macromolecules. ACS Cent Sci 5(9):1523\u20131531. https:\/\/doi.org\/10.1021\/acscentsci.9b00476","journal-title":"ACS Cent Sci"},{"issue":"1","key":"725_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1758-2946-3-1","volume":"3","author":"A Drefahl","year":"2011","unstructured":"Drefahl A (2011) CurlySMILES: a chemical language to customize and annotate encodings of molecular and nanodevice structures. J Cheminform 3(1):1\u20137. https:\/\/doi.org\/10.1186\/1758-2946-3-1","journal-title":"J Cheminform"},{"key":"725_CR14","unstructured":"ChemAxon Extended SMILES and SMARTS - CXSMILES and CXSMARTS - Documentation. https:\/\/docs.chemaxon.com\/display\/docs\/chemaxon-smiles-extensions.md. Accessed: 10 Feb 2022"},{"key":"725_CR15","unstructured":"OpenSMILES. Home page http:\/\/opensmiles.org. Accessed: 10 Dec 2021"},{"key":"725_CR16","doi-asserted-by":"publisher","DOI":"10.26434\/chemrxiv.7097960.v1","author":"NM O\u2019Boyle","year":"2018","unstructured":"O\u2019Boyle NM, Dalke A (2018) DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv. https:\/\/doi.org\/10.26434\/chemrxiv.7097960.v1","journal-title":"ChemRxiv"},{"issue":"4","key":"725_CR17","doi-asserted-by":"publisher","DOI":"10.1088\/2632-2153\/aba947","volume":"1","author":"M Krenn","year":"2020","unstructured":"Krenn M, H\u00e4se F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach Learn Sci Technol 1(4):045024. https:\/\/doi.org\/10.1088\/2632-2153\/aba947","journal-title":"Mach Learn Sci Technol"},{"issue":"9","key":"725_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1758-2946-4-22","volume":"4","author":"NM O\u2019Boyle","year":"2012","unstructured":"O\u2019Boyle NM (2012) Towards a universal SMILES representation - a standard method to generate canonical SMILES based on the InChI. J Cheminform 4(9):1\u201314. https:\/\/doi.org\/10.1186\/1758-2946-4-22","journal-title":"J Cheminform"},{"issue":"10","key":"725_CR19","doi-asserted-by":"publisher","first-page":"2111","DOI":"10.1021\/acs.jcim.5b00543","volume":"55","author":"N Schneider","year":"2015","unstructured":"Schneider N, Sayle RA, Landrum GA (2015) Get your atoms in order-an open-source implementation of a novel and robust molecular canonicalization algorithm. J Chem Inf Model 55(10):2111\u20132120. https:\/\/doi.org\/10.1021\/acs.jcim.5b00543","journal-title":"J Chem Inf Model"},{"issue":"1","key":"725_CR20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-015-0076-4","volume":"7","author":"VD H\u00e4hnke","year":"2015","unstructured":"H\u00e4hnke VD, Bolton EE, Bryant SH (2015) PubChem atom environments. J Cheminform 7(1):1\u201337. https:\/\/doi.org\/10.1186\/s13321-015-0076-4","journal-title":"J Cheminform"},{"issue":"4","key":"725_CR21","doi-asserted-by":"publisher","first-page":"1560","DOI":"10.1021\/acs.jcim.0c01127","volume":"61","author":"X Li","year":"2021","unstructured":"Li X, Fourches D (2021) SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning. J Chem Inf Model 61(4):1560\u20131569. https:\/\/doi.org\/10.1021\/acs.jcim.0c01127","journal-title":"J Chem Inf Model"},{"key":"725_CR22","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1810.04805","author":"J Devlin","year":"2018","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1810.04805","journal-title":"arXiv"},{"key":"725_CR23","unstructured":"Radford A, Wu J, Child R, Luan D, Amodei D & Sutskever I (2019) Language Models are Unsupervised Multitask Learners. OpenAI. https:\/\/www.openai.com\/blog\/better-language-models\/"},{"key":"725_CR24","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1901.07291","author":"G Lample","year":"2019","unstructured":"Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1901.07291","journal-title":"arXiv"},{"issue":"1","key":"725_CR25","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1186\/s13321-018-0279-6","volume":"10","author":"M Quir\u00f3s","year":"2018","unstructured":"Quir\u00f3s M, Gra\u1e91ulis S, Girdzijauskait\u0117 S, Merkys A, Vaitkus A (2018) Using SMILES strings for the description of chemical connectivity in the crystallography open database. J Cheminform 10(1):23. https:\/\/doi.org\/10.1186\/s13321-018-0279-6","journal-title":"J Cheminform"},{"key":"725_CR26","doi-asserted-by":"publisher","DOI":"10.1021\/ci900161g","author":"K Hansen","year":"2009","unstructured":"Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, Steger-Hartmann T, Heinrich N, M\u00fcller K (2009) Benchmark data set for in silico prediction of ames mutagenicity. J Chem Inform Model. https:\/\/doi.org\/10.1021\/ci900161g","journal-title":"J Chem Inform Model"},{"issue":"563","key":"725_CR27","doi-asserted-by":"publisher","first-page":"2964","DOI":"10.1126\/scisignal.aaw2964","volume":"12","author":"VB O\u2019Donnell","year":"2019","unstructured":"O\u2019Donnell VB, Dennis EA, Wakelam MJO, Subramaniam S (2019) LIPID MAPS: serving the next generation of lipid researchers with tools, resources, data, and training. Sci Signal 12(563):2964. https:\/\/doi.org\/10.1126\/scisignal.aaw2964","journal-title":"Sci Signal"},{"issue":"4","key":"725_CR28","doi-asserted-by":"publisher","first-page":"62839","DOI":"10.1371\/journal.pone.0062839","volume":"8","author":"J Gu","year":"2013","unstructured":"Gu J, Gui Y, Chen L, Yuan G, Lu H-Z, Xu X (2013) Use of natural products as chemical library for drug discovery and network pharmacology. PLoS ONE 8(4):62839. https:\/\/doi.org\/10.1371\/journal.pone.0062839","journal-title":"PLoS ONE"},{"issue":"suppl\u20131","key":"725_CR29","doi-asserted-by":"publisher","first-page":"1035","DOI":"10.1093\/nar\/gkq1126","volume":"39","author":"C Knox","year":"2011","unstructured":"Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS (2011) DrugBank 3.0: a comprehensive resource for \u2018Omics\u2019 research on drugs. Nucleic Acids Res 39(suppl\u20131):1035\u20131041. https:\/\/doi.org\/10.1093\/nar\/gkq1126","journal-title":"Nucleic Acids Res"},{"key":"725_CR30","first-page":"5999","volume":"2017\u2013Decem","author":"A Vaswani","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. Adv. Neural Inf. Process Syst. 2017\u2013Decem:5999\u20136009","journal-title":"Adv. Neural Inf. Process Syst."},{"key":"725_CR31","unstructured":"Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., 1\u201315"},{"issue":"25","key":"725_CR32","doi-asserted-by":"publisher","first-page":"8732","DOI":"10.1021\/ja902302h","volume":"131","author":"LC Blum","year":"2009","unstructured":"Blum LC, Reymond J-L (2009) 970 Million druglike small molecules for virtual screening in the chemical universe database GDB-13. J Am Chem Soc 131(25):8732\u20138733. https:\/\/doi.org\/10.1021\/ja902302h","journal-title":"J Am Chem Soc"},{"issue":"7","key":"725_CR33","doi-asserted-by":"publisher","first-page":"637","DOI":"10.1007\/s10822-011-9436-y","volume":"25","author":"LC Blum","year":"2011","unstructured":"Blum LC, Deursen Rv, Reymond J-L (2011) Visualisation and subsets of the chemical universe database GDB-13 for virtual screening. J Comput Aided Mol Des 25(7):637\u2013647. https:\/\/doi.org\/10.1007\/s10822-011-9436-y","journal-title":"J Comput Aided Mol Des"},{"key":"725_CR34","unstructured":"GDB-13 Database. Home page https:\/\/gdb.unibe.ch\/downloads\/. Accessed: 02 Nov 2022"},{"issue":"17","key":"725_CR35","doi-asserted-by":"publisher","DOI":"10.1063\/1.4965818","volume":"145","author":"UV Ucak","year":"2016","unstructured":"Ucak UV, Ji H, Singh Y, Jung Y (2016) A soft damping function for dispersion corrections with less overfitting. J. Chem. Phys. 145(17):174104. https:\/\/doi.org\/10.1063\/1.4965818","journal-title":"J. Chem. Phys."},{"issue":"25","key":"725_CR36","doi-asserted-by":"publisher","first-page":"5966","DOI":"10.1002\/chem.201605499","volume":"23","author":"MHS Segler","year":"2017","unstructured":"Segler MHS, Waller MP (2017) Neural-symbolic machine learning for retrosynthesis and reaction prediction. Eur J Chem 23(25):5966\u20135971. https:\/\/doi.org\/10.1002\/chem.201605499","journal-title":"Eur J Chem"},{"key":"725_CR37","first-page":"2608","volume":"2017\u2013Decem","author":"W Jin","year":"2017","unstructured":"Jin W, Coley CW, Barzilay R, Jaakkola T (2017) Predicting organic reaction outcomes with weisfeiler-lehman network. Adv Neural Inf Process Syst 2017\u2013Decem:2608\u20132617","journal-title":"Adv Neural Inf Process Syst"},{"issue":"5","key":"725_CR38","doi-asserted-by":"publisher","first-page":"1281","DOI":"10.1021\/acs.accounts.8b00087","volume":"51","author":"CW Coley","year":"2018","unstructured":"Coley CW, Green WH, Jensen KF (2018) Machine learning in computer-aided synthesis planning. Acc Chem Res 51(5):1281\u20131289. https:\/\/doi.org\/10.1021\/acs.accounts.8b00087","journal-title":"Acc Chem Res"},{"issue":"10","key":"725_CR39","doi-asserted-by":"publisher","first-page":"1103","DOI":"10.1021\/acscentsci.7b00303","volume":"3","author":"B Liu","year":"2017","unstructured":"Liu B, Ramsundar B, Kawthekar P, Shi J, Gomes J, Luu Nguyen Q, Ho S, Sloane J, Wender P, Pande V (2017) Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent Sci 3(10):1103\u20131113. https:\/\/doi.org\/10.1021\/acscentsci.7b00303","journal-title":"ACS Cent Sci"},{"key":"725_CR40","doi-asserted-by":"crossref","unstructured":"Karpov P, Godin G, Tetko IV (2019) A transformer model for retrosynthesis. In: artificial neural networks and machine learning \u2013 ICANN 2019: workshop and special sessions, pp. 817\u2013830. Springer, Cham","DOI":"10.1007\/978-3-030-30493-5_78"},{"issue":"1","key":"725_CR41","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-020-19266-y","volume":"11","author":"IV Tetko","year":"2020","unstructured":"Tetko IV, Karpov P, Van Deursen R, Godin G (2020) State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat Commun 11(1):1\u201311. https:\/\/doi.org\/10.1038\/s41467-020-19266-y","journal-title":"Nat Commun"},{"issue":"12","key":"725_CR42","doi-asserted-by":"publisher","first-page":"3316","DOI":"10.1039\/c9sc05704h","volume":"11","author":"P Schwaller","year":"2020","unstructured":"Schwaller P, Petraglia R, Zullo V, Nair VH, Haeuselmann RA, Pisoni R, Bekas C, Iuliano A, Laino T (2020) Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem Sci 11(12):3316\u20133325. https:\/\/doi.org\/10.1039\/c9sc05704h","journal-title":"Chem Sci"},{"issue":"1","key":"725_CR43","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-020-00482-z","volume":"13","author":"UV Ucak","year":"2021","unstructured":"Ucak UV, Kang T, Ko J, Lee J (2021) Substructure-based neural machine translation for retrosynthetic prediction. J Cheminform 13(1):1\u201315. https:\/\/doi.org\/10.1186\/s13321-020-00482-z","journal-title":"J Cheminform"},{"issue":"1","key":"725_CR44","doi-asserted-by":"publisher","first-page":"1186","DOI":"10.1038\/s41467-022-28857-w","volume":"13","author":"UV Ucak","year":"2022","unstructured":"Ucak UV, Ashyrmamatov I, Ko J, Lee J (2022) Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nat Commun 13(1):1186. https:\/\/doi.org\/10.1038\/s41467-022-28857-w","journal-title":"Nat Commun"},{"issue":"20","key":"725_CR45","doi-asserted-by":"publisher","first-page":"5904","DOI":"10.1002\/anie.201506101","volume":"55","author":"S Szymku\u0107","year":"2016","unstructured":"Szymku\u0107 S, Gajewska EP, Klucznik T, Molga K, Dittwald P, Startek M, Bajczyk M, Grzybowski BA (2016) Computer-assisted synthetic planning: the end of the beginning. Angew Chem Int Ed 55(20):5904\u20135937. https:\/\/doi.org\/10.1002\/anie.201506101","journal-title":"Angew Chem Int Ed"},{"issue":"5","key":"725_CR46","doi-asserted-by":"publisher","first-page":"434","DOI":"10.1021\/acscentsci.7b00064","volume":"3","author":"CW Coley","year":"2017","unstructured":"Coley CW, Barzilay R, Jaakkola TS, Green WH, Jensen KF (2017) Prediction of organic reaction outcomes using machine learning. ACS Cent Sci 3(5):434\u2013443. https:\/\/doi.org\/10.1021\/acscentsci.7b00064","journal-title":"ACS Cent Sci"},{"issue":"3","key":"725_CR47","doi-asserted-by":"publisher","first-page":"593","DOI":"10.1021\/ci800228y","volume":"49","author":"J Law","year":"2009","unstructured":"Law J, Zsoldos Z, Simon A, Reid D, Liu Y, Khew SY, Johnson AP, Major S, Wade RA, Ando HY (2009) Route designer: a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation. J Chem Inf Model 49(3):593\u2013602. https:\/\/doi.org\/10.1021\/ci800228y","journal-title":"J Chem Inf Model"},{"key":"725_CR48","doi-asserted-by":"publisher","unstructured":"Lowe DM (2012) Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge. https:\/\/doi.org\/10.17863\/CAM.16293","DOI":"10.17863\/CAM.16293"},{"key":"725_CR49","doi-asserted-by":"publisher","DOI":"10.6084\/m9.figshare.5104873.v1","author":"D Lowe","year":"2017","unstructured":"Lowe D (2017) Chemical reactions from US patents (1976-Sep2016). Figshare. https:\/\/doi.org\/10.6084\/m9.figshare.5104873.v1","journal-title":"Figshare"},{"issue":"5","key":"725_CR50","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742\u2013754. https:\/\/doi.org\/10.1021\/ci100050t","journal-title":"J Chem Inf Model"},{"issue":"2","key":"725_CR51","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1039\/d1dd00013f","volume":"1","author":"K Rajan","year":"2022","unstructured":"Rajan K, Steinbeck C, Zielesny A (2022) Performance of chemical structure string representations for chemical image recognition using transformers. Digit Discov 1(2):84\u201390. https:\/\/doi.org\/10.1039\/d1dd00013f","journal-title":"Digit Discov"},{"key":"725_CR52","doi-asserted-by":"crossref","unstructured":"Nair P, Singh AK (2021) On reducing repetition in abstractive summarization. In: proceedings of the student research workshop associated with RANLP 2021, pp. 126\u2013134. INCOMA Ltd., Online. Accessed 17 Apr 2023\u00a0\u00a0https:\/\/aclanthology.org\/2021.ranlp-srw.18\u00a0","DOI":"10.26615\/issn.2603-2821.2021_018"},{"key":"725_CR53","doi-asserted-by":"publisher","unstructured":"Jawahar G, Abdul-Mageed M, Lakshmanan LVS (2020) Automatic detection of machine generated text: A critical survey. In: proceedings of the 28th international conference on computational linguistics, pp. 2296\u20132309. International Committee on Computational Linguistics, Barcelona, Spain (Online).\u00a0Accessed 17 Apr 2023\u00a0\u00a0https:\/\/doi.org\/10.18653\/v1\/2020.coling-main.208.https:\/\/aclanthology.org\/2020.coling-main.208","DOI":"10.18653\/v1\/2020.coling-main.208."},{"key":"725_CR54","doi-asserted-by":"publisher","DOI":"10.1101\/2022.03.09.483666","author":"N Ferruz","year":"2022","unstructured":"Ferruz N, Schmidt S, H\u00f6cker B (2022) A deep unsupervised language model for protein design. BioRxiv. https:\/\/doi.org\/10.1101\/2022.03.09.483666","journal-title":"BioRxiv"},{"key":"725_CR55","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.2204.11817","author":"C Edwards","year":"2022","unstructured":"Edwards C, Lai T, Ros K, Honke G, Cho K, Ji H (2022) Translation between molecules and natural language. arXiv. https:\/\/doi.org\/10.48550\/arxiv.2204.11817","journal-title":"arXiv"},{"key":"725_CR56","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.2012.14660","author":"Z Fu","year":"2020","unstructured":"Fu Z, Lam W, So AM-C, Shi B (2020) A theoretical analysis of the repetition problem in text generation. arXiv. https:\/\/doi.org\/10.48550\/arxiv.2012.14660","journal-title":"arXiv"},{"issue":"12","key":"725_CR57","doi-asserted-by":"publisher","first-page":"3355","DOI":"10.1039\/c9sc03666k","volume":"11","author":"K Lin","year":"2020","unstructured":"Lin K, Xu Y, Pei J, Lai L (2020) Automatic retrosynthetic route planning using template-free models. Chem Sci 11(12):3355\u20133364. https:\/\/doi.org\/10.1039\/c9sc03666k","journal-title":"Chem Sci"},{"key":"725_CR58","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1039\/C7SC02664A","volume":"9","author":"Z Wu","year":"2018","unstructured":"Wu Z, Ramsundar B, Feinberg E, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) Moleculenet: a benchmark for molecular machine learning. Chem Sci 9:513\u2013530. https:\/\/doi.org\/10.1039\/C7SC02664A","journal-title":"Chem Sci"}],"updated-by":[{"DOI":"10.1186\/s13321-023-00740-w","type":"correction","label":"Correction","source":"publisher","updated":{"date-parts":[[2023,7,31]],"date-time":"2023-07-31T00:00:00Z","timestamp":1690761600000}}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00725-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00725-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00725-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,7,31]],"date-time":"2023-07-31T12:07:16Z","timestamp":1690805236000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00725-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,29]]},"references-count":58,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["725"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00725-9","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2022-9xx75","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,29]]},"assertion":[{"value":"16 January 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 May 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 May 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 July 2023","order":4,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Correction","order":5,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"A Correction to this paper has been published:","order":6,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"https:\/\/doi.org\/10.1186\/s13321-023-00740-w","URL":"https:\/\/doi.org\/10.1186\/s13321-023-00740-w","order":7,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"55"}}