{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:28:53Z","timestamp":1772166533009,"version":"3.50.1"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,4,12]],"date-time":"2023-04-12T00:00:00Z","timestamp":1681257600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,4,12]],"date-time":"2023-04-12T00:00:00Z","timestamp":1681257600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["21K06663"],"award-info":[{"award-number":["21K06663"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100009619","name":"Japan Agency for Medical Research and Development","doi-asserted-by":"publisher","award":["JP22mk0101250h"],"award-info":[{"award-number":["JP22mk0101250h"]}],"id":[{"id":"10.13039\/100009619","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Descriptor generation methods using latent representations of encoder\u2013decoder (ED) models with SMILES as input are useful because of the continuity of descriptor and restorability to the structure. However, it is not clear how the structure is recognized in the learning progress of ED models. In this work, we created ED models of various learning progress and investigated the relationship between structural information and learning progress. We showed that compound substructures were learned early in ED models by monitoring the accuracy of downstream tasks and input\u2013output substructure similarity using substructure-based descriptors, which suggests that existing evaluation methods based on the accuracy of downstream tasks may not be sensitive enough to evaluate the performance of ED models with SMILES as descriptor generation methods. On the other hand, we showed that structure restoration was time-consuming, and in particular, insufficient learning led to the estimation of a larger structure than the actual one. It can be inferred that determining the endpoint of the structure is a difficult task for the model. To our knowledge, this is the first study to link the learning progress of SMILES by ED model to chemical structures for a wide range of chemicals.<\/jats:p>\n                  <jats:p>\n                    <jats:bold>Graphical Abstract<\/jats:bold>\n                  <\/jats:p>","DOI":"10.1186\/s13321-023-00713-z","type":"journal-article","created":{"date-parts":[[2023,4,12]],"date-time":"2023-04-12T10:04:15Z","timestamp":1681293855000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Investigation of chemical structure recognition by encoder\u2013decoder models in learning progress"],"prefix":"10.1186","volume":"15","author":[{"given":"Shumpei","family":"Nemoto","sequence":"first","affiliation":[]},{"given":"Tadahaya","family":"Mizuno","sequence":"additional","affiliation":[]},{"given":"Hiroyuki","family":"Kusuhara","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,4,12]]},"reference":[{"key":"713_CR1","doi-asserted-by":"publisher","first-page":"2338","DOI":"10.1093\/bioinformatics\/btw168","volume":"32","author":"Z Wang","year":"2016","unstructured":"Wang Z, Clark NR, Ma\u2019ayan A (2016) Drug-induced adverse events prediction with the LINCS L1000 data. Bioinformatics 32:2338\u20132345. https:\/\/doi.org\/10.1093\/bioinformatics\/btw168","journal-title":"Bioinformatics"},{"key":"713_CR2","doi-asserted-by":"publisher","first-page":"1199","DOI":"10.1021\/tx400110f","volume":"26","author":"Y Low","year":"2013","unstructured":"Low Y, Sedykh A, Fourches D et al (2013) Integrative chemical-biological read-across approach for chemical hazard classification. Chem Res Toxicol 26:1199\u20131208. https:\/\/doi.org\/10.1021\/tx400110f","journal-title":"Chem Res Toxicol"},{"key":"713_CR3","doi-asserted-by":"publisher","first-page":"1283","DOI":"10.1021\/acs.jnatprod.0c01381","volume":"84","author":"S Nemoto","year":"2021","unstructured":"Nemoto S, Morita K, Mizuno T, Kusuhara H (2021) Decomposition profile data analysis for deep understanding of multiple effects of natural products. J Nat Prod 84:1283\u20131293. https:\/\/doi.org\/10.1021\/acs.jnatprod.0c01381","journal-title":"J Nat Prod"},{"key":"713_CR4","doi-asserted-by":"publisher","first-page":"8705","DOI":"10.1021\/acs.jmedchem.0c00385","volume":"63","author":"KV Chuang","year":"2020","unstructured":"Chuang KV, Gunsalus LM, Keiser MJ (2020) Learning molecular representations for medicinal chemistry. J Med Chem 63:8705\u20138722. https:\/\/doi.org\/10.1021\/acs.jmedchem.0c00385","journal-title":"J Med Chem"},{"key":"713_CR5","doi-asserted-by":"publisher","first-page":"4538","DOI":"10.1016\/j.csbj.2021.08.011","volume":"19","author":"P Carracedo-Reboredo","year":"2021","unstructured":"Carracedo-Reboredo P, Li\u00f1ares-Blanco J, Rodr\u00edguez-Fern\u00e1ndez N et al (2021) A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J 19:4538\u20134558. https:\/\/doi.org\/10.1016\/j.csbj.2021.08.011","journal-title":"Comput Struct Biotechnol J"},{"key":"713_CR6","doi-asserted-by":"publisher","first-page":"268","DOI":"10.1021\/acscentsci.7b00572","volume":"4","author":"R G\u00f3mez-Bombarelli","year":"2018","unstructured":"G\u00f3mez-Bombarelli R, Wei JN, Duvenaud D et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4:268\u2013276. https:\/\/doi.org\/10.1021\/acscentsci.7b00572","journal-title":"ACS Cent Sci"},{"key":"713_CR7","doi-asserted-by":"crossref","unstructured":"Bowman SR, Vilnis L, Vinyals O, et al (2015) Generating sentences from a continuous space. arXiv:1511.06349","DOI":"10.18653\/v1\/K16-1002"},{"key":"713_CR8","unstructured":"Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 27th international conference on neural information processing systems, pp 3104\u20133112"},{"key":"713_CR9","unstructured":"Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1949.0473"},{"key":"713_CR10","doi-asserted-by":"publisher","first-page":"1692","DOI":"10.1039\/c8sc04175j","volume":"10","author":"R Winter","year":"2019","unstructured":"Winter R, Montanari F, No\u00e9 F, Clevert DA (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10:1692\u20131701. https:\/\/doi.org\/10.1039\/c8sc04175j","journal-title":"Chem Sci"},{"key":"713_CR11","doi-asserted-by":"crossref","unstructured":"Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder\u2013decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pp 103\u2013111","DOI":"10.3115\/v1\/W14-4012"},{"key":"713_CR12","unstructured":"Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. EMNLP 2013\u20142013 Conf Empir Methods Nat Lang Process Proc Conf, pp 1700\u20131709"},{"key":"713_CR13","doi-asserted-by":"crossref","unstructured":"Harel S, Radinsky K (2018) Accelerating prototype-based drug discovery using conditional diversity networks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, New York, NY, USA, pp 331\u2013339","DOI":"10.1145\/3219819.3219882"},{"key":"713_CR14","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1186\/s13321-021-00497-0","volume":"13","author":"J He","year":"2021","unstructured":"He J, You H, Sandstr\u00f6m E et al (2021) Molecular optimization by capturing chemist\u2019s intuition using deep neural networks. J Cheminform 13:26. https:\/\/doi.org\/10.1186\/s13321-021-00497-0","journal-title":"J Cheminform"},{"key":"713_CR15","doi-asserted-by":"publisher","first-page":"1700111","DOI":"10.1002\/minf.201700111","volume":"37","author":"A Gupta","year":"2018","unstructured":"Gupta A, M\u00fcller AT, Huisman BJH et al (2018) Generative recurrent networks for de novo drug design. Mol Inform 37:1700111. https:\/\/doi.org\/10.1002\/minf.201700111","journal-title":"Mol Inform"},{"key":"713_CR16","doi-asserted-by":"publisher","first-page":"10378","DOI":"10.1039\/d0sc03115a","volume":"11","author":"T Le","year":"2020","unstructured":"Le T, Winter R, No\u00e9 F, Clevert DA (2020) Neuraldecipher-reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures. Chem Sci 11:10378\u201310389. https:\/\/doi.org\/10.1039\/d0sc03115a","journal-title":"Chem Sci"},{"key":"713_CR17","doi-asserted-by":"publisher","first-page":"1273","DOI":"10.1021\/ci010132r","volume":"42","author":"JL Durant","year":"2002","unstructured":"Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42:1273\u20131280. https:\/\/doi.org\/10.1021\/ci010132r","journal-title":"J Chem Inf Comput Sci"},{"key":"713_CR18","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1021\/acs.jcim.7b00616","volume":"58","author":"S Jaeger","year":"2018","unstructured":"Jaeger S, Fulle S, Turk S (2018) Mol2vec: unsupervised machine learning approach with chemical intuition. J Chem Inf Model 58:27\u201335. https:\/\/doi.org\/10.1021\/acs.jcim.7b00616","journal-title":"J Chem Inf Model"},{"key":"713_CR19","unstructured":"Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, et al (2015) Convolutional networks on graphs for learning molecular fingerprints. Adv Neural Inf Process Syst 2015-Janua:2224\u20132232"},{"key":"713_CR20","unstructured":"Goodfellow IJ, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. In: Advances in neural information processing systems, p 27"},{"key":"713_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-022-00623-6","volume":"14","author":"M Abbasi","year":"2022","unstructured":"Abbasi M, Santos BP, Pereira TC et al (2022) Designing optimized drug candidates with generative adversarial network. J Cheminform 14:1\u201316. https:\/\/doi.org\/10.1186\/s13321-022-00623-6","journal-title":"J Cheminform"},{"key":"713_CR22","first-page":"1","volume":"2022","author":"K Maziarz","year":"2021","unstructured":"Maziarz K, Jackson-Flux H, Cameron P et al (2021) Learning to extend molecular scaffolds with structural motifs. ICLR 2022:1\u201322","journal-title":"ICLR"},{"key":"713_CR23","doi-asserted-by":"publisher","first-page":"1194","DOI":"10.1021\/acs.jcim.7b00690","volume":"58","author":"E Putin","year":"2018","unstructured":"Putin E, Asadulaev A, Ivanenkov Y et al (2018) Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model 58:1194\u20131204. https:\/\/doi.org\/10.1021\/acs.jcim.7b00690","journal-title":"J Chem Inf Model"},{"key":"713_CR24","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1186\/s13321-019-0397-9","volume":"11","author":"O Prykhodko","year":"2019","unstructured":"Prykhodko O, Johansson SV, Kotsias P-C et al (2019) A de novo molecular generation method using latent vector based generative adversarial network. J Cheminform 11:74. https:\/\/doi.org\/10.1186\/s13321-019-0397-9","journal-title":"J Cheminform"},{"key":"713_CR25","doi-asserted-by":"publisher","unstructured":"Martinelli DD (2022) Generative machine learning for de novo drug discovery: a systematic review. Comput Biol Med 145:105403. https:\/\/doi.org\/10.1016\/j.compbiomed.2022.105403","DOI":"10.1016\/j.compbiomed.2022.105403"},{"key":"713_CR26","doi-asserted-by":"publisher","first-page":"2099","DOI":"10.1093\/bib\/bbz125","volume":"21","author":"X Lin","year":"2020","unstructured":"Lin X, Quan Z, Wang ZJ et al (2020) A novel molecular representation with BiGRU neural networks for learning atom. Brief Bioinform 21:2099\u20132111. https:\/\/doi.org\/10.1093\/bib\/bbz125","journal-title":"Brief Bioinform"},{"key":"713_CR27","doi-asserted-by":"publisher","first-page":"2324","DOI":"10.1021\/acs.jcim.5b00559","volume":"55","author":"T Sterling","year":"2015","unstructured":"Sterling T, Irwin JJ (2015) ZINC 15\u2014ligand discovery for everyone. J Chem Inf Model 55:2324\u20132337. https:\/\/doi.org\/10.1021\/acs.jcim.5b00559","journal-title":"J Chem Inf Model"},{"key":"713_CR28","unstructured":"Bjerrum EJ (2017) SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv:1703.07076"},{"key":"713_CR29","unstructured":"United States Environmental Protection Agency. https:\/\/www.epa.gov\/"},{"key":"713_CR30","unstructured":"CompTox-ToxCast-tcpl. https:\/\/github.com\/USEPA\/CompTox-ToxCast-tcpl"},{"key":"713_CR31","doi-asserted-by":"publisher","unstructured":"Lamb J, Crawford ED, Peck D, et al (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science (80- ) 313:1929\u20131935. https:\/\/doi.org\/10.1126\/science.1132939","DOI":"10.1126\/science.1132939"},{"key":"713_CR32","doi-asserted-by":"publisher","first-page":"D1202","DOI":"10.1093\/nar\/gkv951","volume":"44","author":"S Kim","year":"2016","unstructured":"Kim S, Thiessen PA, Bolton EE et al (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202\u2013D1213. https:\/\/doi.org\/10.1093\/nar\/gkv951","journal-title":"Nucleic Acids Res"},{"key":"713_CR33","unstructured":"RDKit: Open-Source Cheminformatics Software."},{"key":"713_CR34","doi-asserted-by":"publisher","first-page":"270","DOI":"10.1162\/neco.1989.1.2.270","volume":"1","author":"RJ Williams","year":"1989","unstructured":"Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1:270\u2013280. https:\/\/doi.org\/10.1162\/neco.1989.1.2.270","journal-title":"Neural Comput"},{"key":"713_CR35","doi-asserted-by":"crossref","unstructured":"Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 785\u2013794","DOI":"10.1145\/2939672.2939785"},{"key":"713_CR36","doi-asserted-by":"crossref","unstructured":"Akiba T, Sano S, Yanase T, et al (2019) Optuna: a next-generation hyperparameter optimization framework. arXiv:1907.10902","DOI":"10.1145\/3292500.3330701"},{"key":"713_CR37","doi-asserted-by":"crossref","unstructured":"McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426v3","DOI":"10.21105\/joss.00861"},{"key":"713_CR38","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742\u2013754. https:\/\/doi.org\/10.1021\/ci100050t","journal-title":"J Chem Inf Model"},{"key":"713_CR39","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1186\/s12864-019-6413-7","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6. https:\/\/doi.org\/10.1186\/s12864-019-6413-7","journal-title":"BMC Genomics"},{"key":"713_CR40","unstructured":"Sun X, Yang D, Li X, et al (2021) Interpreting deep learning models in natural language processing: a review. arXiv:2110.10470"},{"key":"713_CR41","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1088\/2632-2153\/ac3ffb","volume":"3","author":"R Irwin","year":"2022","unstructured":"Irwin R, Dimitriadis S, He J, Bjerrum EJ (2022) Chemformer: a pre-trained transformer for computational chemistry. Mach Learn Sci Technol 3:1\u201315. https:\/\/doi.org\/10.1088\/2632-2153\/ac3ffb","journal-title":"Mach Learn Sci Technol"},{"key":"713_CR42","doi-asserted-by":"crossref","unstructured":"Hu F, Wang D, Hu Y, et al (2020) Generating novel compounds targeting SARS-CoV-2 main protease based on imbalanced dataset. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 432\u2013436","DOI":"10.1109\/BIBM49941.2020.9313317"},{"key":"713_CR43","unstructured":"Maziarka \u0141, Danel T, Mucha S, et al (2020) Molecule attention transformer. arXiv:2002.08264"},{"key":"713_CR44","doi-asserted-by":"publisher","first-page":"5804","DOI":"10.1021\/acs.jcim.1c01289","volume":"61","author":"H Kim","year":"2021","unstructured":"Kim H, Na J, Lee WB (2021) Generative chemical transformer: neural machine learning of molecular geometric structures from chemical language via attention. J Chem Inf Model 61:5804\u20135814. https:\/\/doi.org\/10.1021\/acs.jcim.1c01289","journal-title":"J Chem Inf Model"},{"key":"713_CR45","doi-asserted-by":"publisher","unstructured":"Mercado R, Rastemo T, Lindel\u00f6f E, et al (2021) Graph networks for molecular design. Mach Learn Sci Technol 2:025023. https:\/\/doi.org\/10.1088\/2632-2153\/abcf91","DOI":"10.1088\/2632-2153\/abcf91"},{"key":"713_CR46","unstructured":"Ertl P, Lewis R, Martin E, Polyakov V (2017) In silico generation of novel, drug-like chemical matter using the LSTM neural network. arXiv:1712.07449"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00713-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00713-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00713-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,4,12]],"date-time":"2023-04-12T10:07:51Z","timestamp":1681294071000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00713-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,12]]},"references-count":46,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["713"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00713-z","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-2300113\/v1","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,12]]},"assertion":[{"value":"22 November 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 March 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 April 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"45"}}