{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T16:45:42Z","timestamp":1774975542676,"version":"3.50.1"},"reference-count":50,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2025,5,8]],"date-time":"2025-05-08T00:00:00Z","timestamp":1746662400000},"content-version":"vor","delay-in-days":7,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2019R1F1A1060250"],"award-info":[{"award-number":["NRF-2019R1F1A1060250"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,5,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Molecular property prediction with deep learning has accelerated drug discovery and retrosynthesis. However, the shortage of labeled molecular data and the challenge of generalizing across the vast chemical spaces pose significant hurdles for leveraging deep learning in molecular property prediction. This study proposes a self-supervised framework designed to acquire a Simplified Molecular Input Line Entry System (SMILES) representation, which we have dubbed Simple SMILES contrastive learning (SimSon). SimSon was pre-trained using unlabeled SMILES data through contrastive learning to grasp the SMILES representations.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Our findings demonstrate that contrastive learning with randomized SMILES enriches the ability of the model to generalize and its robustness as it captures the global semantic context at the molecular level. In downstream tasks, SimSon performs competitively when compared to graph-based methods and even outperforms them on certain benchmark datasets. These results indicate that SimSon effectively captures structural information from SMILES, exhibiting remarkable generalization and robustness. The potential applications of SimSon extend to bioinformatics and cheminformatics, encompassing areas such as drug discovery and drug\u2013drug interaction prediction.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The source code is available at https:\/\/github.com\/lee00206\/SimSon.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf275","type":"journal-article","created":{"date-parts":[[2025,5,8]],"date-time":"2025-05-08T23:00:31Z","timestamp":1746745231000},"source":"Crossref","is-referenced-by-count":3,"title":["SimSon: simple contrastive learning of SMILES for molecular property prediction"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-8245-8478","authenticated-orcid":false,"given":"Chae Eun","family":"Lee","sequence":"first","affiliation":[{"name":"Department of Industrial and Management Engineering, Korea University , Seoul 02841,","place":["Republic of Korea"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8819-4596","authenticated-orcid":false,"given":"Jin Sob","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Industrial and Management Engineering, Korea University , Seoul 02841,","place":["Republic of Korea"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9261-9235","authenticated-orcid":false,"given":"Jin Hong","family":"Min","sequence":"additional","affiliation":[{"name":"Department of Industrial and Management Engineering, Korea University , Seoul 02841,","place":["Republic of Korea"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0040-3542","authenticated-orcid":false,"given":"Sung Won","family":"Han","sequence":"additional","affiliation":[{"name":"Department of Industrial and Management Engineering, Korea University , Seoul 02841,","place":["Republic of Korea"]}]}],"member":"286","published-online":{"date-parts":[[2025,5,8]]},"reference":[{"key":"2025053011215136300_btaf275-B1","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1186\/s13321-019-0393-0","article-title":"Randomized smiles strings improve the quality of molecular generative models","volume":"11","author":"Ar\u00fas-Pous","year":"2019","journal-title":"J Cheminform"},{"key":"2025053011215136300_btaf275-B2","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1186\/s13321-020-00441-8","article-title":"Smiles-based deep generative scaffold decorator for de-novo drug design","volume":"12","author":"Ar\u00fas-Pous","year":"2020","journal-title":"J Cheminform"},{"key":"2025053011215136300_btaf275-B3","author":"Bjerrum","year":"2017"},{"key":"2025053011215136300_btaf275-B4","author":"Carlsson"},{"key":"2025053011215136300_btaf275-B5","doi-asserted-by":"crossref","first-page":"1241","DOI":"10.1016\/j.drudis.2018.01.039","article-title":"The rise of deep learning in drug discovery","volume":"23","author":"Chen","year":"2018","journal-title":"Drug Discov Today"},{"key":"2025053011215136300_btaf275-B6","first-page":"1597","author":"Chen","year":"2020"},{"key":"2025053011215136300_btaf275-B7","author":"Chithrananda","year":"2020"},{"key":"2025053011215136300_btaf275-B8","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1186\/s13321-020-00460-5","article-title":"Molecular representations in ai-driven drug discovery: a review and practical guide","volume":"12","author":"David","year":"2020","journal-title":"J Cheminform"},{"key":"2025053011215136300_btaf275-B9","author":"Fang","year":"2020"},{"key":"2025053011215136300_btaf275-B10","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1038\/s42256-021-00438-4","article-title":"Geometry-enhanced molecular representation learning for property prediction","volume":"4","author":"Fang","year":"2022","journal-title":"Nat Mach Intell"},{"key":"2025053011215136300_btaf275-B11","doi-asserted-by":"crossref","first-page":"542","DOI":"10.1038\/s42256-023-00654-0","article-title":"Knowledge graph-enhanced molecular contrastive learning with functional prompt","volume":"5","author":"Fang","year":"2023","journal-title":"Nat Mach Intell"},{"key":"2025053011215136300_btaf275-B12","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1021\/acscentsci.7b00572","article-title":"Automatic chemical design using a data-driven continuous representation of molecules","volume":"4","author":"G\u00f3mez-Bombarelli","year":"2018","journal-title":"ACS Cent Sci"},{"key":"2025053011215136300_btaf275-B13","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.242"},{"key":"2025053011215136300_btaf275-B14","first-page":"1735","author":"Hadsell","year":"2006"},{"key":"2025053011215136300_btaf275-B15","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1186\/s13321-015-0068-4","article-title":"INCHI, the IUPAC international chemical identifier","volume":"7","author":"Heller","year":"2015","journal-title":"J Cheminform"},{"key":"2025053011215136300_btaf275-B16","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput"},{"key":"2025053011215136300_btaf275-B17","doi-asserted-by":"crossref","first-page":"ii190","DOI":"10.1093\/bioinformatics\/btae386","article-title":"MolMVC: enhancing molecular representations for drug-related tasks through multi-view contrastive learning","volume":"40","author":"Huang","year":"2024","journal-title":"Bioinformatics"},{"key":"2025053011215136300_btaf275-B18","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1021\/ci049714+","article-title":"Zinc\u2014a free database of commercially available compounds for virtual screening","volume":"45","author":"Irwin","year":"2005","journal-title":"J Chem Inf Model"},{"key":"2025053011215136300_btaf275-B19","doi-asserted-by":"crossref","first-page":"2","DOI":"10.3390\/technologies9010002","article-title":"A survey on contrastive self-supervised learning","volume":"9","author":"Jaiswal","year":"2020","journal-title":"Technologies"},{"key":"2025053011215136300_btaf275-B20","doi-asserted-by":"crossref","first-page":"D1388","DOI":"10.1093\/nar\/gkaa971","article-title":"PubChem in 2021: new data content and improved web interfaces","volume":"49","author":"Kim","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2025053011215136300_btaf275-B21","doi-asserted-by":"crossref","first-page":"S69","DOI":"10.1016\/j.toxlet.2017.07.175","article-title":"Deeptox: toxicity prediction using deep learning","volume":"280","author":"Klambauer","year":"2017","journal-title":"Toxicol Lett"},{"key":"2025053011215136300_btaf275-B22","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1208\/s12248-021-00644-3","article-title":"Machine learning and artificial intelligence in pharmaceutical research and development: a review","volume":"24","author":"Kolluri","year":"2022","journal-title":"AAPS J"},{"key":"2025053011215136300_btaf275-B23","first-page":"31","article-title":"Rdkit: a software suite for cheminformatics, computational chemistry, and predictive modeling","volume":"8","author":"Landrum","year":"2013","journal-title":"Greg Landrum"},{"key":"2025053011215136300_btaf275-B24","first-page":"6874","author":"Larsson","year":"2017"},{"key":"2025053011215136300_btaf275-B25","doi-asserted-by":"crossref","first-page":"103373","DOI":"10.1016\/j.drudis.2022.103373","article-title":"Deep learning methods for molecular representation and property prediction","volume":"27","author":"Li","year":"2022","journal-title":"Drug Discov Today"},{"key":"2025053011215136300_btaf275-B26","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1186\/s13321-018-0286-7","article-title":"Molecular generative model based on conditional variational autoencoder for de novo molecular design","volume":"10","author":"Lim","year":"2018","journal-title":"J Cheminform"},{"key":"2025053011215136300_btaf275-B27","author":"Liu","year":"2021"},{"key":"2025053011215136300_btaf275-B28","first-page":"1052","author":"Lu","year":"2019"},{"key":"2025053011215136300_btaf275-B29","doi-asserted-by":"crossref","first-page":"D930","DOI":"10.1093\/nar\/gky1075","article-title":"Chembl: towards direct deposition of bioassay data","volume":"47","author":"Mendez","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2025053011215136300_btaf275-B30","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1186\/1758-2946-4-22","article-title":"Towards a universal smiles representation-a standard method to generate canonical smiles based on the inchi","volume":"4","author":"O\u2019Boyle","year":"2012","journal-title":"J Cheminform"},{"key":"2025053011215136300_btaf275-B31","doi-asserted-by":"crossref","first-page":"i821","DOI":"10.1093\/bioinformatics\/bty593","article-title":"Deepdta: deep drug\u2013target binding affinity prediction","volume":"34","author":"\u00d6zt\u00fcrk","year":"2018","journal-title":"Bioinformatics"},{"key":"2025053011215136300_btaf275-B32","author":"Radford","year":"2018"},{"key":"2025053011215136300_btaf275-B33","first-page":"12559","author":"Rong","year":"2020"},{"key":"2025053011215136300_btaf275-B34","doi-asserted-by":"crossref","first-page":"241722","DOI":"10.1063\/1.5019779","article-title":"Schnet\u2013a deep learning architecture for molecules and materials","volume":"148","author":"Sch\u00fctt","year":"2018","journal-title":"J Chem Phys"},{"key":"2025053011215136300_btaf275-B35","doi-asserted-by":"crossref","first-page":"132306","DOI":"10.1016\/j.physd.2019.132306","article-title":"Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network","volume":"404","author":"Sherstinsky","year":"2020","journal-title":"Phys D Nonlinear Phenomena"},{"key":"2025053011215136300_btaf275-B36","first-page":"2831","author":"Song","year":"2020"},{"key":"2025053011215136300_btaf275-B37","doi-asserted-by":"crossref","first-page":"3666","DOI":"10.1093\/bioinformatics\/bty374","article-title":"Development and evaluation of a deep learning model for protein\u2013ligand binding affinity prediction","volume":"34","author":"Stepniewska-Dziubinska","year":"2018","journal-title":"Bioinformatics"},{"key":"2025053011215136300_btaf275-B38","doi-asserted-by":"crossref","first-page":"i357","DOI":"10.1093\/bioinformatics\/btae260","article-title":"Mollm: a unified language model for integrating biomedical text with 2D and 3D molecular representations","volume":"40","author":"Tang","year":"2024","journal-title":"Bioinformatics"},{"key":"2025053011215136300_btaf275-B39","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1038\/s41573-019-0024-5","article-title":"Applications of machine learning in drug discovery and development","volume":"18","author":"Vamathevan","year":"2019","journal-title":"Nat Rev Drug Discovery"},{"key":"2025053011215136300_btaf275-B40","first-page":"939","author":"Wang","year":"2022"},{"key":"2025053011215136300_btaf275-B41","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1038\/s42256-022-00447-x","article-title":"Molecular contrastive learning of representations via graph neural networks","volume":"4","author":"Wang","year":"2022","journal-title":"Nat Mach Intell"},{"key":"2025053011215136300_btaf275-B42","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1021\/ci00057a005","article-title":"Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules","volume":"28","author":"Weininger","year":"1988","journal-title":"J Chem Inf Comput Sci"},{"key":"2025053011215136300_btaf275-B43","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.ddtec.2020.11.009","article-title":"A compact review of molecular property prediction with graph neural networks","volume":"37","author":"Wieder","year":"2020","journal-title":"Drug Discov Today Technol"},{"key":"2025053011215136300_btaf275-B44","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1038\/s41524-019-0203-2","article-title":"Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm","volume":"5","author":"Wu","year":"2019","journal-title":"NPJ Computational Materials"},{"key":"2025053011215136300_btaf275-B45","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1039\/C7SC02664A","article-title":"Moleculenet: a benchmark for molecular machine learning","volume":"9","author":"Wu","year":"2018","journal-title":"Chem Sci"},{"key":"2025053011215136300_btaf275-B46","doi-asserted-by":"crossref","first-page":"3370","DOI":"10.1021\/acs.jcim.9b00237","article-title":"Analyzing learned molecular representations for property prediction","volume":"59","author":"Yang","year":"2019","journal-title":"J Chem Inf Model"},{"key":"2025053011215136300_btaf275-B47","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1038\/s42004-023-00825-5","article-title":"Hierarchical molecular graph self-supervised learning for property prediction","volume":"6","author":"Zang","year":"2023","journal-title":"Commun Chem"},{"key":"2025053011215136300_btaf275-B48","first-page":"15870","article-title":"Motif-based graph self-supervised learning for molecular property prediction","volume":"34","author":"Zhang","year":"2021","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025053011215136300_btaf275-B49","doi-asserted-by":"crossref","first-page":"914","DOI":"10.1021\/acs.jcim.8b00803","article-title":"Identifying structure\u2013property relationships through smiles syntax analysis with self-attention mechanism","volume":"59","author":"Zheng","year":"2019","journal-title":"J Chem Inf Model"},{"key":"2025053011215136300_btaf275-B50","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599317"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf275\/63129252\/btaf275.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/5\/btaf275\/63129252\/btaf275.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/5\/btaf275\/63129252\/btaf275.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,30]],"date-time":"2025-05-30T15:22:07Z","timestamp":1748618527000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf275\/8127203"}},"subtitle":[],"editor":[{"given":"Arne","family":"Elofsson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2025,5]]},"references-count":50,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,5,6]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf275","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,5]]},"published":{"date-parts":[[2025,5]]},"article-number":"btaf275"}}