{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T11:48:05Z","timestamp":1773488885459,"version":"3.50.1"},"reference-count":53,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2022,4,18]],"date-time":"2022-04-18T00:00:00Z","timestamp":1650240000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"publisher","award":["LZ19H300001"],"award-info":[{"award-number":["LZ19H300001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"publisher","award":["LD22H300001"],"award-info":[{"award-number":["LD22H300001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"publisher","award":["2021YFE0206400"],"award-info":[{"award-number":["2021YFE0206400"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,5,13]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Molecular property prediction models based on machine learning algorithms have become important tools to triage unpromising lead molecules in the early stages of drug discovery. Compared with the mainstream descriptor- and graph-based methods for molecular property predictions, SMILES-based methods can directly extract molecular features from SMILES without human expert knowledge, but they require more powerful algorithms for feature extraction and a larger amount of data for training, which makes SMILES-based methods less popular. Here, we show the great potential of pre-training in promoting the predictions of important pharmaceutical properties. By utilizing three pre-training tasks based on atom feature prediction, molecular feature prediction and contrastive learning, a new pre-training method K-BERT, which can extract chemical information from SMILES like chemists, was developed. The calculation results on 15 pharmaceutical datasets show that K-BERT outperforms well-established descriptor-based (XGBoost) and graph-based (Attentive FP and HRGCN+) models. In addition, we found that the contrastive learning pre-training task enables K-BERT to \u2018understand\u2019 SMILES not limited to canonical SMILES. Moreover, the general fingerprints K-BERT-FP generated by K-BERT exhibit comparative predictive power to MACCS on 15 pharmaceutical datasets and can also capture molecular size and chirality information that traditional binary fingerprints cannot capture. Our results illustrate the great potential of K-BERT in the practical applications of molecular property predictions in drug discovery.<\/jats:p>","DOI":"10.1093\/bib\/bbac131","type":"journal-article","created":{"date-parts":[[2022,3,21]],"date-time":"2022-03-21T12:08:20Z","timestamp":1647864500000},"source":"Crossref","is-referenced-by-count":81,"title":["Knowledge-based BERT: a method to extract molecular features like computational chemists"],"prefix":"10.1093","volume":"23","author":[{"given":"Zhenxing","family":"Wu","sequence":"first","affiliation":[{"name":"Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"},{"name":"Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"},{"name":"State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"}]},{"given":"Dejun","family":"Jiang","sequence":"additional","affiliation":[{"name":"Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"}]},{"given":"Jike","family":"Wang","sequence":"additional","affiliation":[{"name":"Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"},{"name":"National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, Hubei, P. R. China"}]},{"given":"Xujun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"}]},{"given":"Hongyan","family":"Du","sequence":"additional","affiliation":[{"name":"Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"}]},{"given":"Lurong","family":"Pan","sequence":"additional","affiliation":[{"name":"Global Health Drug Discovery Institute, Beijing 100192, P. R. China"}]},{"given":"Chang-Yu","family":"Hsieh","sequence":"additional","affiliation":[{"name":"Tencent Quantum Laboratory, Tencent, Shenzhen 518057, Guangdong, P. R. China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3604-3785","authenticated-orcid":false,"given":"Dongsheng","family":"Cao","sequence":"additional","affiliation":[{"name":"Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410004, Hunan, P. R. China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7227-2580","authenticated-orcid":false,"given":"Tingjun","family":"Hou","sequence":"additional","affiliation":[{"name":"Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"},{"name":"Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"},{"name":"State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China"}]}],"member":"286","published-online":{"date-parts":[[2022,4,18]]},"reference":[{"key":"2022051813444395600_ref1","doi-asserted-by":"crossref","first-page":"727","DOI":"10.1038\/90765","article-title":"Drug discovery\u2014an operating model for a new era","volume":"19","author":"Myers","year":"2001","journal-title":"Nat Biotechnol"},{"key":"2022051813444395600_ref2","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1016\/j.jhealeco.2016.01.012","article-title":"Innovation in the pharmaceutical industry: new estimates of R&D costs","volume":"47","author":"DiMasi","year":"2016","journal-title":"J Health Econ"},{"key":"2022051813444395600_ref3","doi-asserted-by":"crossref","first-page":"475","DOI":"10.1038\/nrd4609","article-title":"An analysis of the attrition of drug candidates from four major pharmaceutical companies","volume":"14","author":"Waring","year":"2015","journal-title":"Nat Rev Drug Discov"},{"key":"2022051813444395600_ref4","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1038\/s42256-020-0209-y","article-title":"Minimal-uncertainty prediction of general drug-likeness based on Bayesian neural networks","volume":"2","author":"Beker","year":"2020","journal-title":"Nature Machine Intelligence"},{"key":"2022051813444395600_ref5","first-page":"1","article-title":"Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT","volume":"12","author":"Li","year":"2020","journal-title":"J Chem"},{"key":"2022051813444395600_ref6","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1016\/j.cbi.2009.06.024","article-title":"Cholinesterase inhibitory activities of some flavonoid derivatives and chosen xanthone and their molecular docking studies","volume":"181","author":"Khan","year":"2009","journal-title":"Chem Biol Interact"},{"key":"2022051813444395600_ref7","doi-asserted-by":"crossref","first-page":"402","DOI":"10.1016\/S1367-5931(03)00055-3","article-title":"Profiling drug-like properties in discovery research","volume":"7","author":"Di","year":"2003","journal-title":"Curr Opin Chem Biol"},{"key":"2022051813444395600_ref8","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1016\/S0169-409X(02)00003-0","article-title":"Prediction of \u2018drug-likeness\u2019","volume":"54","author":"Walters","year":"2002","journal-title":"Adv Drug Deliv Rev"},{"key":"2022051813444395600_ref9","doi-asserted-by":"crossref","first-page":"6924","DOI":"10.1021\/acs.jmedchem.1c00421","article-title":"Mining Toxicity Information from Large Amounts of Toxicity Data","volume":"64","author":"Wu","year":"2021","journal-title":"J Med Chem"},{"key":"2022051813444395600_ref10","doi-asserted-by":"crossref","first-page":"30","DOI":"10.3389\/fchem.2018.00030","article-title":"In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts","volume":"6","author":"Yang","year":"2018","journal-title":"Front Chem"},{"key":"2022051813444395600_ref11","doi-asserted-by":"crossref","first-page":"4463","DOI":"10.1021\/jm0303195","article-title":"Classification of kinase inhibitors using a Bayesian model","volume":"47","author":"Xia","year":"2004","journal-title":"J Med Chem"},{"key":"2022051813444395600_ref12","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1021\/mp300023x","article-title":"ADMET evaluation in drug discovery. 12. Development of binary classification models for prediction of hERG potassium channel blockage","volume":"9","author":"Wang","year":"2012","journal-title":"Mol Pharm"},{"key":"2022051813444395600_ref13","doi-asserted-by":"crossref","first-page":"2048","DOI":"10.1021\/ci0340916","article-title":"Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions","volume":"43","author":"Zernov","year":"2003","journal-title":"J Chem Inf Comput Sci"},{"key":"2022051813444395600_ref14","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1080\/10629360701843482","article-title":"Prediction of PAH mutagenicity in human cells by QSAR classification","volume":"19","author":"Papa","year":"2008","journal-title":"SAR QSAR in Environmental Research"},{"key":"2022051813444395600_ref15","doi-asserted-by":"crossref","first-page":"1273","DOI":"10.1021\/ci010132r","article-title":"Reoptimization of MDL keys for use in drug discovery","volume":"42","author":"Durant","year":"2002","journal-title":"J Chem Inf Comput Sci"},{"key":"2022051813444395600_ref16","doi-asserted-by":"crossref","first-page":"595","DOI":"10.1007\/s10822-016-9938-8","article-title":"Molecular graph convolutions: moving beyond fingerprints","volume":"30","author":"Kearnes","year":"2016","journal-title":"J Comput Aided Mol Des"},{"key":"2022051813444395600_ref17","first-page":"2224","volume-title":"Advances in Neural Information Processing Systems","author":"Duvenaud","year":"2015"},{"key":"2022051813444395600_ref18","doi-asserted-by":"crossref","first-page":"3370","DOI":"10.1021\/acs.jcim.9b00237","article-title":"Analyzing learned molecular representations for property prediction","volume":"59","author":"Yang","year":"2019","journal-title":"J Chem Inf Model"},{"key":"2022051813444395600_ref19","doi-asserted-by":"crossref","first-page":"8749","DOI":"10.1021\/acs.jmedchem.9b00959","article-title":"Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism","volume":"63","author":"Xiong","year":"2019","journal-title":"J Med Chem"},{"key":"2022051813444395600_ref20","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1021\/acs.jcim.9b00587","article-title":"Graph Convolutional Neural Networks as \u201cGeneral-Purpose\u201d Property Predictors: The Universality and Limits of Applicability","volume":"60","author":"Korolev","year":"2019","journal-title":"J Chem Inf Model"},{"key":"2022051813444395600_ref21","doi-asserted-by":"crossref","first-page":"8778","DOI":"10.1021\/acs.jmedchem.9b01129","article-title":"Practical high-quality electrostatic potential surfaces for drug discovery using a graph-convolutional deep neural network","volume":"63","author":"Rathi","year":"2020","journal-title":"J Med Chem"},{"key":"2022051813444395600_ref22","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1021\/ci00057a005","article-title":"SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules","volume":"28","author":"Weininger","year":"1988","journal-title":"J Chem Inf Comput Sci"},{"key":"2022051813444395600_ref23","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2022051813444395600_ref24","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2018"},{"key":"2022051813444395600_ref25","volume-title":"Improving Language Understanding by Generative Pre-training","author":"Radford","year":"2018"},{"key":"2022051813444395600_ref26","doi-asserted-by":"crossref","first-page":"429","DOI":"10.1145\/3307339.3342186","volume-title":"Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","author":"Wang","year":"2019"},{"key":"2022051813444395600_ref27","article-title":"Smiles transformer: pre-trained molecular fingerprint for low data drug discovery","author":"Honda","year":"2019"},{"key":"2022051813444395600_ref28","article-title":"Do Transformers Really Perform Bad for Graph Representation?","author":"Ying","year":"2021"},{"key":"2022051813444395600_ref29","article-title":"Chemformer: a pre-trained transformer for computational chemistry","author":"Irwin","year":"2022","journal-title":"Machine Learning: Science and Technology"},{"key":"2022051813444395600_ref30","article-title":"Molecular representation learning with language models and domain-relevant auxiliary tasks","author":"Fabian","year":"2020"},{"key":"2022051813444395600_ref31","article-title":"Self-supervised graph transformer on large-scale molecular data","author":"Rong","year":"2020"},{"key":"2022051813444395600_ref32","article-title":"Strategies for pre-training graph neural networks","author":"Hu","year":"2019"},{"key":"2022051813444395600_ref33","article-title":"Molecule attention transformer","author":"Maziarka","year":"2020"},{"key":"2022051813444395600_ref34","article-title":"Adversarial examples in the physical world","author":"Kurakin","year":"2016"},{"key":"2022051813444395600_ref35","article-title":"Understanding neural networks through representation erasure","author":"Li","year":"2016"},{"key":"2022051813444395600_ref36","first-page":"8018","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Jin","year":"2020"},{"key":"2022051813444395600_ref37","first-page":"1","volume-title":"Xgboost: extreme gradient boosting","author":"Chen","year":"2015"},{"key":"2022051813444395600_ref38","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbab112","article-title":"Hyperbolic relational graph convolution networks plus: a simple but highly efficient QSAR-modeling method","volume":"22","author":"Wu","year":"2021","journal-title":"Brief Bioinform"},{"key":"2022051813444395600_ref39","doi-asserted-by":"crossref","first-page":"D930","DOI":"10.1093\/nar\/gky1075","article-title":"ChEMBL: towards direct deposition of bioassay data","volume":"47","author":"Mendez","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2022051813444395600_ref40","doi-asserted-by":"crossref","first-page":"W5","DOI":"10.1093\/nar\/gkab255","article-title":"ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties","volume":"49","author":"Xiong","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2022051813444395600_ref41","doi-asserted-by":"crossref","first-page":"344","DOI":"10.1038\/nature19804","article-title":"Diversity-oriented synthesis yields novel multistage antimalarial inhibitors","volume":"538","author":"Kato","year":"2016","journal-title":"Nature"},{"key":"2022051813444395600_ref42","article-title":"Message passing networks for molecules with tetrahedral chirality","author":"Pattanaik","year":"2020"},{"key":"2022051813444395600_ref43","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1038\/s41586-019-0917-9","article-title":"Ultra-large library docking for discovering new chemotypes","volume":"566","author":"Lyu","year":"2019","journal-title":"Nature"},{"key":"2022051813444395600_ref44","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P19-1452","article-title":"BERT rediscovers the classical NLP pipeline","author":"Tenney","year":"2019"},{"key":"2022051813444395600_ref45","doi-asserted-by":"crossref","first-page":"154290","DOI":"10.1109\/ACCESS.2019.2946594","article-title":"Target-dependent sentiment classification with BERT","volume":"7","author":"Gao","year":"2019","journal-title":"IEEE Access"},{"key":"2022051813444395600_ref46","first-page":"1","volume-title":"6th Italian Conference on Computational Linguistics, CLiC-it 2019","author":"Polignano","year":"2019"},{"key":"2022051813444395600_ref47","doi-asserted-by":"crossref","first-page":"6091","DOI":"10.1039\/C8SC02339E","article-title":"\u201cFound in Translation\u201d: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models","volume":"9","author":"Schwaller","year":"2018","journal-title":"Chem Sci"},{"key":"2022051813444395600_ref48","article-title":"SMILES enumeration as data augmentation for neural network modeling of molecules","author":"Bjerrum","year":"2017"},{"key":"2022051813444395600_ref49","doi-asserted-by":"crossref","first-page":"1193","DOI":"10.1021\/ci8004644","article-title":"Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection","volume":"49","author":"Khalifa","year":"2009","journal-title":"J Chem Inf Model"},{"key":"2022051813444395600_ref50","volume-title":"Advances in Neural Information Processing Systems","author":"Hinton","year":"2002"},{"key":"2022051813444395600_ref51","first-page":"1","article-title":"Visualization of very large high-dimensional data sets as minimum spanning trees","volume":"12","author":"Probst","year":"2020","journal-title":"J Chem"},{"key":"2022051813444395600_ref52","doi-asserted-by":"crossref","first-page":"D1074","DOI":"10.1093\/nar\/gkx1037","article-title":"DrugBank 5.0: a major update to the DrugBank database for 2018","volume":"46","author":"Wishart","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2022051813444395600_ref53","first-page":"1","article-title":"One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome","volume":"12","author":"Capecchi","year":"2020","journal-title":"J Chem"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/23\/3\/bbac131\/43745184\/bbac131.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/23\/3\/bbac131\/43745184\/bbac131.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,5,18]],"date-time":"2022-05-18T13:47:02Z","timestamp":1652881622000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbac131\/6570013"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,18]]},"references-count":53,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,5,13]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbac131","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,5]]},"published":{"date-parts":[[2022,4,18]]},"article-number":"bbac131"}}