{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T22:47:00Z","timestamp":1777502820057,"version":"3.51.4"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,2,8]],"date-time":"2023-02-08T00:00:00Z","timestamp":1675814400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,2,8]],"date-time":"2023-02-08T00:00:00Z","timestamp":1675814400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Molecular similarity search is an often-used method in drug discovery, especially in virtual screening studies. While simple one- or two-dimensional similarity metrics can be applied to search databases containing billions of molecules in a reasonable amount of time, this is not the case for complex three-dimensional methods. In this work, we trained a transformer model to autoencode tokenized SMILES strings using a custom loss function developed to conserve similarities in latent space. This allows the direct sampling of molecules in the generated latent space based on their Euclidian distance. Reducing the similarity between molecules to their Euclidian distance in latent space allows the model to perform independent of the similarity metric it was trained on. While we test the method here using 2D similarity as proof-of-concept study, the algorithm will enable also high-content screening with time-consuming 3D similarity metrics. We show that the presence of a specific loss function for similarity conservation greatly improved the model\u2019s ability to predict highly similar molecules. When applying the model to a database containing 1.5 billion molecules, our model managed to reduce the relevant search space by 5 orders of magnitude. We also show that our model was able to generalize adequately when trained on a relatively small dataset of representative structures. The herein presented method thereby provides new means of substantially reducing the relevant search space in virtual screening approaches, thus highly increasing their throughput. Additionally, the distance awareness of the model causes the efficiency of this method to be independent of the underlying similarity metric.<\/jats:p>","DOI":"10.1186\/s13321-023-00686-z","type":"journal-article","created":{"date-parts":[[2023,2,8]],"date-time":"2023-02-08T11:06:11Z","timestamp":1675854371000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Efficient virtual high-content screening using a distance-aware transformer model"],"prefix":"10.1186","volume":"15","author":[{"given":"Manuel S.","family":"Sellner","sequence":"first","affiliation":[]},{"given":"Amr H.","family":"Mahmoud","sequence":"additional","affiliation":[]},{"given":"Markus A.","family":"Lill","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,2,8]]},"reference":[{"issue":"9","key":"686_CR1","doi-asserted-by":"publisher","first-page":"844","DOI":"10.1001\/jama.2020.1166","volume":"323","author":"OJ Wouters","year":"2020","unstructured":"Wouters OJ, McKee M, Luyten J (2020) Estimated research and development investment needed to bring a new medicine to market, 2009\u20132018. JAMA 323(9):844. https:\/\/doi.org\/10.1001\/jama.2020.1166","journal-title":"JAMA"},{"key":"686_CR2","doi-asserted-by":"publisher","first-page":"315","DOI":"10.3389\/fchem.2018.00315","volume":"6","author":"A Kumar","year":"2018","unstructured":"Kumar A, Zhang KYJ (2018) Advances in the development of shape similarity methods and their application in drug discovery. Front Chem 6:315. https:\/\/doi.org\/10.3389\/fchem.2018.00315","journal-title":"Front Chem"},{"issue":"2","key":"686_CR3","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1517\/17460441.2016.1117070","volume":"11","author":"I Muegge","year":"2016","unstructured":"Muegge I, Mukherjee P (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discov 11(2):137\u2013148. https:\/\/doi.org\/10.1517\/17460441.2016.1117070","journal-title":"Expert Opin Drug Discov"},{"issue":"7","key":"686_CR4","doi-asserted-by":"publisher","first-page":"1892","DOI":"10.1021\/ci500232g","volume":"54","author":"M Awale","year":"2014","unstructured":"Awale M, Reymond J-L (2014) Atom Pair 2D-fingerprints perceive 3D-molecular shape and pharmacophores for very fast virtual screening of ZINC and GDB-17. J Chem Inf Model 54(7):1892\u20131907. https:\/\/doi.org\/10.1021\/ci500232g","journal-title":"J Chem Inf Model"},{"issue":"1","key":"686_CR5","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1186\/s13321-020-00445-4","volume":"12","author":"A Capecchi","year":"2020","unstructured":"Capecchi A, Probst D, Reymond J-L (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminformatics 12(1):43. https:\/\/doi.org\/10.1186\/s13321-020-00445-4","journal-title":"J Cheminformatics"},{"issue":"6","key":"686_CR6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/bib\/bbab291","volume":"22","author":"B Zagidullin","year":"2021","unstructured":"Zagidullin B, Wang Z, Guan Y, Pitk\u00e4nen E, Tang J (2021) Comparative analysis of molecular fingerprints in prediction of drug combination effects. Brief Bioinform 22(6):1\u201315. https:\/\/doi.org\/10.1093\/bib\/bbab291","journal-title":"Brief Bioinform"},{"issue":"17","key":"686_CR7","doi-asserted-by":"publisher","first-page":"7393","DOI":"10.1021\/acs.jmedchem.7b00696","volume":"60","author":"SD Axen","year":"2017","unstructured":"Axen SD, Huang X-P, C\u00e1ceres EL, Gendelev L, Roth BL, Keiser MJ (2017) A simple representation of three-dimensional molecular structure. J Med Chem 60(17):7393\u20137409. https:\/\/doi.org\/10.1021\/acs.jmedchem.7b00696","journal-title":"J Med Chem"},{"issue":"10","key":"686_CR8","doi-asserted-by":"publisher","first-page":"3626","DOI":"10.3390\/ijms21103626","volume":"21","author":"A Fischer","year":"2020","unstructured":"Fischer A, Sellner M, Neranjan S, Smie\u0161ko M, Lill MA (2020) Potential inhibitors for novel coronavirus protease identified by virtual screening of 606 million compounds. Int J Mol Sci 21(10):3626. https:\/\/doi.org\/10.3390\/ijms21103626","journal-title":"Int J Mol Sci"},{"key":"686_CR9","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1016\/j.ymeth.2014.08.005","volume":"71","author":"A Cereto-Massagu\u00e9","year":"2015","unstructured":"Cereto-Massagu\u00e9 A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallv\u00e9 S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58\u201363. https:\/\/doi.org\/10.1016\/j.ymeth.2014.08.005","journal-title":"Methods"},{"issue":"1","key":"686_CR10","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1186\/1752-153X-1-12","volume":"1","author":"F Fontaine","year":"2007","unstructured":"Fontaine F, Bolton E, Borodina Y, Bryant SH (2007) Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Central J 1(1):12. https:\/\/doi.org\/10.1186\/1752-153X-1-12","journal-title":"Chem Central J"},{"issue":"6","key":"686_CR11","doi-asserted-by":"publisher","first-page":"2858","DOI":"10.1021\/acs.jcim.0c00161","volume":"60","author":"Y Chen","year":"2020","unstructured":"Chen Y, Mathai N, Kirchmair J (2020) Scope of 3D shape-based approaches in predicting the macromolecular targets of structurally complex small molecules including natural products and macrocyclic ligands. J Chem Inf Model 60(6):2858\u20132875. https:\/\/doi.org\/10.1021\/acs.jcim.0c00161","journal-title":"J Chem Inf Model"},{"key":"686_CR12","doi-asserted-by":"crossref","unstructured":"Sun Y, Cheng C, Zhang Y, Zhang C, Zheng L, Wang Z, Wei Y (2020) Circle loss: a unified perspective of pair similarity optimization. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 6397\u20136406.https:\/\/doi.org\/10.48550\/arxiv.2002.10857","DOI":"10.1109\/CVPR42600.2020.00643"},{"key":"686_CR13","doi-asserted-by":"publisher","unstructured":"Su\u00e1rez-D\u00edaz JL, Garc\u00eda S, Herrera F (2018) A tutorial on distance metric learning: mathematical foundations, algorithms, experimental analysis, prospects and challenges (with Appendices on Mathematical Background and Detailed Algorithms Explanation). ArXiv. https:\/\/doi.org\/10.48550\/arxiv.1812.05944","DOI":"10.48550\/arxiv.1812.05944"},{"key":"686_CR14","doi-asserted-by":"publisher","unstructured":"Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F (2020) A survey on contrastive self-supervised learning. Technologies 9(1), 2. https:\/\/doi.org\/10.3390\/technologies9010002. arXiv:2011.00362","DOI":"10.3390\/technologies9010002"},{"key":"686_CR15","unstructured":"Gutmann M, Hyv\u00e4rinen A (2010) Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 9, pp. 297\u2013304. https:\/\/proceedings.mlr.press\/v9\/gutmann10a.html"},{"key":"686_CR16","doi-asserted-by":"publisher","unstructured":"Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2018) Learning deep representations by mutual information estimation and maximization. In: 7th International Conference on Learning Representations, ICLR 2019. https:\/\/doi.org\/10.48550\/arxiv.1808.06670","DOI":"10.48550\/arxiv.1808.06670"},{"key":"686_CR17","doi-asserted-by":"crossref","unstructured":"Misra I, van der Maaten L (2019) Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 6706\u20136716.https:\/\/doi.org\/10.48550\/arxiv.1912.01991","DOI":"10.1109\/CVPR42600.2020.00674"},{"key":"686_CR18","doi-asserted-by":"publisher","unstructured":"Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. 37th International Conference on Machine Learning, ICML 2020 PartF16814, pp 1575\u20131585. https:\/\/doi.org\/10.48550\/arxiv.2002.05709","DOI":"10.48550\/arxiv.2002.05709"},{"key":"686_CR19","first-page":"12559","volume":"33","author":"Y Rong","year":"2020","unstructured":"Rong Y, Bian Y, Xu T, Xie W, WEI Y, Huang W, Huang J, (2020) Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst 33:12559\u201312571","journal-title":"Adv Neural Inf Process Syst"},{"issue":"2","key":"686_CR20","doi-asserted-by":"publisher","first-page":"2000203","DOI":"10.1002\/minf.202000203","volume":"40","author":"D Koge","year":"2021","unstructured":"Koge D, Ono N, Huang M, Altaf-Ul-Amin M, Kanaya S (2021) Embedding of molecular structure using molecular hypergraph variational autoencoder with metric learning. Mol Inform 40(2):2000203. https:\/\/doi.org\/10.1002\/minf.202000203","journal-title":"Mol Inform"},{"key":"686_CR21","doi-asserted-by":"publisher","unstructured":"Wang S, Guo , Wang Y, Sun H, Huang J (2019) SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 429\u2013436. ACM, New York, NY, USA. https:\/\/doi.org\/10.1145\/3307339.3342186","DOI":"10.1145\/3307339.3342186"},{"issue":"6","key":"686_CR22","doi-asserted-by":"publisher","first-page":"1692","DOI":"10.1039\/C8SC04175J","volume":"10","author":"R Winter","year":"2019","unstructured":"Winter R, Montanari F, No\u00e9 F, Clevert D-A (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10(6):1692\u20131701. https:\/\/doi.org\/10.1039\/C8SC04175J","journal-title":"Chem Sci"},{"key":"686_CR23","doi-asserted-by":"publisher","unstructured":"G\u00f3mez-Bombarelli R, Wei JN, Duvenaud D, Hern\u00e1ndez-Lobato JM, S\u00e1nchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci 4(2), 268\u2013276. https:\/\/doi.org\/10.1021\/acscentsci.7b00572. arXiv:1610.02415","DOI":"10.1021\/acscentsci.7b00572"},{"key":"686_CR24","unstructured":"Honda S, Shi S, Ueda HR (2019) SMILES Transformer: pre-trained molecular fingerprint for low data drug discovery. ArXiv arXiv:1911.04738"},{"key":"686_CR25","doi-asserted-by":"publisher","unstructured":"Bjerrum E, Sattarov B (2018) Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules 8(4):131. https:\/\/doi.org\/10.3390\/biom8040131arXiv:1806.09300","DOI":"10.3390\/biom8040131"},{"key":"686_CR26","doi-asserted-by":"publisher","unstructured":"Hong SH, Ryu S, Lim J, Kim WY (2020) Molecular generative model based on an adversarially regularized autoencoder. J Chem Inf Model 60(1), 29\u201336. https:\/\/doi.org\/10.1021\/acs.jcim.9b00694. arXiv:1912.05617","DOI":"10.1021\/acs.jcim.9b00694"},{"key":"686_CR27","doi-asserted-by":"publisher","unstructured":"Yan C, Wang S, Yang J, Xu T, Huang J (2020) Re-balancing variational autoencoder loss for molecule sequence generation. In: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, vol. 20, pp 1\u20137. ACM, New York, NY, USA. https:\/\/doi.org\/10.1145\/3388440.3412458. arXiv:1910.00698","DOI":"10.1145\/3388440.3412458"},{"key":"686_CR28","doi-asserted-by":"publisher","unstructured":"Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: A unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 07-12-June, IEEE, pp 815\u2013823. https:\/\/doi.org\/10.1109\/CVPR.2015.7298682. arXiv:1503.03832. http:\/\/ieeexplore.ieee.org\/document\/7298682\/","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"686_CR29","doi-asserted-by":"publisher","unstructured":"Misra I, Girdhar R, Joulin A (2021) An end-to-end transformer model for 3D object detection. 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), 2886\u20132897. https:\/\/doi.org\/10.1109\/ICCV48922.2021.00290. arXiv:2109.08141","DOI":"10.1109\/ICCV48922.2021.00290"},{"key":"686_CR30","doi-asserted-by":"publisher","unstructured":"Shi Y, Wang Y, Wu C, Yeh C-F, Chan J, Zhang F, Le D, Seltzer M (2020) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-June, 6783\u20136787. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414560. arXiv:2010.10759","DOI":"10.1109\/ICASSP39728.2021.9414560"},{"key":"686_CR31","doi-asserted-by":"publisher","unstructured":"Farahani M, Gharachorloo M, Farahani M, Manthouri M (2020) ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Process Lett 53(6):3831\u20133847. https:\/\/doi.org\/10.1007\/s11063-021-10528-4arXiv:2005.12515","DOI":"10.1007\/s11063-021-10528-4"},{"issue":"1","key":"686_CR32","doi-asserted-by":"publisher","first-page":"19541","DOI":"10.1038\/s41598-021-98915-8","volume":"11","author":"MA Hannan","year":"2021","unstructured":"Hannan MA, How DNT, Lipu MSH, Mansor M, Ker PJ, Dong ZY, Sahari KSM, Tiong SK, Muttaqi KM, Mahlia TMI, Blaabjerg F (2021) Deep learning approach towards accurate state of charge estimation for lithium-ion batteries using self-supervised transformer model. Sci Rep 11(1):19541. https:\/\/doi.org\/10.1038\/s41598-021-98915-8","journal-title":"Sci Rep"},{"key":"686_CR33","doi-asserted-by":"publisher","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1, pp 4171\u20134186. https:\/\/doi.org\/10.48550\/arxiv.1810.04805","DOI":"10.48550\/arxiv.1810.04805"},{"key":"686_CR34","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, Neural information processing systems foundation, vol. 2017-December, pp 5999\u20136009. arXiv:1706.03762. https:\/\/arxiv.org\/abs\/1706.03762v5"},{"issue":"11","key":"686_CR35","doi-asserted-by":"publisher","first-page":"2324","DOI":"10.1021\/acs.jcim.5b00559","volume":"55","author":"T Sterling","year":"2015","unstructured":"Sterling T, Irwin JJ (2015) ZINC 15 - ligand discovery for everyone. J Chem Inf Model 55(11):2324\u20132337. https:\/\/doi.org\/10.1021\/acs.jcim.5b00559","journal-title":"J Chem Inf Model"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00686-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00686-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00686-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,8]],"date-time":"2023-02-08T11:16:45Z","timestamp":1675855005000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00686-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,8]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["686"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00686-z","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,8]]},"assertion":[{"value":"19 June 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 January 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 February 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"18"}}