{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T17:13:57Z","timestamp":1774890837012,"version":"3.50.1"},"reference-count":84,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The identification of drug\/compound\u2013target interactions (DTIs) constitutes the basis of drug discovery, for which computational predictive approaches have been developed. As a relatively new data-driven paradigm, proteochemometric (PCM) modeling utilizes both protein and compound properties as a pair at the input level and processes them via statistical\/machine learning. The representation of input samples (i.e., proteins and their ligands) in the form of quantitative feature vectors is crucial for the extraction of interaction-related properties during the artificial learning and subsequent prediction of DTIs. Lately, the representation learning approach, in which input samples are automatically featurized via training and applying a machine\/deep learning model, has been utilized in biomedical sciences. In this study, we performed a comprehensive investigation of different computational approaches\/techniques for protein featurization (including both conventional approaches and the novel learned embeddings), data preparation and exploration, machine learning-based modeling, and performance evaluation with the aim of achieving better data representations and more successful learning in DTI prediction. For this, we first constructed realistic and challenging benchmark datasets on small, medium, and large scales to be used as reliable gold standards for specific DTI modeling tasks. We developed and applied a network analysis-based splitting strategy to divide datasets into structurally different training and test folds. Using these datasets together with various featurization methods, we trained and tested DTI prediction models and evaluated their performance from different angles. Our main findings can be summarized under 3 items: (i) random splitting of datasets into train and test folds leads to near-complete data memorization and produce highly over-optimistic results, as a result, should be avoided, (ii) learned protein sequence embeddings work well in DTI prediction and offer high potential, despite interaction-related properties (e.g.,\u00a0structures) of proteins are unused during their self-supervised model training, and (iii) during the learning process,\u00a0PCM models tend to rely heavily on\u00a0compound features while partially\u00a0ignoring\u00a0protein features, primarily due to the inherent bias in DTI data, indicating the requirement for new and unbiased datasets. We hope this study will aid researchers in designing robust and high-performing data-driven DTI prediction systems that have real-world translational value in drug discovery.<\/jats:p>","DOI":"10.1186\/s13321-023-00689-w","type":"journal-article","created":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T06:03:02Z","timestamp":1675663382000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":30,"title":["How to approach machine learning-based prediction of drug\/compound\u2013target interactions"],"prefix":"10.1186","volume":"15","author":[{"given":"Heval","family":"Atas Guvenilir","sequence":"first","affiliation":[]},{"given":"Tunca","family":"Do\u011fan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,2,6]]},"reference":[{"key":"689_CR1","doi-asserted-by":"publisher","first-page":"1878","DOI":"10.1093\/bib\/bby061","volume":"20","author":"AS Rifaioglu","year":"2019","unstructured":"Rifaioglu AS, Atas H, Martin MJ et al (2019) Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform 20:1878\u20131912. https:\/\/doi.org\/10.1093\/bib\/bby061","journal-title":"Brief Bioinform"},{"key":"689_CR2","doi-asserted-by":"publisher","first-page":"2531","DOI":"10.1039\/C9SC03414E","volume":"11","author":"AS Rifaioglu","year":"2020","unstructured":"Rifaioglu AS, Nalbat E, Atalay V et al (2020) DEEPScreen: high performance drug\u2013target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chem Sci 11:2531\u20132557. https:\/\/doi.org\/10.1039\/C9SC03414E","journal-title":"Chem Sci"},{"key":"689_CR3","doi-asserted-by":"publisher","first-page":"2839","DOI":"10.2174\/09298673113209990001","volume":"20","author":"A Lavecchia","year":"2013","unstructured":"Lavecchia A, Di Giovanni C (2013) Virtual screening strategies in drug discovery: a critical review. Curr Med Chem 20:2839\u20132860","journal-title":"Curr Med Chem"},{"key":"689_CR4","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1039\/C4MD00216D","volume":"6","author":"I Cort\u00e9s-Ciriano","year":"2015","unstructured":"Cort\u00e9s-Ciriano I, Ain QU, Subramanian V et al (2015) Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects. Medchemcomm 6:24\u201350. https:\/\/doi.org\/10.1039\/C4MD00216D","journal-title":"Medchemcomm"},{"key":"689_CR5","doi-asserted-by":"publisher","first-page":"487","DOI":"10.1093\/bioinformatics\/bts412","volume":"28","author":"Y Tabei","year":"2012","unstructured":"Tabei Y, Pauwels E, Stoven V et al (2012) Identification of chemogenomic features from drug\u2013target interaction networks using interpretable classifiers. Bioinformatics 28:487\u2013494. https:\/\/doi.org\/10.1093\/bioinformatics\/bts412","journal-title":"Bioinformatics"},{"key":"689_CR6","doi-asserted-by":"publisher","first-page":"125","DOI":"10.1093\/bib\/bbw004","volume":"18","author":"T Qiu","year":"2017","unstructured":"Qiu T, Qiu J, Feng J et al (2017) The recent progress in proteochemometric modelling: focusing on target descriptors, cross-term descriptors and application scope. Brief Bioinform 18:125\u2013136. https:\/\/doi.org\/10.1093\/bib\/bbw004","journal-title":"Brief Bioinform"},{"key":"689_CR7","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1016\/j.ymeth.2014.08.005","volume":"71","author":"A Cereto-Massagu\u00e9","year":"2015","unstructured":"Cereto-Massagu\u00e9 A, Ojeda MJ, Valls C et al (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58\u201363. https:\/\/doi.org\/10.1016\/j.ymeth.2014.08.005","journal-title":"Methods"},{"key":"689_CR8","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1517\/17460441.2016.1117070","volume":"11","author":"I Muegge","year":"2016","unstructured":"Muegge I, Mukherjee P (2016) An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discov 11:137\u2013148. https:\/\/doi.org\/10.1517\/17460441.2016.1117070","journal-title":"Expert Opin Drug Discov"},{"key":"689_CR9","doi-asserted-by":"publisher","first-page":"719","DOI":"10.1002\/minf.201400066","volume":"33","author":"R Sawada","year":"2014","unstructured":"Sawada R, Kotera M, Yamanishi Y (2014) Benchmarking a wide range of chemical descriptors for drug\u2013target interaction prediction using a chemogenomic approach. Mol Inform 33:719\u2013731. https:\/\/doi.org\/10.1002\/minf.201400066","journal-title":"Mol Inform"},{"key":"689_CR10","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1093\/bioinformatics\/bth466","volume":"21","author":"K-C Chou","year":"2005","unstructured":"Chou K-C (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10\u201319. https:\/\/doi.org\/10.1093\/bioinformatics\/bth466","journal-title":"Bioinformatics"},{"key":"689_CR11","doi-asserted-by":"publisher","first-page":"300","DOI":"10.1186\/1471-2105-8-300","volume":"8","author":"SA Ong","year":"2007","unstructured":"Ong SA, Lin HH, Chen YZ et al (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8:300. https:\/\/doi.org\/10.1186\/1471-2105-8-300","journal-title":"BMC Bioinformatics"},{"key":"689_CR12","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1186\/1758-2946-5-41","volume":"5","author":"GJP Van Westen","year":"2013","unstructured":"Van Westen GJP, Swier RF, Cortes-Ciriano I et al (2013) Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): Modeling performance of 13 amino acid descriptor sets. J Cheminform 5:41. https:\/\/doi.org\/10.1186\/1758-2946-5-41","journal-title":"J Cheminform"},{"key":"689_CR13","doi-asserted-by":"publisher","first-page":"231","DOI":"10.1186\/s12859-016-1110-x","volume":"17","author":"M Sun","year":"2016","unstructured":"Sun M, Wang X, Zou C et al (2016) Accurate prediction of RNA-binding protein residues with two discriminative structural descriptors. BMC Bioinformatics 17:231. https:\/\/doi.org\/10.1186\/s12859-016-1110-x","journal-title":"BMC Bioinformatics"},{"key":"689_CR14","doi-asserted-by":"publisher","first-page":"212","DOI":"10.1186\/1471-2105-13-212","volume":"13","author":"D Wu","year":"2012","unstructured":"Wu D, Huang Q, Zhang Y et al (2012) Screening of selective histone deacetylase inhibitors by proteochemometric modeling. BMC Bioinformatics 13:212. https:\/\/doi.org\/10.1186\/1471-2105-13-212","journal-title":"BMC Bioinformatics"},{"key":"689_CR15","doi-asserted-by":"publisher","first-page":"648","DOI":"10.1089\/omi.2015.0095","volume":"19","author":"V Saravanan","year":"2015","unstructured":"Saravanan V, Gautham N (2015) Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor. OMICS 19:648\u2013658. https:\/\/doi.org\/10.1089\/omi.2015.0095","journal-title":"OMICS"},{"key":"689_CR16","doi-asserted-by":"publisher","first-page":"133","DOI":"10.1089\/cmb.2010.0213","volume":"18","author":"L Perlman","year":"2011","unstructured":"Perlman L, Gottlieb A, Atias N et al (2011) Combining drug and gene similarity measures for drug\u2013target elucidation. J Comput Biol 18:133\u2013145. https:\/\/doi.org\/10.1089\/cmb.2010.0213","journal-title":"J Comput Biol"},{"key":"689_CR17","doi-asserted-by":"publisher","first-page":"e1009171","DOI":"10.1371\/JOURNAL.PCBI.1009171","volume":"17","author":"T Do\u01e7an","year":"2021","unstructured":"Do\u01e7an T, G\u00fczelcan EA, Baumann M et al (2021) Protein domain-based prediction of drug\/compound\u2013target interactions and experimental validation on LIM kinases. PLoS Comput Biol 17:e1009171. https:\/\/doi.org\/10.1371\/JOURNAL.PCBI.1009171","journal-title":"PLoS Comput Biol"},{"key":"689_CR18","doi-asserted-by":"publisher","first-page":"1183","DOI":"10.1021\/ci100476q","volume":"51","author":"Y Yamanishi","year":"2011","unstructured":"Yamanishi Y, Pauwels E, Saigo H, Stoven V (2011) Extracting sets of chemical substructures and protein domains governing drug\u2013target interactions. J Chem Inf Model 51:1183\u20131194. https:\/\/doi.org\/10.1021\/ci100476q","journal-title":"J Chem Inf Model"},{"key":"689_CR19","doi-asserted-by":"publisher","first-page":"e5298","DOI":"10.7717\/PEERJ.5298","volume":"6","author":"T Do\u011fan","year":"2018","unstructured":"Do\u011fan T (2018) HPO2GO: prediction of human phenotype ontology term associations for proteins using cross ontology annotation co-occurrences. PeerJ 6:e5298. https:\/\/doi.org\/10.7717\/PEERJ.5298","journal-title":"PeerJ"},{"key":"689_CR20","doi-asserted-by":"publisher","first-page":"2264","DOI":"10.1093\/BIOINFORMATICS\/BTW114","volume":"32","author":"T Do\u01e7an","year":"2016","unstructured":"Do\u01e7an T, Macdougall A, Saidi R et al (2016) UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB. Bioinformatics 32:2264. https:\/\/doi.org\/10.1093\/BIOINFORMATICS\/BTW114","journal-title":"Bioinformatics"},{"key":"689_CR21","doi-asserted-by":"publisher","first-page":"756","DOI":"10.17706\/jsw.11.8.756-767","volume":"11","author":"H Saini","year":"2016","unstructured":"Saini H, Raicar G, Lal S et al (2016) Protein fold recognition using genetic algorithm optimized voting scheme and profile bigram. J Softw 11:756\u2013767. https:\/\/doi.org\/10.17706\/jsw.11.8.756-767","journal-title":"J Softw"},{"key":"689_CR22","doi-asserted-by":"publisher","first-page":"227","DOI":"10.1038\/s42256-022-00457-9","volume":"4","author":"S Unsal","year":"2022","unstructured":"Unsal S, Atas H, Albayrak M et al (2022) Learning functional properties of proteins with language models. Nat Mach Intell 4:227","journal-title":"Nat Mach Intell"},{"key":"689_CR23","doi-asserted-by":"publisher","first-page":"141287","DOI":"10.1371\/journal.pone.0141287","volume":"10","author":"E Asgari","year":"2015","unstructured":"Asgari E, Mofrad MRK (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10:141287. https:\/\/doi.org\/10.1371\/journal.pone.0141287","journal-title":"PLoS ONE"},{"key":"689_CR24","doi-asserted-by":"publisher","first-page":"1315","DOI":"10.1038\/s41592-019-0598-1","volume":"16","author":"EC Alley","year":"2019","unstructured":"Alley EC, Khimulya G, Biswas S et al (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16:1315\u20131322. https:\/\/doi.org\/10.1038\/s41592-019-0598-1","journal-title":"Nat Methods"},{"key":"689_CR25","doi-asserted-by":"publisher","first-page":"723","DOI":"10.1186\/s12859-019-3220-8","volume":"20","author":"M Heinzinger","year":"2019","unstructured":"Heinzinger M, Elnaggar A, Wang Y et al (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20:723. https:\/\/doi.org\/10.1186\/s12859-019-3220-8","journal-title":"BMC Bioinformatics"},{"key":"689_CR26","doi-asserted-by":"publisher","first-page":"e0220182","DOI":"10.1371\/JOURNAL.PONE.0220182","volume":"14","author":"C Mirabello","year":"2019","unstructured":"Mirabello C, Wallner B (2019) rawMSA: end-to-end deep learning using raw multiple sequence alignments. PLoS ONE 14:e0220182. https:\/\/doi.org\/10.1371\/JOURNAL.PONE.0220182","journal-title":"PLoS ONE"},{"key":"689_CR27","doi-asserted-by":"crossref","unstructured":"Rao R, Bhattacharya N, Thomas N et al (2019) Evaluating protein transfer learning with TAPE. In: 33rd Conference on Neural Information Processing Systems","DOI":"10.1101\/676825"},{"key":"689_CR28","doi-asserted-by":"publisher","first-page":"12882","DOI":"10.3390\/IJMS222312882\/S1","volume":"22","author":"PT Kim","year":"2021","unstructured":"Kim PT, Winter R, Clevert DA (2021) Unsupervised representation learning for proteochemometric modeling. Int J Mol Sci 22:12882. https:\/\/doi.org\/10.3390\/IJMS222312882\/S1","journal-title":"Int J Mol Sci"},{"key":"689_CR29","first-page":"04166","volume":"1902","author":"H \u00f6zt\u00fcrk","year":"2019","unstructured":"\u00f6zt\u00fcrk H, Ozkirimli E, \u00f6zg\u00fcr A (2019) WideDTA: prediction of drug-target binding affinity. ArXiv 1902:04166","journal-title":"ArXiv"},{"key":"689_CR30","doi-asserted-by":"publisher","first-page":"693","DOI":"10.1093\/BIOINFORMATICS\/BTAA858","volume":"37","author":"AS Rifaioglu","year":"2021","unstructured":"Rifaioglu AS, Cetin Atalay R, Cansen Kahraman D et al (2021) MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery. Bioinformatics 37:693\u2013704. https:\/\/doi.org\/10.1093\/BIOINFORMATICS\/BTAA858","journal-title":"Bioinformatics"},{"key":"689_CR31","doi-asserted-by":"publisher","first-page":"434","DOI":"10.1016\/J.COMPBIOLCHEM.2018.03.009","volume":"74","author":"A Dutta","year":"2018","unstructured":"Dutta A, Dubey T, Singh KK, Anand A (2018) SpliceVec: distributed feature representations for splice junction prediction. Comput Biol Chem 74:434\u2013441. https:\/\/doi.org\/10.1016\/J.COMPBIOLCHEM.2018.03.009","journal-title":"Comput Biol Chem"},{"key":"689_CR32","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1016\/j.ymeth.2018.05.026","volume":"145","author":"R You","year":"2018","unstructured":"You R, Huang X, Zhu S (2018) DeepText2GO: improving large-scale protein function prediction with deep semantic text representation. Methods 145:82\u201390. https:\/\/doi.org\/10.1016\/j.ymeth.2018.05.026","journal-title":"Methods"},{"key":"689_CR33","doi-asserted-by":"publisher","first-page":"2401","DOI":"10.1093\/BIOINFORMATICS\/BTAA003","volume":"36","author":"N Strodthoff","year":"2020","unstructured":"Strodthoff N, Wagner P, Wenzel M, Samek W (2020) UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36:2401. https:\/\/doi.org\/10.1093\/BIOINFORMATICS\/BTAA003","journal-title":"Bioinformatics"},{"key":"689_CR34","doi-asserted-by":"publisher","first-page":"1023","DOI":"10.1039\/C4IB00175C","volume":"6","author":"QU Ain","year":"2014","unstructured":"Ain QU, M\u00e9ndez-Lucio O, Ciriano IC et al (2014) Modelling ligand selectivity of serine proteases using integrative proteochemometric approaches improves model performance and allows the multi-target dependent interpretation of features. Integr Biol 6:1023\u20131033. https:\/\/doi.org\/10.1039\/C4IB00175C","journal-title":"Integr Biol"},{"key":"689_CR35","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1186\/1758-2946-5-42","volume":"5","author":"GJ Van Westen","year":"2013","unstructured":"Van Westen GJ, Swier RF, Cortes-Ciriano I et al (2013) Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets. J Cheminform 5:42. https:\/\/doi.org\/10.1186\/1758-2946-5-42","journal-title":"J Cheminform"},{"key":"689_CR36","doi-asserted-by":"publisher","first-page":"2773","DOI":"10.1021\/acs.jcim.0c00073","volume":"60","author":"Y Xu","year":"2020","unstructured":"Xu Y, Verma D, Sheridan RP et al (2020) Deep dive into machine learning models for protein engineering. J Chem Inf Model 60:2773\u20132790. https:\/\/doi.org\/10.1021\/acs.jcim.0c00073","journal-title":"J Chem Inf Model"},{"key":"689_CR37","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1186\/s13321-017-0232-0","volume":"9","author":"EB Lenselink","year":"2017","unstructured":"Lenselink EB, Ten Dijke N, Bongers B et al (2017) Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 9:45. https:\/\/doi.org\/10.1186\/s13321-017-0232-0","journal-title":"J Cheminform"},{"key":"689_CR38","doi-asserted-by":"publisher","first-page":"4490","DOI":"10.1093\/bioinformatics\/btaa495","volume":"36","author":"S Liang","year":"2020","unstructured":"Liang S, Yu H (2020) Revealing new therapeutic opportunities through drug target prediction: a class imbalance-tolerant machine learning approach. Bioinformatics 36:4490\u20134497. https:\/\/doi.org\/10.1093\/bioinformatics\/btaa495","journal-title":"Bioinformatics"},{"key":"689_CR39","doi-asserted-by":"publisher","first-page":"5441","DOI":"10.1039\/c8sc00148k","volume":"9","author":"A Mayr","year":"2018","unstructured":"Mayr A, Klambauer G, Unterthiner T et al (2018) Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem Sci 9:5441\u20135451. https:\/\/doi.org\/10.1039\/c8sc00148k","journal-title":"Chem Sci"},{"key":"689_CR40","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1039\/C7SC02664A","volume":"9","author":"Z Wu","year":"2018","unstructured":"Wu Z, Ramsundar B, Feinberg EN et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9:513\u2013530. https:\/\/doi.org\/10.1039\/C7SC02664A","journal-title":"Chem Sci"},{"key":"689_CR41","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-021-27137-3","volume":"12","author":"Q Ye","year":"2021","unstructured":"Ye Q, Hsieh CY, Yang Z et al (2021) A unified drug\u2013target interaction prediction framework based on knowledge graph and recommendation system. Nat Commun 12:1\u201312. https:\/\/doi.org\/10.1038\/s41467-021-27137-3","journal-title":"Nat Commun"},{"key":"689_CR42","doi-asserted-by":"publisher","first-page":"276","DOI":"10.1016\/S0168-9525(00)02024-2","volume":"16","author":"P Rice","year":"2000","unstructured":"Rice P, Longden I, Bleasby A (2000) EMBOSS: the european molecular biology open software suite. Trends Genet 16:276\u2013277. https:\/\/doi.org\/10.1016\/S0168-9525(00)02024-2","journal-title":"Trends Genet"},{"issue":"1","key":"689_CR43","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/S13321-019-0398-8","volume":"11","author":"A Dalke","year":"2019","unstructured":"Dalke A (2019) The chemfp project. J Cheminformat 11(1):1\u201321. https:\/\/doi.org\/10.1186\/S13321-019-0398-8","journal-title":"J Cheminformat"},{"key":"689_CR44","doi-asserted-by":"publisher","DOI":"10.4230\/DAGREP.5.4.18","author":"T Darrell","year":"2015","unstructured":"Darrell T, Kloft M, Pontil M et al (2015) Machine learning with interdependent and non-identically distributed data (Dagstuhl Seminar 15152). Dagstuhl Rep. https:\/\/doi.org\/10.4230\/DAGREP.5.4.18","journal-title":"Dagstuhl Rep"},{"key":"689_CR45","doi-asserted-by":"publisher","first-page":"e5518","DOI":"10.7717\/PEERJ.5518\/SUPP-1","volume":"2018","author":"T Hengl","year":"2018","unstructured":"Hengl T, Nussbaum M, Wright MN et al (2018) Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ 2018:e5518. https:\/\/doi.org\/10.7717\/PEERJ.5518\/SUPP-1","journal-title":"PeerJ"},{"key":"689_CR46","doi-asserted-by":"publisher","unstructured":"Dharani G, Nair NG, Satpathy P, Christopher J (2019) Covariate Shift: a review and analysis on classifiers. In: 2019 Global Conference for Advancement in Technology, GCAT 2019. https:\/\/doi.org\/10.1109\/GCAT47503.2019.8978471","DOI":"10.1109\/GCAT47503.2019.8978471"},{"key":"689_CR47","doi-asserted-by":"publisher","first-page":"2756","DOI":"10.1093\/bioinformatics\/btx302","volume":"33","author":"J Wang","year":"2017","unstructured":"Wang J, Yang B, Revote J et al (2017) POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33:2756\u20132758. https:\/\/doi.org\/10.1093\/bioinformatics\/btx302","journal-title":"Bioinformatics"},{"key":"689_CR48","doi-asserted-by":"publisher","first-page":"2499","DOI":"10.1093\/bioinformatics\/bty140","volume":"34","author":"Z Chen","year":"2018","unstructured":"Chen Z, Zhao P, Li F et al (2018) iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499\u20132502. https:\/\/doi.org\/10.1093\/bioinformatics\/bty140","journal-title":"Bioinformatics"},{"issue":"1","key":"689_CR49","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-021-23165-1","volume":"12","author":"A Cicho\u0144ska","year":"2021","unstructured":"Cicho\u0144ska A, Ravikumar B, Allaway RJ et al (2021) Crowdsourced mapping of unexplored target space of kinase inhibitors. Nat Commun 12(1):1\u201318. https:\/\/doi.org\/10.1038\/s41467-021-23165-1","journal-title":"Nat Commun"},{"key":"689_CR50","doi-asserted-by":"publisher","first-page":"893","DOI":"10.1080\/1062936X20161250229","volume":"27","author":"T Hanser","year":"2016","unstructured":"Hanser T, Barber C, Marchaland JF, Werner S (2016) Applicability domain: towards a more formal definition. SAR QSAR Environ Res 27:893\u2013909. https:\/\/doi.org\/10.1080\/1062936X20161250229","journal-title":"SAR QSAR Environ Res"},{"key":"689_CR51","doi-asserted-by":"publisher","first-page":"4791","DOI":"10.3390\/MOLECULES17054791","volume":"17","author":"F Sahigara","year":"2012","unstructured":"Sahigara F, Mansouri K, Ballabio D et al (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17:4791. https:\/\/doi.org\/10.3390\/MOLECULES17054791","journal-title":"Molecules"},{"key":"689_CR52","doi-asserted-by":"publisher","first-page":"1037","DOI":"10.1039\/C6MD00701E","volume":"8","author":"V Subramanian","year":"2017","unstructured":"Subramanian V, Ain QU, Henno H et al (2017) 3D proteochemometrics: using three-dimensional information of proteins and ligands to address aspects of the selectivity of serine proteases. Medchemcomm 8:1037. https:\/\/doi.org\/10.1039\/C6MD00701E","journal-title":"Medchemcomm"},{"key":"689_CR53","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1758-2946-6-35\/FIGURES\/6","volume":"6","author":"I Cortes-Ciriano","year":"2014","unstructured":"Cortes-Ciriano I, Van Westen GJP, Lenselink EB et al (2014) Proteochemometric modeling in a Bayesian framework. J Cheminform 6:1\u201316. https:\/\/doi.org\/10.1186\/1758-2946-6-35\/FIGURES\/6","journal-title":"J Cheminform"},{"key":"689_CR54","doi-asserted-by":"publisher","first-page":"e96","DOI":"10.1093\/nar\/gkab543","volume":"49","author":"T Do\u01e7an","year":"2021","unstructured":"Do\u01e7an T, Atas H, Joshi V et al (2021) CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res 49:e96. https:\/\/doi.org\/10.1093\/nar\/gkab543","journal-title":"Nucleic Acids Res"},{"key":"689_CR55","doi-asserted-by":"publisher","first-page":"D945","DOI":"10.1093\/nar\/gkw1074","volume":"45","author":"A Gaulton","year":"2017","unstructured":"Gaulton A, Hersey A, Nowotka M et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45:D945\u2013D954. https:\/\/doi.org\/10.1093\/nar\/gkw1074","journal-title":"Nucleic Acids Res"},{"key":"689_CR56","doi-asserted-by":"publisher","first-page":"591","DOI":"10.12688\/f1000research.8357.2","volume":"5","author":"S Jasial","year":"2016","unstructured":"Jasial S, Hu Y, Vogt M, Bajorath J (2016) Activity-relevant similarity values for fingerprints and implications for similarity searching. F1000Res 5:591. https:\/\/doi.org\/10.12688\/f1000research.8357.2","journal-title":"F1000Res"},{"key":"689_CR57","doi-asserted-by":"publisher","unstructured":"The UniProt Consortium (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49. https:\/\/doi.org\/10.1093\/nar\/gkaa1100","DOI":"10.1093\/nar\/gkaa1100"},{"key":"689_CR58","doi-asserted-by":"publisher","first-page":"1046","DOI":"10.1038\/nbt.1990","volume":"29","author":"MI Davis","year":"2011","unstructured":"Davis MI, Hunt JP, Herrgard S et al (2011) Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29:1046\u20131051. https:\/\/doi.org\/10.1038\/nbt.1990","journal-title":"Nat Biotechnol"},{"key":"689_CR59","doi-asserted-by":"publisher","first-page":"i821","DOI":"10.1093\/BIOINFORMATICS\/BTY593","volume":"34","author":"H \u00f6zt\u00fcrk","year":"2018","unstructured":"\u00f6zt\u00fcrk H, \u00f6zg\u00fcr A, Ozkirimli E (2018) DeepDTA: deep drug\u2013target binding affinity prediction. Bioinformatics 34:i821\u2013i829. https:\/\/doi.org\/10.1093\/BIOINFORMATICS\/BTY593","journal-title":"Bioinformatics"},{"key":"689_CR60","doi-asserted-by":"publisher","first-page":"926","DOI":"10.1093\/bioinformatics\/btu739","volume":"31","author":"BE Suzek","year":"2015","unstructured":"Suzek BE, Wang Y, Huang H et al (2015) UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31:926\u2013932. https:\/\/doi.org\/10.1093\/bioinformatics\/btu739","journal-title":"Bioinformatics"},{"key":"689_CR61","unstructured":"Landrum G (2016) RDKit: Open-Source Cheminformatics Software. http:\/\/www.rdkit.org\/"},{"key":"689_CR62","doi-asserted-by":"crossref","unstructured":"Hagberg A, Swart P, S Chult D (2008) Exploring Network Structure, Dynamics, and Function using NetworkX. United States","DOI":"10.25080\/TCWV9851"},{"key":"689_CR63","doi-asserted-by":"publisher","first-page":"P10008","DOI":"10.1088\/1742-5468\/2008\/10\/P10008","volume":"2008","author":"VD Blondel","year":"2008","unstructured":"Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008. https:\/\/doi.org\/10.1088\/1742-5468\/2008\/10\/P10008","journal-title":"J Stat Mech Theory Exp"},{"key":"689_CR64","doi-asserted-by":"publisher","first-page":"401","DOI":"10.1002\/(SICI)1097-0134(19990601)35:4<401::AID-PROT3>3.0.CO;2-K","volume":"35","author":"I Dubchak","year":"1999","unstructured":"Dubchak I, Muchnik I, Mayor C et al (1999) Recognition of a protein fold in the context of the SCOP classification. Proteins Struct Funct Genetics 35:401\u2013407. https:\/\/doi.org\/10.1002\/(SICI)1097-0134(19990601)35:4<401::AID-PROT3>3.0.CO;2-K","journal-title":"Proteins Struct Funct Genetics"},{"key":"689_CR65","doi-asserted-by":"publisher","first-page":"4337","DOI":"10.1073\/pnas.0607879104","volume":"104","author":"J Shen","year":"2007","unstructured":"Shen J, Zhang J, Luo X et al (2007) Predicting protein\u2013protein interactions based only on sequences information. Proc Natl Acad Sci USA 104:4337\u20134341. https:\/\/doi.org\/10.1073\/pnas.0607879104","journal-title":"Proc Natl Acad Sci USA"},{"key":"689_CR66","doi-asserted-by":"publisher","first-page":"115","DOI":"10.2307\/2986645","volume":"5","author":"RC Geary","year":"1954","unstructured":"Geary RC (1954) The contiguity ratio and statistical mapping. Incorporated Statist 5:115\u2013146","journal-title":"Incorporated Statist"},{"key":"689_CR67","doi-asserted-by":"publisher","first-page":"W32","DOI":"10.1093\/nar\/gkr284","volume":"34","author":"ZR Li","year":"2006","unstructured":"Li ZR, Lin HH, Han LY et al (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34:W32\u2013W37. https:\/\/doi.org\/10.1093\/nar\/gkr284","journal-title":"Nucleic Acids Res"},{"key":"689_CR68","doi-asserted-by":"publisher","first-page":"D427","DOI":"10.1093\/nar\/gky995","volume":"47","author":"S El-Gebali","year":"2019","unstructured":"El-Gebali S, Mistry J, Bateman A et al (2019) The Pfam protein families database in 2019. Nucleic Acids Res 47:D427\u2013D432. https:\/\/doi.org\/10.1093\/nar\/gky995","journal-title":"Nucleic Acids Res"},{"key":"689_CR69","doi-asserted-by":"publisher","first-page":"i221","DOI":"10.1093\/bioinformatics\/btv256","volume":"31","author":"H Liu","year":"2015","unstructured":"Liu H, Sun J, Guan J et al (2015) Improving compound\u2013protein interaction prediction by building up highly credible negative samples. Bioinformatics 31:i221\u2013i229. https:\/\/doi.org\/10.1093\/bioinformatics\/btv256","journal-title":"Bioinformatics"},{"key":"689_CR70","doi-asserted-by":"publisher","first-page":"335","DOI":"10.1016\/S0006-3495(94)80782-9","volume":"66","author":"G Schneider","year":"1994","unstructured":"Schneider G, Wrede P (1994) The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys J 66:335\u2013344","journal-title":"Biophys J"},{"key":"689_CR71","doi-asserted-by":"publisher","first-page":"477","DOI":"10.1006\/bbrc.2000.3815","volume":"278","author":"K-C Chou","year":"2000","unstructured":"Chou K-C (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 278:477\u2013483. https:\/\/doi.org\/10.1006\/bbrc.2000.3815","journal-title":"Biochem Biophys Res Commun"},{"key":"689_CR72","doi-asserted-by":"publisher","first-page":"122","DOI":"10.1016\/j.compbiolchem.2007.11.004","volume":"32","author":"OS Sarac","year":"2008","unstructured":"Sarac OS, G\u00fcrsoy-Y\u00fcz\u00fcg\u00fcll\u00fc O, Cetin-Atalay R, Atalay V (2008) Subsequence-based feature map for protein function classification. Comput Biol Chem 32:122\u2013130. https:\/\/doi.org\/10.1016\/j.compbiolchem.2007.11.004","journal-title":"Comput Biol Chem"},{"key":"689_CR73","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1002\/PROT.25416","volume":"86","author":"AS Rifaioglu","year":"2018","unstructured":"Rifaioglu AS, Do\u011fan T, Sara\u00e7 \u00d6S et al (2018) Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins Struct Funct Bioinformat 86:135\u2013151. https:\/\/doi.org\/10.1002\/PROT.25416","journal-title":"Proteins Struct Funct Bioinformat"},{"key":"689_CR74","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/S12859-018-2368-Y\/TABLES\/14","volume":"19","author":"A Dalkiran","year":"2018","unstructured":"Dalkiran A, Rifaioglu AS, Martin MJ et al (2018) ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinformatics 19:1\u201313. https:\/\/doi.org\/10.1186\/S12859-018-2368-Y\/TABLES\/14","journal-title":"BMC Bioinformatics"},{"key":"689_CR75","doi-asserted-by":"publisher","first-page":"D202","DOI":"10.1093\/nar\/gkm998","volume":"36","author":"S Kawashima","year":"2008","unstructured":"Kawashima S, Pokarowski P, Pokarowska M et al (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36:D202\u2013D205. https:\/\/doi.org\/10.1093\/nar\/gkm998","journal-title":"Nucleic Acids Res"},{"key":"689_CR76","doi-asserted-by":"publisher","first-page":"1493","DOI":"10.1016\/j.bbapap.2006.07.005","volume":"1764","author":"MM Gromiha","year":"2006","unstructured":"Gromiha MM, Suwa M (2006) Influence of amino acid properties for discriminating outer membrane proteins at better accuracy. Biochim Biophys Acta Proteins Proteom 1764:1493\u20131497. https:\/\/doi.org\/10.1016\/j.bbapap.2006.07.005","journal-title":"Biochim Biophys Acta Proteins Proteom"},{"key":"689_CR77","doi-asserted-by":"publisher","first-page":"416","DOI":"10.1016\/j.jmb.2016.10.013","volume":"429","author":"P Zhang","year":"2017","unstructured":"Zhang P, Tao L, Zeng X et al (2017) PROFEAT update: a protein features web server with added facility to compute network descriptors for studying omics-derived networks. J Mol Biol 429:416\u2013425. https:\/\/doi.org\/10.1016\/j.jmb.2016.10.013","journal-title":"J Mol Biol"},{"key":"689_CR78","unstructured":"Vaswani A, Brain G, Shazeer N et al (2017) Attention \u0131s all you need. In: 31st Conference on Neural Information Processing Systems"},{"key":"689_CR79","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742\u2013754. https:\/\/doi.org\/10.1021\/ci100050t","journal-title":"J Chem Inf Model"},{"key":"689_CR80","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"689_CR81","first-page":"2579","volume":"9","author":"L Van Der Maaten","year":"2008","unstructured":"Van Der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579\u20132605","journal-title":"J Mach Learn Res"},{"key":"689_CR82","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12864-019-6413-7","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:1\u201313. https:\/\/doi.org\/10.1186\/s12864-019-6413-7","journal-title":"BMC Genomics"},{"key":"689_CR83","doi-asserted-by":"publisher","first-page":"3021","DOI":"10.21105\/joss.03021","volume":"6","author":"M Waskom","year":"2021","unstructured":"Waskom M (2021) seaborn: statistical data visualization. J Open Source Softw 6:3021. https:\/\/doi.org\/10.21105\/joss.03021","journal-title":"J Open Source Softw"},{"key":"689_CR84","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1109\/MCSE.2007.55","volume":"9","author":"JD Hunter","year":"2007","unstructured":"Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90\u201395. https:\/\/doi.org\/10.1109\/MCSE.2007.55","journal-title":"Comput Sci Eng"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00689-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-023-00689-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-023-00689-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,13]],"date-time":"2024-10-13T17:01:18Z","timestamp":1728838878000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-023-00689-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,6]]},"references-count":84,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["689"],"URL":"https:\/\/doi.org\/10.1186\/s13321-023-00689-w","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,6]]},"assertion":[{"value":"2 September 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 January 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 February 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"There are no competing interests to declare.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"16"}}