{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:30:14Z","timestamp":1772166614193,"version":"3.50.1"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T00:00:00Z","timestamp":1764201600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T00:00:00Z","timestamp":1767052800000},"content-version":"vor","delay-in-days":33,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001602","name":"Science Foundation Ireland","doi-asserted-by":"publisher","award":["18\/CRT\/6214"],"award-info":[{"award-number":["18\/CRT\/6214"]}],"id":[{"id":"10.13039\/501100001602","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.<\/jats:p>\n                  <jats:p>\n                    <jats:bold>Graphical Abstract<\/jats:bold>\n                  <\/jats:p>","DOI":"10.1186\/s13321-025-01115-z","type":"journal-article","created":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T17:34:58Z","timestamp":1764264898000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["How to build machine learning models able to extrapolate from standard to modified peptides"],"prefix":"10.1186","volume":"17","author":[{"given":"Ra\u00fal","family":"Fern\u00e1ndez-D\u00edaz","sequence":"first","affiliation":[]},{"given":"Rodrigo","family":"Ochoa","sequence":"additional","affiliation":[]},{"given":"Thanh Lam","family":"Hoang","sequence":"additional","affiliation":[]},{"given":"Vanessa","family":"Lopez","sequence":"additional","affiliation":[]},{"given":"Denis C.","family":"Shields","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,11,27]]},"reference":[{"key":"1115_CR1","doi-asserted-by":"publisher","first-page":"2625","DOI":"10.1038\/s41467-023-38328-5","volume":"14","author":"NR Bennett","year":"2023","unstructured":"Bennett NR, Coventry B, Goreshnik I, Huang B, Allen A, Vafeados D, Peng YP, Dauparas J, Baek M, Stewart L et al (2023) Improving de novo protein binder design with deep learning. Nat Commun 14:2625","journal-title":"Nat Commun"},{"key":"1115_CR2","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1038\/s41467-024-54791-0","volume":"16","author":"W Yang","year":"2025","unstructured":"Yang W, Hicks DR, Ghosh A, Schwartze TA, Conventry B, Goreshnik I, Allen A, Halabiya SF, Kim CJ (2025) Hinck CS others (2001) Design of high-affinity binders to immune modulating receptors for cancer immunotherapy. Nat Commun 16:2","journal-title":"Nat Commun"},{"key":"1115_CR3","doi-asserted-by":"publisher","first-page":"8682","DOI":"10.1039\/D4SC07642G","volume":"16","author":"G Geylan","year":"2025","unstructured":"Geylan G, Janet JP, Tibo A, He J, Patronov A, Kabeshov M, Czechtizky W, David F, Engkvist O, De Maria L (2025) PepINVENT: generative peptide design beyond natural amino acids. Chem Sci 16:8682\u20138696","journal-title":"Chem Sci"},{"key":"1115_CR4","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1038\/s41392-022-00904-4","volume":"7","author":"L Wang","year":"2022","unstructured":"Wang L, Wang N, Zhang W, Cheng X, Yan Z, Shao G, Wang X, Wang R, Fu C (2022) Therapeutic peptides: current applications and future directions. Signal Transduct Target Ther 7:48","journal-title":"Signal Transduct Target Ther"},{"key":"1115_CR5","doi-asserted-by":"publisher","first-page":"148570","DOI":"10.1109\/ACCESS.2020.3015792","volume":"8","author":"M Attique","year":"2020","unstructured":"Attique M, Farooq MS, Khelifi A, Abid A (2020) Prediction of therapeutic peptides using machine learning: computational models, datasets, and feature encodings. Ieee Access 8:148570\u2013148594","journal-title":"Ieee Access"},{"key":"1115_CR6","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1038\/s41573-020-00135-8","volume":"20","author":"M Muttenthaler","year":"2021","unstructured":"Muttenthaler M, King GF, Adams DJ, Alewood PF (2021) Trends in peptide drug discovery. Nat Rev Drug Discov 20:309\u2013325","journal-title":"Nat Rev Drug Discov"},{"key":"1115_CR7","doi-asserted-by":"publisher","first-page":"baac011","DOI":"10.1093\/database\/baac011","volume":"2022","author":"S Ramazi","year":"2022","unstructured":"Ramazi S, Mohammadi N, Allahverdi A, Khalili E, Abdolmaleki P (2022) A review on antimicrobial peptides databases and the computational tools. Database 2022:baac011","journal-title":"Database"},{"issue":"8","key":"1115_CR8","doi-asserted-by":"publisher","DOI":"10.1016\/j.drudis.2025.104421","volume":"30","author":"N Bajiya","year":"2025","unstructured":"Bajiya N, Najrin S, Kumar P, Choudhury S, Tomer R, Raghava GP (2025) CPPsite3: an updated large repository of experimentally validated cell-penetrating peptides. Drug Discov Today 30(8):104421","journal-title":"Drug Discov Today"},{"key":"1115_CR9","doi-asserted-by":"crossref","unstructured":"Cabas-Mora G, Daza A, Soto-Garc\u00eda N, Garrido V, Alvarez D, Navarrete M, Sarmiento-Var\u00f3n L, Sep\u00falveda Ya\u00f1ez JH, Davari MD, Cadet F others (2024) Peptipedia v2. 0: A peptide sequence database and user-friendly web platform. A major update. Database, 2024:baae113","DOI":"10.1093\/database\/baae113"},{"key":"1115_CR10","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31\u201336","journal-title":"J Chem Inf Comput Sci"},{"key":"1115_CR11","doi-asserted-by":"crossref","unstructured":"Zhang T, Li H, Xi H, Stanton RV, Rotstein SH (2012) HELM: a hierarchical notation language for complex biomolecule structure representation","DOI":"10.1021\/ci3001925"},{"key":"1115_CR12","doi-asserted-by":"publisher","first-page":"3942","DOI":"10.1021\/acs.jcim.2c00703","volume":"62","author":"T Fox","year":"2022","unstructured":"Fox T, Bieler M, Haebel P, Ochoa R, Peters S, Weber A (2022) BILN: a human-readable line notation for complex peptides. J Chem Inf Model 62:3942\u20133947","journal-title":"J Chem Inf Model"},{"key":"1115_CR13","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1021\/acsmedchemlett.3c00037","volume":"14","author":"JL Hickey","year":"2023","unstructured":"Hickey JL, Sindhikara D, Zultanski SL, Schultz DM (2023) Beyond 20 in the 21st century: prospects and challenges of non-canonical amino acids in peptide drug discovery. ACS Med Chem Lett 14:557\u2013565","journal-title":"ACS Med Chem Lett"},{"key":"1115_CR14","doi-asserted-by":"publisher","first-page":"lqad088","DOI":"10.1093\/nargab\/lqad088","volume":"5","author":"F Teufel","year":"2023","unstructured":"Teufel F, G\u00edslason MH, Almagro Armenteros JJ, Johansen AR, Winther O, Nielsen H (2023) GraphPart: homology partitioning for biological sequence analysis. NAR Genomics Bioinform 5:lqad088","journal-title":"NAR Genomics Bioinform"},{"key":"1115_CR15","doi-asserted-by":"publisher","first-page":"747","DOI":"10.1021\/ci9803381","volume":"39","author":"D Butina","year":"1999","unstructured":"Butina D (1999) Unsupervised data base clustering based on daylight\u2019s fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inf Comput Sci 39:747\u2013750","journal-title":"J Chem Inf Comput Sci"},{"key":"1115_CR16","unstructured":"Ramsundar B, Eastman P, Walters P, Pande V, Leswing K, Wu Z (2019) Deep Learning for the Life Sciences; O\u2019Reilly Media. https:\/\/www.amazon.com\/Deep-Learning-Life-Sciences-Microscopy\/dp\/1492039837"},{"key":"1115_CR17","doi-asserted-by":"publisher","first-page":"759","DOI":"10.1039\/D2DD00146B","volume":"2","author":"G Tom","year":"2023","unstructured":"Tom G, Hickman RJ, Zinzuwadia A, Mohajeri A, Sanchez-Lengeling B, Aspuru-Guzik A (2023) Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS. Digit Discov 2:759\u2013774","journal-title":"Digit Discov"},{"key":"1115_CR18","first-page":"64526","volume":"36","author":"S Steshin","year":"2023","unstructured":"Steshin S (2023) Lo-hi: practical ml drug discovery benchmark. Adv Neural Inf Process Syst 36:64526\u201364554","journal-title":"Adv Neural Inf Process Syst"},{"key":"1115_CR19","doi-asserted-by":"publisher","first-page":"697","DOI":"10.1021\/acs.jcim.3c01774","volume":"64","author":"P Tossou","year":"2024","unstructured":"Tossou P, Wognum C, Craig M, Mary H, Noutahi E (2024) Real-world molecular out-of-distribution: specification and investigation. J Chem Inf Model 64:697\u2013711","journal-title":"J Chem Inf Model"},{"key":"1115_CR20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-025-01039-8","volume":"17","author":"Q Guo","year":"2025","unstructured":"Guo Q, Hernandez-Hernandez S, Ballester PJ (2025) UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines. J Cheminform 17:1\u201318","journal-title":"J Cheminform"},{"key":"1115_CR21","doi-asserted-by":"crossref","unstructured":"Ektefaie Y, Shen A, Bykova D, Marin M, Zitnik M, Farhat M (2024) Evaluating generalizability of artificial intelligence models for molecular datasets. bioRxiv","DOI":"10.1101\/2024.02.25.581982"},{"key":"1115_CR22","doi-asserted-by":"crossref","unstructured":"Fernandez-Diaz R, Lam HT, L\u00f3pez V, Shields DC (2025) A new framework for evaluating model out-of-distribution generalisation for the biochemical domain. The Thirteenth International Conference on Learning Representations","DOI":"10.1101\/2024.03.14.584508"},{"key":"1115_CR23","unstructured":"Adamczyk J, Ludynia P, Czech W (2025) Molecular Fingerprints Are Strong Models for Peptide Function Prediction. arXiv preprint arXiv:2501.17901"},{"key":"1115_CR24","doi-asserted-by":"publisher","first-page":"btae555","DOI":"10.1093\/bioinformatics\/btae555","volume":"40","author":"R Fern\u00e1ndez-D\u00edaz","year":"2024","unstructured":"Fern\u00e1ndez-D\u00edaz R, Cossio-P\u00e9rez R, Agoni C, Lam HT, Lopez V, Shields DC (2024) AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors. Bioinformatics 40:btae555","journal-title":"Bioinformatics"},{"key":"1115_CR25","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742\u2013754","journal-title":"J Chem Inf Model"},{"key":"1115_CR26","doi-asserted-by":"publisher","first-page":"1924","DOI":"10.1021\/ci050413p","volume":"46","author":"P Gedeck","year":"2006","unstructured":"Gedeck P, Rohde B, Bartels C (2006) QSAR- how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets. J Chem Inf Model 46:1924\u20131936","journal-title":"J Chem Inf Model"},{"key":"1115_CR27","doi-asserted-by":"publisher","first-page":"1256","DOI":"10.1038\/s42256-022-00580-7","volume":"4","author":"J Ross","year":"2022","unstructured":"Ross J, Belgodere B, Chenthamarakshan V, Padhi I, Mroueh Y, Das P (2022) Large-scale chemical language representations capture molecular structure and properties. Nat Mach Intell 4:1256\u20131264","journal-title":"Nat Mach Intell"},{"key":"1115_CR28","unstructured":"Ahmad W, Simon E, Chithrananda S, Grand G, Ramsundar B (2022) Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712"},{"issue":"2","key":"1115_CR29","doi-asserted-by":"publisher","first-page":"571","DOI":"10.1021\/acs.jcim.4c01441","volume":"65","author":"AL Feller","year":"2024","unstructured":"Feller AL, Wilke CO (2024) Peptide-aware chemical language model successfully predicts membrane diffusion of cyclic peptides. J Chem Inf Model 65(2):571\u2013579","journal-title":"J Chem Inf Model"},{"key":"1115_CR30","unstructured":"Zhang R, Wu H, Xiu Y, Li K, Chen N, Wang Y, Wang Y, Gao X, Zhou F (2023) PepLand: a large-scale pre-trained peptide representation model for a comprehensive landscape of both canonical and non-canonical amino acids. arXiv preprint arXiv:2311.04419"},{"key":"1115_CR31","doi-asserted-by":"publisher","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","volume":"44","author":"A Elnaggar","year":"2021","unstructured":"Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M et al (2021) Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 44:7112\u20137127","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1115_CR32","unstructured":"Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, dos Santos Costa A, Fazel-Zarandi M, Sercu T, Candido S others (2022) Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv 2022:500902"},{"key":"1115_CR33","doi-asserted-by":"publisher","first-page":"16","DOI":"10.53805\/lads.v3i1.63","volume":"3","author":"MDSP Gonzalez","year":"2023","unstructured":"Gonzalez MDSP, Cheohen C, Andriolo BV, da Silva ML (2023) Development of a database of peptides with potential for pharmacological intervention in human pathogen molecular targets. Latin Am Data Sci 3:16\u201321","journal-title":"Latin Am Data Sci"},{"key":"1115_CR34","doi-asserted-by":"crossref","unstructured":"Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. pp 2623\u20132631","DOI":"10.1145\/3292500.3330701"},{"key":"1115_CR35","doi-asserted-by":"publisher","first-page":"1026","DOI":"10.1038\/nbt.3988","volume":"35","author":"M Steinegger","year":"2017","unstructured":"Steinegger M, S\u00f6ding J (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35:1026\u20131028","journal-title":"Nat Biotechnol"},{"key":"1115_CR36","doi-asserted-by":"publisher","first-page":"276","DOI":"10.1016\/S0168-9525(00)02024-2","volume":"16","author":"P Rice","year":"2000","unstructured":"Rice P, Longden I, Bleasby A (2000) EMBOSS: the European molecular biology open software suite. Trends Genet 16:276\u2013277","journal-title":"Trends Genet"},{"key":"1115_CR37","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1186\/s13321-024-00849-6","volume":"16","author":"M Orsi","year":"2024","unstructured":"Orsi M, Reymond J-L (2024) One chiral fingerprint to find them all. J Cheminform 16:53","journal-title":"J Cheminform"},{"key":"1115_CR38","doi-asserted-by":"publisher","DOI":"10.1002\/psc.3666","volume":"31","author":"R Ochoa","year":"2025","unstructured":"Ochoa R, Deibler K (2025) PepFuNN: Novo Nordisk open-source toolkit to enable peptide in silico analysis. J Pept Sci 31:e3666","journal-title":"J Pept Sci"},{"key":"1115_CR39","doi-asserted-by":"crossref","unstructured":"Jain SM (2022) Introduction to transformers for NLP: with the hugging face library and models to solve problems; Springer, pp 51\u201367","DOI":"10.1007\/978-1-4842-8844-3_4"},{"key":"1115_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-025-00963-z","volume":"17","author":"B Zdrazil","year":"2025","unstructured":"Zdrazil B (2025) Fifteen years of ChEMBL and its role in cheminformatics and drug discovery. J Cheminform 17:1\u20139","journal-title":"J Cheminform"},{"key":"1115_CR41","doi-asserted-by":"crossref","unstructured":"Ash JR, Wognum C, Rodr\u00edguez-P\u00e9rez R, Aldeghi M, Cheng AC, Clevert D-A, Engkvist O, Fang C, Price DJ, Hughes-Oliver JM others (2024) Practically significant method comparison protocols for machine learning in small molecule drug discovery","DOI":"10.26434\/chemrxiv-2024-6dbwv"},{"key":"1115_CR42","first-page":"3","volume":"8","author":"C Bonferroni","year":"1936","unstructured":"Bonferroni C (1936) Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R istituto superiore di scienze economiche e commericiali di firenze 8:3\u201362","journal-title":"Pubblicazioni del R istituto superiore di scienze economiche e commericiali di firenze"},{"key":"1115_CR43","doi-asserted-by":"publisher","first-page":"1122","DOI":"10.1038\/s41592-021-01205-4","volume":"18","author":"I Walsh","year":"2021","unstructured":"Walsh I, Fishman D, Garcia-Gasulla D, Titma T, Pollastri G, Harrow J, Psomopoulos FE, Tosatto SC (2021) DOME: recommendations for supervised machine learning validation in biology. Nat Methods 18:1122\u20131127","journal-title":"Nat Methods"},{"key":"1115_CR44","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-020-00445-4","volume":"12","author":"A Capecchi","year":"2020","unstructured":"Capecchi A, Probst D, Reymond J-L (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 12:1\u201315","journal-title":"J Cheminform"},{"key":"1115_CR45","doi-asserted-by":"publisher","first-page":"2240","DOI":"10.1021\/acs.jcim.2c01573","volume":"63","author":"J Li","year":"2023","unstructured":"Li J, Yanagisawa K, Sugita M, Fujie T, Ohue M, Akiyama Y (2023) CycPeptMPDB: a comprehensive database of membrane permeability of cyclic peptides. J Chem Inf Model 63:2240\u20132250","journal-title":"J Chem Inf Model"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01115-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-025-01115-z","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01115-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T04:19:30Z","timestamp":1767068370000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1186\/s13321-025-01115-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,27]]},"references-count":45,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1115"],"URL":"https:\/\/doi.org\/10.1186\/s13321-025-01115-z","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2025-ggp8n-v3","asserted-by":"object"},{"id-type":"doi","id":"10.26434\/chemrxiv-2025-ggp8n-v4","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,27]]},"assertion":[{"value":"19 July 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 October 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 November 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"185"}}