{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T03:04:16Z","timestamp":1775531056395,"version":"3.50.1"},"reference-count":54,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2024,9,18]],"date-time":"2024-09-18T00:00:00Z","timestamp":1726617600000},"content-version":"vor","delay-in-days":17,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001602","name":"Science Foundation Ireland","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001602","id-type":"DOI","asserted-by":"publisher"}]},{"name":"European Union\u2019s Horizon 2020 research and innovation programme"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,9,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation cannot only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present a new automated method for drawing negative peptides that achieves better balance between specificity and generalization than current alternatives. We study the effect of homology-based partitioning for generating the training and testing data subsets and demonstrate that model performance is overestimated when no such homology correction is used, which indicates that prior studies may have overestimated their performance when applied to new peptide sequences. We also conduct a systematic analysis of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimized traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to allow researchers without a computational background to build new predictive models for peptide bioactivity in a matter of minutes.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Source code, documentation, and data are available at https:\/\/github.com\/IBM\/AutoPeptideML and a dedicated web-server at http:\/\/peptide.ucd.ie\/AutoPeptideML. A static version of the software to ensure the reproduction of the results is available at https:\/\/zenodo.org\/records\/13363975.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae555","type":"journal-article","created":{"date-parts":[[2024,9,17]],"date-time":"2024-09-17T15:44:09Z","timestamp":1726587849000},"source":"Crossref","is-referenced-by-count":11,"title":["AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors"],"prefix":"10.1093","volume":"40","author":[{"given":"Ra\u00fal","family":"Fern\u00e1ndez-D\u00edaz","sequence":"first","affiliation":[{"name":"IBM Research, Dublin , Dublin D15 HN66,","place":["Ireland"]},{"name":"School of Medicine, University College Dublin , Dublin D04 C1P1,","place":["Ireland"]},{"name":"Conway Institute of Biomolecular and Biomedical Science, University College Dublin , Dublin D04 C1P ,","place":["Ireland"]},{"name":"The SFI Centre for Research Training in Genomics Data Science ,","place":["Ireland"]}]},{"given":"Rodrigo","family":"Cossio-P\u00e9rez","sequence":"additional","affiliation":[{"name":"School of Medicine, University College Dublin , Dublin D04 C1P1,","place":["Ireland"]},{"name":"Conway Institute of Biomolecular and Biomedical Science, University College Dublin , Dublin D04 C1P ,","place":["Ireland"]},{"name":"Department of Science and Technology, National University of Quilmes , Bernal B1876, Provincia de Buenos Aires,","place":["Argentina"]}]},{"given":"Clement","family":"Agoni","sequence":"additional","affiliation":[{"name":"School of Medicine, University College Dublin , Dublin D04 C1P1,","place":["Ireland"]},{"name":"Conway Institute of Biomolecular and Biomedical Science, University College Dublin , Dublin D04 C1P ,","place":["Ireland"]},{"name":"Discipline of Pharmaceutical Sciences, School of Health Sciences, University of KwaZulu-Natal , Durban 4000,","place":["South Africa"]}]},{"given":"Hoang Thanh","family":"Lam","sequence":"additional","affiliation":[{"name":"IBM Research, Dublin , Dublin D15 HN66,","place":["Ireland"]}]},{"given":"Vanessa","family":"Lopez","sequence":"additional","affiliation":[{"name":"IBM Research, Dublin , Dublin D15 HN66,","place":["Ireland"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4015-2474","authenticated-orcid":false,"given":"Denis C","family":"Shields","sequence":"additional","affiliation":[{"name":"School of Medicine, University College Dublin , Dublin D04 C1P1,","place":["Ireland"]},{"name":"Conway Institute of Biomolecular and Biomedical Science, University College Dublin , Dublin D04 C1P ,","place":["Ireland"]}]}],"member":"286","published-online":{"date-parts":[[2024,9,18]]},"reference":[{"key":"2024092902115592900_btae555-B1","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbaa153","article-title":"Anticp 2.0: an updated model for predicting anticancer peptides","volume":"22","author":"Agrawal","year":"2021","journal-title":"Brief Bioinform"},{"key":"2024092902115592900_btae555-B2","first-page":"2623","author":"Akiba","year":"2019"},{"key":"2024092902115592900_btae555-B3","first-page":"268","author":"Amirian","year":"2021"},{"key":"2024092902115592900_btae555-B4","doi-asserted-by":"crossref","first-page":"148570","DOI":"10.1109\/ACCESS.2020.3015792","article-title":"Prediction of therapeutic peptides using machine learning: computational models, datasets, and feature encodings","volume":"8","author":"Attique","year":"2020","journal-title":"IEEE Access"},{"key":"2024092902115592900_btae555-B5","doi-asserted-by":"crossref","first-page":"3732","DOI":"10.1021\/acs.jproteome.0c00276","article-title":"Prediction of neuropeptides from sequence information using ensemble classifier and hybrid features","volume":"19","author":"Bin","year":"2020","journal-title":"J Proteome Res"},{"key":"2024092902115592900_btae555-B6","doi-asserted-by":"crossref","first-page":"197","DOI":"10.1046\/j.1365-2796.2003.01228.x","article-title":"Antibacterial peptides: basic facts and emerging concepts","volume":"254","author":"Boman","year":"2003","journal-title":"J Intern Med"},{"key":"2024092902115592900_btae555-B7","doi-asserted-by":"crossref","first-page":"3017","DOI":"10.1038\/s41598-021-82513-9","article-title":"Improved prediction and characterization of anticancer activities of peptides using a novel flexible scoring card method","volume":"11","author":"Charoenkwan","year":"2021","journal-title":"Sci Rep"},{"key":"2024092902115592900_btae555-B8","doi-asserted-by":"crossref","first-page":"4125","DOI":"10.1021\/acs.jproteome.0c00590","article-title":"iDPPIV-SCM: a sequence-based predictor for identifying and analyzing dipeptidyl peptidase iv (DPP-IV) inhibitory peptides using a scoring card method","volume":"19","author":"Charoenkwan","year":"2020","journal-title":"J Proteome Res"},{"key":"2024092902115592900_btae555-B9","doi-asserted-by":"crossref","first-page":"32653","DOI":"10.1021\/acsomega.2c04305","article-title":"SCMRSA: a new approach for identifying and analyzing anti-mrsa peptides using estimated propensity scores of dipeptides","volume":"7","author":"Charoenkwan","year":"2022","journal-title":"ACS Omega"},{"key":"2024092902115592900_btae555-B10","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/j.ymeth.2021.12.001","article-title":"StackDPPIV: a novel computational approach for accurate prediction of dipeptidyl peptidase iv (DPP-IV) inhibitory peptides","volume":"204","author":"Charoenkwan","year":"2022","journal-title":"Methods"},{"key":"2024092902115592900_btae555-B11","doi-asserted-by":"crossref","first-page":"113747","DOI":"10.1016\/j.ab.2020.113747","article-title":"iTTCA-Hybrid: improved and robust identification of tumor t cell antigens by utilizing hybrid feature representation","volume":"599","author":"Charoenkwan","year":"2020","journal-title":"Anal Biochem"},{"key":"2024092902115592900_btae555-B12","doi-asserted-by":"crossref","first-page":"41082","DOI":"10.1021\/acsomega.2c04465","article-title":"iAMAP-SCM: a novel computational tool for large-scale identification of antimalarial peptides using estimated propensity scores of dipeptides","volume":"7","author":"Charoenkwan","year":"2022","journal-title":"ACS Omega"},{"key":"2024092902115592900_btae555-B13","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbac319","article-title":"NeuroPred-CLQ: incorporating deep temporal convolutional networks and multi-head attention mechanism to predict neuropeptides","volume":"23","author":"Chen","year":"2022","journal-title":"Brief Bioinform"},{"key":"2024092902115592900_btae555-B14","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1186\/s13040-023-00322-4","article-title":"The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification","volume":"16","author":"Chicco","year":"2023","journal-title":"BioData Min"},{"key":"2024092902115592900_btae555-B15","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1186\/s13040-021-00244-z","article-title":"The matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation","volume":"14","author":"Chicco","year":"2021","journal-title":"BioData Min"},{"key":"2024092902115592900_btae555-B16","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1021\/acs.jcim.0c01115","article-title":"Bbppred: sequence-based prediction of blood-brain barrier peptides with feature representation learning and logistic regression","volume":"61","author":"Dai","year":"2021","journal-title":"J Chem Inf Model"},{"key":"2024092902115592900_btae555-B17","doi-asserted-by":"crossref","first-page":"1947","DOI":"10.1007\/s10462-021-10058-4","article-title":"Machine learning in drug discovery: a review","volume":"55","author":"Dara","year":"2022","journal-title":"Artif Intell Rev"},{"key":"2024092902115592900_btae555-B18","doi-asserted-by":"crossref","first-page":"vbac021","DOI":"10.1093\/bioadv\/vbac021","article-title":"Lmpred: predicting antimicrobial peptides using pre-trained language models and deep learning","volume":"2","author":"Dee","year":"2022","journal-title":"Bioinform Adv"},{"key":"2024092902115592900_btae555-B19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/bib\/bbad135","article-title":"Unidl4biopep: a universal deep learning architecture for binary classification in peptide bioactivity","volume":"24","author":"Du","year":"2023","journal-title":"Brief Bioinform"},{"key":"2024092902115592900_btae555-B20","first-page":"3723","author":"Dvornik","year":"2019"},{"key":"2024092902115592900_btae555-B21","doi-asserted-by":"crossref","first-page":"140","DOI":"10.1073\/pnas.81.1.140","article-title":"The hydrophobic moment detects periodicity in protein hydrophobicity","volume":"81","author":"Eisenberg","year":"1984","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024092902115592900_btae555-B22","doi-asserted-by":"crossref","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"Prottrans: toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2021","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2024092902115592900_btae555-B23","doi-asserted-by":"publisher","author":"Fern\u00e1ndez-D\u00edaz","year":"2024","DOI":"10.1101\/2023.11.13.566825"},{"key":"2024092902115592900_btae555-B24","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1016\/j.inffus.2010.06.010","article-title":"An empirical study of binary classifier fusion methods for multiclass classification","volume":"12","author":"Garc\u00eda-Pedrajas","year":"2011","journal-title":"Inf Fusion"},{"key":"2024092902115592900_btae555-B25","doi-asserted-by":"crossref","first-page":"106622","DOI":"10.1016\/j.knosys.2020.106622","article-title":"Automl: a survey of the state-of-the-art","volume":"212","author":"He","year":"2021","journal-title":"Knowledge-Based Syst"},{"key":"2024092902115592900_btae555-B26","doi-asserted-by":"publisher","author":"Heinzinger","year":"2023","DOI":"10.1101\/2023.07.23.550085"},{"key":"2024092902115592900_btae555-B27","first-page":"1895","article-title":"Thermostability and aliphatic index of globular proteins","volume":"88","author":"Ikai","year":"1980","journal-title":"J Biochem"},{"key":"2024092902115592900_btae555-B28","author":"Larralde"},{"key":"2024092902115592900_btae555-B29","doi-asserted-by":"publisher","author":"Li","year":"2024","DOI":"10.1101\/2024.02.05.578959"},{"key":"2024092902115592900_btae555-B30","doi-asserted-by":"publisher","author":"Lin","year":"2022","DOI":"10.1101\/2022.07.20.500902"},{"key":"2024092902115592900_btae555-B31","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2024092902115592900_btae555-B32","doi-asserted-by":"crossref","first-page":"2757","DOI":"10.1093\/bioinformatics\/bty1047","article-title":"mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation","volume":"35","author":"Manavalan","year":"2019","journal-title":"Bioinformatics"},{"key":"2024092902115592900_btae555-B33","doi-asserted-by":"crossref","first-page":"21471","DOI":"10.1038\/s41598-020-78319-w","article-title":"Anoxpepred: using deep learning for the prediction of antioxidative properties of peptides","volume":"10","author":"Olsen","year":"2020","journal-title":"Sci Rep"},{"key":"2024092902115592900_btae555-B34","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1186\/s13321-024-00849-6","article-title":"One chiral fingerprint to find them all","volume":"16","author":"Orsi","year":"2024","journal-title":"J Cheminform"},{"key":"2024092902115592900_btae555-B35","doi-asserted-by":"crossref","first-page":"5368","DOI":"10.1093\/bioinformatics\/btac711","article-title":"Integrating transformer and imbalanced multi-label learning to identify antimicrobial peptides and their functional activities","volume":"38","author":"Pang","year":"2022","journal-title":"Bioinformatics"},{"key":"2024092902115592900_btae555-B36","doi-asserted-by":"crossref","first-page":"3141","DOI":"10.1021\/acs.jcim.1c00251","article-title":"Alignment-free antimicrobial peptide predictors: improving performance by a thorough analysis of the largest available data set","volume":"61","author":"Pinacho-Castellanos","year":"2021","journal-title":"J Chem Inf Model"},{"key":"2024092902115592900_btae555-B37","doi-asserted-by":"crossref","first-page":"baab055","DOI":"10.1093\/database\/baab055","article-title":"Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach","volume":"2021","author":"Quiroz","year":"2021","journal-title":"Database"},{"key":"2024092902115592900_btae555-B38","doi-asserted-by":"crossref","first-page":"e0120066","DOI":"10.1371\/journal.pone.0120066","article-title":"Prediction and analysis of quorum sensing peptides based on sequence features","volume":"10","author":"Rajput","year":"2015","journal-title":"PLoS One"},{"key":"2024092902115592900_btae555-B39","first-page":"9689","article-title":"Evaluating protein transfer learning with tape","volume":"32","author":"Rao","year":"2019","journal-title":"Adv Neural Inf Process Syst"},{"key":"2024092902115592900_btae555-B40","doi-asserted-by":"publisher","author":"Rao","year":"2020","DOI":"10.1101\/2020.12.15.422761"},{"key":"2024092902115592900_btae555-B41","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1016\/j.compbiomed.2004.09.006","article-title":"Isoelectric point determination of proteins and other macromolecules: oscillating method","volume":"36","author":"Sillero","year":"2006","journal-title":"Comput Biol Med"},{"key":"2024092902115592900_btae555-B42","doi-asserted-by":"crossref","first-page":"603","DOI":"10.1038\/s41592-019-0437-4","article-title":"Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold","volume":"16","author":"Steinegger","year":"2019","journal-title":"Nat Methods"},{"key":"2024092902115592900_btae555-B43","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1038\/nbt.3988","article-title":"Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets","volume":"35","author":"Steinegger","year":"2017","journal-title":"Nat Biotechnol"},{"key":"2024092902115592900_btae555-B44","doi-asserted-by":"crossref","first-page":"1282","DOI":"10.1093\/bioinformatics\/btm098","article-title":"Uniref: comprehensive and non-redundant uniprot reference clusters","volume":"23","author":"Suzek","year":"2007","journal-title":"Bioinformatics"},{"key":"2024092902115592900_btae555-B45","doi-asserted-by":"crossref","first-page":"lqad088","DOI":"10.1093\/nargab\/lqad088","article-title":"Graphpart: homology partitioning for biological sequence analysis","volume":"5","author":"Teufel","year":"2023","journal-title":"NAR Genom Bioinform"},{"key":"2024092902115592900_btae555-B46","doi-asserted-by":"crossref","first-page":"2850","DOI":"10.3390\/molecules25122850","article-title":"Antimicrobial peptides as anticancer agents: functional properties and biological activities","volume":"25","author":"Tornesello","year":"2020","journal-title":"Molecules"},{"key":"2024092902115592900_btae555-B47","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1038\/s42256-022-00457-9","article-title":"Learning functional properties of proteins with language models","volume":"4","author":"Unsal","year":"2022","journal-title":"Nat Mach Intell"},{"key":"2024092902115592900_btae555-B48","doi-asserted-by":"crossref","first-page":"1122","DOI":"10.1038\/s41592-021-01205-4","article-title":"Dome: recommendations for supervised machine learning validation in biology","volume":"18","author":"Walsh","year":"2021","journal-title":"Nat Methods"},{"key":"2024092902115592900_btae555-B49","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1038\/s41392-022-00904-4","article-title":"Therapeutic peptides: current applications and future directions","volume":"7","author":"Wang","year":"2022","journal-title":"Signal Transduct Target Ther"},{"key":"2024092902115592900_btae555-B50","first-page":"106","article-title":"Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms","volume":"21","author":"Wei","year":"2020","journal-title":"Brief Bioinform"},{"key":"2024092902115592900_btae555-B51","doi-asserted-by":"crossref","first-page":"bbab041","DOI":"10.1093\/bib\/bbab041","article-title":"Atse: a peptide toxicity predictor by exploiting structural and evolutionary information based on graph neural network and attention mechanism","volume":"22","author":"Wei","year":"2021","journal-title":"Brief Bioinform"},{"key":"2024092902115592900_btae555-B52","doi-asserted-by":"crossref","first-page":"bbab209","DOI":"10.1093\/bib\/bbab209","article-title":"iAMP-CA2L: a new cnn-bilstm-svm classifier based on cellular automata image for identifying antimicrobial peptides and their functional types","volume":"22","author":"Xiao","year":"2021","journal-title":"Brief Bioinform"},{"key":"2024092902115592900_btae555-B53","doi-asserted-by":"crossref","first-page":"258","DOI":"10.1007\/s12539-021-00484-x","article-title":"Predapp: predicting anti-parasitic peptides with undersampling and ensemble approaches","volume":"14","author":"Zhang","year":"2022","journal-title":"Interdiscip Sci Comput Life Sci"},{"key":"2024092902115592900_btae555-B54","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1002\/prot.1071","article-title":"Some insights into protein structural class prediction","volume":"44","author":"Zhou","year":"2001","journal-title":"Proteins Struct Funct Bioinf"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae555\/59183251\/btae555.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/9\/btae555\/59423926\/btae555.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/9\/btae555\/59423926\/btae555.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,28]],"date-time":"2024-09-28T22:12:44Z","timestamp":1727561564000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae555\/7760207"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,9]]},"references-count":54,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae555","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.11.13.566825","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,9]]},"published":{"date-parts":[[2024,9]]},"article-number":"btae555"}}