{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T16:23:49Z","timestamp":1774542229548,"version":"3.50.1"},"reference-count":57,"publisher":"Public Library of Science (PLoS)","issue":"12","license":[{"start":{"date-parts":[[2022,12,1]],"date-time":"2022-12-01T00:00:00Z","timestamp":1669852800000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to \u201cstate-of-the-art,\u201d take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1010669","type":"journal-article","created":{"date-parts":[[2022,12,1]],"date-time":"2022-12-01T19:18:43Z","timestamp":1669922323000},"page":"e1010669","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":17,"title":["Ten quick tips for sequence-based prediction of protein properties using machine learning"],"prefix":"10.1371","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7655-0899","authenticated-orcid":true,"given":"Qingzhen","family":"Hou","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8570-7640","authenticated-orcid":true,"given":"Katharina","family":"Waury","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8809-0861","authenticated-orcid":true,"given":"Dea","family":"Gogishvili","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6755-9667","authenticated-orcid":true,"given":"K. Anton","family":"Feenstra","sequence":"additional","affiliation":[]}],"member":"340","published-online":{"date-parts":[[2022,12,1]]},"reference":[{"issue":"1","key":"pcbi.1010669.ref001","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1093\/bib\/bbk007","article-title":"Machine learning in bioinformatics","volume":"7","author":"P Larra\u00f1aga","year":"2006","journal-title":"Brief Bioinform"},{"issue":"11","key":"pcbi.1010669.ref002","doi-asserted-by":"crossref","first-page":"659","DOI":"10.1038\/s41580-019-0176-5","article-title":"Setting the standards for machine learning in biology","volume":"20","author":"DT Jones","year":"2019","journal-title":"Nat Rev Mol Cell Biol"},{"issue":"1","key":"pcbi.1010669.ref003","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1038\/s41580-021-00407-0","article-title":"A guide to machine learning for biologists","volume":"23","author":"JG Greener","year":"2021","journal-title":"Nat Rev Mol Cell Biol"},{"issue":"3","key":"pcbi.1010669.ref004","doi-asserted-by":"crossref","first-page":"e1009803","DOI":"10.1371\/journal.pcbi.1009803","article-title":"Ten quick tips for deep learning in biology","volume":"18","author":"BD Lee","year":"2022","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1010669.ref005","doi-asserted-by":"crossref","first-page":"20170387","DOI":"10.1098\/rsif.2017.0387","article-title":"Opportunities and obstacles for deep learning in biology and medicine.","volume":"15","author":"T Ching","year":"2018","journal-title":"J R Soc Interface"},{"issue":"2","key":"pcbi.1010669.ref006","doi-asserted-by":"crossref","first-page":"e1008531","DOI":"10.1371\/journal.pcbi.1008531","article-title":"Ten simple rules for engaging with artificial intelligence in biomedicine","volume":"17","author":"A Malik","year":"2021","journal-title":"PLoS Comput Biol"},{"issue":"1","key":"pcbi.1010669.ref007","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13040-017-0155-3","article-title":"Ten quick tips for machine learning in computational biology","volume":"10","author":"D. Chicco","year":"2017","journal-title":"BioData Min"},{"issue":"4","key":"pcbi.1010669.ref008","doi-asserted-by":"crossref","first-page":"e1004191","DOI":"10.1371\/journal.pcbi.1004191","article-title":"Ten simple rules for reducing overoptimistic reporting in methodological computational research.","volume":"11","author":"AL Boulesteix","year":"2015","journal-title":"PLoS Comput Biol"},{"issue":"10","key":"pcbi.1010669.ref009","doi-asserted-by":"crossref","first-page":"1122","DOI":"10.1038\/s41592-021-01205-4","article-title":"DOME: recommendations for supervised machine learning validation in biology","volume":"18","author":"I Walsh","year":"2021","journal-title":"Nat Methods"},{"issue":"1","key":"pcbi.1010669.ref010","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-022-04565-6","article-title":"Assigning protein function from domain-function associations using DomFun","volume":"23","author":"E Rojano","year":"2022","journal-title":"BMC Bioinformatics"},{"issue":"14","key":"pcbi.1010669.ref011","doi-asserted-by":"crossref","first-page":"2403","DOI":"10.1093\/bioinformatics\/bty1006","article-title":"Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks","volume":"35","author":"J Hanson","year":"2019","journal-title":"Bioinformatics"},{"issue":"15","key":"pcbi.1010669.ref012","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"A Rives","year":"2021","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"1","key":"pcbi.1010669.ref013","doi-asserted-by":"crossref","first-page":"723","DOI":"10.1186\/s12859-019-3220-8","article-title":"Modeling aspects of the language of life through transfer-learning protein sequences","volume":"20","author":"M Heinzinger","year":"2019","journal-title":"BMC Bioinformatics"},{"issue":"1","key":"pcbi.1010669.ref014","doi-asserted-by":"crossref","first-page":"1160","DOI":"10.1038\/s41598-020-80786-0","article-title":"Embeddings from deep learning transfer GO annotations beyond homology.","volume":"11","author":"M Littmann","year":"2021","journal-title":"Sci Rep."},{"issue":"1","key":"pcbi.1010669.ref015","doi-asserted-by":"crossref","first-page":"10487","DOI":"10.1038\/s41598-022-13951-2","article-title":"Multi-task learning to leverage partially annotated data for PPI interface prediction.","volume":"12","author":"H Capel","year":"2022","journal-title":"Sci Rep"},{"issue":"1","key":"pcbi.1010669.ref016","doi-asserted-by":"crossref","first-page":"16047","DOI":"10.1038\/s41598-022-19608-4","article-title":"ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.","volume":"12","author":"H Capel","year":"2022","journal-title":"Sci Rep"},{"issue":"8","key":"pcbi.1010669.ref017","doi-asserted-by":"crossref","first-page":"2111","DOI":"10.1093\/bioinformatics\/btac071","article-title":"PIPENN: protein interface prediction from sequence with an ensemble of neural nets","volume":"38","author":"B Stringer","year":"2022","journal-title":"Bioinformatics"},{"issue":"10","key":"pcbi.1010669.ref018","article-title":"Seeing the trees through the forest: Sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest","volume":"33","author":"Q Hou","year":"2017","journal-title":"Bioinformatics"},{"key":"pcbi.1010669.ref019","article-title":"SeRenDIP: SEquential REmasteriNg to DerIve profiles for fast and accurate predictions of PPI interface positions","author":"Q Hou","year":"2019","journal-title":"Bioinformatics"},{"issue":"20","key":"pcbi.1010669.ref020","doi-asserted-by":"crossref","first-page":"3421","DOI":"10.1093\/bioinformatics\/btab321","article-title":"SeRenDIP-CE: sequence-based interface prediction for conformational epitopes","volume":"37","author":"Q Hou","year":"2021","journal-title":"Bioinformatics"},{"key":"pcbi.1010669.ref021","first-page":"1","article-title":"How sticky are our proteins? Quantifying hydrophobicity of the human proteome.","author":"JHM van Gils","year":"2022","journal-title":"Bioinform Adv."},{"issue":"7873","key":"pcbi.1010669.ref022","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with AlphaFold","volume":"596","author":"J Jumper","year":"2021","journal-title":"Nature"},{"issue":"7873","key":"pcbi.1010669.ref023","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1038\/s41586-021-03828-1","article-title":"Highly accurate protein structure prediction for the human proteome","volume":"596","author":"K Tunyasuvunakool","year":"2021","journal-title":"Nature"},{"key":"pcbi.1010669.ref024","doi-asserted-by":"crossref","first-page":"2102592","DOI":"10.1002\/advs.202102592","article-title":"Improved Protein Structure Prediction Using a New Multi-Scale Network and Homologous Templates.","author":"H Su","year":"2021","journal-title":"Adv Sci"},{"key":"pcbi.1010669.ref025","article-title":"Deep graph learning of inter-protein contacts","author":"Z Xie","year":"2021","journal-title":"Bioinformatics"},{"issue":"10","key":"pcbi.1010669.ref026","doi-asserted-by":"crossref","first-page":"1666","DOI":"10.1038\/s41591-021-01533-0","article-title":"AlphaFold heralds a data-driven revolution in biology and medicine","volume":"27","author":"JM Thornton","year":"2021","journal-title":"Nat Med"},{"issue":"1","key":"pcbi.1010669.ref027","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1038\/s41592-021-01365-3","article-title":"The impact of AlphaFold2 one year on.","volume":"19","author":"DT Jones","year":"2022","journal-title":"Nat Methods"},{"issue":"10","key":"pcbi.1010669.ref028","doi-asserted-by":"crossref","first-page":"e1008281","DOI":"10.1371\/journal.pcbi.1008281","article-title":"Ten simple rules for biologists initiating a collaboration with computer scientists","volume":"16","author":"M. Cechova","year":"2020","journal-title":"PLoS Comput Biol"},{"issue":"5","key":"pcbi.1010669.ref029","doi-asserted-by":"crossref","first-page":"e1008879","DOI":"10.1371\/journal.pcbi.1008879","article-title":"Ten simple rules to cultivate transdisciplinary collaboration in data science","volume":"17","author":"F Sahneh","year":"2021","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1010669.ref030","first-page":"864405","article-title":"End-to-end multitask learning, from protein language to protein features without alignments.","author":"A Elnaggar","year":"2019","journal-title":"bioRxiv."},{"issue":"8","key":"pcbi.1010669.ref031","article-title":"ProtTrans: Towards Cracking the Language of Life\u2019s Code Through Self-Supervised Deep Learning and High Performance Computing.","volume":"14","author":"A Elnaggar","year":"2020","journal-title":"bioRxiv"},{"key":"pcbi.1010669.ref032","doi-asserted-by":"crossref","first-page":"278","DOI":"10.12688\/f1000research.20559.1","article-title":"A community proposal to integrate structural bioinformatics activities in ELIXIR (3D-Bioinfo Community).","volume":"9","author":"C Orengo","year":"2020","journal-title":"F1000Res"},{"key":"pcbi.1010669.ref033","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship.","volume":"3","author":"MD Wilkinson","year":"2016","journal-title":"Sci Data."},{"issue":"3","key":"pcbi.1010669.ref034","doi-asserted-by":"crossref","first-page":"e1005399","DOI":"10.1371\/journal.pcbi.1005399","article-title":"Ten simple rules for responsible big data research.","volume":"13","author":"M Zook","year":"2017","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1010669.ref035","article-title":"Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language.","author":"MR Crusoe","year":"2021","journal-title":"arXiv"},{"issue":"D1","key":"pcbi.1010669.ref036","doi-asserted-by":"crossref","first-page":"D1","DOI":"10.1093\/nar\/gkab1195","article-title":"The 2022 Nucleic Acids Research database issue and the online molecular biology database collection","volume":"50","author":"DJ Rigden","year":"2022","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"pcbi.1010669.ref037","article-title":"Sequence specificity between interacting and non-interacting homologs identifies interface residues\u2014a homodimer and monomer use case","volume":"16","author":"Q Hou","year":"2015","journal-title":"BMC Bioinformatics"},{"issue":"D1","key":"pcbi.1010669.ref038","doi-asserted-by":"crossref","first-page":"D266","DOI":"10.1093\/nar\/gkaa1079","article-title":"CATH: increased structural coverage of functional space","volume":"49","author":"I Sillitoe","year":"2021","journal-title":"Nucleic Acids Res"},{"issue":"D1","key":"pcbi.1010669.ref039","doi-asserted-by":"crossref","first-page":"D304","DOI":"10.1093\/nar\/gkt1240","article-title":"SCOPe: Structural Classification of Proteins\u2013extended, integrating SCOP and ASTRAL data and classification of new structures","volume":"42","author":"NK Fox","year":"2014","journal-title":"Nucleic Acids Res"},{"issue":"4","key":"pcbi.1010669.ref040","doi-asserted-by":"crossref","first-page":"448","DOI":"10.1093\/bioinformatics\/btaa773","article-title":"EpiDope: a deep neural network for linear B-cell epitope prediction","volume":"37","author":"M Collatz","year":"2021","journal-title":"Bioinformatics"},{"issue":"W1","key":"pcbi.1010669.ref041","doi-asserted-by":"crossref","first-page":"W24","DOI":"10.1093\/nar\/gkx346","article-title":"BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes","volume":"45","author":"MC Jespersen","year":"2017","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"pcbi.1010669.ref042","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1093\/bib\/bbx022","article-title":"Review and comparative assessment of sequence-based predictors of protein-binding residues","volume":"19","author":"J Zhang","year":"2018","journal-title":"Brief Bioinform"},{"key":"pcbi.1010669.ref043","first-page":"4765","volume-title":"A Unified Approach to Interpreting Model Predictions.","author":"SM Lundberg","year":"2017"},{"issue":"W1","key":"pcbi.1010669.ref044","doi-asserted-by":"crossref","first-page":"W510","DOI":"10.1093\/nar\/gkac439","article-title":"NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning","volume":"50","author":"MH H\u00f8ie","year":"2022","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"pcbi.1010669.ref045","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1148\/radiology.143.1.7063747","article-title":"The meaning and use of the area under a receiver operating characteristic (ROC) curve.","volume":"143","author":"JA Hanley","year":"1982","journal-title":"Radiology."},{"key":"pcbi.1010669.ref046","doi-asserted-by":"crossref","first-page":"742","DOI":"10.12688\/f1000research.15140.2","article-title":"Recommendations for the packaging and containerizing of bioinformatics software","volume":"7","author":"B Gruening","year":"2019","journal-title":"F1000Res."},{"issue":"11","key":"pcbi.1010669.ref047","doi-asserted-by":"crossref","first-page":"e1008316","DOI":"10.1371\/journal.pcbi.1008316","article-title":"Ten simple rules for writing Dockerfiles for reproducible data science","volume":"16","author":"D Nust","year":"2020","journal-title":"PLoS Comput Biol"},{"issue":"W1","key":"pcbi.1010669.ref048","doi-asserted-by":"crossref","first-page":"W537","DOI":"10.1093\/nar\/gky379","article-title":"The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update","volume":"46","author":"E Afgan","year":"2018","journal-title":"Nucleic Acids Res"},{"issue":"W1","key":"pcbi.1010669.ref049","doi-asserted-by":"crossref","first-page":"W345","DOI":"10.1093\/nar\/gkac247","article-title":"The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update","volume":"50","author":"TG Community","year":"2022","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"pcbi.1010669.ref050","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1145\/2641190.2641198","article-title":"OpenML: networked science in machine learning","volume":"15","author":"J Vanschoren","year":"2014","journal-title":"ACM SIGKDD Explor"},{"key":"pcbi.1010669.ref051","author":"J Bai","year":"2019","journal-title":"Others. ONNX: Open Neural Network Exchange"},{"issue":"W1","key":"pcbi.1010669.ref052","doi-asserted-by":"crossref","first-page":"W52","DOI":"10.1093\/nar\/gkab425","article-title":"b2bTools: online predictions for protein biophysical features and their conservation","volume":"49","author":"LP Kagami","year":"2021","journal-title":"Nucleic Acids Res"},{"issue":"W1","key":"pcbi.1010669.ref053","doi-asserted-by":"crossref","first-page":"W1","DOI":"10.1093\/nar\/gkac525","article-title":"Editorial: the 20th annual Nucleic Acids Research Web Server Issue 2022","volume":"50","author":"J Bujnicki","year":"2022","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"pcbi.1010669.ref054","first-page":"1","article-title":"SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome","volume":"18","author":"Y Li","year":"2017","journal-title":"BMC Bioinformatics"},{"issue":"D1","key":"pcbi.1010669.ref055","doi-asserted-by":"crossref","first-page":"D439","DOI":"10.1093\/nar\/gkab1061","article-title":"AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models","volume":"50","author":"M Varadi","year":"2022","journal-title":"Nucleic Acids Res"},{"issue":"10","key":"pcbi.1010669.ref056","doi-asserted-by":"crossref","first-page":"e1003858","DOI":"10.1371\/journal.pcbi.1003858","article-title":"Ten Simple Rules for Writing a PLOS Ten Simple Rules Article.","volume":"10","author":"H Dashnow","year":"2014","journal-title":"PLoS Comput Biol."},{"issue":"6","key":"pcbi.1010669.ref057","doi-asserted-by":"crossref","first-page":"e1002108","DOI":"10.1371\/journal.pcbi.1002108","article-title":"Ten Simple Rules for Building and Maintaining a Scientific Reputation.","volume":"7","author":"PE Bourne","year":"2011","journal-title":"PLoS Comput Biol."}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1010669","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,1]],"date-time":"2022-12-01T19:19:24Z","timestamp":1669922364000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1010669"}},"subtitle":[],"editor":[{"given":"Patricia M.","family":"Palagi","sequence":"first","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,12,1]]},"references-count":57,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2022,12,1]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1010669","relation":{},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,1]]}}}