{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,11]],"date-time":"2025-11-11T12:47:12Z","timestamp":1762865232164},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2007,3,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: To predict which of the vast number of human single nucleotide polymorphisms (SNPs) are deleterious to gene function or likely to be disease associated is an important problem, and many methods have been reported in the literature. All methods require data sets of mutations classified as \u2018deleterious\u2019 or \u2018neutral\u2019 for training and\/or validation. While different workers have used different data sets there has been no study of which is best. Here, the three most commonly used data sets are analysed. We examine their contents and relate this to classifiers, with the aims of revealing the strengths and pitfalls of each data set, and recommending a best approach for future studies.<\/jats:p><jats:p>Results: The data sets examined are shown to be substantially different in content, particularly with regard to amino acid substitutions, reflecting the different ways in which they are derived. This leads to differences in classifiers and reveals some serious pitfalls of some data sets, making them less than ideal for non-synonymous SNP prediction.<\/jats:p><jats:p>Availability: Software is available on request from the authors.<\/jats:p><jats:p>Contact: \u00a0d.r.westhead@leeds.ac.uk<\/jats:p><jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btl649","type":"journal-article","created":{"date-parts":[[2007,1,19]],"date-time":"2007-01-19T01:13:12Z","timestamp":1169169192000},"page":"664-672","source":"Crossref","is-referenced-by-count":46,"title":["Deleterious SNP prediction: be mindful of your training data!"],"prefix":"10.1093","volume":"23","author":[{"given":"Matthew A.","family":"Care","sequence":"first","affiliation":[{"name":"1 Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and 2School of Computing, University of Leeds, Leeds, LS2 9JT, UK"}]},{"given":"Chris J.","family":"Needham","sequence":"additional","affiliation":[{"name":"1 Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and 2School of Computing, University of Leeds, Leeds, LS2 9JT, UK"}]},{"given":"Andrew J.","family":"Bulpitt","sequence":"additional","affiliation":[{"name":"1 Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and 2School of Computing, University of Leeds, Leeds, LS2 9JT, UK"}]},{"given":"David R.","family":"Westhead","sequence":"additional","affiliation":[{"name":"1 Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and 2School of Computing, University of Leeds, Leeds, LS2 9JT, UK"}]}],"member":"286","published-online":{"date-parts":[[2007,1,18]]},"reference":[{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"753","DOI":"10.1002\/prot.20176","article-title":"Accurate prediction of solvent accessibility using neural networks-based regression","volume":"56","author":"Adamczak","year":"2004","journal-title":"Proteins"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"3754","DOI":"10.1021\/bi00387a002","article-title":"Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the folded protein","volume":"26","author":"Alber","year":"1987","journal-title":"Biochemistry"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"2185","DOI":"10.1093\/bioinformatics\/bti365","article-title":"Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information","volume":"21","author":"Bao","year":"2005","journal-title":"Bioinformatics"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"1323","DOI":"10.1093\/protein\/7.11.1323","article-title":"Amino acid substitution during functionally constrained divergent evolution of protein sequences","volume":"7","author":"Benner","year":"1994","journal-title":"Protein Eng."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"365","DOI":"10.1093\/nar\/gkg095","article-title":"The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003","volume":"31","author":"Boeckmann","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"178","DOI":"10.1002\/humu.20063","article-title":"Bayesian approach to discovering pathogenic SNPs in conserved protein domains","volume":"24","author":"Cai","year":"2004","journal-title":"Hum. Mutat."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1038\/10290","article-title":"Characterization of single-nucleotide polymorphisms in coding regions of human genes","volume":"22","author":"Cargill","year":"1999","journal-title":"Nat. Genet."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"683","DOI":"10.1006\/jmbi.2001.4510","article-title":"Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation","volume":"307","author":"Chasman","year":"2001","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"1229","DOI":"10.1101\/gr.8.12.1229","article-title":"A DNA polymorphism discovery resource for research on human genetic variation","volume":"8","author":"Collins","year":"1998","journal-title":"Genome Res."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1186\/1471-2105-7-217","article-title":"Predicting deleterious nsSNPs: an analysis of sequence and structural attributes","volume":"7","author":"Dobson","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1186\/1471-2105-5-113","article-title":"MUSCLE: a multiple sequence alignment method with reduced time and space complexity","volume":"5","author":"Edgar","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"878","DOI":"10.1002\/prot.20664","article-title":"Use of bioinformatics tools for the annotation of disease-associated mutations in animal models","volume":"61","author":"Ferrer-Costa","year":"2005","journal-title":"Proteins"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"811","DOI":"10.1002\/prot.20252","article-title":"Sequence-based prediction of pathological mutations","volume":"57","author":"Ferrer-Costa","year":"2004","journal-title":"Proteins"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"771","DOI":"10.1006\/jmbi.2001.5255","article-title":"Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties","volume":"315","author":"Ferrer-Costa","year":"2002","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"806","DOI":"10.1002\/prot.10458","article-title":"Prediction of deleterious functional effects of amino acid mutations using a library of structure-based function descriptors","volume":"53","author":"Herrgard","year":"2003","journal-title":"Proteins"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"1022","DOI":"10.1016\/0022-2836(94)90009-4","article-title":"Wide variations in neighbor-dependent substitution rates","volume":"236","author":"Hess","year":"1994","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"2199","DOI":"10.1093\/bioinformatics\/btg297","article-title":"A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function","volume":"19","author":"Krishnan","year":"2003","journal-title":"Bioinformatics"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"234","DOI":"10.1038\/85776","article-title":"Variation is the spice of life","volume":"27","author":"Kruglyak","year":"2001","journal-title":"Nat. Genet."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1006\/jmbi.1994.1458","article-title":"Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as \u201cspacers\u201d which do not require a specific sequence","volume":"240","author":"Markiewicz","year":"1994","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"405","DOI":"10.1186\/1471-2105-7-405","article-title":"Predicting the effect of missense mutations on protein function: analysis with Bayesian networks","volume":"7","author":"Needham","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"863","DOI":"10.1101\/gr.176601","article-title":"Predicting deleterious amino acid substitutions","volume":"11","author":"Ng","year":"2001","journal-title":"Genome Res."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"3894","DOI":"10.1093\/nar\/gkf493","article-title":"Human non-synonymous SNPs: server and survey","volume":"30","author":"Ramensky","year":"2002","journal-title":"Nucleic Acids Res."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1016\/0022-2836(91)90738-R","article-title":"Systematic mutation of bacteriophage T4 lysozyme","volume":"222","author":"Rennell","year":"1991","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"584","DOI":"10.1006\/jmbi.1993.1413","article-title":"Prediction of protein secondary structure at better than 70% accuracy","volume":"232","author":"Rost","year":"1993","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","first-page":"260","article-title":"YaDT: Yet another Decision Tree builder. Proceedings of the 16th International Conference on Tools with Artificial Intelligence","volume":"0","author":"Ruggieri","year":"2004","journal-title":"IEEE Press"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"891","DOI":"10.1016\/S0022-2836(02)00813-6","article-title":"Evaluation of structural and evolutionary contributions to deleterious mutation prediction","volume":"322","author":"Saunders","year":"2002","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1006\/jmbi.1996.0479","article-title":"Genetic studies of the Lac repressor. XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure","volume":"261","author":"Suckow","year":"1996","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"591","DOI":"10.1093\/hmg\/10.6.591","article-title":"Prediction of deleterious human alleles","volume":"10","author":"Sunyaev","year":"2001","journal-title":"Hum. Mol. Genet."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1002\/prot.10146","article-title":"Scoring residue conservation","volume":"48","author":"Valdar","year":"2002","journal-title":"Proteins"},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1111\/j.1467-9876.2005.00478.x","article-title":"A hierarchical Bayesian model for predicting the functional consequences of amino-acid polymorphisms","volume":"54","author":"Verzilli","year":"2005","journal-title":"J. R. Stat. Soc. Ser. C-Appl. Stat."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"R72","DOI":"10.1186\/gb-2003-4-11-r72","article-title":"The amino-acid mutational spectrum of human genetic disease","volume":"4","author":"Vitkup","year":"2003","journal-title":"Genome Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"263","DOI":"10.1002\/humu.22","article-title":"SNPs, protein structure, and disease","volume":"17","author":"Wang","year":"2001","journal-title":"Hum. Mutat."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"464","DOI":"10.1002\/humu.20021","article-title":"The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants","volume":"23","author":"Yip","year":"2004","journal-title":"Hum. Mutat."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1016\/j.jmb.2005.08.020","article-title":"Loss of protein structure stability as a major causative factor in monogenic disease","volume":"353","author":"Yue","year":"2005","journal-title":"J. Mol. Biol."},{"key":"2023041107503041000_","doi-asserted-by":"crossref","first-page":"1263","DOI":"10.1016\/j.jmb.2005.12.025","article-title":"Identification and Analysis of Deleterious Human SNPs","volume":"356","author":"Yue","year":"2006","journal-title":"J. Mol. Biol."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/6\/664\/49821334\/bioinformatics_23_6_664.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/6\/664\/49821334\/bioinformatics_23_6_664.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T09:04:07Z","timestamp":1707555847000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/23\/6\/664\/414126"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,1,18]]},"references-count":36,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2007,3,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btl649","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2007,3,15]]},"published":{"date-parts":[[2007,1,18]]}}}