{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T18:28:50Z","timestamp":1758392930137},"reference-count":37,"publisher":"Oxford University Press (OUP)","issue":"16","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014,8,15]]},"abstract":"<jats:p>Motivation: Structural knowledge, extracted from the Protein Data Bank (PDB), underlies numerous potential functions and prediction methods. The PDB, however, is highly biased: many proteins have more than one entry, while entire protein families are represented by a single structure, or even not at all. The standard solution to this problem is to limit the studies to non-redundant subsets of the PDB. While alleviating biases, this solution hides the many-to-many relations between sequences and structures. That is, non-redundant datasets conceal the diversity of sequences that share the same fold and the existence of multiple conformations for the same protein. A particularly disturbing aspect of non-redundant subsets is that they hardly benefit from the rapid pace of protein structure determination, as most newly solved structures fall within existing families.<\/jats:p>\n               <jats:p>Results: In this study we explore the concept of redundancy-weighted datasets, originally suggested by Miyazawa and Jernigan. Redundancy-weighted datasets include all available structures and associate them (or features thereof) with weights that are inversely proportional to the number of their homologs. Here, we provide the first systematic comparison of redundancy-weighted datasets with non-redundant ones. We test three weighting schemes and show that the distributions of structural features that they produce are smoother (having higher entropy) compared with the distributions inferred from non-redundant datasets. We further show that these smoothed distributions are both more robust and more correct than their non-redundant counterparts.<\/jats:p>\n               <jats:p>We suggest that the better distributions, inferred using redundancy-weighting, may improve the accuracy of knowledge-based potentials and increase the power of protein structure prediction methods. Consequently, they may enhance model-driven molecular biology.<\/jats:p>\n               <jats:p>Contact: \u00a0cheny@il.ibm.com or chen.keasar@gmail.com<\/jats:p>","DOI":"10.1093\/bioinformatics\/btu242","type":"journal-article","created":{"date-parts":[[2014,4,26]],"date-time":"2014-04-26T04:03:50Z","timestamp":1398485030000},"page":"2295-2301","source":"Crossref","is-referenced-by-count":11,"title":["Redundancy-weighting for better inference of protein structural features"],"prefix":"10.1093","volume":"30","author":[{"given":"Chen","family":"Yanover","sequence":"first","affiliation":[{"name":"1 \u00a01Machine Learning for Healthcare and Life-Sciences, Analytics Department, IBM Research Laboratory, Haifa, 3490002, 2Department of Software Engineering, Shamoon College of Engineering, Beer-Sheva 84100, Israel, 3Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA, 4Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838 and 5Departments of Life Sciences and Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Natalia","family":"Vanetik","sequence":"additional","affiliation":[{"name":"1 \u00a01Machine Learning for Healthcare and Life-Sciences, Analytics Department, IBM Research Laboratory, Haifa, 3490002, 2Department of Software Engineering, Shamoon College of Engineering, Beer-Sheva 84100, Israel, 3Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA, 4Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838 and 5Departments of Life Sciences and Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Levitt","sequence":"additional","affiliation":[{"name":"1 \u00a01Machine Learning for Healthcare and Life-Sciences, Analytics Department, IBM Research Laboratory, Haifa, 3490002, 2Department of Software Engineering, Shamoon College of Engineering, Beer-Sheva 84100, Israel, 3Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA, 4Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838 and 5Departments of Life Sciences and Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rachel","family":"Kolodny","sequence":"additional","affiliation":[{"name":"1 \u00a01Machine Learning for Healthcare and Life-Sciences, Analytics Department, IBM Research Laboratory, Haifa, 3490002, 2Department of Software Engineering, Shamoon College of Engineering, Beer-Sheva 84100, Israel, 3Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA, 4Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838 and 5Departments of Life Sciences and Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chen","family":"Keasar","sequence":"additional","affiliation":[{"name":"1 \u00a01Machine Learning for Healthcare and Life-Sciences, Analytics Department, IBM Research Laboratory, Haifa, 3490002, 2Department of Software Engineering, Shamoon College of Engineering, Beer-Sheva 84100, Israel, 3Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA, 4Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838 and 5Departments of Life Sciences and Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2014,4,25]]},"reference":[{"key":"2023012711523175000_btu242-B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2023012711523175000_btu242-B2","doi-asserted-by":"crossref","first-page":"218","DOI":"10.1002\/bip.22132","article-title":"The future of the protein data bank","volume":"99","author":"Berman","year":"2013","journal-title":"Biopolymers"},{"key":"2023012711523175000_btu242-B3","doi-asserted-by":"crossref","first-page":"1036","DOI":"10.1016\/j.febslet.2012.12.029","article-title":"Trendspotting in the protein data bank","volume":"587","author":"Berman","year":"2013","journal-title":"FEBS Lett."},{"key":"2023012711523175000_btu242-B4","doi-asserted-by":"crossref","first-page":"535","DOI":"10.1016\/S0022-2836(77)80200-3","article-title":"The protein data bank: a computer-based archival file for macromolecular structures","volume":"112","author":"Bernstein","year":"1977","journal-title":"J. Mol. Biol."},{"key":"2023012711523175000_btu242-B6","doi-asserted-by":"crossref","first-page":"3481","DOI":"10.1073\/pnas.0914097107","article-title":"FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately","volume":"107","author":"Budowski-Tal","year":"2010","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012711523175000_btu242-B7","doi-asserted-by":"crossref","first-page":"e55484","DOI":"10.1371\/journal.pone.0055484","article-title":"Maximising the size of non-redundant protein datasets using graph theory","volume":"8","author":"Bull","year":"2013","journal-title":"PLoS One"},{"key":"2023012711523175000_btu242-B8","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1021\/bi00699a002","article-title":"Prediction of protein conformation","volume":"13","author":"Chou","year":"1974","journal-title":"Biochemistry"},{"key":"2023012711523175000_btu242-B10","doi-asserted-by":"crossref","DOI":"10.1016\/S0076-6879(97)77022-8","article-title":"VERIFY3D: assessment of protein models with three-dimensional profiles","author":"Eisenberg","year":"1997"},{"key":"2023012711523175000_btu242-B11","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1016\/0022-2836(78)90297-8","article-title":"Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins","volume":"120","author":"Garnier","year":"1978","journal-title":"J. Mol. Biol."},{"key":"2023012711523175000_btu242-B12","doi-asserted-by":"crossref","first-page":"1923","DOI":"10.1002\/prot.23015","article-title":"Multibody coarse-grained potentials for native structure recognition and quality assessment of protein models","volume":"79","author":"Gniewek","year":"2011","journal-title":"Proteins"},{"key":"2023012711523175000_btu242-B13","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1016\/j.sbi.2008.01.006","article-title":"The structure of protein evolution and the evolution of protein structure","volume":"18","author":"Goldstein","year":"2008","journal-title":"Curr. Opin. Struct. Biol."},{"key":"2023012711523175000_btu242-B14","doi-asserted-by":"crossref","first-page":"e23294","DOI":"10.1371\/journal.pone.0023294","article-title":"Generalized fragment picking in Rosetta: design, protocols and applications","volume":"6","author":"Gront","year":"2011","journal-title":"PLoS One"},{"key":"2023012711523175000_btu242-B15","volume-title":"Scientific Computing: An Introductory Survey","author":"Heath","year":"2002","edition":"2nd edn"},{"key":"2023012711523175000_btu242-B16","doi-asserted-by":"crossref","first-page":"522","DOI":"10.1002\/pro.5560030317","article-title":"Enlarged representative set of protein structures","volume":"3","author":"Hobohm","year":"1994","journal-title":"Protein Sci."},{"key":"2023012711523175000_btu242-B17","doi-asserted-by":"crossref","first-page":"2577","DOI":"10.1002\/bip.360221211","article-title":"Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features","volume":"22","author":"Kabsch","year":"1983","journal-title":"Biopolymers"},{"key":"2023012711523175000_btu242-B18","doi-asserted-by":"crossref","first-page":"3110","DOI":"10.1093\/bioinformatics\/btr541","article-title":"HHfrag: HMM-based fragment detection using HHpred","volume":"27","author":"Kalev","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012711523175000_btu242-B19","doi-asserted-by":"crossref","first-page":"W492","DOI":"10.1093\/nar\/gkp403","article-title":"SAM-T08, HMM-based protein structure prediction","volume":"37","author":"Karplus","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"2023012711523175000_btu242-B21","doi-asserted-by":"crossref","first-page":"891","DOI":"10.1002\/prot.21770","article-title":"Sequence-similar, structure-dissimilar protein pairs in the PDB","volume":"71","author":"Kosloff","year":"2008","journal-title":"Proteins"},{"key":"2023012711523175000_btu242-B22","doi-asserted-by":"crossref","first-page":"11079","DOI":"10.1073\/pnas.0905029106","article-title":"Nature of the protein universe","volume":"106","author":"Levitt","year":"2009","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012711523175000_btu242-B23","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2023012711523175000_btu242-B24","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence measures based on the Shannon entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans. Inf. Theory"},{"key":"2023012711523175000_btu242-B25","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1038\/356083a0","article-title":"Assessment of protein models with three-dimensional profiles","volume":"356","author":"L\u00fcthy","year":"1992","journal-title":"Nature"},{"key":"2023012711523175000_btu242-B26","doi-asserted-by":"crossref","first-page":"404","DOI":"10.1093\/bioinformatics\/16.4.404","article-title":"The PSIPRED protein structure prediction server","volume":"16","author":"McGuffin","year":"2000","journal-title":"Bioinformatics"},{"key":"2023012711523175000_btu242-B27","doi-asserted-by":"crossref","first-page":"534","DOI":"10.1021\/ma00145a039","article-title":"Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation","volume":"18","author":"Miyazawa","year":"1985","journal-title":"Macromolecules"},{"key":"2023012711523175000_btu242-B28","doi-asserted-by":"crossref","first-page":"623","DOI":"10.1006\/jmbi.1996.0114","article-title":"Residue\u2013residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading","volume":"256","author":"Miyazawa","year":"1996","journal-title":"J. Mol. Biol."},{"key":"2023012711523175000_btu242-B29","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1002\/(SICI)1097-0134(19990101)34:1<49::AID-PROT5>3.0.CO;2-L","article-title":"Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues","volume":"34","author":"Miyazawa","year":"1999","journal-title":"Proteins"},{"key":"2023012711523175000_btu242-B30","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"J. Mol. Biol."},{"key":"2023012711523175000_btu242-B32","doi-asserted-by":"crossref","first-page":"2444","DOI":"10.1073\/pnas.85.8.2444","article-title":"Improved tools for biological sequence comparison","volume":"85","author":"Pearson","year":"1988","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012711523175000_btu242-B34","doi-asserted-by":"crossref","DOI":"10.1016\/S0076-6879(96)66033-9","article-title":"PHD: predicting one-dimensional protein structure by profile-based neural networks","volume-title":"Computer Methods for Macromolecular Sequence Analysis","author":"Rost","year":"1996"},{"key":"2023012711523175000_btu242-B35","doi-asserted-by":"crossref","first-page":"895","DOI":"10.1006\/jmbi.1997.1479","article-title":"An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction","volume":"275","author":"Samudrala","year":"1998","journal-title":"J. Mol. Biol."},{"key":"2023012711523175000_btu242-B36","doi-asserted-by":"crossref","first-page":"e80493","DOI":"10.1371\/journal.pone.0080493","article-title":"Detecting protein candidate fragments using a structural alphabet profile comparison approach","volume":"8","author":"Shen","year":"2013","journal-title":"PLoS One"},{"key":"2023012711523175000_btu242-B37","doi-asserted-by":"crossref","first-page":"355","DOI":"10.1002\/prot.340170404","article-title":"Recognition of errors in three-dimensional structures of proteins","volume":"17","author":"Sippl","year":"1993","journal-title":"Proteins"},{"key":"2023012711523175000_btu242-B38","doi-asserted-by":"crossref","first-page":"3177","DOI":"10.1073\/pnas.0611593104","article-title":"Near-native structure refinement using in vacuo energy minimization","volume":"104","author":"Summa","year":"2007","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012711523175000_btu242-B39","doi-asserted-by":"crossref","first-page":"945","DOI":"10.1021\/ma60054a013","article-title":"Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins","volume":"9","author":"Tanaka","year":"1976","journal-title":"Macromolecules"},{"key":"2023012711523175000_btu242-B40","doi-asserted-by":"crossref","first-page":"4673","DOI":"10.1093\/nar\/22.22.4673","article-title":"CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice","volume":"22","author":"Thompson","year":"1994","journal-title":"Nucleic Acids Res."},{"key":"2023012711523175000_btu242-B41","doi-asserted-by":"crossref","first-page":"1589","DOI":"10.1093\/bioinformatics\/btg224","article-title":"PISCES: a protein sequence culling server","volume":"19","author":"Wang","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012711523175000_btu242-B42","doi-asserted-by":"crossref","first-page":"W94","DOI":"10.1093\/nar\/gki402","article-title":"PISCES: recent improvements to a PDB sequence culling server","volume":"33","author":"Wang","year":"2005","journal-title":"Nucleic Acids Res."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/16\/2295\/48925762\/bioinformatics_30_16_2295.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/16\/2295\/48925762\/bioinformatics_30_16_2295.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T12:12:20Z","timestamp":1674821540000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/30\/16\/2295\/2748190"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,4,25]]},"references-count":37,"journal-issue":{"issue":"16","published-print":{"date-parts":[[2014,8,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btu242","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,4,25]]}}}