{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,5]],"date-time":"2025-12-05T12:08:40Z","timestamp":1764936520666,"version":"3.37.0"},"reference-count":34,"publisher":"Oxford University Press (OUP)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2010,2,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.<\/jats:p><jats:p>Results: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in \u03b1-helical secondary structures (but not \u03b2-strands). Five-residue consensus exceptional words are enriched for \u03b1-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for \u03b1-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.<\/jats:p><jats:p>Contact: \u00a0wrp@virginia.edu<\/jats:p><jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btp660","type":"journal-article","created":{"date-parts":[[2009,12,1]],"date-time":"2009-12-01T02:20:44Z","timestamp":1259634044000},"page":"310-318","source":"Crossref","is-referenced-by-count":14,"title":["Globally, unrelated protein sequences appear random"],"prefix":"10.1093","volume":"26","author":[{"given":"Daniel T.","family":"Lavelle","sequence":"first","affiliation":[{"name":"Department of Biochemistry and Molecular Genetics, University of Virginia, Jordan Hall Box 800733, Charlottesville, VA 22908, USA"}]},{"given":"William R.","family":"Pearson","sequence":"additional","affiliation":[{"name":"Department of Biochemistry and Molecular Genetics, University of Virginia, Jordan Hall Box 800733, Charlottesville, VA 22908, USA"}]}],"member":"286","published-online":{"date-parts":[[2009,11,30]]},"reference":[{"key":"2023012511003120200_B1","doi-asserted-by":"crossref","first-page":"178","DOI":"10.1186\/1471-2105-7-178","article-title":"Protein secondary structure prediction for a single-sequence using hidden semi-Markov models","volume":"7","author":"Aydin","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023012511003120200_B2","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1016\/S0968-0004(98)01346-2","article-title":"Is protein folding hierarchic? i. local structure and peptide folding","volume":"24","author":"Baldwin","year":"1999","journal-title":"Trends Biochem. Sci."},{"key":"2023012511003120200_B3","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","article-title":"Controlling the false discovery rate: a practical and powerful approach to multiple testing","volume":"57","author":"Benjamini","year":"1995","journal-title":"J. R. Stat. Soc. Ser. B (Methodol.)"},{"key":"2023012511003120200_B4","doi-asserted-by":"crossref","first-page":"6073","DOI":"10.1073\/pnas.95.11.6073","article-title":"Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships","volume":"95","author":"Brenner","year":"1998","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012511003120200_B5","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1021\/bi00699a001","article-title":"Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins","volume":"13","author":"Chou","year":"1974","journal-title":"Biochemistry"},{"key":"2023012511003120200_B6","doi-asserted-by":"crossref","first-page":"1603","DOI":"10.1093\/bioinformatics\/bth132","article-title":"Protein secondary structure: entropy, correlations and prediction","volume":"20","author":"Crooks","year":"2004","journal-title":"Bioinformatics"},{"issue":"Suppl. 8","key":"2023012511003120200_B7","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1002\/prot.21636","article-title":"Structure prediction for casp7 targets using extensive all-atom refinement with rosetta@home","volume":"69","author":"Das","year":"2007","journal-title":"Proteins"},{"key":"2023012511003120200_B8","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1093\/bioinformatics\/14.9.755","article-title":"Profile hidden Markov models","volume":"14","author":"Eddy","year":"1998","journal-title":"Bioinformatics"},{"key":"2023012511003120200_B9","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1016\/0097-8485(93)85006-X","article-title":"Statistics of local complexity in amino acid sequences and sequence databases","volume":"17","author":"Federhen","year":"1993","journal-title":"Comput. Chem."},{"key":"2023012511003120200_B10","doi-asserted-by":"crossref","first-page":"10869","DOI":"10.1073\/pnas.92.24.10869","article-title":"Optimization of rates of protein folding: the nucleation-condensation mechanism and its implications","volume":"92","author":"Fersht","year":"1995","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023012511003120200_B11","doi-asserted-by":"crossref","first-page":"10428","DOI":"10.1021\/bi00107a010","article-title":"Folding of chymotrypsin inhibitor 2. 1. evidence for a two-state transition","volume":"30","author":"Jackson","year":"1991","journal-title":"Biochemistry"},{"issue":"Suppl. 8","key":"2023012511003120200_B12","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1002\/prot.21771","article-title":"Assessment of CASP7 structure predictions for template free targets","volume":"69","author":"Jauch","year":"2007","journal-title":"Proteins"},{"key":"2023012511003120200_B13","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1006\/jmbi.1999.3091","article-title":"Protein secondary structure prediction based on position-specific scoring matrices","volume":"292","author":"Jones","year":"1999","journal-title":"J. Mol. Biol."},{"key":"2023012511003120200_B14","doi-asserted-by":"crossref","first-page":"650","DOI":"10.1002\/pro.5560030413","article-title":"Protein folding dynamics: the diffusion-collision model and experimental data","volume":"3","author":"Karplus","year":"1994","journal-title":"Protein Sci."},{"key":"2023012511003120200_B15","doi-asserted-by":"crossref","first-page":"863","DOI":"10.1038\/nature01428","article-title":"The complete folding pathway of a protein from nanoseconds to microseconds","volume":"421","author":"Mayor","year":"2003","journal-title":"Nature"},{"key":"2023012511003120200_B16","doi-asserted-by":"crossref","first-page":"1719","DOI":"10.1093\/bioinformatics\/bti203","article-title":"Porter: a new, accurate server for protein secondary structure prediction","volume":"21","author":"McLysaght","year":"2005","journal-title":"Bioinformatics"},{"key":"2023012511003120200_B17","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1006\/jmbi.2001.4602","article-title":"Evolutionary conservation of the folding nucleus","volume":"308","author":"Mirny","year":"2001","journal-title":"J. Mol. Biol."},{"key":"2023012511003120200_B18","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1016\/j.sbi.2005.05.011","article-title":"A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction","volume":"15","author":"Moult","year":"2005","journal-title":"Curr. Opin. Struct. Biol."},{"key":"2023012511003120200_B19","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1093\/protein\/13.3.149","article-title":"Simplified amino acid alphabets for protein fold recognition and implications for folding","volume":"13","author":"Murphy","year":"2000","journal-title":"Protein Eng."},{"key":"2023012511003120200_B20","doi-asserted-by":"crossref","DOI":"10.2202\/1544-6115.1219","article-title":"Numerical solutions for patterns statistics on markov chains","volume":"5","author":"Nuel","year":"2006","journal-title":"Stat. Appl. Genet. Mol. Biol."},{"key":"2023012511003120200_B21","doi-asserted-by":"crossref","first-page":"1093","DOI":"10.1016\/S0969-2126(97)00260-8","article-title":"CATH\u2013a hierarchic classification of protein domain structures","volume":"5","author":"Orengo","year":"1997","journal-title":"Structure"},{"key":"2023012511003120200_B22","doi-asserted-by":"crossref","first-page":"254","DOI":"10.1016\/j.sbi.2005.05.005","article-title":"The limits of protein sequence comparison?","volume":"15","author":"Pearson","year":"2005","journal-title":"Curr. Opin. Struct. Biol."},{"key":"2023012511003120200_B23","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1002\/prot.10559","article-title":"Proteomic signatures: amino acid and oligopeptide compositions differentiate among phyla","volume":"54","author":"Pe'er","year":"2004","journal-title":"Proteins"},{"key":"2023012511003120200_B24","doi-asserted-by":"crossref","first-page":"228","DOI":"10.1002\/prot.10082","article-title":"Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles","volume":"47","author":"Pollastri","year":"2002","journal-title":"Proteins"},{"key":"2023012511003120200_B25","doi-asserted-by":"crossref","first-page":"655","DOI":"10.1006\/jmbi.1997.1620","article-title":"Protein folding and protein evolution: common folding nucleus in different subfamilies of c-type cytochromes?","volume":"278","author":"Ptitsyn","year":"1998","journal-title":"J. Mol. Biol."},{"key":"2023012511003120200_B26","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1080\/07391102.1986.10507651","article-title":"Protein structure and neutral theory of evolution","volume":"4","author":"Ptitsyn","year":"1986","journal-title":"J. Biomol. Struct. Dyn."},{"key":"2023012511003120200_B27","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1089\/10665270050081360","article-title":"Probabilistic and statistical properties of words: an overview","volume":"7","author":"Reinert","year":"2000","journal-title":"J. Comput. Biol."},{"key":"2023012511003120200_B28","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1016\/S0076-6879(04)83004-0","article-title":"Protein structure prediction using Rosetta","volume":"383","author":"Rohl","year":"2004","journal-title":"Methods Enzymol."},{"key":"2023012511003120200_B29","doi-asserted-by":"crossref","first-page":"204","DOI":"10.1006\/jsbi.2001.4336","article-title":"Review: protein secondary structure prediction continues to rise","volume":"134","author":"Rost","year":"2001","journal-title":"J. Struct. Biol."},{"key":"2023012511003120200_B30","doi-asserted-by":"crossref","first-page":"7558","DOI":"10.1073\/pnas.90.16.7558","article-title":"Improved prediction of protein secondary structure by use of sequence profiles and neural networks","volume":"90","author":"Rost","year":"1993","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023012511003120200_B31","doi-asserted-by":"crossref","first-page":"405","DOI":"10.1002\/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L","article-title":"Pfam: a comprehensive database of protein domain families based on seed alignments","volume":"28","author":"Sonnhammer","year":"1997","journal-title":"Proteins"},{"key":"2023012511003120200_B32","doi-asserted-by":"crossref","first-page":"9440","DOI":"10.1073\/pnas.1530509100","article-title":"Statistical significance for genomewide studies","volume":"100","author":"Storey","year":"2003","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012511003120200_B33","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1006\/jtbi.2000.2138","article-title":"Information content of protein sequences","volume":"206","author":"Weiss","year":"2000","journal-title":"J. Theor Biol."},{"key":"2023012511003120200_B34","first-page":"674","article-title":"Motif identification neural design for rapid and sensitive protein family search","author":"Wu","year":"1996","journal-title":"Pac. Symp. Biocomput."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/26\/3\/310\/48860542\/bioinformatics_26_3_310.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/26\/3\/310\/48860542\/bioinformatics_26_3_310.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,13]],"date-time":"2025-02-13T14:38:38Z","timestamp":1739457518000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/26\/3\/310\/213945"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,11,30]]},"references-count":34,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2010,2,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btp660","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"type":"electronic","value":"1367-4811"},{"type":"print","value":"1367-4803"}],"subject":[],"published-other":{"date-parts":[[2010,2,1]]},"published":{"date-parts":[[2009,11,30]]}}}