{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T05:19:29Z","timestamp":1774415969211,"version":"3.50.1"},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"S17","license":[{"start":{"date-parts":[[2012,12,1]],"date-time":"2012-12-01T00:00:00Z","timestamp":1354320000000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/2.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2012,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Existing methods for predicting protein solubility on overexpression in <jats:italic>Escherichia coli<\/jats:italic> advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of \u03b1-helix structure and thermophilic proteins to be soluble.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Availability<\/jats:title>\n            <jats:p>The used datasets, source codes of SCM, and supplementary files are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"http:\/\/iclab.life.nctu.edu.tw\/SCM\/\" ext-link-type=\"uri\">http:\/\/iclab.life.nctu.edu.tw\/SCM\/<\/jats:ext-link>.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-13-s17-s3","type":"journal-article","created":{"date-parts":[[2019,12,11]],"date-time":"2019-12-11T01:59:20Z","timestamp":1576029560000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":52,"title":["Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition"],"prefix":"10.1186","volume":"13","author":[{"given":"Hui-Ling","family":"Huang","sequence":"first","affiliation":[]},{"given":"Phasit","family":"Charoenkwan","sequence":"additional","affiliation":[]},{"given":"Te-Fen","family":"Kao","sequence":"additional","affiliation":[]},{"given":"Hua-Chin","family":"Lee","sequence":"additional","affiliation":[]},{"given":"Fang-Lin","family":"Chang","sequence":"additional","affiliation":[]},{"given":"Wen-Lin","family":"Huang","sequence":"additional","affiliation":[]},{"given":"Shinn-Jang","family":"Ho","sequence":"additional","affiliation":[]},{"given":"Li-Sun","family":"Shu","sequence":"additional","affiliation":[]},{"given":"Wen-Liang","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Shinn-Ying","family":"Ho","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2012,12,13]]},"reference":[{"issue":"9","key":"5467_CR1","doi-asserted-by":"publisher","first-page":"927","DOI":"10.1038\/nbt732","volume":"20","author":"JD Pedelacq","year":"2002","unstructured":"Pedelacq JD, Piltch E, Liong EC, Berendzen J, Kim CY, Rho BS, Park MS, Terwilliger TC, Waldo GS: Engineering soluble proteins for structural genomics. Nat Biotechnol. 2002, 20 (9): 927-932. 10.1038\/nbt732.","journal-title":"Nat Biotechnol"},{"issue":"2","key":"5467_CR2","doi-asserted-by":"publisher","first-page":"449","DOI":"10.1016\/j.jmb.2006.10.026","volume":"366","author":"SR Trevino","year":"2007","unstructured":"Trevino SR, Scholtz JM, Pace CN: Amino acid contribution to protein solubility: Asp, Glu, and Ser contribute more favorably than the other hydrophilic amino acids in RNase Sa. J Mol Biol. 2007, 366 (2): 449-460. 10.1016\/j.jmb.2006.10.026.","journal-title":"J Mol Biol"},{"issue":"3","key":"5467_CR3","doi-asserted-by":"publisher","first-page":"278","DOI":"10.1093\/bioinformatics\/bti810","volume":"22","author":"S Idicula-Thomas","year":"2006","unstructured":"Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV: A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics. 2006, 22 (3): 278-284. 10.1093\/bioinformatics\/bti810.","journal-title":"Bioinformatics"},{"issue":"7","key":"5467_CR4","doi-asserted-by":"publisher","first-page":"933","DOI":"10.1093\/protein\/7.7.933","volume":"7","author":"GE Dale","year":"1994","unstructured":"Dale GE, Broger C, Langen H, D'Arcy A, Stuber D: Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase. Protein Eng. 1994, 7 (7): 933-939. 10.1093\/protein\/7.7.933.","journal-title":"Protein Eng"},{"issue":"13","key":"5467_CR5","doi-asserted-by":"publisher","first-page":"6057","DOI":"10.1073\/pnas.92.13.6057","volume":"92","author":"TM Jenkins","year":"1995","unstructured":"Jenkins TM, Hickman AB, Dyda F, Ghirlando R, Davies DR, Craigie R: Catalytic domain of human immunodeficiency virus type 1 integrase: identification of a soluble mutant by systematic replacement of hydrophobic residues. Proc Natl Acad Sci USA. 1995, 92 (13): 6057-6061. 10.1073\/pnas.92.13.6057.","journal-title":"Proc Natl Acad Sci USA"},{"issue":"1","key":"5467_CR6","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1111\/j.1432-1033.1995.tb20531.x","volume":"230","author":"M Murby","year":"1995","unstructured":"Murby M, Samuelsson E, Nguyen TN, Mignard L, Power U, Binz H, Uhlen M, Stahl S: Hydrophobicity engineering to increase solubility and stability of a recombinant protein from respiratory syncytial virus. Eur J Biochem. 1995, 230 (1): 38-44. 10.1111\/j.1432-1033.1995.tb20531.x.","journal-title":"Eur J Biochem"},{"issue":"5","key":"5467_CR7","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1038\/nbt0591-443","volume":"9","author":"DL Wilkinson","year":"1991","unstructured":"Wilkinson DL, Harrison RG: Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology (N Y). 1991, 9 (5): 443-448. 10.1038\/nbt0591-443.","journal-title":"Biotechnology (N Y)"},{"issue":"4","key":"5467_CR8","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1002\/(SICI)1097-0290(19991120)65:4<382::AID-BIT2>3.0.CO;2-I","volume":"65","author":"GD Davis","year":"1999","unstructured":"Davis GD, Elisee C, Newham DM, Harrison RG: New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng. 1999, 65 (4): 382-388. 10.1002\/(SICI)1097-0290(19991120)65:4<382::AID-BIT2>3.0.CO;2-I.","journal-title":"Biotechnol Bioeng"},{"issue":"3","key":"5467_CR9","doi-asserted-by":"publisher","first-page":"582","DOI":"10.1110\/ps.041009005","volume":"14","author":"S Idicula-Thomas","year":"2005","unstructured":"Idicula-Thomas S, Balaji PV: Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci. 2005, 14 (3): 582-592. 10.1110\/ps.041009005.","journal-title":"Protein Sci"},{"issue":"19","key":"5467_CR10","doi-asserted-by":"publisher","first-page":"2536","DOI":"10.1093\/bioinformatics\/btl623","volume":"23","author":"P Smialowski","year":"2007","unstructured":"Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D: Protein solubility: sequence based prediction and experimental verification. Bioinformatics. 2007, 23 (19): 2536-2542. 10.1093\/bioinformatics\/btl623.","journal-title":"Bioinformatics"},{"issue":"17","key":"5467_CR11","doi-asserted-by":"publisher","first-page":"2200","DOI":"10.1093\/bioinformatics\/btp386","volume":"25","author":"CN Magnan","year":"2009","unstructured":"Magnan CN, Randall A, Baldi P: SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics. 2009, 25 (17): 2200-2207. 10.1093\/bioinformatics\/btp386.","journal-title":"Bioinformatics"},{"issue":"2","key":"5467_CR12","doi-asserted-by":"publisher","first-page":"374","DOI":"10.1002\/bit.22537","volume":"105","author":"AA Diaz","year":"2010","unstructured":"Diaz AA, Tomba E, Lennarson R, Richard R, Bagajewicz MJ, Harrison RG: Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol Bioeng. 2010, 105 (2): 374-383. 10.1002\/bit.22537.","journal-title":"Biotechnol Bioeng"},{"issue":"Suppl 1","key":"5467_CR13","doi-asserted-by":"publisher","first-page":"S21","DOI":"10.1186\/1471-2105-11-S1-S21","volume":"11","author":"WC Chan","year":"2010","unstructured":"Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN: Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinformatics. 2010, 11 (Suppl 1): S21-10.1186\/1471-2105-11-S1-S21.","journal-title":"BMC Bioinformatics"},{"issue":"12","key":"5467_CR14","doi-asserted-by":"publisher","first-page":"2192","DOI":"10.1111\/j.1742-4658.2012.08603.x","volume":"279","author":"P Smialowski","year":"2012","unstructured":"Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D: PROSO II - a new method for protein solubility prediction. FEBS J. 2012, 279 (12): 2192-2200. 10.1111\/j.1742-4658.2012.08603.x.","journal-title":"FEBS J"},{"issue":"6","key":"5467_CR15","doi-asserted-by":"publisher","first-page":"522","DOI":"10.1109\/TEVC.2004.835176","volume":"8","author":"SY Ho","year":"2004","unstructured":"Ho SY, Shu LS, Chen JH: Intelligent evolutionary algorithms for large parameter optimization problems. IEEE Transactions on Evolutionary Computation. 2004, 8 (6): 522-541. 10.1109\/TEVC.2004.835176.","journal-title":"IEEE Transactions on Evolutionary Computation"},{"issue":"Database","key":"5467_CR16","doi-asserted-by":"publisher","first-page":"D202","DOI":"10.1093\/nar\/gkm998","volume":"36","author":"S Kawashima","year":"2008","unstructured":"Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36 (Database): D202-205.","journal-title":"Nucleic Acids Res"},{"key":"5467_CR17","volume-title":"The 6th International Conference on Bioinformatics and Biomedical Engineering (iCBBE 2012)","author":"H-C Lee","year":"2012","unstructured":"Lee H-C, Liou Y-F, Charoenkwan P, Ho S-J, Shu L-S, Ho S-Y, Huang H-L: Prediction of carbohydrate-binding proteins using a scoring card method. The 6th International Conference on Bioinformatics and Biomedical Engineering (iCBBE 2012). 2012"},{"key":"5467_CR18","doi-asserted-by":"crossref","unstructured":"Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S: CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Structural Biology. 2009, 9 (50):","DOI":"10.1186\/1472-6807-9-50"},{"issue":"22","key":"5467_CR19","doi-asserted-by":"publisher","first-page":"23262","DOI":"10.1074\/jbc.M401932200","volume":"279","author":"M Bhasin","year":"2004","unstructured":"Bhasin M, Raghava GP: Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem. 2004, 279 (22): 23262-23266. 10.1074\/jbc.M401932200.","journal-title":"J Biol Chem"},{"issue":"1","key":"5467_CR20","first-page":"24","volume":"3","author":"SB Muley","year":"2011","unstructured":"Muley SB, Bastikar V, Bothe S, Meshram A, Roy N: Virulence prediction model (virprob) using amino acid and dipeptide composition for human pathogens. Journal of Biophysics and Structural Biology. 2011, 3 (1): 24-29.","journal-title":"Journal of Biophysics and Structural Biology"},{"issue":"10","key":"5467_CR21","doi-asserted-by":"publisher","first-page":"1596","DOI":"10.1002\/jcc.20918","volume":"29","author":"K Chen","year":"2008","unstructured":"Chen K, Kurgan LA, Ruan J: Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem. 2008, 29 (10): 1596-1604. 10.1002\/jcc.20918.","journal-title":"J Comput Chem"},{"issue":"1","key":"5467_CR22","doi-asserted-by":"publisher","first-page":"64","DOI":"10.1016\/j.jtbi.2010.10.019","volume":"269","author":"H Lin","year":"2011","unstructured":"Lin H, Ding H: Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J Theor Biol. 2011, 269 (1): 64-69. 10.1016\/j.jtbi.2010.10.019.","journal-title":"J Theor Biol"},{"key":"5467_CR23","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1186\/1471-2105-6-59","volume":"6","author":"GP Raghava","year":"2005","unstructured":"Raghava GP, Han JH: Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics. 2005, 6: 59-10.1186\/1471-2105-6-59.","journal-title":"BMC Bioinformatics"},{"issue":"5","key":"5467_CR24","doi-asserted-by":"publisher","first-page":"680","DOI":"10.1093\/bioinformatics\/btq003","volume":"26","author":"Y Huang","year":"2010","unstructured":"Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093\/bioinformatics\/btq003.","journal-title":"Bioinformatics"},{"issue":"7","key":"5467_CR25","doi-asserted-by":"publisher","first-page":"1145","DOI":"10.1016\/S0031-3203(96)00142-2","volume":"30","author":"AP Bradley","year":"1997","unstructured":"Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30 (7): 1145-1159. 10.1016\/S0031-3203(96)00142-2.","journal-title":"Pattern Recognition"},{"issue":"1","key":"5467_CR26","doi-asserted-by":"publisher","first-page":"609","DOI":"10.1109\/TSMCB.2003.817090","volume":"34","author":"SY Ho","year":"2004","unstructured":"Ho SY, Chen JH, Huang MH: Inheritable genetic algorithm for biobjective 0\/1 combinatorial optimization problems and its applications. IEEE Trans Syst Man Cybern B Cybern. 2004, 34 (1): 609-620. 10.1109\/TSMCB.2003.817090.","journal-title":"IEEE Trans Syst Man Cybern B Cybern"},{"issue":"8","key":"5467_CR27","doi-asserted-by":"publisher","first-page":"942","DOI":"10.1093\/bioinformatics\/btm061","volume":"23","author":"CW Tung","year":"2007","unstructured":"Tung CW, Ho SY: POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties. Bioinformatics. 2007, 23 (8): 942-949. 10.1093\/bioinformatics\/btm061.","journal-title":"Bioinformatics"},{"issue":"3","key":"5467_CR28","doi-asserted-by":"publisher","first-page":"27:21","DOI":"10.1145\/1961189.1961199","volume":"2","author":"C-CaL Chang","year":"2011","unstructured":"Chang C-CaL, Chih-Jen : LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (3): 27:21--27:27.","journal-title":"ACM Transactions on Intelligent Systems and Technology"},{"issue":"10","key":"5467_CR29","doi-asserted-by":"publisher","first-page":"903","DOI":"10.1038\/82823","volume":"7","author":"D Christendat","year":"2000","unstructured":"Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR, Booth V, Mackereth CD, Saridakis V, Ekiel I: Structural proteomics of an archaeon. Nat Struct Biol. 2000, 7 (10): 903-909. 10.1038\/82823.","journal-title":"Nat Struct Biol"},{"issue":"5","key":"5467_CR30","doi-asserted-by":"publisher","first-page":"399","DOI":"10.1016\/j.ejps.2004.04.013","volume":"22","author":"SW Larsen","year":"2004","unstructured":"Larsen SW, Ankersen M, Larsen C: Kinetics of degradation and oil solubility of ester prodrugs of a model dipeptide (Gly-Phe). Eur J Pharm Sci. 2004, 22 (5): 399-408. 10.1016\/j.ejps.2004.04.013.","journal-title":"Eur J Pharm Sci"},{"issue":"2","key":"5467_CR31","doi-asserted-by":"publisher","first-page":"441","DOI":"10.1016\/j.bbrc.2006.01.159","volume":"342","author":"S Costantini","year":"2006","unstructured":"Costantini S, Colonna G, Facchiano AM: Amino acid propensities for secondary structures are influenced by the protein structural class. Biochem Biophys Res Commun. 2006, 342 (2): 441-451. 10.1016\/j.bbrc.2006.01.159.","journal-title":"Biochem Biophys Res Commun"},{"key":"5467_CR32","volume-title":"J Mol Biol","author":"SR Trevino","year":"2009","unstructured":"Trevino SR, Scholtz JM, Pace CN: Amino acid contribution to protein solubility. J Mol Biol. 2009"},{"issue":"6","key":"5467_CR33","first-page":"1895","volume":"88","author":"A Ikai","year":"1980","unstructured":"Ikai A: Thermostability and aliphatic index of globular proteins. J Biochem. 1980, 88 (6): 1895-1898.","journal-title":"J Biochem"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-13-S17-S3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/1471-2105-13-S17-S3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-13-S17-S3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T21:06:28Z","timestamp":1630530388000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-13-S17-S3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,12]]},"references-count":33,"journal-issue":{"issue":"S17","published-print":{"date-parts":[[2012,12]]}},"alternative-id":["5467"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-13-s17-s3","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,12]]},"assertion":[{"value":"13 December 2012","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S3"}}