{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T05:19:29Z","timestamp":1774415969582,"version":"3.50.1"},"reference-count":49,"publisher":"Oxford University Press (OUP)","issue":"23","license":[{"start":{"date-parts":[[2021,6,19]],"date-time":"2021-06-19T00:00:00Z","timestamp":1624060800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2018YFC0910403"],"award-info":[{"award-number":["2018YFC0910403"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62072353"],"award-info":[{"award-number":["62072353"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61672406"],"award-info":[{"award-number":["61672406"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61532014"],"award-info":[{"award-number":["61532014"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,12,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>The heterologous expression of recombinant protein requires host cells, such as Escherichiacoli, and the solubility of protein greatly affects the protein yield. A novel and highly accurate solubility predictor that concurrently improves the production yield and minimizes production cost, and that forecasts protein solubility in an E.coli expression system before the actual experimental work is highly sought.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this article, EPSOL, a novel deep learning architecture for the prediction of protein solubility in an E.coli expression system, which automatically obtains comprehensive protein feature representations using multidimensional embedding, is presented. EPSOL outperformed all existing sequence-based solubility predictors and achieved 0.79 in accuracy and 0.58 in Matthew\u2019s correlation coefficient. The higher performance of EPSOL permits large-scale screening for sequence variants with enhanced manufacturability and predicts the solubility of new recombinant proteins in an E.coli expression system with greater reliability.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>EPSOL\u2019s best model and results can be downloaded from GitHub (https:\/\/github.com\/LiangYu-Xidian\/EPSOL).<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab463","type":"journal-article","created":{"date-parts":[[2021,6,17]],"date-time":"2021-06-17T19:11:56Z","timestamp":1623957116000},"page":"4314-4320","source":"Crossref","is-referenced-by-count":43,"title":["EPSOL: sequence-based protein solubility prediction using multidimensional embedding"],"prefix":"10.1093","volume":"37","author":[{"given":"Xiang","family":"Wu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Xidian University , Xi\u2019an, Shaanxi 710071, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8351-3332","authenticated-orcid":false,"given":"Liang","family":"Yu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University , Xi\u2019an, Shaanxi 710071, China"}]}],"member":"286","published-online":{"date-parts":[[2021,6,19]]},"reference":[{"key":"2023061310462682300_btab463-B1","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1016\/j.jmb.2011.12.005","article-title":"Sequence-based prediction of protein solubility","volume":"421","author":"Agostini","year":"2012","journal-title":"J. Mol. Biol"},{"key":"2023061310462682300_btab463-B2","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"J. Mach. Learn. Res"},{"key":"2023061310462682300_btab463-B3","doi-asserted-by":"crossref","first-page":"2884","DOI":"10.1093\/nar\/29.13.2884","article-title":"SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics","volume":"29","author":"Bertone","year":"2001","journal-title":"Nucleic Acids Res"},{"key":"2023061310462682300_btab463-B4","doi-asserted-by":"crossref","first-page":"655","DOI":"10.2174\/1574893613666180726163429","article-title":"Predicting enhancers from multiple cell lines and tissues across different developmental stages based on SVM method","volume":"13","author":"Bu","year":"2018","journal-title":"Curr. Bioinf"},{"key":"2023061310462682300_btab463-B5","first-page":"535","author":"Bucilu\u01ce","year":"2006"},{"key":"2023061310462682300_btab463-B6","doi-asserted-by":"crossref","first-page":"953","DOI":"10.1093\/bib\/bbt057","article-title":"Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction","volume":"15","author":"Chang","year":"2014","journal-title":"Brief. Bioinf"},{"key":"2023061310462682300_btab463-B7","doi-asserted-by":"crossref","first-page":"e1900007","DOI":"10.1002\/pmic.201900007","article-title":"SecProMTB: a SVM-based classifier for secretory proteins of Mycobacterium tuberculosis with imbalanced data set","volume":"19","author":"Chao","year":"2019","journal-title":"Proteomics"},{"key":"2023061310462682300_btab463-B8","doi-asserted-by":"crossref","first-page":"W72","DOI":"10.1093\/nar\/gki396","article-title":"SCRATCH: a protein structure and structural feature prediction server","volume":"33","author":"Cheng","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2023061310462682300_btab463-B9","doi-asserted-by":"crossref","first-page":"903","DOI":"10.1038\/82823","article-title":"Structural proteomics of an archaeon","volume":"7","author":"Christendat","year":"2000","journal-title":"Nat. Struct. Biol"},{"key":"2023061310462682300_btab463-B10","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1007\/BF00994018","article-title":"Support-vector networks","volume":"20","author":"Cortes","year":"1995","journal-title":"Mach. Learn"},{"key":"2023061310462682300_btab463-B11","first-page":"bbaa356","article-title":"DeepYY1: a deep learning approach to identify YY1-mediated chromatin loops","volume":"21","author":"Dao","year":"2020","journal-title":"Brief. Bioinf"},{"key":"2023061310462682300_btab463-B12","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1002\/(SICI)1097-0290(19991120)65:4<382::AID-BIT2>3.0.CO;2-I","article-title":"New fusion protein systems designed to give soluble expression in Escherichia coli","volume":"65","author":"Davis","year":"1999","journal-title":"Biotechnol. Bioeng"},{"key":"2023061310462682300_btab463-B13","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy function approximation: a gradient boosting machine","volume":"29","author":"Friedman","year":"2001","journal-title":"Ann. Stat"},{"key":"2023061310462682300_btab463-B14","doi-asserted-by":"crossref","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","article-title":"CD-HIT: accelerated for clustering the next-generation sequencing data","volume":"28","author":"Fu","year":"2012","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B15","volume-title":"Digital Design and Computer Architecture","author":"Harris","year":"2010"},{"key":"2023061310462682300_btab463-B16","author":"Hinton","year":"2015"},{"key":"2023061310462682300_btab463-B17","volume-title":"\u00a0","author":"Huang","year":"2012"},{"key":"2023061310462682300_btab463-B18","doi-asserted-by":"crossref","first-page":"582","DOI":"10.1110\/ps.041009005","article-title":"Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli","volume":"14","author":"Idicula-Thomas","year":"2005","journal-title":"Protein Sci"},{"key":"2023061310462682300_btab463-B19","doi-asserted-by":"crossref","first-page":"2605","DOI":"10.1093\/bioinformatics\/bty166","article-title":"DeepSol: a deep learning framework for sequence-based protein solubility prediction","volume":"34","author":"Khurana","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B20","author":"Kim","year":"2014"},{"key":"2023061310462682300_btab463-B21","author":"Kingma","year":"2014"},{"key":"2023061310462682300_btab463-B22","article-title":"Convolutional networks for images, speech, and time series","author":"LeCun","year":"1995"},{"key":"2023061310462682300_btab463-B23","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B24","doi-asserted-by":"crossref","first-page":"1043","DOI":"10.1016\/j.omtn.2020.07.035","article-title":"Predicting preference of transcription factors for methylated DNA using sequence information","volume":"22","author":"Liu","year":"2020","journal-title":"Mol. Ther. Nucleic Acids"},{"key":"2023061310462682300_btab463-B25","doi-asserted-by":"crossref","first-page":"788","DOI":"10.2174\/1574893615666200127124145","article-title":"Densely dilated spatial pooling convolutional network using benign loss functions for imbalanced volumetric prostate segmentation","volume":"15","author":"Liu","year":"2020","journal-title":"Curr. Bioinf"},{"key":"2023061310462682300_btab463-B26","first-page":"101","article-title":"Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method","volume":"11","author":"Lv","year":"2020","journal-title":"Brief. Bioinf"},{"key":"2023061310462682300_btab463-B27","doi-asserted-by":"crossref","first-page":"14851","DOI":"10.1109\/ACCESS.2020.2966576","article-title":"Escherichia coli DNA N-4-methycytosine site prediction accuracy improved by light gradient boosting machine feature selection technology","volume":"8","author":"Lv","year":"2020","journal-title":"IEEE Access"},{"key":"2023061310462682300_btab463-B28","doi-asserted-by":"crossref","first-page":"2592","DOI":"10.1093\/bioinformatics\/btu352","article-title":"SSpro\/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity","volume":"30","author":"Magnan","year":"2014","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B29","doi-asserted-by":"crossref","first-page":"2200","DOI":"10.1093\/bioinformatics\/btp386","article-title":"SOLpro: accurate sequence-based prediction of protein solubility","volume":"25","author":"Magnan","year":"2009","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B30","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1214\/aoms\/1177730491","article-title":"On a test of whether one of two random variables is stochastically larger than the other","volume":"18","author":"Mann","year":"1947","journal-title":"Ann. Math. Stat"},{"key":"2023061310462682300_btab463-B31","doi-asserted-by":"crossref","first-page":"694","DOI":"10.1109\/TASLP.2016.2520371","article-title":"Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval","volume":"24","author":"Palangi","year":"2016","journal-title":"IEEE-ACM Trans. Audio Speech Lang"},{"key":"2023061310462682300_btab463-B32","doi-asserted-by":"crossref","first-page":"1092","DOI":"10.1093\/bioinformatics\/btx662","article-title":"PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine","volume":"34","author":"Rawi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B33","doi-asserted-by":"crossref","first-page":"181423","DOI":"10.1109\/ACCESS.2019.2920241","article-title":"Exploiting discriminative regions of brain slices based on 2D CNNs for Alzheimer\u2019s disease classification","volume":"7","author":"Ren","year":"2019","journal-title":"IEEE Access"},{"key":"2023061310462682300_btab463-B34","doi-asserted-by":"crossref","first-page":"2192","DOI":"10.1111\/j.1742-4658.2012.08603.x","article-title":"PROSO II \u2013 a new method for protein solubility prediction","volume":"279","author":"Smialowski","year":"2012","journal-title":"FEBS J"},{"key":"2023061310462682300_btab463-B35","doi-asserted-by":"crossref","first-page":"2536","DOI":"10.1093\/bioinformatics\/btl623","article-title":"Protein solubility: sequence based prediction and experimental verification","volume":"23","author":"Smialowski","year":"2007","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B36","first-page":"1929","article-title":"Dropout: a simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res"},{"key":"2023061310462682300_btab463-B37","doi-asserted-by":"crossref","DOI":"10.1142\/5089","article-title":"Least squares support vector machines","author":"Suykens","year":"2002"},{"key":"2023061310462682300_btab463-B38","doi-asserted-by":"crossref","first-page":"957","DOI":"10.7150\/ijbs.24174","article-title":"HBPred: a tool to identify growth hormone-binding proteins","volume":"14","author":"Tang","year":"2018","journal-title":"Int. J. Biol. Sci"},{"key":"2023061310462682300_btab463-B39","first-page":"493","article-title":"Predicting thermophilic proteins by machine learning","volume":"15","author":"Wang","year":"2020","journal-title":"Curr. Bioinf"},{"key":"2023061310462682300_btab463-B40","first-page":"443","article-title":"Predicting the solubility of recombinant proteins in Escherichia coli","volume":"9","author":"Wilkinson","year":"1991","journal-title":"Bio\/Technology (Nature Publishing Company)"},{"key":"2023061310462682300_btab463-B41","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1016\/0169-7439(87)80084-9","article-title":"Principal component analysis","volume":"2","author":"Wold","year":"1987","journal-title":"Chemom. Intell. Lab. Syst"},{"key":"2023061310462682300_btab463-B42","author":"Xu","year":"2015"},{"key":"2023061310462682300_btab463-B43","first-page":"2335","author":"Zeng","year":"2014"},{"key":"2023061310462682300_btab463-B44","first-page":"1","article-title":"iBLP: an XGBoost-based predictor for identifying bioluminescent proteins","volume":"2021","author":"Zhang","year":"2021","journal-title":"Comput. Math. Methods Med"},{"key":"2023061310462682300_btab463-B45","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1093\/bioinformatics\/btaa702","article-title":"iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features","volume":"37","author":"Zhang","year":"2021","journal-title":"Bioinformatics"},{"key":"2023061310462682300_btab463-B46","doi-asserted-by":"crossref","first-page":"526","DOI":"10.1093\/bib\/bbz177","article-title":"Design powerful predictor for mRNA subcellular location prediction in Homo sapiens","volume":"22","author":"Zhang","year":"2021","journal-title":"Brief. Bioinf"},{"key":"2023061310462682300_btab463-B47","doi-asserted-by":"crossref","first-page":"368","DOI":"10.2174\/1574893614666191105155713","article-title":"ConvsPPIS: identifying protein\u2013protein interaction sites by an ensemble convolutional neural network with feature graph","volume":"15","author":"Zhu","year":"2020","journal-title":"Curr. Bioinf"},{"key":"2023061310462682300_btab463-B48","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1261\/rna.069112.118","article-title":"Gene2vec: gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mRNA","volume":"25","author":"Zou","year":"2019","journal-title":"RNA"},{"key":"2023061310462682300_btab463-B49","first-page":"1393","author":"Zou","year":"2013"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab463\/38973704\/btab463.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/23\/4314\/50579717\/btab463.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/23\/4314\/50579717\/btab463.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T10:47:41Z","timestamp":1686653261000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/23\/4314\/6305822"}},"subtitle":[],"editor":[{"given":"Jinbo","family":"Xu","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,6,19]]},"references-count":49,"journal-issue":{"issue":"23","published-print":{"date-parts":[[2021,12,7]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab463","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,12,1]]},"published":{"date-parts":[[2021,6,19]]}}}