{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,15]],"date-time":"2026-05-15T02:49:55Z","timestamp":1778813395125,"version":"3.51.4"},"reference-count":46,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2019,10,11]],"date-time":"2019-10-11T00:00:00Z","timestamp":1570752000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew\u2019s correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew\u2019s correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>Our BCrystal webserver is at https:\/\/machinelearning-protein.qcri.org\/ and source code is available at https:\/\/github.com\/raghvendra5688\/BCrystal.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btz762","type":"journal-article","created":{"date-parts":[[2019,10,8]],"date-time":"2019-10-08T08:19:25Z","timestamp":1570522765000},"page":"1429-1438","source":"Crossref","is-referenced-by-count":31,"title":["BCrystal: an interpretable sequence-based protein crystallization predictor"],"prefix":"10.1093","volume":"36","author":[{"given":"Abdurrahman","family":"Elbasir","sequence":"first","affiliation":[{"name":"ICT Division, College of Science and Engineering , Hamad Bin Khalifa University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1779-3150","authenticated-orcid":false,"given":"Raghvendra","family":"Mall","sequence":"additional","affiliation":[{"name":"Data Analytics, Qatar Computing Research Institute , Hamad Bin Khalifa University, Doha 34110, Qatar"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Khalid","family":"Kunji","sequence":"additional","affiliation":[{"name":"Data Analytics, Qatar Computing Research Institute , Hamad Bin Khalifa University, Doha 34110, Qatar"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Reda","family":"Rawi","sequence":"additional","affiliation":[{"name":"Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda , MD 20892, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zeyaul","family":"Islam","sequence":"additional","affiliation":[{"name":"Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University , Doha 34100, Qatar"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gwo-Yu","family":"Chuang","sequence":"additional","affiliation":[{"name":"Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda , MD 20892, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Prasanna R","family":"Kolatkar","sequence":"additional","affiliation":[{"name":"Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University , Doha 34100, Qatar"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Halima","family":"Bensmail","sequence":"additional","affiliation":[{"name":"Data Analytics, Qatar Computing Research Institute , Hamad Bin Khalifa University, Doha 34110, Qatar"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2019,10,11]]},"reference":[{"key":"2023060910275740400_btz762-B1","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn"},{"key":"2023060910275740400_btz762-B2","doi-asserted-by":"crossref","first-page":"3333.","DOI":"10.1038\/srep03333","article-title":"Soluble expression of proteins correlates with a lack of positively-charged surface","volume":"3","author":"Chan","year":"2013","journal-title":"Sci. Rep"},{"key":"2023060910275740400_btz762-B3","doi-asserted-by":"crossref","first-page":"27.","DOI":"10.1145\/1961189.1961199","article-title":"LIBSVM: a library for support vector machines","volume":"2","author":"Chang","year":"2011","journal-title":"ACM Trans. Intell. Syst. Technol"},{"key":"2023060910275740400_btz762-B4","doi-asserted-by":"crossref","first-page":"e72368.","DOI":"10.1371\/journal.pone.0072368","article-title":"SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs","volume":"8","author":"Charoenkwan","year":"2013","journal-title":"PLoS One"},{"key":"2023060910275740400_btz762-B5","doi-asserted-by":"crossref","first-page":"3193","DOI":"10.1093\/nar\/gki633","article-title":"Prediction of solvent accessibility and sites of deleterious mutations from protein sequence","volume":"33","author":"Chen","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2023060910275740400_btz762-B6","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1145\/2939672.2939785","volume-title":"Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","author":"Chen","year":"2016"},{"key":"2023060910275740400_btz762-B7","doi-asserted-by":"crossref","first-page":"W72","DOI":"10.1093\/nar\/gki396","article-title":"Scratch: a protein structure and structural feature prediction server","volume":"33 (Suppl_2)","author":"Cheng","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2023060910275740400_btz762-B8","doi-asserted-by":"crossref","first-page":"598","DOI":"10.1109\/SP.2016.42","volume-title":"2016 IEEE Symposium on Security and Privacy (SP)","author":"Datta","year":"2016"},{"key":"2023060910275740400_btz762-B9","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1107\/S2053230X15024619","article-title":"Protein stability: a crystallographer\u2019s perspective","volume":"72","author":"Deller","year":"2016","journal-title":"Acta Crystallogr. F"},{"key":"2023060910275740400_btz762-B10","first-page":"155","article-title":"Support vector regression machines","author":"Drucker","year":"1997","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2023060910275740400_btz762-B11","doi-asserted-by":"crossref","first-page":"2216","DOI":"10.1093\/bioinformatics\/bty953","article-title":"DeepCrystal: a deep learning framework for sequence-based protein crystallization prediction","volume":"35","author":"Elbasir","year":"2019","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B12","volume-title":"Fundamentals of Neural Networks: Architectures, Algorithms, and Applications","author":"Fausett","year":"1994"},{"key":"2023060910275740400_btz762-B13","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy function approximation: a gradient boosting machine","volume":"29","author":"Friedman","year":"2001","journal-title":"Ann. Stat"},{"key":"2023060910275740400_btz762-B14","doi-asserted-by":"crossref","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","article-title":"CD-HIT: accelerated for clustering the next-generation sequencing data","volume":"28","author":"Fu","year":"2012","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B15","doi-asserted-by":"crossref","first-page":"1295","DOI":"10.1093\/bioinformatics\/btx780","article-title":"DeepSF: deep convolutional neural network for mapping protein sequences to folds","volume":"34","author":"Hou","year":"2018","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B16","doi-asserted-by":"crossref","first-page":"2533","DOI":"10.1007\/s00726-016-2274-4","article-title":"TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM","volume":"48","author":"Hu","year":"2016","journal-title":"Amino Acids"},{"key":"2023060910275740400_btz762-B17","doi-asserted-by":"crossref","first-page":"627","DOI":"10.1107\/S1399004713032070","article-title":"Improving the chances of successful protein structure determination with a random forest classifier","volume":"70","author":"Jahandideh","year":"2014","journal-title":"Acta Crystallogr. D"},{"key":"2023060910275740400_btz762-B18","first-page":"9.","article-title":"DeepSol: a deep learning framework for sequence-based protein solubility prediction","volume":"1","author":"Khurana","year":"2018","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B19","first-page":"93","article-title":"Sequence-based protein crystallization propensity prediction for structural genomics: review and comparative analysis","volume":"1","author":"Kurgan","year":"2009","journal-title":"Nat. Sci"},{"key":"2023060910275740400_btz762-B20","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"2023060910275740400_btz762-B21","doi-asserted-by":"crossref","first-page":"319","DOI":"10.1002\/asmb.446","article-title":"Analysis of regression in game theory approach","volume":"17","author":"Lipovetsky","year":"2001","journal-title":"Appl. Stoch. Models Bus. Ind"},{"key":"2023060910275740400_btz762-B22","first-page":"4765","article-title":"A unified approach to interpreting model predictions","author":"Lundberg","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2023060910275740400_btz762-B23","doi-asserted-by":"crossref","first-page":"330","DOI":"10.1145\/3107411.3107418","volume-title":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","author":"Mall","year":"2017"},{"key":"2023060910275740400_btz762-B24","doi-asserted-by":"crossref","first-page":"378","DOI":"10.12688\/f1000research.14258.1","article-title":"An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity","volume":"7","author":"Mall","year":"2018","journal-title":"F1000Research"},{"key":"2023060910275740400_btz762-B25","doi-asserted-by":"crossref","first-page":"e39","DOI":"10.1093\/nar\/gky015","article-title":"RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes","volume":"46","author":"Mall","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023060910275740400_btz762-B26","doi-asserted-by":"crossref","first-page":"404","DOI":"10.1093\/bioinformatics\/16.4.404","article-title":"The PSIPRED protein structure prediction server","volume":"16","author":"McGuffin","year":"2000","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B27","doi-asserted-by":"crossref","first-page":"580.","DOI":"10.1186\/s12859-017-1995-z","article-title":"fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization","volume":"18","author":"Meng","year":"2018","journal-title":"BMC Bioinformatics"},{"key":"2023060910275740400_btz762-B28","doi-asserted-by":"crossref","first-page":"1092","DOI":"10.1093\/bioinformatics\/btx662","article-title":"PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine","volume":"34","author":"Rawi","year":"2017","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B29","doi-asserted-by":"crossref","first-page":"1135","DOI":"10.1145\/2939672.2939778","volume-title":"Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","author":"Ribeiro","year":"2016"},{"key":"2023060910275740400_btz762-B30","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1016\/0022-0248(88)90323-5","article-title":"Molecular factors stabilizing protein crystals","volume":"90","author":"Salemme","year":"1988","journal-title":"J. Cryst. Growth"},{"key":"2023060910275740400_btz762-B31","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1007\/978-0-387-21579-2_9","volume-title":"Nonlinear Estimation and Classification","author":"Schapire","year":"2003"},{"key":"2023060910275740400_btz762-B32","doi-asserted-by":"crossref","first-page":"5857","DOI":"10.1073\/pnas.95.11.5857","article-title":"Smart, a simple modular architecture research tool: identification of signaling domains","volume":"95","author":"Schultz","year":"1998","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023060910275740400_btz762-B33","doi-asserted-by":"crossref","first-page":"1554.","DOI":"10.1126\/science.307.5715.1554","article-title":"Structural biology. Structural genomics, round 2","volume":"307","author":"Service","year":"2005","journal-title":"Science"},{"key":"2023060910275740400_btz762-B34","first-page":"307","article-title":"A value for n-person games","volume":"2","author":"Shapley","year":"1953","journal-title":"Contributions to the Theory of Games"},{"key":"2023060910275740400_btz762-B35","doi-asserted-by":"crossref","first-page":"647","DOI":"10.1007\/s10115-013-0679-x","article-title":"Explaining prediction models and individual predictions with feature contributions","volume":"41","author":"\u0160trumbelj","year":"2014","journal-title":"Knowl. Inform. Syst"},{"key":"2023060910275740400_btz762-B36","doi-asserted-by":"crossref","first-page":"371","DOI":"10.1146\/annurev.biophys.050708.133740","article-title":"Lessons from structural genomics","volume":"38","author":"Terwilliger","year":"2009","journal-title":"Ann. Rev. Biophys"},{"key":"2023060910275740400_btz762-B37","doi-asserted-by":"crossref","first-page":"e80635.","DOI":"10.1371\/journal.pone.0080635","article-title":"Maximum allowed solvent accessibilities of residues in proteins","volume":"8","author":"Tien","year":"2013","journal-title":"PLoS One"},{"key":"2023060910275740400_btz762-B38","doi-asserted-by":"crossref","first-page":"3126","DOI":"10.1093\/bioinformatics\/bty342","article-title":"TMCrys: predict propensity of success for transmembrane protein crystallization","volume":"34","author":"Varga","year":"2018","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B39","doi-asserted-by":"crossref","first-page":"e105902.","DOI":"10.1371\/journal.pone.0105902","article-title":"PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection","volume":"9","author":"Wang","year":"2014","journal-title":"PLoS One"},{"key":"2023060910275740400_btz762-B40","doi-asserted-by":"crossref","first-page":"21383.","DOI":"10.1038\/srep21383","article-title":"Crysalis: an integrated server for computational analysis and design of protein crystallization","volume":"6","author":"Wang","year":"2016","journal-title":"Sci. Rep"},{"key":"2023060910275740400_btz762-B41","doi-asserted-by":"crossref","first-page":"838","DOI":"10.1093\/bib\/bbx018","article-title":"Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity","volume":"19","author":"Wang","year":"2017","journal-title":"Brief. Bioinform"},{"key":"2023060910275740400_btz762-B42","doi-asserted-by":"crossref","first-page":"2138","DOI":"10.1093\/bioinformatics\/bth195","article-title":"The DISOPRED server for the prediction of protein disorder","volume":"20","author":"Ward","year":"2004","journal-title":"Bioinformatics"},{"key":"2023060910275740400_btz762-B43","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1111\/j.1467-985X.2010.00678.x","article-title":"Towards more accessible conceptions of statistical inference","volume":"174","author":"Wild","year":"2011","journal-title":"J. Royal Stat. Soc"},{"key":"2023060910275740400_btz762-B44","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1023\/B:jsfg.0000031965.37625.0e","article-title":"His tag effect on solubility of human proteins produced in Escherichia coli: a comparison between four expression vectors","volume":"5","author":"Woestenenk","year":"2004","journal-title":"J. Struct. Funct. Genomics"},{"key":"2023060910275740400_btz762-B45","doi-asserted-by":"crossref","first-page":"617","DOI":"10.1002\/prot.22375","article-title":"On the relation between residue flexibility and local solvent accessibility in proteins","volume":"76","author":"Zhang","year":"2009","journal-title":"Proteins"},{"key":"2023060910275740400_btz762-B46","first-page":"649","article-title":"Character-level convolutional networks for text classification","author":"Zhang","year":"2015","journal-title":"Advances in Neural Information Processing Systems"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btz762\/30296339\/btz762.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/5\/1429\/50553091\/bioinformatics_36_5_1429.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/5\/1429\/50553091\/bioinformatics_36_5_1429.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,9]],"date-time":"2023-06-09T10:30:39Z","timestamp":1686306639000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/5\/1429\/5585746"}},"subtitle":[],"editor":[{"given":"Yann","family":"Ponty","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2019,10,11]]},"references-count":46,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2020,3,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btz762","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,3]]},"published":{"date-parts":[[2019,10,11]]}}}