{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:45Z","timestamp":1772138085280,"version":"3.50.1"},"reference-count":48,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2017,12,22]],"date-time":"2017-12-22T00:00:00Z","timestamp":1513900800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,5,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG).<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>m-NGSG is freely available at Bitbucket: https:\/\/bitbucket.org\/sm_islam\/mngsg\/src. A web server is available at watson.ecs.baylor.edu\/ngsg.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btx823","type":"journal-article","created":{"date-parts":[[2017,12,21]],"date-time":"2017-12-21T23:56:25Z","timestamp":1513900585000},"page":"1481-1487","source":"Crossref","is-referenced-by-count":31,"title":["Protein classification using modified\n                    <i>n-grams<\/i>\n                    and\n                    <i>skip-grams<\/i>"],"prefix":"10.1093","volume":"34","author":[{"given":"S M Ashiqul","family":"Islam","sequence":"first","affiliation":[{"name":"Institute of Biomedical Studies, Baylor University, Waco, TX, USA"}]},{"given":"Benjamin J","family":"Heil","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Baylor University, Waco, TX, USA"}]},{"given":"Christopher Michel","family":"Kearney","sequence":"additional","affiliation":[{"name":"Institute of Biomedical Studies, Baylor University, Waco, TX, USA"},{"name":"Department of Biology, Baylor University, Waco, TX, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7798-5704","authenticated-orcid":false,"given":"Erich J","family":"Baker","sequence":"additional","affiliation":[{"name":"Institute of Biomedical Studies, Baylor University, Waco, TX, USA"},{"name":"Department of Computer Science, Baylor University, Waco, TX, USA"}]}],"member":"286","published-online":{"date-parts":[[2017,12,22]]},"reference":[{"key":"2023012713020204400_btx823-B1","doi-asserted-by":"crossref","first-page":"e0141287.","DOI":"10.1371\/journal.pone.0141287","article-title":"Continuous distributed representation of biological sequences for deep proteomics and genomics","volume":"10","author":"Asgari","year":"2015","journal-title":"PloS One"},{"key":"2023012713020204400_btx823-B2","doi-asserted-by":"crossref","first-page":"783","DOI":"10.1016\/j.jmb.2004.05.028","article-title":"Improved prediction of signal peptides: SignalP 3.0","volume":"340","author":"Bendtsen","year":"2004","journal-title":"J. Mol. Biol"},{"key":"2023012713020204400_btx823-B3","doi-asserted-by":"crossref","first-page":"455","DOI":"10.1093\/bioinformatics\/17.5.455","article-title":"Predicting proteinprotein interactions from primary structure","volume":"17","author":"Bock","year":"2001","journal-title":"Bioinformatics"},{"key":"2023012713020204400_btx823-B4","doi-asserted-by":"crossref","first-page":"890","DOI":"10.1093\/bib\/bbt052","article-title":"Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis","volume":"15","author":"Bonham-Carter","year":"2014","journal-title":"Briefings in Bioinformatics"},{"key":"2023012713020204400_btx823-B5","doi-asserted-by":"crossref","first-page":"3692","DOI":"10.1093\/nar\/gkg600","article-title":"Svm-prot: web-based support vector machine software for functional classification of a protein from its primary sequence","volume":"31","author":"Cai","year":"2003","journal-title":"Nucleic Acids Res"},{"key":"2023012713020204400_btx823-B6","first-page":"1.","article-title":"Protein sequence classification with improved extreme learning machine algorithms","volume":"2014","author":"Cao","year":"2014","journal-title":"BioMed Res. Int"},{"key":"2023012713020204400_btx823-B7","first-page":"161","article-title":"N-gram-based text categorization","volume":"48113","author":"Cavnar","year":"1994","journal-title":"Ann Arbor MI"},{"key":"2023012713020204400_btx823-B8","doi-asserted-by":"crossref","DOI":"10.1038\/srep22843","article-title":"A web server and mobile app for computing hemolytic potency of peptides","volume":"6","author":"Chaudhary","year":"2016","journal-title":"Sci. Rep"},{"key":"2023012713020204400_btx823-B9","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1093\/bioinformatics\/bth466","article-title":"Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes","volume":"21","author":"Chou","year":"2005","journal-title":"Bioinformatics"},{"key":"2023012713020204400_btx823-B10","first-page":"316","article-title":"Vector quantization kernels for the classification of protein sequences and structures","volume":"2014","author":"Clark","year":"2014","journal-title":"Biocomputing"},{"key":"2023012713020204400_btx823-B11","first-page":"1265","article-title":"Comparative experiments on sentiment classification for online product reviews","volume":"6","author":"Cui","year":"2006","journal-title":"AAAI"},{"key":"2023012713020204400_btx823-B12","doi-asserted-by":"crossref","first-page":"2229","DOI":"10.1039\/C4MB00316K","article-title":"Identification of bacteriophage virion proteins by the anova feature selection and analysis","volume":"10","author":"Ding","year":"2014","journal-title":"Molecular BioSystems"},{"key":"2023012713020204400_btx823-B13","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1007\/s12539-013-0205-6","article-title":"Prediction of protein structural classes based on feature selection technique","volume":"6","author":"Ding","year":"2014","journal-title":"Interdisc. Sci. Comput. Life Sci"},{"key":"2023012713020204400_btx823-B14","doi-asserted-by":"crossref","first-page":"330","DOI":"10.1016\/j.jtbi.2009.08.004","article-title":"Subchlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic k-nearest neighbor (et-knn) algorithm","volume":"261","author":"Du","year":"2009","journal-title":"J. Theor. Biol"},{"key":"2023012713020204400_btx823-B15","doi-asserted-by":"crossref","first-page":"3495","DOI":"10.3390\/ijms15033495","article-title":"PseAAC-general: fast building various modes of general form of chous pseudo-amino acid composition for large-scale protein datasets","volume":"15","author":"Du","year":"2014","journal-title":"Int. J. Mol. Sci"},{"key":"2023012713020204400_btx823-B16","doi-asserted-by":"crossref","first-page":"14427","DOI":"10.1074\/jbc.M411789200","article-title":"Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search","volume":"280","author":"Garg","year":"2005","journal-title":"J. Biol. Chem"},{"key":"2023012713020204400_btx823-B17","doi-asserted-by":"crossref","first-page":"6266","DOI":"10.1016\/j.eswa.2013.05.057","article-title":"Twitter brand sentiment analysis: a hybrid system using n-gram analysis and dynamic artificial neural network","volume":"40","author":"Ghiassi","year":"2013","journal-title":"Exp. Syst. Appl"},{"key":"2023012713020204400_btx823-B18","author":"Goldberg","year":"2014"},{"key":"2023012713020204400_btx823-B19","first-page":"1","author":"Guthrie","year":"2006"},{"key":"2023012713020204400_btx823-B20","doi-asserted-by":"crossref","first-page":"W7","DOI":"10.1093\/nar\/gku398","article-title":"Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches","volume":"42","author":"Horwege","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023012713020204400_btx823-B21","first-page":"168","author":"Hu","year":"2004"},{"key":"2023012713020204400_btx823-B22","doi-asserted-by":"crossref","first-page":"210.","DOI":"10.1186\/s12859-015-0633-x","article-title":"PredSTP: a highly accurate SVM based model to predict sequential cystine stabilized peptides","volume":"16","author":"Islam","year":"2015","journal-title":"BMC Bioinformatics"},{"key":"2023012713020204400_btx823-B23","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1016\/j.jtbi.2015.04.011","article-title":"ippi-esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac","volume":"377","author":"Jia","year":"2015","journal-title":"J. Theor. Biol"},{"key":"2023012713020204400_btx823-B25","doi-asserted-by":"crossref","first-page":"374\u2013374.","DOI":"10.1093\/nar\/28.1.374","article-title":"Aaindex: amino acid index database","volume":"28","author":"Kawashima","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2023012713020204400_btx823-B26","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1016\/j.bbapap.2013.05.002","article-title":"Prediction and characterization of cyclic proteins from sequences in three domains of life","volume":"1844","author":"Kedarisetti","year":"2014","journal-title":"Biochim. Biophys. Acta (BBA) Proteins Proteomics"},{"key":"2023012713020204400_btx823-B27","first-page":"566","author":"Leslie","year":"2002"},{"key":"2023012713020204400_btx823-B28","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1093\/bioinformatics\/btg431","article-title":"Mismatch string kernels for discriminative protein classification","volume":"20","author":"Leslie","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012713020204400_btx823-B29","doi-asserted-by":"crossref","first-page":"739","DOI":"10.2174\/092986608785133681","article-title":"Predicting subcellular localization of mycobacterial proteins by using Chou\u2019s pseudo amino acid composition","volume":"15","author":"Lin","year":"2008","journal-title":"Protein Peptide Lett"},{"key":"2023012713020204400_btx823-B30","doi-asserted-by":"crossref","first-page":"S3.","DOI":"10.1186\/1471-2105-15-S16-S3","article-title":"Using distances between Top-n-gram and residue pairs for protein remote homology detection","volume":"15","author":"Liu","year":"2014","journal-title":"BMC Bioinformatics"},{"key":"2023012713020204400_btx823-B31","doi-asserted-by":"crossref","first-page":"133","DOI":"10.2174\/157340613804488341","article-title":"Prediction of allergenic proteins by means of the concept of Chou\u2019s pseudo amino acid composition and a machine learning approach","volume":"9","author":"Mohabatkar","year":"2013","journal-title":"Med. Chem"},{"key":"2023012713020204400_btx823-B32","first-page":"79","author":"Pang","year":"2002"},{"key":"2023012713020204400_btx823-B34","doi-asserted-by":"crossref","DOI":"10.1021\/bk-1979-0092","volume-title":"Functionality and Protein Structure: Based on a Symposium","author":"Pour-El","year":"1979"},{"key":"2023012713020204400_btx823-B35","doi-asserted-by":"crossref","first-page":"44310.","DOI":"10.18632\/oncotarget.10027","article-title":"iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC","volume":"7","author":"Qiu","year":"2016","journal-title":"Oncotarget"},{"key":"2023012713020204400_btx823-B36","doi-asserted-by":"crossref","first-page":"e0136990.","DOI":"10.1371\/journal.pone.0136990","article-title":"AntiAngioPred: a server for prediction of anti-angiogenic peptides","volume":"10","author":"Ramaprasad","year":"2015","journal-title":"Plos One"},{"key":"2023012713020204400_btx823-B38","doi-asserted-by":"crossref","first-page":"1607.","DOI":"10.1038\/srep01607","article-title":"Computational approach for designing tumor homing peptides","volume":"3","author":"Sharma","year":"2013","journal-title":"Sci. Rep"},{"key":"2023012713020204400_btx823-B39","doi-asserted-by":"crossref","first-page":"72.","DOI":"10.1186\/s13321-016-0185-8","article-title":"osfp: a web server for predicting the oligomeric states of fluorescent proteins","volume":"8","author":"Simeon","year":"2016","journal-title":"J. Cheminf"},{"key":"2023012713020204400_btx823-B40","first-page":"1642","author":"Socher","year":"2013"},{"key":"2023012713020204400_btx823-B41","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1016\/S0306-4573(01)00045-0","article-title":"The use of bigrams to enhance text categorization","volume":"38","author":"Tan","year":"2002","journal-title":"Inf. Process. Manag"},{"key":"2023012713020204400_btx823-B42","doi-asserted-by":"crossref","first-page":"1269","DOI":"10.1039\/C5MB00883B","article-title":"Identification of immunoglobulins using chou\u2019s pseudo amino acid composition with feature selection technique","volume":"12","author":"Tang","year":"2016","journal-title":"Mol. BioSystems"},{"key":"2023012713020204400_btx823-B43","doi-asserted-by":"crossref","first-page":"251.","DOI":"10.1186\/1471-2105-11-251","article-title":"High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABERTOOTH","volume":"11","author":"Teichert","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023012713020204400_btx823-B44","doi-asserted-by":"crossref","first-page":"197","DOI":"10.1016\/j.cmpb.2016.07.004","article-title":"Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou\u2019s general PseAAC","volume":"134","author":"Tiwari","year":"2016","journal-title":"Comput. Methods Programs Biomed"},{"key":"2023012713020204400_btx823-B45","doi-asserted-by":"crossref","first-page":"S9","DOI":"10.1186\/1471-2105-13-S15-S9","article-title":"A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins","volume":"13","author":"Verma","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023012713020204400_btx823-B46","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1093\/bioinformatics\/btg005","article-title":"Alignment-free sequence comparisona review","volume":"19","author":"Vinga","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012713020204400_btx823-B47","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1016\/j.ab.2013.01.019","article-title":"iamp-2l: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types","volume":"436","author":"Xiao","year":"2013","journal-title":"Anal. Biochem"},{"key":"2023012713020204400_btx823-B48","doi-asserted-by":"crossref","first-page":"e55844.","DOI":"10.1371\/journal.pone.0055844","article-title":"iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition","volume":"8","author":"Xu","year":"2013","journal-title":"Plos One"},{"key":"2023012713020204400_btx823-B49","first-page":"165","author":"Yu","year":"2008"},{"key":"2023012713020204400_btx823-B50","doi-asserted-by":"crossref","first-page":"1.","DOI":"10.1155\/2015\/674296","article-title":"Survey of natural language processing techniques in bioinformatics","volume":"2015","author":"Zeng","year":"2015","journal-title":"Comput. Math. Methods Med"},{"key":"2023012713020204400_btx823-B51","doi-asserted-by":"crossref","first-page":"492","DOI":"10.2174\/092986612800191080","article-title":"Predicting protein\u2013protein interactions by combing various sequence-derived features into the general form of Chous pseudo amino acid composition","volume":"19","author":"Zhao","year":"2012","journal-title":"Protein Peptide Lett"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/9\/1481\/48914713\/bioinformatics_34_9_1481.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/9\/1481\/48914713\/bioinformatics_34_9_1481.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T08:53:48Z","timestamp":1674809628000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/9\/1481\/4772682"}},"subtitle":[],"editor":[{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,12,22]]},"references-count":48,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2018,5,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btx823","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/170407","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,5,1]]},"published":{"date-parts":[[2017,12,22]]}}}