{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:07Z","timestamp":1772138047502,"version":"3.50.1"},"reference-count":42,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2021,8,20]],"date-time":"2021-08-20T00:00:00Z","timestamp":1629417600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100006602","name":"Air Force Research Laboratory","doi-asserted-by":"publisher","award":["FA8750-17-C-0231"],"award-info":[{"award-number":["FA8750-17-C-0231"]}],"id":[{"id":"10.13039\/100006602","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Department of Defense or the United States Government"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,12,22]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Accurate automatic annotation of protein function relies on both innovative models and robust datasets. Due to their importance in biological processes, the identification of DNA-binding proteins directly from protein sequence has been the focus of many studies. However, the datasets used to train and evaluate these methods have suffered from substantial flaws. We describe some of the weaknesses of the datasets used in previous DNA-binding protein literature and provide several new datasets addressing these problems. We suggest new evaluative benchmark tasks that more realistically assess real-world performance for protein annotation models. We propose a simple new model for the prediction of DNA-binding proteins and compare its performance on the improved datasets to two previously published models. In addition, we provide extensive tests showing how the best models predict across taxa.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Our new gradient boosting model, which uses features derived from a published protein language model, outperforms the earlier models. Perhaps surprisingly, so does a baseline nearest neighbor model using BLAST percent identity. We evaluate the sensitivity of these models to perturbations of DNA-binding regions and control regions of protein sequences. The successful data-driven models learn to focus on DNA-binding regions. When predicting across taxa, the best models are highly accurate across species in the same kingdom and can provide some information when predicting across kingdoms.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and Implementation<\/jats:title>\n                    <jats:p>The data and results for this article can be found at https:\/\/doi.org\/10.5281\/zenodo.5153906. The code for this article can be found at https:\/\/doi.org\/10.5281\/zenodo.5153683. The code, data and results can also be found at https:\/\/github.com\/AZaitzeff\/tools_for_dna_binding_proteins.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab603","type":"journal-article","created":{"date-parts":[[2021,8,18]],"date-time":"2021-08-18T15:18:12Z","timestamp":1629299892000},"page":"44-51","source":"Crossref","is-referenced-by-count":5,"title":["Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9731-1873","authenticated-orcid":false,"given":"Alexander","family":"Zaitzeff","sequence":"first","affiliation":[{"name":"Two Six Research, Two Six Technologies , Arlington, VA 22203, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nicholas","family":"Leiby","sequence":"additional","affiliation":[{"name":"Two Six Research, Two Six Technologies , Arlington, VA 22203, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Francis C","family":"Motta","sequence":"additional","affiliation":[{"name":"Department of Mathematical Sciences, Florida Atlantic University , Boca Raton, FL 33431, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Steven B","family":"Haase","sequence":"additional","affiliation":[{"name":"Department of Biology, Duke University , Durham, NC 27708, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jedediah M","family":"Singer","sequence":"additional","affiliation":[{"name":"Two Six Research, Two Six Technologies , Arlington, VA 22203, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2021,8,20]]},"reference":[{"key":"2023020201173611900_btab603-B1","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1016\/j.jtbi.2018.10.027","article-title":"Effective DNA binding protein prediction by using key features via Chou\u2019s general PseAAC","volume":"460","author":"Adilina","year":"2019","journal-title":"J. Theor. Biol"},{"key":"2023020201173611900_btab603-B2","doi-asserted-by":"crossref","first-page":"645","DOI":"10.1007\/s10822-019-00207-x","article-title":"DP-binder: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information","volume":"33","author":"Ali","year":"2019","journal-title":"J. Comput. Aided Mol. Des"},{"key":"2023020201173611900_btab603-B3","doi-asserted-by":"crossref","first-page":"460","DOI":"10.1016\/S0076-6879(96)66029-7","article-title":"Local alignment statistics","volume":"266","author":"Altschul","year":"1996","journal-title":"Methods Enzymol"},{"key":"2023020201173611900_btab603-B4","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol"},{"key":"2023020201173611900_btab603-B5","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet"},{"key":"2023020201173611900_btab603-B6","doi-asserted-by":"crossref","first-page":"3203","DOI":"10.1093\/bioinformatics\/bts608","article-title":"Assessing the relationship between conservation of function and conservation of sequence using photosynthetic proteins","volume":"28","author":"Ashkenazi","year":"2012","journal-title":"Bioinformatics"},{"key":"2023020201173611900_btab603-B7","doi-asserted-by":"crossref","first-page":"i305","DOI":"10.1093\/bioinformatics\/btz328","article-title":"Multifaceted protein\u2013protein interaction prediction based on Siamese residual RCNN","volume":"35","author":"Chen","year":"2019","journal-title":"Bioinformatics"},{"key":"2023020201173611900_btab603-B8","first-page":"785","author":"Chen","year":"2016"},{"key":"2023020201173611900_btab603-B9","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1186\/s13040-017-0155-3","article-title":"Ten quick tips for machine learning in computational biology","volume":"10","author":"Chicco","year":"2017","journal-title":"BioData Min"},{"key":"2023020201173611900_btab603-B10","doi-asserted-by":"crossref","first-page":"14938","DOI":"10.1038\/s41598-017-14945-1","article-title":"iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features","volume":"7","author":"Chowdhury","year":"2017","journal-title":"Sci. Rep"},{"key":"2023020201173611900_btab603-B11","doi-asserted-by":"crossref","first-page":"3119","DOI":"10.1021\/acs.jproteome.9b00226","article-title":"Msdbp: exploring DNA-binding proteins by integrating multiscale sequence information via Chou\u2019s five-step rule","volume":"18","author":"Du","year":"2019","journal-title":"J. Proteome Res"},{"key":"2023020201173611900_btab603-B12","author":"Elnaggar","year":"2020"},{"key":"2023020201173611900_btab603-B13","doi-asserted-by":"crossref","first-page":"D1186","DOI":"10.1093\/nar\/gky1036","article-title":"Eco, the evidence & conclusion ontology: community standard for evidence information","volume":"47","author":"Giglio","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023020201173611900_btab603-B14","doi-asserted-by":"crossref","first-page":"412","DOI":"10.1016\/S0955-0674(97)80015-4","article-title":"Nuclear protein import","volume":"9","author":"G\u00f6rlich","year":"1997","journal-title":"Curr. Opin. Cell Biol"},{"key":"2023020201173611900_btab603-B15","doi-asserted-by":"crossref","first-page":"i802","DOI":"10.1093\/bioinformatics\/bty573","article-title":"Predicting protein\u2013protein interactions through sequence-based deep learning","volume":"34","author":"Hashemifar","year":"2018","journal-title":"Bioinformatics"},{"key":"2023020201173611900_btab603-B16","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1186\/1471-2148-1-4","article-title":"A genomic timescale for the origin of eukaryotes","volume":"1","author":"Hedges","year":"2001","journal-title":"BMC Evol. Biol"},{"key":"2023020201173611900_btab603-B17","doi-asserted-by":"crossref","first-page":"e0225317","DOI":"10.1371\/journal.pone.0225317","article-title":"An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences","volume":"14","author":"Hu","year":"2019","journal-title":"PLoS One"},{"key":"2023020201173611900_btab603-B18","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1016\/B978-0-12-374984-0.00439-3","volume-title":"Brenner\u2019s Encyclopedia of Genetics","author":"Jen","year":"2013","edition":"2nd edn"},{"key":"2023020201173611900_btab603-B19","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1016\/0092-8674(87)90358-8","article-title":"A cellular DNA-binding protein that activates eukaryotic transcription and DNA replication","volume":"48","author":"Jones","year":"1987","journal-title":"Cell"},{"key":"2023020201173611900_btab603-B20","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1186\/1471-2105-8-463","article-title":"Identification of DNA-binding proteins using support vector machines and evolutionary profiles","volume":"8","author":"Kumar","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2023020201173611900_btab603-B21","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1186\/s13062-020-00263-6","article-title":"Origin of the nuclear proteome on the basis of pre-existing nuclear localization signals in prokaryotic proteins","volume":"15","author":"Lisitsyna","year":"2020","journal-title":"Biol. Direct"},{"key":"2023020201173611900_btab603-B22","doi-asserted-by":"crossref","first-page":"e106691","DOI":"10.1371\/journal.pone.0106691","article-title":"iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition","volume":"9","author":"Liu","year":"2014","journal-title":"PLoS One"},{"key":"2023020201173611900_btab603-B23","doi-asserted-by":"crossref","first-page":"15479","DOI":"10.1038\/srep15479","article-title":"DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation","volume":"5","author":"Liu","year":"2015","journal-title":"Sci. Rep"},{"key":"2023020201173611900_btab603-B24","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1109\/TNB.2016.2555951","article-title":"Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning","volume":"15","author":"Liu","year":"2016","journal-title":"IEEE Trans. Nanobiosci"},{"key":"2023020201173611900_btab603-B25","doi-asserted-by":"crossref","first-page":"e86703","DOI":"10.1371\/journal.pone.0086703","article-title":"Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes","volume":"9","author":"Lou","year":"2014","journal-title":"PLoS One"},{"key":"2023020201173611900_btab603-B26","doi-asserted-by":"crossref","first-page":"e0167345","DOI":"10.1371\/journal.pone.0167345","article-title":"DNAbp: identification of DNA-binding proteins based on feature selection using a random forest and predicting binding residues","volume":"11","author":"Ma","year":"2016","journal-title":"PLoS One"},{"key":"2023020201173611900_btab603-B27","doi-asserted-by":"crossref","first-page":"433","DOI":"10.1093\/bioinformatics\/bty653","article-title":"Stackdppred: a stacking based prediction of DNA-binding protein from sequence","volume":"35","author":"Mishra","year":"2019","journal-title":"Bioinformatics"},{"key":"2023020201173611900_btab603-B28","doi-asserted-by":"crossref","first-page":"e158","DOI":"10.1093\/nar\/gkv805","article-title":"DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool","volume":"43","author":"Motion","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2023020201173611900_btab603-B29","doi-asserted-by":"crossref","first-page":"13424","DOI":"10.1038\/ncomms13424","article-title":"De-novo protein function prediction using DNA binding and RNA binding proteins as a test case","volume":"7","author":"Peled","year":"2016","journal-title":"Nat. Commun"},{"key":"2023020201173611900_btab603-B30","doi-asserted-by":"crossref","first-page":"e0188129","DOI":"10.1371\/journal.pone.0188129","article-title":"On the prediction of DNA-binding proteins only from primary sequences: a deep learning approach","volume":"12","author":"Qu","year":"2017","journal-title":"PLoS One"},{"key":"2023020201173611900_btab603-B31","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1016\/j.jtbi.2018.05.006","article-title":"DPP-PseAAC: a DNA-binding protein prediction model using Chou\u2019s general PseAAC","volume":"452","author":"Rahman","year":"2018","journal-title":"J. Theor. Biol"},{"key":"2023020201173611900_btab603-B32","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","author":"Rives","year":"2019","journal-title":"bioRxiv, doi: 10.1101\/622803"},{"key":"2023020201173611900_btab603-B33","article-title":"Sequence-based prediction of protein\u2013protein interactions: a structure-aware interpretable deep learning model","author":"Sledzieski","year":"2021","journal-title":"bioRxiv, doi: 10.1101\/2021.01.22.427866"},{"key":"2023020201173611900_btab603-B34","first-page":"D506","article-title":"UniProt: a worldwide hub of protein knowledge","volume":"47","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023020201173611900_btab603-B35","doi-asserted-by":"crossref","first-page":"D330","DOI":"10.1093\/nar\/gky1055","article-title":"The gene ontology resource: 20 years and still going strong","volume":"47","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023020201173611900_btab603-B36","first-page":"7297631","article-title":"PredDBP-stack: prediction of DNA-binding proteins from HMM profiles using a stacked ensemble method","volume":"2020","author":"Wang","year":"2020","journal-title":"Biomed. Res. Int"},{"key":"2023020201173611900_btab603-B37","doi-asserted-by":"crossref","first-page":"e0185587","DOI":"10.1371\/journal.pone.0185587","article-title":"Improved detection of DNA-binding proteins via compression technology on PSSM information","volume":"12","author":"Wang","year":"2017","journal-title":"PLoS One"},{"key":"2023020201173611900_btab603-B38","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1016\/j.neucom.2016.03.025","article-title":"Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix","volume":"199","author":"Waris","year":"2016","journal-title":"Neurocomputing"},{"key":"2023020201173611900_btab603-B39","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1016\/j.ins.2016.06.026","article-title":"Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information","volume":"384","author":"Wei","year":"2017","journal-title":"Inf. Sci. (N.Y.)"},{"key":"2023020201173611900_btab603-B40","doi-asserted-by":"crossref","first-page":"294279","DOI":"10.1155\/2014\/294279","article-title":"enDNA-prot: identification of DNA-binding proteins by applying ensemble learning","volume":"2014","author":"Xu","year":"2014","journal-title":"Biomed. Res. Int"},{"key":"2023020201173611900_btab603-B41","doi-asserted-by":"crossref","first-page":"4590609","DOI":"10.1155\/2017\/4590609","article-title":"Hmmbinder: DNA-binding protein prediction using hmm profile based features","volume":"2017","author":"Zaman","year":"2017","journal-title":"Biomed. Res. Int"},{"key":"2023020201173611900_btab603-B42","doi-asserted-by":"crossref","first-page":"1856","DOI":"10.3390\/ijms18091856","article-title":"PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation","volume":"18","author":"Zhang","year":"2017","journal-title":"Int. J. Mol. Sci"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab603\/40664185\/btab603.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/1\/44\/49006293\/btab603.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/1\/44\/49006293\/btab603.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,1]],"date-time":"2023-02-01T22:37:12Z","timestamp":1675291032000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/1\/44\/6355576"}},"subtitle":[],"editor":[{"given":"Lenore","family":"Cowen","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2021,8,20]]},"references-count":42,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12,22]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab603","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.04.09.439184","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,1,1]]},"published":{"date-parts":[[2021,8,20]]}}}