{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T12:14:09Z","timestamp":1771330449519,"version":"3.50.1"},"reference-count":44,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T00:00:00Z","timestamp":1769558400000},"content-version":"vor","delay-in-days":1,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Flemish Government under the \u201cOnderzoeksprogramma Artifici\u00eble Intelligentie (AI) Vlaanderen\u201d Programme"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,2,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The code used in this study is available at https:\/\/github.com\/NatanTourne\/TFBS-negatives (DOI: 10.5281\/zenodo.18007567).<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btag048","type":"journal-article","created":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T12:38:54Z","timestamp":1769171934000},"source":"Crossref","is-referenced-by-count":0,"title":["How negative sampling shapes the performance of transcription factor binding site prediction models"],"prefix":"10.1093","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-5046-010X","authenticated-orcid":false,"given":"Natan","family":"Tourne","sequence":"first","affiliation":[{"name":"Department of Data Analysis and Mathematical Modelling, Ghent University , Ghent 9000,","place":["Belgium"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0367-9699","authenticated-orcid":false,"given":"Gaetan","family":"De Waele","sequence":"additional","affiliation":[{"name":"Department of Data Analysis and Mathematical Modelling, Ghent University , Ghent 9000,","place":["Belgium"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1975-0712","authenticated-orcid":false,"given":"Vanessa","family":"Vermeirssen","sequence":"additional","affiliation":[{"name":"Lab for Computational Biology, Integromics and Gene Regulation (CBIGR), Cancer Research Institute Ghent (CRIG) , Ghent 9000,","place":["Belgium"]},{"name":"Department of Biomedical Molecular Biology, Ghent University , Ghent 9000,","place":["Belgium"]},{"name":"Department of Biomolecular Medicine, Ghent University , Ghent 9000,","place":["Belgium"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5950-3003","authenticated-orcid":false,"given":"Willem","family":"Waegeman","sequence":"additional","affiliation":[{"name":"Department of Data Analysis and Mathematical Modelling, Ghent University , Ghent 9000,","place":["Belgium"]}]}],"member":"286","published-online":{"date-parts":[[2026,1,27]]},"reference":[{"key":"2026021706212602600_btag048-B1","doi-asserted-by":"publisher","first-page":"831","DOI":"10.1038\/nbt.3300","article-title":"Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning","volume":"33","author":"Alipanahi","year":"2015","journal-title":"Nat Biotechnol"},{"key":"2026021706212602600_btag048-B2","doi-asserted-by":"publisher","first-page":"1196","DOI":"10.1038\/s41592-021-01252-x","article-title":"Effective gene expression prediction from sequence by integrating long-range interactions","volume":"18","author":"Avsec","year":"2021","journal-title":"Nat Methods"},{"key":"2026021706212602600_btag048-B3","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1038\/s41588-021-00782-6","article-title":"Base-resolution models of transcription-factor binding reveal soft motif syntax","volume":"53","author":"Avsec","year":"2021","journal-title":"Nat Genet"},{"key":"2026021706212602600_btag048-B4","doi-asserted-by":"publisher","first-page":"393","DOI":"10.1038\/nprot.2008.195","article-title":"Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors","volume":"4","author":"Berger","year":"2009","journal-title":"Nat Protoc"},{"key":"2026021706212602600_btag048-B5","doi-asserted-by":"publisher","first-page":"3937","DOI":"10.1093\/bioinformatics\/btz194","article-title":"Neural networks with circular filters enable data efficient inference of sequence motifs","volume":"35","author":"Blum","year":"2019","journal-title":"Bioinformatics"},{"key":"2026021706212602600_btag048-B6","doi-asserted-by":"publisher","first-page":"253","DOI":"10.1034\/j.1399-0004.2000.570403.x","article-title":"Online Mendelian Inheritance in Man (OMIM) as a knowledgebase for human developmental disorders","volume":"57","author":"Boyadjiev","year":"2000","journal-title":"Clin Genet"},{"key":"2026021706212602600_btag048-B7","doi-asserted-by":"publisher","first-page":"497","DOI":"10.1007\/s00412-015-0543-8","article-title":"Homeodomain proteins: an update","volume":"125","author":"B\u00fcrglin","year":"2016","journal-title":"Chromosoma"},{"key":"2026021706212602600_btag048-B8","doi-asserted-by":"publisher","first-page":"D165","DOI":"10.1093\/nar\/gkab1113","article-title":"JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles","volume":"50","author":"Castro-Mondragon","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2026021706212602600_btag048-B9","doi-asserted-by":"publisher","first-page":"e1010863","DOI":"10.1371\/journal.pcbi.1010863","article-title":"maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks","volume":"19","author":"Cazares","year":"2023","journal-title":"PLoS Comput Biol"},{"key":"2026021706212602600_btag048-B10","doi-asserted-by":"publisher","first-page":"1422","DOI":"10.1093\/bioinformatics\/btp163","article-title":"Biopython: freely available Python tools for computational molecular biology and bioinformatics","volume":"25","author":"Cock","year":"2009","journal-title":"Bioinformatics"},{"key":"2026021706212602600_btag048-B11","doi-asserted-by":"publisher","first-page":"1188","DOI":"10.1101\/gr.849004","article-title":"WebLogo: a sequence logo generator","volume":"14","author":"Crooks","year":"2004","journal-title":"Genome Res"},{"key":"2026021706212602600_btag048-B12","doi-asserted-by":"publisher","first-page":"208","DOI":"10.1186\/1471-2105-10-208","article-title":"OHMM: a Hidden Markov Model accurately predicting the occupancy of a transcription factor with a self-overlapping binding motif","volume":"10","author":"Drawid","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2026021706212602600_btag048-B13","doi-asserted-by":"publisher","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"ENCODE Project Consortium","year":"2012","journal-title":"Nature"},{"key":"2026021706212602600_btag048-B14","doi-asserted-by":"publisher","first-page":"eaba9031","DOI":"10.1126\/sciadv.aba9031","article-title":"Predicting transcription factor binding in single cells through deep learning","volume":"6","author":"Fu","year":"2020","journal-title":"Sci Adv"},{"key":"2026021706212602600_btag048-B15","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1186\/1471-2164-7-3","article-title":"Structural and functional properties of genes involved in human cancer","volume":"7","author":"Furney","year":"2006","journal-title":"BMC Genomics"},{"key":"2026021706212602600_btag048-B16","doi-asserted-by":"publisher","first-page":"2205","DOI":"10.1093\/bioinformatics\/btw203","article-title":"gkmSVM: an R package for gapped-kmer SVM","volume":"32","author":"Ghandi","year":"2016","journal-title":"Bioinformatics"},{"key":"2026021706212602600_btag048-B17","doi-asserted-by":"publisher","first-page":"4990","DOI":"10.3390\/ijms25094990","article-title":"Predicting transcription factor binding sites with deep learning","volume":"25","author":"Ghosh","year":"2024","journal-title":"Int J Mol Sci"},{"key":"2026021706212602600_btag048-B18","doi-asserted-by":"publisher","author":"Hiranuma","DOI":"10.1101\/172767"},{"key":"2026021706212602600_btag048-B19","doi-asserted-by":"publisher","first-page":"226","DOI":"10.1186\/gb-2004-5-6-226","article-title":"An overview of the basic helix-loop-helix proteins","volume":"5","author":"Jones","year":"2004","journal-title":"Genome Biol"},{"key":"2026021706212602600_btag048-B20","doi-asserted-by":"publisher","first-page":"1607","DOI":"10.1093\/bioinformatics\/btaa928","article-title":"BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences","volume":"37","author":"Khan","year":"2021","journal-title":"Bioinformatics"},{"key":"2026021706212602600_btag048-B21","doi-asserted-by":"publisher","author":"Kingma","year":"2017","DOI":"10.48550\/arXiv.1412.6980,"},{"key":"2026021706212602600_btag048-B22","doi-asserted-by":"publisher","first-page":"77","DOI":"10.1016\/S0076-6879(10)70004-5","volume-title":"Methods in Enzymology, Volume 470 of Guide to Yeast Genetics: Functional Genomics, Proteomics, and Other Systems Analysis","author":"Lefran\u00e7ois","year":"2010"},{"key":"2026021706212602600_btag048-B23","doi-asserted-by":"publisher","first-page":"D882","DOI":"10.1093\/nar\/gkz1062","article-title":"New developments on the encyclopedia of DNA elements (ENCODE) data portal","volume":"48","author":"Luo","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2026021706212602600_btag048-B24","doi-asserted-by":"publisher","first-page":"e1003214","DOI":"10.1371\/journal.pcbi.1003214","article-title":"The next generation of transcription factor binding site prediction","volume":"9","author":"Mathelier","year":"2013","journal-title":"PLoS Comput Biol"},{"key":"2026021706212602600_btag048-B25","doi-asserted-by":"publisher","first-page":"1187","DOI":"10.1007\/s10955-010-0102-x","article-title":"Statistical mechanics of transcription-factor binding site discovery using Hidden Markov Models","volume":"142","author":"Mehta","year":"2011","journal-title":"J Stat Phys"},{"key":"2026021706212602600_btag048-B26","doi-asserted-by":"publisher","first-page":"1879","DOI":"10.1101\/gr.278205.123","article-title":"Characterization of human transcription factor function and patterns of gene regulation in HepG2 cells","volume":"33","author":"Moyers","year":"2023","journal-title":"Genome Res"},{"key":"2026021706212602600_btag048-B27","doi-asserted-by":"publisher","first-page":"e1005403","DOI":"10.1371\/journal.pcbi.1005403","article-title":"Imputation for transcription factor binding predictions based on deep learning","volume":"13","author":"Qin","year":"2017","journal-title":"PLoS Comput Biol"},{"key":"2026021706212602600_btag048-B28","doi-asserted-by":"publisher","first-page":"e107","DOI":"10.1093\/nar\/gkw226","article-title":"DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences","volume":"44","author":"Quang","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2026021706212602600_btag048-B29","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1016\/j.ymeth.2019.03.020","article-title":"FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data","volume":"166","author":"Quang","year":"2019","journal-title":"Methods"},{"key":"2026021706212602600_btag048-B30","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1134\/S0006297912030017","article-title":"Cys2His2 zinc finger protein family: classification, functions, and major members","volume":"77","author":"Razin","year":"2012","journal-title":"Biochemistry (Mosc)"},{"key":"2026021706212602600_btag048-B31","doi-asserted-by":"publisher","first-page":"804","DOI":"10.1101\/gad.1775509","article-title":"Nucleosome-binding affinity as a primary determinant of the nuclear mobility of the pioneer transcription factor FoxA","volume":"23","author":"Sekiya","year":"2009","journal-title":"Genes Dev"},{"key":"2026021706212602600_btag048-B32","doi-asserted-by":"publisher","first-page":"15270","DOI":"10.1038\/s41598-018-33321-1","article-title":"Recurrent neural network for predicting transcription factor binding sites","volume":"8","author":"Shen","year":"2018","journal-title":"Sci Rep"},{"key":"2026021706212602600_btag048-B33","doi-asserted-by":"publisher","first-page":"252","DOI":"10.1038\/nrg2538","article-title":"A census of human transcription factors: function, expression and evolution","volume":"10","author":"Vaquerizas","year":"2009","journal-title":"Nat Rev Genet"},{"key":"2026021706212602600_btag048-B34","doi-asserted-by":"publisher","first-page":"1067","DOI":"10.1016\/j.molcel.2017.11.026","article-title":"AP-1 transcription factors and the BAF complex mediate Signal-Dependent enhancer selection","volume":"68","author":"Vierbuchen","year":"2017","journal-title":"Mol Cell"},{"key":"2026021706212602600_btag048-B35","doi-asserted-by":"publisher","first-page":"7326","DOI":"10.1093\/nar\/gkac531","article-title":"SETDB1 acts as a topological accessory to Cohesin via an H3K9me3-independent, genomic shunt for regulating cell fates","volume":"50","author":"Warrier","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2026021706212602600_btag048-B36","doi-asserted-by":"publisher","first-page":"2227","DOI":"10.1101\/gad.176826.111","article-title":"Pioneer transcription factors: establishing competence for gene expression","volume":"25","author":"Zaret","year":"2011","journal-title":"Genes Dev"},{"key":"2026021706212602600_btag048-B37","doi-asserted-by":"publisher","first-page":"1184","DOI":"10.1109\/TCBB.2018.2819660","article-title":"High-Order convolutional neural network architecture for predicting DNA-protein binding sites","volume":"16","author":"Zhang","year":"2019","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"2026021706212602600_btag048-B38","doi-asserted-by":"publisher","first-page":"e1009941","DOI":"10.1371\/journal.pcbi.1009941","article-title":"Base-resolution prediction of transcription factor binding signals by a deep learning framework","volume":"18","author":"Zhang","year":"2022","journal-title":"PLoS Comput Biol"},{"key":"2026021706212602600_btag048-B39","doi-asserted-by":"publisher","author":"Zhang","year":"2022","DOI":"10.1101\/2022.05.02.490240,"},{"key":"2026021706212602600_btag048-B40","doi-asserted-by":"publisher","first-page":"4636","DOI":"10.1093\/bioinformatics\/btac572","article-title":"MMGraph: a multiple motif predictor based on graph neural network and coexisting probability for ATAC-seq data","volume":"38","author":"Zhang","year":"2022","journal-title":"Bioinformatics"},{"key":"2026021706212602600_btag048-B41","doi-asserted-by":"publisher","first-page":"bbab273","DOI":"10.1093\/bib\/bbab273","article-title":"High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method","volume":"22","author":"Zhang","year":"2021","journal-title":"Brief Bioinform"},{"key":"2026021706212602600_btag048-B42","doi-asserted-by":"publisher","first-page":"931","DOI":"10.1038\/nmeth.3547","article-title":"Predicting effects of noncoding variants with deep learning\u2013based sequence model","volume":"12","author":"Zhou","year":"2015","journal-title":"Nat Methods"},{"key":"2026021706212602600_btag048-B43","doi-asserted-by":"publisher","first-page":"4654","DOI":"10.1073\/pnas.1422023112","article-title":"Quantitative modeling of transcription factor binding specificities using DNA shape","volume":"112","author":"Zhou","year":"2015","journal-title":"Proc Natl Acad Sci USA"},{"key":"2026021706212602600_btag048-B44","doi-asserted-by":"publisher","first-page":"4322","DOI":"10.1021\/acs.jcim.3c02088","article-title":"MulTFBS: a spatial-temporal network with multichannels for predicting transcription factor binding sites","volume":"64","author":"Zhuang","year":"2024","journal-title":"J Chem Inf Model"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btag048\/66602862\/btag048.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/2\/btag048\/66602862\/btag048.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/2\/btag048\/66602862\/btag048.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T11:21:34Z","timestamp":1771327294000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btag048\/8442895"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2026,1,27]]},"references-count":44,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,3]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btag048","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026,2]]},"published":{"date-parts":[[2026,1,27]]},"article-number":"btag048"}}