{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:21:09Z","timestamp":1740144069736,"version":"3.37.3"},"reference-count":43,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2019,11,5]],"date-time":"2019-11-05T00:00:00Z","timestamp":1572912000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"HKSAR General Research Fund","award":["14170217"],"award-info":[{"award-number":["14170217"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,9,25]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Many DNA-binding proteins interact with partner proteins. Recently, based on the high-throughput consecutive affinity-purification systematic evolution of ligands by exponential enrichment (CAP-SELEX) method, many such protein pairs have been found to bind DNA with flexible spacing between their individual binding motifs. Most existing motif representations were not designed to capture such flexibly spaced regions. In order to computationally discover more co-binding events without prior knowledge about the identities of the co-binding proteins, a new representation is needed. We propose a new class of sequence patterns that flexibly model such variable regions and corresponding algorithms that identify co-bound sequences using these patterns. Based on both simulated and CAP-SELEX data, features derived from our sequence patterns lead to better classification performance than patterns that do not explicitly model the variable regions. We also show that even for standard ChIP-seq data, this new class of sequence patterns can help discover co-bound events in a subset of sequences in an unsupervised manner. The open-source software is available at https:\/\/github.com\/kevingroup\/glk-SVM.<\/jats:p>","DOI":"10.1093\/bib\/bbz101","type":"journal-article","created":{"date-parts":[[2019,7,17]],"date-time":"2019-07-17T11:21:32Z","timestamp":1563362492000},"page":"1787-1797","source":"Crossref","is-referenced-by-count":1,"title":["Flexible k-mers with variable-length indels for identifying binding sequences of protein dimers"],"prefix":"10.1093","volume":"21","author":[{"given":"Chenyang","family":"Hong","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering at The Chinese University of Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5516-9944","authenticated-orcid":false,"given":"Kevin Y","family":"Yip","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering at The Chinese University of Hong Kong"}]}],"member":"286","published-online":{"date-parts":[[2019,11,5]]},"reference":[{"issue":"9","key":"2021031107441509700_ref1","doi-asserted-by":"crossref","first-page":"2997","DOI":"10.1093\/nar\/10.9.2997","article-title":"Use of the \u2018perceptron\u2019 algorithm to distinguish translational initiation sites in E. coli","volume":"10","author":"Stormo","year":"1982","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"2021031107441509700_ref2","first-page":"499","article-title":"A weight array method for splicing signal analysis","volume":"9","author":"Zhang","year":"1993","journal-title":"Comput Appl Biosci"},{"key":"2021031107441509700_ref3","doi-asserted-by":"crossref","first-page":"S100","DOI":"10.1093\/bioinformatics\/18.suppl_2.S100","article-title":"Identifying transcription factor binding sites through markov chain optimization","volume":"18","author":"Ellrott","year":"2002","journal-title":"Bioinformatics"},{"key":"2021031107441509700_ref4","doi-asserted-by":"crossref","first-page":"2290","DOI":"10.1093\/nar\/gki519","article-title":"Computational technique for improvement of the position-weight matrices for the DNA\/protein binding sites","volume":"33","author":"Gershenzon","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2021031107441509700_ref5","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0009722","article-title":"Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix","volume":"5","author":"Siddharthan","year":"2010","journal-title":"PLoS One"},{"key":"2021031107441509700_ref6","doi-asserted-by":"crossref","first-page":"933","DOI":"10.1093\/bioinformatics\/btm055","article-title":"Position dependencies in transcription factor binding sites","volume":"23","author":"Tomovic","year":"2007","journal-title":"Bioinformatics"},{"key":"2021031107441509700_ref7","first-page":"564","article-title":"The spectrum kernel: a string kernel for SVM protein classification","author":"Leslie","year":"2001","journal-title":"Pac Symp Biocomput"},{"key":"2021031107441509700_ref8","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1093\/bioinformatics\/btg431","article-title":"Mismatch string kernels for discriminative protein classification","volume":"20","author":"Leslie","year":"2004","journal-title":"Bioinformatics"},{"key":"2021031107441509700_ref9","first-page":"1435","article-title":"Fast string kernels using inexact matching for protein sequences","volume":"5","author":"Leslie","year":"2004","journal-title":"J Mach Learn Res"},{"issue":"1","key":"2021031107441509700_ref10","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J Mol Biol"},{"issue":"6","key":"2021031107441509700_ref11","doi-asserted-by":"crossref","first-page":"857","DOI":"10.1089\/106652703322756113","article-title":"Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships","volume":"10","author":"Liao","year":"2003","journal-title":"J Comput Biol"},{"key":"2021031107441509700_ref12","doi-asserted-by":"crossref","first-page":"2167","DOI":"10.1101\/gr.121905.111","article-title":"Discriminative prediction of mammalian enhancers from DNA sequences","volume":"21","author":"Lee","year":"2011","journal-title":"Genome Res"},{"key":"2021031107441509700_ref13","doi-asserted-by":"crossref","first-page":"W544","DOI":"10.1093\/nar\/gkt519","article-title":"kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets","volume":"41","author":"Fletez-Brant","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2021031107441509700_ref14","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1003711","article-title":"Enhanced regulatory sequence prediction using gapped k-mer features","volume":"10","author":"Ghandi","year":"2014","journal-title":"PLoS Comput Biol"},{"key":"2021031107441509700_ref15","doi-asserted-by":"crossref","first-page":"2196","DOI":"10.1093\/bioinformatics\/btw142","article-title":"LS-GKM: a new gkm-SVM for large-scale datasets","volume":"32","author":"Lee","year":"2016","journal-title":"Bioinformatics"},{"key":"2021031107441509700_ref16","doi-asserted-by":"crossref","first-page":"744","DOI":"10.1016\/j.cell.2010.01.044","article-title":"An atlas of combinatorial transcriptional regulation in mouse and man","volume":"140","author":"Ravasi","year":"2010","journal-title":"Cell"},{"key":"2021031107441509700_ref17","doi-asserted-by":"crossref","first-page":"833","DOI":"10.1006\/jmbi.2000.3614","article-title":"Experimental analysis and computer prediction of CTF\/NFI transcription factor DNA binding sites","volume":"297","author":"Roulet","year":"2000","journal-title":"J Mol Biol"},{"key":"2021031107441509700_ref18","doi-asserted-by":"crossref","first-page":"2099","DOI":"10.1093\/nar\/gkt1112","article-title":"Protein\u2013DNA binding: complexities and multi-protein codes","volume":"42","author":"Siggers","year":"2014","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"2021031107441509700_ref19","doi-asserted-by":"crossref","first-page":"402","DOI":"10.1038\/nrm2395","article-title":"Transcriptional control of human p53-regulated genes","volume":"9","author":"Riley","year":"2008","journal-title":"Nat Rev Mol Cell Biol"},{"issue":"8","key":"2021031107441509700_ref20","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1002638","article-title":"High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints","volume":"8","author":"Guo","year":"2012","journal-title":"PLoS Comput Biol"},{"issue":"7578","key":"2021031107441509700_ref21","doi-asserted-by":"crossref","first-page":"384","DOI":"10.1038\/nature15518","article-title":"DNA-dependent formation of transcription factor pairs alters their binding specificity","volume":"527","author":"Jolma","year":"2015","journal-title":"Nature"},{"issue":"8","key":"2021031107441509700_ref22","doi-asserted-by":"crossref","first-page":"1307","DOI":"10.1101\/gr.154922.113","article-title":"Comprehensive prediction in 78 human cell lines reveals rigidity and compactness of transcription factor dimers","volume":"23","author":"Jankowski","year":"2013","journal-title":"Genome Res"},{"issue":"5","key":"2021031107441509700_ref23","doi-asserted-by":"crossref","first-page":"S2","DOI":"10.1186\/1752-0509-8-S5-S2","article-title":"Identifying cooperative transcription factors in yeast using multiple data sources","volume":"8","author":"Lai","year":"2014","journal-title":"BMC Syst Biol"},{"issue":"8","key":"2021031107441509700_ref24","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1000154","article-title":"A feature-based approach to modeling protein\u2013dna interactions","volume":"4","author":"Sharon","year":"2008","journal-title":"PLoS Comput Biol"},{"issue":"14","key":"2021031107441509700_ref25","doi-asserted-by":"crossref","first-page":"3082","DOI":"10.1093\/bioinformatics\/bti477","article-title":"A multiple-feature framework for modelling and predicting transcription factor binding sites","volume":"21","author":"Pudimat","year":"2005","journal-title":"Bioinformatics"},{"issue":"1","key":"2021031107441509700_ref26","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/S0169-2607(01)00198-5","article-title":"Finding subtle motifs with variable gaps in unaligned DNA sequences","volume":"70","author":"Hu","year":"2003","journal-title":"Comput Methods Programs Biomed"},{"issue":"13","key":"2021031107441509700_ref27","doi-asserted-by":"crossref","first-page":"5832","DOI":"10.1093\/nar\/gks206","article-title":"Efficient motif search in ranked lists and applications to variable gap motifs","volume":"40","author":"Leibovich","year":"2012","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"2021031107441509700_ref28","doi-asserted-by":"crossref","first-page":"e1000071","DOI":"10.1371\/journal.pcbi.1000071","article-title":"Discovering sequence motifs with arbitrary insertions and deletions","volume":"4","author":"Frith","year":"2008","journal-title":"PLoS Comput Biol"},{"key":"2021031107441509700_ref29","first-page":"29","volume-title":"8th Int. Workshop on Data Mining in Bioinformatics","author":"Kuksa","year":"2008"},{"key":"2021031107441509700_ref30","doi-asserted-by":"crossref","first-page":"381","DOI":"10.1145\/2147805.2147855","article-title":"Kernel methods for calmodulin binding and binding site prediction","volume-title":"Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine","author":"Hamilton","year":"2011"},{"key":"2021031107441509700_ref31","first-page":"1323","article-title":"Efficient computation of gapped substring kernels on large alphabets","volume":"6","author":"Rousu","year":"2005","journal-title":"J Mach Learn Res"},{"key":"2021031107441509700_ref32","doi-asserted-by":"crossref","first-page":"2205","DOI":"10.1093\/bioinformatics\/btw203","article-title":"GkmSVM: an R package for gapped-kmer SVM","volume":"32","author":"Ghandi","year":"2016","journal-title":"Bioinformatics"},{"issue":"15","key":"2021031107441509700_ref33","doi-asserted-by":"crossref","first-page":"2574","DOI":"10.1093\/bioinformatics\/btv176","article-title":"Kebabs: an r package for kernel-based analysis of biological sequences","volume":"31","author":"Palme","year":"2015","journal-title":"Bioinformatics"},{"key":"2021031107441509700_ref34","doi-asserted-by":"crossref","first-page":"D110","DOI":"10.1093\/nar\/gkv1176","article-title":"JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles","volume":"44","author":"Mathelier","year":"2016","journal-title":"Nucleic Acids Res"},{"issue":"7414","key":"2021031107441509700_ref35","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"The ENCODE Project Consortium","year":"2012","journal-title":"Nature"},{"key":"2021031107441509700_ref36","doi-asserted-by":"crossref","first-page":"R48","DOI":"10.1186\/gb-2012-13-9-r48","article-title":"Classification of human genomic regions based on experimentally-determined binding sites of more than 100 transcription-related factors","volume":"13","author":"Yip","year":"2012","journal-title":"Genome Biol"},{"issue":"8","key":"2021031107441509700_ref37","doi-asserted-by":"crossref","first-page":"831","DOI":"10.1038\/nbt.3300","article-title":"Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning","volume":"33","author":"Alipanahi","year":"2015","journal-title":"Nat Biotechnol"},{"issue":"Oct","key":"2021031107441509700_ref38","first-page":"2825","article-title":"Scikit-learn: machine learning in python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J Mach Learn Res"},{"issue":"4","key":"2021031107441509700_ref39","doi-asserted-by":"crossref","first-page":"576","DOI":"10.1016\/j.molcel.2010.05.004","article-title":"Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities","volume":"38","author":"Heinz","year":"2010","journal-title":"Mol Cell"},{"issue":"10","key":"2021031107441509700_ref40","doi-asserted-by":"crossref","first-page":"1555","DOI":"10.1093\/bioinformatics\/btw024","article-title":"Tfbstools: an r\/bioconductor package for transcription factor binding site analysis","volume":"32","author":"Tan","year":"2016","journal-title":"Bioinformatics"},{"key":"2021031107441509700_ref41","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0083737","article-title":"Structure of the catalytic domain of EZH2 reveals conformational plasticity in cofactor and substrate binding sites and explains oncogenic mutations","volume":"8","author":"Wu","year":"2013","journal-title":"PLoS One"},{"key":"2021031107441509700_ref42","doi-asserted-by":"crossref","DOI":"10.1038\/ncomms12514","article-title":"reChIP-seq reveals widespread bivalency of H3K4me3 and H3K27me3 in CD4+ memory T cells","volume":"7","author":"Kinkley","year":"2016","journal-title":"Nat Commun"},{"issue":"4","key":"2021031107441509700_ref43","doi-asserted-by":"crossref","first-page":"276","DOI":"10.1038\/nrg1315","article-title":"Applied bioinformatics for the identification of regulatory elements","volume":"5","author":"Wasserman","year":"2004","journal-title":"Nat Rev Genet"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/5\/1787\/36529377\/bbz101.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/5\/1787\/36529377\/bbz101.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,11]],"date-time":"2021-03-11T09:54:32Z","timestamp":1615456472000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/21\/5\/1787\/5612164"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,5]]},"references-count":43,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2019,11,5]]},"published-print":{"date-parts":[[2020,9,25]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbz101","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"type":"print","value":"1467-5463"},{"type":"electronic","value":"1477-4054"}],"subject":[],"published-other":{"date-parts":[[2020,9]]},"published":{"date-parts":[[2019,11,5]]}}}