{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:06Z","timestamp":1772138046241,"version":"3.50.1"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"23","license":[{"start":{"date-parts":[[2021,6,22]],"date-time":"2021-06-22T00:00:00Z","timestamp":1624320000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IIS-1955151"],"award-info":[{"award-number":["IIS-1955151"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["OAC-1934600"],"award-info":[{"award-number":["OAC-1934600"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["GM128636"],"award-info":[{"award-number":["GM128636"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["UL1TR003015"],"award-info":[{"award-number":["UL1TR003015"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,12,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/github.com\/databio\/regionset-embedding.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab439","type":"journal-article","created":{"date-parts":[[2021,6,15]],"date-time":"2021-06-15T15:14:17Z","timestamp":1623770057000},"page":"4299-4306","source":"Crossref","is-referenced-by-count":24,"title":["Embeddings of genomic region sets capture rich biological associations in lower dimensions"],"prefix":"10.1093","volume":"37","author":[{"given":"Erfaneh","family":"Gharavi","sequence":"first","affiliation":[{"name":"Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"School of Data Science, University of Virginia , Charlottesville, VA 22903, USA"}]},{"given":"Aaron","family":"Gu","sequence":"additional","affiliation":[{"name":"Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"Department of Computer Science, University of Virginia , Charlottesville, VA 22903, USA"}]},{"given":"Guangtao","family":"Zheng","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Virginia , Charlottesville, VA 22903, USA"}]},{"given":"Jason P","family":"Smith","sequence":"additional","affiliation":[{"name":"Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"Department of Biochemistry and Molecular Genetics, University of Virginia , Charlottesville, VA 22903, USA"}]},{"given":"Hyun Jae","family":"Cho","sequence":"additional","affiliation":[{"name":"Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"Department of Computer Science, University of Virginia , Charlottesville, VA 22903, USA"}]},{"given":"Aidong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Virginia , Charlottesville, VA 22903, USA"}]},{"given":"Donald E","family":"Brown","sequence":"additional","affiliation":[{"name":"School of Data Science, University of Virginia , Charlottesville, VA 22903, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5643-4068","authenticated-orcid":false,"given":"Nathan C","family":"Sheffield","sequence":"additional","affiliation":[{"name":"Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"School of Data Science, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"Department of Biochemistry and Molecular Genetics, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"Department of Public Health Sciences, University of Virginia , Charlottesville, VA 22903, USA"},{"name":"Department of Biomedical Engineering, University of Virginia , Charlottesville, VA 22903, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,6,22]]},"reference":[{"key":"2023061402422571700_btab439-B1","doi-asserted-by":"crossref","first-page":"433","DOI":"10.1002\/wics.101","article-title":"Principal component analysis","volume":"2","author":"Abdi","year":"2010","journal-title":"Wiley Interdiscip. Rev. Comput. Stat"},{"key":"2023061402422571700_btab439-B2","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1038\/nbt.4314","article-title":"Dimensionality reduction for visualizing single-cell data using UMAP","volume":"37","author":"Becht","year":"2019","journal-title":"Nat. Biotechnol"},{"key":"2023061402422571700_btab439-B3","doi-asserted-by":"crossref","first-page":"1798","DOI":"10.1109\/TPAMI.2013.50","article-title":"Representation learning: a review and new perspectives","volume":"35","author":"Bengio","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"2023061402422571700_btab439-B4","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1186\/1748-7188-5-21","article-title":"Sequence embedding for fast construction of guide trees for multiple sequence alignment","volume":"5","author":"Blackshields","year":"2010","journal-title":"Algorithms Mol. Biol"},{"key":"2023061402422571700_btab439-B5","doi-asserted-by":"crossref","first-page":"1213","DOI":"10.1038\/nmeth.2688","article-title":"Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position","volume":"10","author":"Buenrostro","year":"2013","journal-title":"Nat. Methods"},{"key":"2023061402422571700_btab439-B6","doi-asserted-by":"crossref","first-page":"241","DOI":"10.1186\/s13059-019-1854-5","article-title":"Assessment of computational methods for the analysis of single-cell ATAC-seq data","volume":"20","author":"Chen","year":"2019","journal-title":"Genome Biol"},{"key":"2023061402422571700_btab439-B7","doi-asserted-by":"crossref","first-page":"1193","DOI":"10.1038\/ng.3646","article-title":"Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution","volume":"48","author":"Corces","year":"2016","journal-title":"Nat. Genet"},{"key":"2023061402422571700_btab439-B8","doi-asserted-by":"crossref","first-page":"3575","DOI":"10.1093\/bioinformatics\/btx480","article-title":"Sequence2vec: a novel embedding approach for modeling transcription factor binding affinity landscape","volume":"33","author":"Dai","year":"2017","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B9","doi-asserted-by":"crossref","first-page":"3323","DOI":"10.1093\/bioinformatics\/btx414","article-title":"Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning","volume":"33","author":"Dozmorov","year":"2017","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B10","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1186\/s12864-018-5370-x","article-title":"Gene2vec: distributed representation of genes based on co-expression","volume":"20","author":"Du","year":"2019","journal-title":"BMC Genomics"},{"key":"2023061402422571700_btab439-B11","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"Dunham","year":"2012","journal-title":"Nature"},{"key":"2023061402422571700_btab439-B12","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1038\/s41586-019-1049-y","article-title":"Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH","volume":"568","author":"Eng","year":"2019","journal-title":"Nature"},{"key":"2023061402422571700_btab439-B13","doi-asserted-by":"crossref","first-page":"840","DOI":"10.1038\/nrg3306","article-title":"ChIP-seq and beyond: new and improved methodologies to detect and characterize protein\u2013DNA interactions","volume":"13","author":"Furey","year":"2012","journal-title":"Nat. Rev. Genet"},{"key":"2023061402422571700_btab439-B14","author":"Gu","year":"2021"},{"key":"2023061402422571700_btab439-B15","doi-asserted-by":"crossref","first-page":"2008","DOI":"10.1109\/TKDE.2018.2871031","article-title":"Next generation indexing for genomic intervals","volume":"31","author":"Jalili","year":"2019","journal-title":"IEEE Trans. Knowledge Data Eng"},{"key":"2023061402422571700_btab439-B16","doi-asserted-by":"crossref","first-page":"1615","DOI":"10.1093\/bioinformatics\/bty835","article-title":"Colocalization analyses of genomic elements: approaches, recommendations and challenges","volume":"35","author":"Kanduri","year":"2019","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B17","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1038\/nmeth.4556","article-title":"GIGGLE: a search engine for large-scale integrated genome analysis","volume":"15","author":"Layer","year":"2018","journal-title":"Nat. Methods"},{"key":"2023061402422571700_btab439-B18","author":"Le","year":"2014"},{"key":"2023061402422571700_btab439-B19","doi-asserted-by":"crossref","first-page":"i96","DOI":"10.1093\/bioinformatics\/bty285","article-title":"Unsupervised embedding of single-cell Hi-C data","volume":"34","author":"Liu","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B20","doi-asserted-by":"crossref","first-page":"733","DOI":"10.1016\/j.omtn.2019.04.019","article-title":"Meta-4mCpred: a sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation","volume":"16","author":"Manavalan","year":"2019","journal-title":"Mol. Ther. Nucleic Acids"},{"key":"2023061402422571700_btab439-B21","doi-asserted-by":"crossref","first-page":"861","DOI":"10.21105\/joss.00861","article-title":"UMAP: uniform manifold approximation and projection","volume":"3","author":"McInnes","year":"2018","journal-title":"J. Open Source Softw"},{"key":"2023061402422571700_btab439-B22","first-page":"3111","author":"Mikolov","year":"2013"},{"key":"2023061402422571700_btab439-B23","doi-asserted-by":"crossref","first-page":"111","DOI":"10.15252\/embr.201846255","article-title":"Ch IP-atlas: a data-mining suite powered by full integration of public ch IP-seq data","volume":"19","author":"Oki","year":"2018","journal-title":"EMBO Rep"},{"key":"2023061402422571700_btab439-B24","first-page":"1532","author":"Pennington","year":"2014"},{"key":"2023061402422571700_btab439-B25","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag"},{"key":"2023061402422571700_btab439-B26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-01977-6","article-title":"Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome","volume":"21","author":"Schreiber","year":"2020","journal-title":"Genome Biol"},{"key":"2023061402422571700_btab439-B27","doi-asserted-by":"crossref","first-page":"587","DOI":"10.1093\/bioinformatics\/btv612","article-title":"LOLA: enrichment analysis for genomic region sets and regulatory elements in R and bioconductor","volume":"32","author":"Sheffield","year":"2016","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B28","doi-asserted-by":"crossref","first-page":"e101","DOI":"10.1002\/cphg.101","article-title":"Analytical approaches for ATAC-seq data analysis","volume":"106","author":"Smith","year":"2020","journal-title":"Curr. Protoc. Hum. Genet"},{"key":"2023061402422571700_btab439-B29","doi-asserted-by":"crossref","first-page":"i417","DOI":"10.1093\/bioinformatics\/btaa488","article-title":"Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition","volume":"36","author":"Trofimov","year":"2020","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B30","first-page":"281","author":"Vapnik","year":"1997"},{"key":"2023061402422571700_btab439-B31","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1186\/1471-2105-13-174","article-title":"A novel hierarchical clustering algorithm for gene sequences","volume":"13","author":"Wei","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023061402422571700_btab439-B32","doi-asserted-by":"crossref","first-page":"e1006721","DOI":"10.1371\/journal.pcbi.1006721","article-title":"16S rRNA sequence embeddings: meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses","volume":"15","author":"Woloszynek","year":"2019","journal-title":"PLoS Comput. Biol"},{"key":"2023061402422571700_btab439-B33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-019-12630-7","article-title":"SCALE method for single-cell ATAC-seq analysis via latent feature extraction","volume":"10","author":"Xiong","year":"2019","journal-title":"Nat. Commun"},{"key":"2023061402422571700_btab439-B34","doi-asserted-by":"crossref","first-page":"2642","DOI":"10.1093\/bioinformatics\/bty178","article-title":"Learned protein embeddings for machine learning","volume":"34","author":"Yang","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061402422571700_btab439-B35","first-page":"42","author":"Yang","year":"1999"},{"key":"2023061402422571700_btab439-B36","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1109\/MCI.2018.2840738","article-title":"Recent trends in deep learning based natural language processing","volume":"13","author":"Young","year":"2018","journal-title":"IEEE Comput. Intell. Mag"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab439\/39554503\/btab439.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/23\/4299\/50579077\/btab439.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/23\/4299\/50579077\/btab439.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,4]],"date-time":"2023-11-04T18:46:05Z","timestamp":1699123565000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/23\/4299\/6307720"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,6,22]]},"references-count":36,"journal-issue":{"issue":"23","published-print":{"date-parts":[[2021,12,7]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab439","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.05.07.443166","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,12,1]]},"published":{"date-parts":[[2021,6,22]]}}}