{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T22:15:07Z","timestamp":1775772907660,"version":"3.50.1"},"reference-count":22,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2019,10,17]],"date-time":"2019-10-17T00:00:00Z","timestamp":1571270400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,9,25]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets.<\/jats:p>\n                  <jats:p>All results on simulated and real data can be inspected and reproduced at https:\/\/hyperbrowser.uio.no\/sim-measure.<\/jats:p>","DOI":"10.1093\/bib\/bbz083","type":"journal-article","created":{"date-parts":[[2019,6,18]],"date-time":"2019-06-18T07:09:18Z","timestamp":1560841758000},"page":"1523-1530","source":"Crossref","is-referenced-by-count":26,"title":["Beware the Jaccard: the choice of\n                    <b>similarity measure<\/b>\n                    is important and non-trivial in genomic colocalisation analysis"],"prefix":"10.1093","volume":"21","author":[{"given":"Stefania","family":"Salvatore","sequence":"first","affiliation":[{"name":"Department of Informatics, University of Oslo, Oslo, Norway"}]},{"given":"Knut","family":"Dagestad Rand","sequence":"additional","affiliation":[{"name":"Department of Mathematics, University of Oslo, Oslo, Norway"}]},{"given":"Ivar","family":"Grytten","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Oslo, Oslo, Norway"}]},{"given":"Egil","family":"Ferkingstad","sequence":"additional","affiliation":[{"name":"Science Institute, University of Iceland, Reykjavik, Iceland"}]},{"given":"Diana","family":"Domanska","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Oslo, Oslo, Norway"}]},{"given":"Lars","family":"Holden","sequence":"additional","affiliation":[{"name":"Statistics For Innovation, Norwegian Computing Center, Oslo, Norway"}]},{"given":"Marius","family":"Gheorghe","sequence":"additional","affiliation":[{"name":"Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, Oslo, Norway"}]},{"given":"Anthony","family":"Mathelier","sequence":"additional","affiliation":[{"name":"Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, Oslo, Norway"},{"name":"Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway"}]},{"given":"Ingrid","family":"Glad","sequence":"additional","affiliation":[{"name":"Department of Mathematics, University of Oslo, Oslo, Norway"}]},{"given":"Geir","family":"Kjetil Sandve","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Oslo, Oslo, Norway"}]}],"member":"286","published-online":{"date-parts":[[2019,10,17]]},"reference":[{"issue":"6","key":"2021031107443216900_ref1","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1101\/gr.229102","article-title":"The human genome browser at ucsc","volume":"12","author":"Kent","year":"2002","journal-title":"Genome Res"},{"key":"2021031107443216900_ref2","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of dna elements in the human genome","volume":"489","author":"The ENCODE Project Consortium","year":"2012","journal-title":"Nature"},{"key":"2021031107443216900_ref3","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1038\/nbt1010-1045","article-title":"The nih roadmap epigenomics mapping consortium","volume":"28","author":"Bernstein","year":"2010","journal-title":"Nat Biotechnol"},{"key":"2021031107443216900_ref4","doi-asserted-by":"crossref","first-page":"580","DOI":"10.1038\/ng.2653","article-title":"The genotype-tissue expression (gtex) project","volume":"45","author":"Lonsdale","year":"2013","journal-title":"Nat Genet"},{"key":"2021031107443216900_ref5","first-page":"547","article-title":"\u00c9tude comparative de la distribution florale dans une portion des alpes et des jura","volume":"37","author":"Jaccard","year":"1901","journal-title":"Bull Soc Vaudoise Sci Nat"},{"key":"2021031107443216900_ref6","article-title":"On the Local Distribution of Certain Illinois Fishes: An Essay in Statistical Ecology","author":"Forbes","year":"1907"},{"issue":"2","key":"2021031107443216900_ref7","doi-asserted-by":"crossref","first-page":"148","DOI":"10.1111\/j.1461-0248.2004.00707.x","article-title":"A new statistical approach for assessing similarity of species composition with incidence and abundance data","volume":"8","author":"Chao","year":"2005","journal-title":"Ecol Lett"},{"issue":"3","key":"2021031107443216900_ref8","doi-asserted-by":"publisher","first-page":"296","DOI":"10.1007\/BF00344966","article-title":"Similarity indices, sample size and diversity","volume":"50","author":"Wolda","year":"1981","journal-title":"Oecologia"},{"issue":"1","key":"2021031107443216900_ref9","doi-asserted-by":"publisher","first-page":"11.12.1","DOI":"10.1002\/0471250953.bi1112s47","article-title":"Bedtools: the swissarmy tool for genome feature analysis","volume":"47","author":"Quinlan","year":"2014","journal-title":"Curr Protoc Bioinformatics"},{"key":"2021031107443216900_ref10","author":"Quinlan"},{"issue":"7","key":"2021031107443216900_ref11","doi-asserted-by":"publisher","first-page":"412","DOI":"10.1186\/s13059-014-0412-4","article-title":"Non-targeted transcription factors motifs are a systemic component of chip-seq datasets","volume":"15","author":"Worsley Hunt","year":"2014","journal-title":"Genome Biol"},{"issue":"9","key":"2021031107443216900_ref12","doi-asserted-by":"crossref","first-page":"913","DOI":"10.1080\/00029890.1975.11993976","article-title":"On the determination of the bivariate normal distribution from distributions of linear combinations of the variables","volume":"82","author":"Hamedani","year":"1975","journal-title":"Am Math Mon"},{"key":"2021031107443216900_ref13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1098\/rsta.1900.0022","article-title":"I. mathematical contributions to the theory of evolution. \u2013vii. on the correlation of characters not quantitatively measurable","volume-title":"Philos Trans R Soc Lond A Math Phys Eng Sci","author":"Pearson","year":"1900"},{"key":"2021031107443216900_ref14","article-title":"R: A Language and Environment for Statistical Computing","volume-title":"R Foundation for Statistical Computing","author":"R Core Team","year":"2016"},{"key":"2021031107443216900_ref15","volume-title":"Polycor: polychoric and polyserial correlations. R package version 0.7\u20138","author":"Fox","year":"2010"},{"issue":"D1","key":"2021031107443216900_ref16","doi-asserted-by":"publisher","first-page":"D267","DOI":"10.1093\/nar\/gkx1092","article-title":"Remap 2018: an updated atlas of regulatory regions from an integrative analysis of dna-binding chip-seq experiments","volume":"46","author":"Gheorghe","year":"2018","journal-title":"Nucleic Acids Res"},{"issue":"3","key":"2021031107443216900_ref17","doi-asserted-by":"crossref","first-page":"297","DOI":"10.2307\/1932409","article-title":"Measures of the amount of ecologic association between species","volume":"26","author":"Dice","year":"1945","journal-title":"Ecology"},{"key":"2021031107443216900_ref18","first-page":"1","article-title":"A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons","volume":"5","author":"S\u00f8rensen","year":"1948","journal-title":"Biol Skr"},{"key":"2021031107443216900_ref19","first-page":"253","article-title":"Mathematical contributions to the theory of evolution. iii. Regression, heredity, and panmixia","volume-title":"Philos Trans R Soc Lond Ser A Cont Pap Math Phys Character","author":"Pearson","year":"1896"},{"key":"2021031107443216900_ref20","doi-asserted-by":"crossref","first-page":"1118","DOI":"10.1038\/ng.717","article-title":"Genome-wide meta-analysis increases to 71 the number of confirmed crohn\u2019s disease susceptibility loci","volume":"42","author":"Franke","year":"2010","journal-title":"Nat Genet"},{"issue":"6099","key":"2021031107443216900_ref21","doi-asserted-by":"publisher","first-page":"1190","DOI":"10.1126\/science.1222794","article-title":"Systematic localization of common disease-associated variation in regulatory dna","volume":"337","author":"Maurano","journal-title":"Science (New York, NY)"},{"key":"2021031107443216900_ref22","first-page":"1615","article-title":"Colocalization analyses of genomic elements: approaches, recommendations and challenges","volume-title":"Bioinformatics","author":"Kanduri","year":"2019"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/5\/1523\/36529358\/bbz083.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/5\/1523\/36529358\/bbz083.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,11]],"date-time":"2021-03-11T04:50:23Z","timestamp":1615438223000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/21\/5\/1523\/5586919"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,17]]},"references-count":22,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2019,10,17]]},"published-print":{"date-parts":[[2020,9,25]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbz083","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/479253","asserted-by":"object"}]},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,9]]},"published":{"date-parts":[[2019,10,17]]}}}