{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T11:20:56Z","timestamp":1763810456891,"version":"3.37.3"},"reference-count":43,"publisher":"Oxford University Press (OUP)","issue":"14","license":[{"start":{"date-parts":[[2021,1,30]],"date-time":"2021-01-30T00:00:00Z","timestamp":1611964800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000025","name":"National Institute of Mental Health","doi-asserted-by":"publisher","award":["MH106910","MH095797"],"award-info":[{"award-number":["MH106910","MH095797"]}],"id":[{"id":"10.13039\/100000025","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Predicting regulatory effects of genetic variants is a challenging but important problem in functional genomics. Given the relatively low sensitivity of functional assays, and the pervasiveness of class imbalance in functional genomic data, popular statistical prediction models can sharply underestimate the probability of a regulatory effect. We describe here the presence-only model (PO-EN), a type of semisupervised model, to predict regulatory effects of genetic variants at sequence-level resolution in a context of interest by integrating a large number of epigenetic features and massively parallel reporter assays (MPRAs).<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Using experimental data from a variety of MPRAs we show that the presence-only model produces better calibrated predicted probabilities and has increased accuracy relative to state-of-the-art prediction models. Furthermore, we show that the predictions based on pretrained PO-EN models are useful for prioritizing functional variants among candidate eQTLs and significant SNPs at GWAS loci. In particular, for the costimulatory locus, associated with multiple autoimmune diseases, we show evidence of a regulatory variant residing in an enhancer 24.4\u2009kb downstream of CTLA4, with evidence from capture Hi-C of interaction with CTLA4. Furthermore, the risk allele of the regulatory variant is on the same risk increasing haplotype as a functional coding variant in exon 1 of CTLA4, suggesting that the regulatory variant acts jointly with the coding variant leading to increased risk to disease.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The presence-only model is implemented in the R package \u2018PO.EN\u2019, freely available on CRAN. A vignette describing a detailed demonstration of using the proposed PO-EN model can be found on github at https:\/\/github.com\/Iuliana-Ionita-Laza\/PO.EN\/<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab040","type":"journal-article","created":{"date-parts":[[2021,1,25]],"date-time":"2021-01-25T16:04:03Z","timestamp":1611590643000},"page":"1953-1962","source":"Crossref","is-referenced-by-count":4,"title":["A semisupervised model to predict regulatory effects of genetic variants at single nucleotide resolution using massively parallel reporter assays"],"prefix":"10.1093","volume":"37","author":[{"given":"Zikun","family":"Yang","sequence":"first","affiliation":[{"name":"Department of Biostatistics, Columbia University , New York, NY 10032, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chen","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, Columbia University , New York, NY 10032, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stephanie","family":"Erjavec","sequence":"additional","affiliation":[{"name":"Department of Genetics and Development, Columbia University , New York, NY 10032, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lynn","family":"Petukhova","sequence":"additional","affiliation":[{"name":"Department of Epidemiology, Columbia University , New York, NY 10032, USA"},{"name":"Department of Dermatology, Columbia University , New York, NY 10032, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Angela","family":"Christiano","sequence":"additional","affiliation":[{"name":"Department of Genetics and Development, Columbia University , New York, NY 10032, USA"},{"name":"Department of Dermatology, Columbia University , New York, NY 10032, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9001-2026","authenticated-orcid":false,"given":"Iuliana","family":"Ionita-Laza","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, Columbia University , New York, NY 10032, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2021,1,30]]},"reference":[{"key":"2023061310294087300_btab040-B1","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1016\/j.cell.2016.10.042","article-title":"The allelic landscape of human blood cell trait variation and links to common complex disease","volume":"167","author":"Astle","year":"2016","journal-title":"Cell"},{"key":"2023061310294087300_btab040-B2","doi-asserted-by":"crossref","first-page":"920","DOI":"10.1016\/j.ajhg.2018.03.026","article-title":"FUN-LDA: a latent Dirichlet allocation model for predicting tissue-specific functional effects of noncoding variation: methods and applications","volume":"102","author":"Backenroth","year":"2018","journal-title":"Am. J. Hum. Genet"},{"key":"2023061310294087300_btab040-B3","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1038\/nbt1010-1045","article-title":"The NIH roadmap epigenomics mapping consortium","volume":"28","author":"Bernstein","year":"2010","journal-title":"Nat. Biotechnol"},{"key":"2023061310294087300_btab040-B4","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1038\/sj.gene.6364265","article-title":"Haplotypes in the ctla4 region are associated with coeliac disease in the Irish population","volume":"7","author":"Brophy","year":"2006","journal-title":"Genes Immun"},{"key":"2023061310294087300_btab040-B5","doi-asserted-by":"crossref","first-page":"570","DOI":"10.1073\/pnas.0610124104","article-title":"Signatures of strong population differentiation shape extended haplotypes across the human CD28, CTLA4, and ICOS costimulatory genes","volume":"104","author":"Butty","year":"2007","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023061310294087300_btab040-B6","doi-asserted-by":"crossref","first-page":"1327","DOI":"10.1038\/s41588-018-0192-y","article-title":"Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk","volume":"50","author":"Castel","year":"2018","journal-title":"Nat. Genet"},{"key":"2023061310294087300_btab040-B7","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1007730.1007733","article-title":"Special issue on learning from imbalanced data sets","volume":"6","author":"Chawla","year":"2004","journal-title":"ACM SIGKDD Explor. Newsl"},{"key":"2023061310294087300_btab040-B8","doi-asserted-by":"crossref","first-page":"10074","DOI":"10.1038\/s41598-018-28423-9","article-title":"Ctla-4 +49 G\/A, a functional T1D risk SNP, affects CTLA-4 level in Treg subsets and IA-2A positivity, but not beta-cell function","volume":"8","author":"Chen","year":"2018","journal-title":"Sci. Rep"},{"key":"2023061310294087300_btab040-B9","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"Consortium","year":"2012","journal-title":"Nature"},{"key":"2023061310294087300_btab040-B10","doi-asserted-by":"crossref","first-page":"779","DOI":"10.1016\/j.ajhg.2013.10.012","article-title":"Beyond GWASs: illuminating the dark road from association to function","volume":"93","author":"Edwards","year":"2013","journal-title":"Am. J. Hum. Genet"},{"key":"2023061310294087300_btab040-B11","doi-asserted-by":"crossref","first-page":"bax028","DOI":"10.1093\/database\/bax028","article-title":"Genehancer: genome-wide integration of enhancers and target genes in genecards","volume":"2017","author":"Fishilevich","year":"2017","journal-title":"Database"},{"key":"2023061310294087300_btab040-B12","doi-asserted-by":"crossref","first-page":"1693","DOI":"10.1214\/14-AOS1220","article-title":"Local case-control sampling: efficient subsampling in imbalanced data sets","volume":"42","author":"Fithian","year":"2014","journal-title":"Ann. Stat"},{"key":"2023061310294087300_btab040-B13","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1038\/nature09753","article-title":"9p21 DNA variants associated with coronary artery disease impair interferon-\u03b3 signalling response","volume":"470","author":"Harismendy","year":"2011","journal-title":"Nature"},{"key":"2023061310294087300_btab040-B14","doi-asserted-by":"crossref","first-page":"1263","DOI":"10.1109\/TKDE.2008.239","article-title":"Learning from imbalanced data","volume":"21","author":"He","year":"2009","journal-title":"IEEE Trans. Knowl. Data Eng"},{"key":"2023061310294087300_btab040-B15","doi-asserted-by":"crossref","first-page":"5199","DOI":"10.1038\/s41467-018-07349-w","article-title":"A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs","volume":"9","author":"He","year":"2018","journal-title":"Nat. Commun"},{"key":"2023061310294087300_btab040-B16","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1101\/gr.212092.116","article-title":"A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity","volume":"27","author":"Inoue","year":"2017","journal-title":"Genome Res"},{"key":"2023061310294087300_btab040-B17","doi-asserted-by":"crossref","first-page":"214","DOI":"10.1038\/ng.3477","article-title":"A spectral approach integrating functional genomic annotations for coding and noncoding variants","volume":"48","author":"Ionita-Laza","year":"2016","journal-title":"Nat. Genet"},{"key":"2023061310294087300_btab040-B18","doi-asserted-by":"crossref","first-page":"434","DOI":"10.1038\/s41586-020-2308-7","article-title":"The mutational constraint spectrum quantified from variation in 141,456 humans","volume":"581","author":"Karczewski","year":"2020","journal-title":"Nature"},{"key":"2023061310294087300_btab040-B19","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1093\/oxfordjournals.pan.a004868","article-title":"Logistic regression in rare events data","volume":"9","author":"King","year":"2001","journal-title":"Polit. Anal"},{"key":"2023061310294087300_btab040-B20","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1146\/annurev-genom-083118-014845","article-title":"Massively parallel assays and quantitative sequence\u2013function relationships","volume":"20","author":"Kinney","year":"2019","journal-title":"Annu. Rev. Genomics Hum. Genet"},{"key":"2023061310294087300_btab040-B21","doi-asserted-by":"crossref","first-page":"310","DOI":"10.1038\/ng.2892","article-title":"A general framework for estimating the relative pathogenicity of human genetic variants","volume":"46","author":"Kircher","year":"2014","journal-title":"Nat. Genet"},{"key":"2023061310294087300_btab040-B22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-019-11526-w","article-title":"Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution","volume":"10","author":"Kircher","year":"2019","journal-title":"Nat. Commun"},{"key":"2023061310294087300_btab040-B23","doi-asserted-by":"crossref","first-page":"955","DOI":"10.1038\/ng.3331","article-title":"A method to predict the impact of regulatory variants from DNA sequence","volume":"47","author":"Lee","year":"2015","journal-title":"Nat. Genet"},{"key":"2023061310294087300_btab040-B24","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1038\/sj.gene.6363752","article-title":"CTLA-4 gene expression is influenced by promoter and exon 1 polymorphisms","volume":"2","author":"Ligers","year":"2001","journal-title":"Genes Immun"},{"key":"2023061310294087300_btab040-B25","doi-asserted-by":"crossref","first-page":"1323","DOI":"10.1109\/TIP.2017.2781298","article-title":"Cost-sensitive feature selection by optimizing f-measures","volume":"27","author":"Liu","year":"2018","journal-title":"IEEE Trans. Image Process"},{"key":"2023061310294087300_btab040-B26","first-page":"76","article-title":"Massively parallel reporter assays: defining functional psychiatric genetic variants across biological contexts","volume":"89","author":"Mulvey","year":"2020"},{"key":"2023061310294087300_btab040-B27","doi-asserted-by":"crossref","first-page":"714","DOI":"10.1038\/nature09266","article-title":"From noncoding variant to phenotype via sort1 at the 1p13 cholesterol locus","volume":"466","author":"Musunuru","year":"2010","journal-title":"Nature"},{"key":"2023061310294087300_btab040-B28","first-page":"413","article-title":"Obtaining calibrated probabilities from boosting","author":"Niculescu-Mizil","year":"2005"},{"key":"2023061310294087300_btab040-B29","first-page":"625","article-title":"Predicting good probabilities with supervised learning","author":"Niculescu-Mizil","year":"2005"},{"key":"2023061310294087300_btab040-B30","doi-asserted-by":"crossref","first-page":"1409","DOI":"10.1890\/12-1520.1","article-title":"On estimating probability of presence from use\u2013availability or presence\u2013background data","volume":"94","author":"Phillips","year":"2013","journal-title":"Ecology"},{"key":"2023061310294087300_btab040-B31","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1093\/biomet\/66.3.403","article-title":"Logistic disease incidence models and case-control studies","volume":"66","author":"Prentice","year":"1979","journal-title":"Biometrika"},{"key":"2023061310294087300_btab040-B32","doi-asserted-by":"crossref","first-page":"E3692","DOI":"10.1073\/pnas.1714376115","article-title":"Accurate and sensitive quantification of protein-DNA binding affinity","volume":"115","author":"Rastogi","year":"2018","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023061310294087300_btab040-B33","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1038\/nmeth.2832","article-title":"Functional annotation of noncoding sequence variants","volume":"11","author":"Ritchie","year":"2014","journal-title":"Nat. Methods"},{"key":"2023061310294087300_btab040-B34","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41598-018-26217-7","article-title":"Genome-wide association study identified new susceptible genetic variants in HLA class I region for hepatitis B virus-related hepatocellular carcinoma","volume":"8","author":"Sawai","year":"2018","journal-title":"Sci. Rep"},{"key":"2023061310294087300_btab040-B35","first-page":"1","article-title":"Pulasso: high-dimensional variable selection with presence-only data","volume":"115","author":"Song","year":"2019","journal-title":"J. Am. Stat. Assoc"},{"key":"2023061310294087300_btab040-B36","doi-asserted-by":"crossref","first-page":"1519","DOI":"10.1016\/j.cell.2016.04.027","article-title":"Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay","volume":"165","author":"Tewhey","year":"2016","journal-title":"Cell"},{"key":"2023061310294087300_btab040-B37","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression shrinkage and selection via the lasso","volume":"58","author":"Tibshirani","year":"1996","journal-title":"J. R. Stat. Soc. Ser. B"},{"key":"2023061310294087300_btab040-B38","doi-asserted-by":"crossref","first-page":"506","DOI":"10.1038\/nature01621","article-title":"Association of the t-cell regulatory gene CTLA4 with susceptibility to autoimmune disease","volume":"423","author":"Ueda","year":"2003","journal-title":"Nature"},{"key":"2023061310294087300_btab040-B39","doi-asserted-by":"crossref","first-page":"1160","DOI":"10.1038\/s41588-019-0455-2","article-title":"High-throughput identification of human SNPs affecting regulatory element activity","volume":"51","author":"van Arensbergen","year":"2019","journal-title":"Nat. Genet"},{"key":"2023061310294087300_btab040-B40","doi-asserted-by":"crossref","first-page":"554","DOI":"10.1111\/j.1541-0420.2008.01116.x","article-title":"Presence-only data and the EM algorithm","volume":"65","author":"Ward","year":"2009","journal-title":"Biometrics"},{"key":"2023061310294087300_btab040-B41","doi-asserted-by":"crossref","first-page":"1388","DOI":"10.1109\/TKDE.2009.187","article-title":"Combating the small sample class imbalance problem using feature selection","volume":"22","author":"Wasikowski","year":"2010","journal-title":"IEEE Trans. Knowl. Data Eng"},{"key":"2023061310294087300_btab040-B42","doi-asserted-by":"crossref","first-page":"931","DOI":"10.1038\/nmeth.3547","article-title":"Predicting effects of noncoding variants with deep learning-based sequence model","volume":"12","author":"Zhou","year":"2015","journal-title":"Nat. Methods"},{"key":"2023061310294087300_btab040-B43","doi-asserted-by":"crossref","first-page":"1733","DOI":"10.1214\/08-AOS625","article-title":"On the adaptive elastic-net with a diverging number of parameters","volume":"37","author":"Zou","year":"2009","journal-title":"Ann. Stat"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab040\/36297314\/btab040.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/14\/1953\/50578493\/btab040.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/14\/1953\/50578493\/btab040.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,22]],"date-time":"2024-08-22T23:30:55Z","timestamp":1724369455000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/14\/1953\/6124410"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2021,1,30]]},"references-count":43,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2021,8,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab040","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2021,7,15]]},"published":{"date-parts":[[2021,1,30]]}}}