{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:28Z","timestamp":1772138068254,"version":"3.50.1"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"4","license":[{"start":{"date-parts":[[2024,4,11]],"date-time":"2024-04-11T00:00:00Z","timestamp":1712793600000},"content-version":"vor","delay-in-days":13,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000289","name":"Cancer Research UK","doi-asserted-by":"publisher","award":["C18281\/A30905"],"award-info":[{"award-number":["C18281\/A30905"]}],"id":[{"id":"10.13039\/501100000289","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,3,29]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Recent advancements in sequencing technologies have led to the discovery of numerous variants in the human genome. However, understanding their precise roles in diseases remains challenging due to their complex functional mechanisms. Various methodologies have emerged to predict the pathogenic significance of these genetic variants. Typically, these methods employ an integrative approach, leveraging diverse data sources that provide important insights into genomic function. Despite the abundance of publicly available data sources and databases, the process of navigating, extracting, and pre-processing features for machine learning models can be highly challenging and time-consuming. Furthermore, researchers often invest substantial effort in feature extraction, only to later discover that these features lack informativeness.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>In this article, we introduce DrivR-Base, an innovative resource that efficiently extracts and integrates molecular information (features) related to single nucleotide variants. These features encompass information about the genomic positions and the associated protein positions of a variant. They are derived from a wide array of databases and tools, including structural properties obtained from AlphaFold, regulatory information sourced from ENCODE, and predicted variant consequences from Variant Effect Predictor. DrivR-Base is easily deployable via a Docker container to ensure reproducibility and ease of access across diverse computational environments. The resulting features can be used as input for machine learning models designed to predict the pathogenic impact of human genome variants in disease. Moreover, these feature sets have applications beyond this, including haploinsufficiency prediction and the development of drug repurposing tools. We describe the resource\u2019s development, practical applications, and potential for future expansion and enhancement.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>DrivR-Base source code is available at https:\/\/github.com\/amyfrancis97\/DrivR-Base.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae197","type":"journal-article","created":{"date-parts":[[2024,4,9]],"date-time":"2024-04-09T19:34:33Z","timestamp":1712691273000},"source":"Crossref","is-referenced-by-count":1,"title":["DrivR-Base: a feature extraction toolkit for variant effect prediction model construction"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-4807-1454","authenticated-orcid":false,"given":"Amy","family":"Francis","sequence":"first","affiliation":[{"name":"MRC Integrative Epidemiology Unit, Bristol Medical School (PHS), University of Bristol , Bristol BS8 2BN, United Kingdom"}]},{"given":"Colin","family":"Campbell","sequence":"additional","affiliation":[{"name":"Intelligent Systems Laboratory, University of Bristol , Bristol BS1 5DD, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0924-3247","authenticated-orcid":false,"given":"Tom R","family":"Gaunt","sequence":"additional","affiliation":[{"name":"MRC Integrative Epidemiology Unit, Bristol Medical School (PHS), University of Bristol , Bristol BS8 2BN, United Kingdom"}]}],"member":"286","published-online":{"date-parts":[[2024,4,11]]},"reference":[{"key":"2024042923514161600_btae197-B1","doi-asserted-by":"crossref","DOI":"10.1002\/0471142905.hg0720s76","article-title":"Predicting functional effect of human missense mutations using polyphen-2","author":"Adzhubei","year":"2013","journal-title":"Curr Protoc Hum Genet"},{"key":"2024042923514161600_btae197-B2","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1093\/nar\/28.1.235","article-title":"The protein data bank","volume":"28","author":"Berman","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B3","doi-asserted-by":"crossref","first-page":"555","DOI":"10.1038\/s41431-021-01034-1","article-title":"Variant pathogenic prediction by locus variability: the importance of the current picture of evolution","volume":"30","author":"Cabrera-Alarcon","year":"2022","journal-title":"Eur J Hum Genet"},{"key":"2024042923514161600_btae197-B4","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-01552-6","volume-title":"Learning with Support Vector Machines","author":"Campbell","year":"2011"},{"key":"2024042923514161600_btae197-B5","doi-asserted-by":"crossref","first-page":"eadg7492","DOI":"10.1126\/science.adg7492","article-title":"Accurate proteome-wide missense variant effect prediction with alphamissense","volume":"381","author":"Cheng","year":"2023","journal-title":"Science"},{"key":"2024042923514161600_btae197-B6","doi-asserted-by":"crossref","first-page":"1211","DOI":"10.1093\/bioinformatics\/btv735","article-title":"Dnashaper: an r\/bioconductor package for dna shape prediction and feature encoding","volume":"32","author":"Chiu","year":"2016","journal-title":"Bioinformatics"},{"key":"2024042923514161600_btae197-B7","doi-asserted-by":"crossref","first-page":"12565","DOI":"10.1093\/nar\/gkx915","article-title":"Genome-wide prediction of minor-groove electrostatic potential enables biophysical modeling of protein\u2013DNA binding","volume":"45","author":"Chiu","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B8","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"Dunham","year":"2012","journal-title":"Nature"},{"key":"2024042923514161600_btae197-B9","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1038\/s41586-021-04043-8","article-title":"Disease variant prediction with deep generative models of evolutionary data","volume":"599","author":"Frazer","year":"2021","journal-title":"Nature"},{"key":"2024042923514161600_btae197-B10","doi-asserted-by":"crossref","first-page":"D37","DOI":"10.1093\/nar\/gkn597","article-title":"Diprodb: a database for dinucleotide properties","volume":"37","author":"Friedel","year":"2009","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B11","doi-asserted-by":"crossref","first-page":"1443","DOI":"10.1126\/science.1604319","article-title":"Exhaustive matching of the entire protein sequence database","volume":"256","author":"Gonnet","year":"1992","journal-title":"Science"},{"key":"2024042923514161600_btae197-B12","doi-asserted-by":"crossref","first-page":"10915","DOI":"10.1073\/pnas.89.22.10915","article-title":"Amino acid substitution matrices from protein blocks","volume":"89","author":"Henikoff","year":"1992","journal-title":"Proc Natl Acad Sci U S A"},{"key":"2024042923514161600_btae197-B13","first-page":"101307","article-title":"The use of genomic variants to drive drug repurposing for chronic hepatitis b","volume":"31","author":"Irham","year":"2022","journal-title":"Biochem Biophys Rep"},{"key":"2024042923514161600_btae197-B14","first-page":"275","article-title":"The rapid generation of mutation data matrices from protein sequences","volume":"8","author":"Jones","year":"1992","journal-title":"Comput Appl Biosci"},{"key":"2024042923514161600_btae197-B15","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1016\/0014-5793(94)80429-X","article-title":"A mutation data matrix for transmembrane proteins","volume":"339","author":"Jones","year":"1994","journal-title":"FEBS Lett"},{"key":"2024042923514161600_btae197-B16","doi-asserted-by":"crossref","first-page":"7189","DOI":"10.1093\/nar\/gkg922","article-title":"Using electrostatic potentials to predict dna-binding sites on dna-binding proteins","volume":"31","author":"Jones","year":"2003","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B17","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2024042923514161600_btae197-B18","first-page":"e120","article-title":"Umap and bismap: quantifying genome and methylome mappability","volume":"46","author":"Karimzadeh","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B19","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1101\/gr.229102","article-title":"The human genome browser at ucsc","volume":"12","author":"Kent","year":"2002","journal-title":"Genome Res"},{"key":"2024042923514161600_btae197-B20","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1186\/s13073-020-00803-9","article-title":"Dbnsfp v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site snvs","volume":"12","author":"Liu","year":"2020","journal-title":"Genome Med"},{"key":"2024042923514161600_btae197-B21","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1186\/s13059-016-0974-4","article-title":"The ensembl variant effect predictor","volume":"17","author":"McLaren","year":"2016","journal-title":"Genome Biol"},{"key":"2024042923514161600_btae197-B22","doi-asserted-by":"crossref","first-page":"760","DOI":"10.1093\/bioinformatics\/16.9.760","article-title":"Phat: a transmembrane-specific substitution matrix. predicted hydrophobic and transmembrane","volume":"16","author":"Ng","year":"2000","journal-title":"Bioinformatics"},{"key":"2024042923514161600_btae197-B23","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1186\/1471-2105-13-133","article-title":"Bios2mds: an r package for comparing orthologous protein families by metric multidimensional scaling","volume":"13","author":"Pel\u00e9","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2024042923514161600_btae197-B24","doi-asserted-by":"crossref","first-page":"110","DOI":"10.1101\/gr.097857.109","article-title":"Detection of non-neutral substitution rates on mammalian phylogenies","volume":"20","author":"Pollard","year":"2009","journal-title":"Genome Res"},{"key":"2024042923514161600_btae197-B25","doi-asserted-by":"crossref","first-page":"761","DOI":"10.1093\/bioinformatics\/btu703","article-title":"Dann: a deep learning approach for annotating the pathogenicity of genetic variants","volume":"31","author":"Quang","year":"2015","journal-title":"Bioinformatics"},{"key":"2024042923514161600_btae197-B26","author":"Reddy","year":"2019"},{"key":"2024042923514161600_btae197-B27","doi-asserted-by":"crossref","first-page":"D886","DOI":"10.1093\/nar\/gky1016","article-title":"Cadd: predicting the deleteriousness of variants throughout the human genome","volume":"47","author":"Rentzsch","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B28","doi-asserted-by":"crossref","first-page":"11597","DOI":"10.1038\/s41598-017-11746-4","article-title":"Cscape: a tool for predicting oncogenic single-point mutations in the cancer genome","volume":"7","author":"Rogers","year":"2017","journal-title":"Sci Rep"},{"key":"2024042923514161600_btae197-B29","doi-asserted-by":"crossref","first-page":"1248","DOI":"10.1038\/nature08473","article-title":"The role of dna shape in protein-dna recognition","volume":"461","author":"Rohs","year":"2009","journal-title":"Nature"},{"key":"2024042923514161600_btae197-B30","doi-asserted-by":"crossref","first-page":"1751","DOI":"10.1093\/bioinformatics\/btx028","article-title":"Hipred: an integrative approach to predicting haploinsufficient genes","volume":"33","author":"Shihab","year":"2017","journal-title":"Bioinformatics"},{"key":"2024042923514161600_btae197-B31","doi-asserted-by":"crossref","first-page":"1536","DOI":"10.1093\/bioinformatics\/btv009","article-title":"An integrative approach to predicting the functional effects of non-coding and coding sequence variation","volume":"31","author":"Shihab","year":"2015","journal-title":"Bioinformatics"},{"key":"2024042923514161600_btae197-B32","doi-asserted-by":"crossref","first-page":"1034","DOI":"10.1101\/gr.3715005","article-title":"Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes","volume":"15","author":"Siepel","year":"2005","journal-title":"Genome Res"},{"key":"2024042923514161600_btae197-B33","doi-asserted-by":"crossref","first-page":"1667","DOI":"10.1038\/s41598-018-38189-9","article-title":"New insights into the pathogenicity of non-synonymous variants through multi-level analysis","volume":"9","author":"Sun","year":"2019","journal-title":"Sci Rep"},{"key":"2024042923514161600_btae197-B34","doi-asserted-by":"crossref","first-page":"1838","DOI":"10.1093\/nar\/gkg296","article-title":"Dna helix: the importance of being gc-rich","volume":"31","author":"Vinogradov","year":"2003","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B35","doi-asserted-by":"crossref","first-page":"e164","DOI":"10.1093\/nar\/gkq603","article-title":"Annovar: functional annotation of genetic variants from high-throughput sequencing data","volume":"38","author":"Wang","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2024042923514161600_btae197-B36","doi-asserted-by":"crossref","first-page":"811","DOI":"10.1016\/B0-12-226865-2\/00355-2","article-title":"Populations, species, and conservation genetics","author":"Woodruff","year":"2001","journal-title":"Encyclopedia of Biodiversity"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae197\/57218152\/btae197.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/4\/btae197\/57356554\/btae197.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/4\/btae197\/57356554\/btae197.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,29]],"date-time":"2024-04-29T19:51:56Z","timestamp":1714420316000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae197\/7644281"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,3,29]]},"references-count":36,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,3,29]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae197","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.01.16.575859","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,4,1]]},"published":{"date-parts":[[2024,3,29]]},"article-number":"btae197"}}