{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T18:08:09Z","timestamp":1757614089413,"version":"3.44.0"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p>The Variant Call Format (VCF) and its binary counterpart (BCF) are commonly used in bioinformatics for storing gene sequence data. While VCF files provide compact storage, they require specific tools and scripts for querying, thereby missing the rich functionality arsenal of database management systems and their potential for integration in multiomics pipelines. In this paper, we leverage Relational Database Management Systems (RDBMS) to enhance efficiency and flexibility in storing and querying large-scale genetic datasets. We map the VCF file structure to narrow, wide, and array-based data models that are further refined using JSON data structures, resulting in eight data models. Our experimental evaluation shows that RDBMS provide competitive performance in comparison with specialized state-of-the-art tools while making full-fledged database capabilities available for genetic data analysis.<\/jats:p>","DOI":"10.14778\/3749646.3749674","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:55:06Z","timestamp":1757008506000},"page":"4045-4053","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Relational Data Models for Genetic VCF data"],"prefix":"10.14778","volume":"18","author":[{"given":"Mohamed Sabri","family":"Hafidi","sequence":"first","affiliation":[{"name":"Free University of Bozen-Bolzano"}]},{"given":"Ozan","family":"Kahramano\u011fullar\u0131","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano"}]},{"given":"Anton","family":"Dign\u00f6s","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano"}]},{"given":"Johann","family":"Gamper","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Gihan Daw Elbait, Habiba Alsafar, and Andreas Henschel.","author":"Al-Aamri Amira","year":"2023","unstructured":"Amira Al-Aamri, Syafiq Kamarul Azman, Gihan Daw Elbait, Habiba Alsafar, and Andreas Henschel. 2023. Critical assessment of on-premise approaches to scalable genome analysis. BMC bioinformatics 24, 1 (2023), 354."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1093\/molbev\/msab032"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkae999"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1186\/s12929-024-01110-w"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-019-0258-4"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1038\/nature15393"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btx011"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Petr Danecek James K Bonfield Jennifer Liddle John Marshall Valeriu Ohan Martin O Pollard Andrew Whitwham Thomas Keane Shane A McCarthy Robert M Davies et al. 2021. Twelve years of SAMtools and BCFtools. Gigascience 10 2 (2021) giab008.","DOI":"10.1093\/gigascience\/giab008"},{"key":"e_1_2_1_9_1","volume-title":"Ammar Husami, James Jones, Himanshu Khangar, Shubham Londhe, Frank Naeymi-Rad, Soujanya Rao, et al.","author":"Dolin Robert H","year":"2021","unstructured":"Robert H Dolin, Shaileshbhai R Gothi, Aziz Boxwala, Bret SE Heale, Ammar Husami, James Jones, Himanshu Khangar, Shubham Londhe, Frank Naeymi-Rad, Soujanya Rao, et al. 2021. vcf2fhir: a utility to convert VCF files into HL7 FHIR format for genomics-EHR integration. BMC bioinformatics 22 (2021), 1\u201311."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1186\/s13059-024-03239-1"},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Holger Fr\u00f6hlich Rudi Balling Niko Beerenwinkel Oliver Kohlbacher Santosh Kumar Thomas Lengauer Marloes H Maathuis Yves Moreau Susan A Murphy Teresa M Przytycka et al. 2018. From hype to reality: data science enabling personalized medicine. BMC medicine 16 (2018) 1\u201315.","DOI":"10.1186\/s12916-018-1122-7"},{"key":"e_1_2_1_12_1","volume-title":"A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS computational biology 18, 5","author":"Garrison Erik","year":"2022","unstructured":"Erik Garrison, Zev N Kronenberg, Eric T Dawson, Brent S Pedersen, and Pjotr Prins. 2022. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS computational biology 18, 5 (2022), e1009123."},{"key":"e_1_2_1_13_1","volume-title":"Fast and accurate variant identification tool for sequencing-based studies. BMC biology 22, 1","author":"Gaston Jeffry M","year":"2024","unstructured":"Jeffry M Gaston, Eric J Alm, and An-Ni Zhang. 2024. Fast and accurate variant identification tool for sequencing-based studies. BMC biology 22, 1 (2024), 90."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0164043"},{"key":"e_1_2_1_15_1","unstructured":"Steffen Janetzki Magn\u00fas Rafn Tiedemann and Hardik Balar. 2015. Genome data management using RDBMSs. Technical Report. Technical Report. https:\/\/www.researchgate.net\/profile\/Hardik-Balar\/publication\/280232082_Genome_Data_Management_using_RDBMSs\/"},{"key":"e_1_2_1_16_1","volume-title":"D1","author":"Katz Kenneth","year":"2022","unstructured":"Kenneth Katz, Oleg Shutov, Richard Lapoint, Michael Kimelman, J Rodney Brister, and Christopher O'Sullivan. 2022. The Sequence Read Archive: a decade more of explosive growth. Nucleic acids research 50, D1 (2022), D387-D390."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btq671"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-66917-5_27"},{"key":"e_1_2_1_19_1","volume-title":"Jointly integrating VCF-based variants and OWL-based biomedical ontologies in MongoDB","author":"Liu Jian","year":"2019","unstructured":"Jian Liu, Zhi Qu, Mo Yang, Jialiang Sun, Shuhui Su, and Lei Zhang. 2019. Jointly integrating VCF-based variants and OWL-based biomedical ontologies in MongoDB. IEEE\/ACM transactions on computational biology and bioinformatics 17, 5 (2019), 1504\u20131515."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0240059"},{"key":"e_1_2_1_21_1","unstructured":"Nature Genetics Nature and Nature Reviews Genetics. 2021. Milestones in Genomic Sequencing. https:\/\/www.nature.com\/immersive\/d42859-020-00099-0\/"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.3390\/plants10020415"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.30574\/wjarr.2024.21.1.0016"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025117"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1186\/s12967-015-0704-9"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btx057"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/bty358"},{"key":"e_1_2_1_28_1","volume-title":"Genetics: A conceptual approach","author":"Pierce Benjamin A","year":"2017","unstructured":"Benjamin A Pierce. 2017. Genetics: A conceptual approach. Macmillan Higher Education, New York, NY."},{"key":"e_1_2_1_29_1","unstructured":"SAMtools. 2024. BCFtools by SAMtools. https:\/\/github.com\/samtools\/bcftools. Accessed: 2025-06-21."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41576-019-0156-9"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1038\/s10038-020-00862-1"},{"key":"e_1_2_1_32_1","unstructured":"TileDB-Inc. 2024. TileDB-VCF: Efficient variant-call data storage and retrieval. https:\/\/github.com\/TileDB-Inc\/TileDB-VCF. Accessed: 2025-06-21."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1136\/bmj.k1687"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1038\/s43586-021-00056-9"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3749646.3749674","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T03:38:58Z","timestamp":1757043538000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3749646.3749674"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7]]},"references-count":34,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.14778\/3749646.3749674"],"URL":"https:\/\/doi.org\/10.14778\/3749646.3749674","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,7]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}