{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:46:06Z","timestamp":1740185166062,"version":"3.37.3"},"reference-count":50,"publisher":"Oxford University Press (OUP)","issue":"19","license":[{"start":{"date-parts":[[2022,8,5]],"date-time":"2022-08-05T00:00:00Z","timestamp":1659657600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Institute of Computing Science Statutory Funds"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,9,30]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Whole-genome sequencing has revolutionized biosciences by providing tools for constructing complete DNA sequences of individuals. With entire genomes at hand, scientists can pinpoint DNA fragments responsible for oncogenesis and predict patient responses to cancer treatments. Machine learning plays a paramount role in this process. However, the sheer volume of whole-genome data makes it difficult to encode the characteristics of genomic variants as features for learning algorithms.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>In this article, we propose three feature extraction methods that facilitate classifier learning from sets of genomic variants. The core contributions of this work include: (i) strategies for determining features using variant length binning, clustering and density estimation; (ii) a programing library for automating distribution-based feature extraction in machine learning pipelines. The proposed methods have been validated on five real-world datasets using four different classification algorithms and a clustering approach. Experiments on genomes of 219 ovarian, 61 lung and 929 breast cancer patients show that the proposed approaches automatically identify genomic biomarkers associated with cancer subtypes and clinical response to oncological treatment. Finally, we show that the extracted features can be used alongside unsupervised learning methods to analyze genomic samples.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The source code of the presented algorithms and reproducible experimental scripts are available on Github at https:\/\/github.com\/MNMdiagnostics\/dbfe.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac513","type":"journal-article","created":{"date-parts":[[2022,8,5]],"date-time":"2022-08-05T13:43:01Z","timestamp":1659706981000},"page":"4466-4473","source":"Crossref","is-referenced-by-count":1,"title":["DBFE: distribution-based feature extraction from structural variants in whole-genome data"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7281-284X","authenticated-orcid":false,"given":"Maciej","family":"Piernik","sequence":"first","affiliation":[{"name":"Institute of Computing Science, Faculty of Computing and Telecommunications, Poznan University of Technology , 60-965 Poznan, Poland"},{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9723-525X","authenticated-orcid":false,"given":"Dariusz","family":"Brzezinski","sequence":"additional","affiliation":[{"name":"Institute of Computing Science, Faculty of Computing and Telecommunications, Poznan University of Technology , 60-965 Poznan, Poland"},{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"},{"name":"Institute of Bioorganic Chemistry of the Polish Academy of Sciences , 61-704 Poznan, Poland"}]},{"given":"Pawel","family":"Sztromwasser","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"}]},{"given":"Klaudia","family":"Pacewicz","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"}]},{"given":"Weronika","family":"Majer-Burman","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"}]},{"given":"Michal","family":"Gniot","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"},{"name":"Department of Hematology and Bone Marrow Transplantation, Poznan University of Medical Sciences , 60-569 Poznan, Poland"}]},{"given":"Dawid","family":"Sielski","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"}]},{"given":"Oleksii","family":"Bryzghalov","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"}]},{"given":"Alicja","family":"Wozna","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"},{"name":"Faculty of Physics, Adam Mickiewicz University , 61-614 Poznan, Poland"}]},{"given":"Pawel","family":"Zawadzki","sequence":"additional","affiliation":[{"name":"MNM Bioscience Inc. , Cambridge, MA 02142, USA"},{"name":"Faculty of Physics, Adam Mickiewicz University , 61-614 Poznan, Poland"}]}],"member":"286","published-online":{"date-parts":[[2022,8,5]]},"reference":[{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"10001","DOI":"10.1038\/ncomms10001","article-title":"A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing","volume":"6","author":"Alioto","year":"2015","journal-title":"Nat. Commun"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1677","DOI":"10.1093\/bioinformatics\/btab859","article-title":"DeepSVP: integration of genotype and phenotype for structural variant prioritization using deep learning","volume":"38","author":"Althagafi","year":"2022","journal-title":"Bioinformatics"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1344","DOI":"10.1158\/1078-0432.CCR-17-2994","article-title":"Genomics-driven precision medicine for advanced pancreatic cancer: early results from the COMPASS trial","volume":"24","author":"Aung","year":"2018","journal-title":"Clin. Cancer Res"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"bbab434","DOI":"10.1093\/bib\/bbab434","article-title":"MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors","volume":"23","author":"Bonidia","year":"2022","journal-title":"Brief. Bioinform"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"194","DOI":"10.1016\/j.tig.2021.08.007","article-title":"Therapeutic and prognostic insights from the analysis of cancer mutational signatures","volume":"38","author":"Brady","year":"2022","journal-title":"Trends Genet"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"202","DOI":"10.1186\/s13059-021-02423-x","article-title":"GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing","volume":"22","author":"Cameron","year":"2021","journal-title":"Genome Biol"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1220","DOI":"10.1093\/bioinformatics\/btv710","article-title":"Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications","volume":"32","author":"Chen","year":"2016","journal-title":"Bioinformatics"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1669","DOI":"10.1214\/13-AOS1129","article-title":"Quantile and quantile-function estimations under density ratio model","volume":"41","author":"Chen","year":"2013","journal-title":"Ann. Statist"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1038\/nbt.2514","article-title":"Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples","volume":"31","author":"Cibulskis","year":"2013","journal-title":"Nat. Biotechnol"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"2156","DOI":"10.1093\/bioinformatics\/btr330","article-title":"The variant call format and VCFtools","volume":"27","author":"Danecek","year":"2011","journal-title":"Bioinformatics"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"giab008","DOI":"10.1093\/gigascience\/giab008","article-title":"Twelve years of SAMtools and BCFtools","volume":"10","author":"Danecek","year":"2021","journal-title":"Gigascience"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"517","DOI":"10.1038\/nm.4292","article-title":"HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures","volume":"23","author":"Davies","year":"2017","journal-title":"Nat. Med"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1035","DOI":"10.1001\/jama.2014.1717","article-title":"Clinical interpretation and implications of whole-genome sequencing","volume":"311","author":"Dewey","year":"2014","journal-title":"JAMA"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1214\/aos\/1176342662","article-title":"Empirical probability plots and statistical inference for nonlinear models in the two-sample case","volume":"2","author":"Doksum","year":"1974","journal-title":"Ann. Statist"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1093\/biomet\/63.3.421","article-title":"Plotting with confidence: graphical comparisons of two populations","volume":"63","author":"Doksum","year":"1976","journal-title":"Biometrika"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"238","DOI":"10.2307\/1403797","article-title":"Discriminatory analysis. Nonparametric discrimination: consistency properties","volume":"57","author":"Fix","year":"1989","journal-title":"Int. Stat. Rev"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"W21","DOI":"10.1093\/nar\/gkab402","article-title":"AnnotSV and knotAnnotSV: a web server for human structural variations annotations, ranking and analysis","volume":"49","author":"Geoffroy","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"344","DOI":"10.1038\/nature13394","article-title":"Genome sequencing identifies major causes of severe intellectual disability","volume":"511","author":"Gilissen","year":"2014","journal-title":"Nature"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"2778","DOI":"10.1093\/bioinformatics\/btq524","article-title":"Ruffus: a lightweight Python library for computational pipelines","volume":"26","author":"Goodstadt","year":"2010","journal-title":"Bioinformatics"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-84858-7","volume-title":"The Elements of Statistical Learning","author":"Hastie","year":"2009"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"3053","DOI":"10.1073\/pnas.1909378117","article-title":"Precision medicine integrating whole-genome sequencing, comprehensive metabolomics, and advanced imaging","volume":"117","author":"Hou","year":"2020","journal-title":"Proc. Natl. Acad. Sci. USA"},{"year":"2022","author":"Islam","key":"2023041408204285100_"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1466","DOI":"10.1016\/j.csbj.2020.06.017","article-title":"Deep learning models in genomics; are we there yet?","volume":"18","author":"Koumakis","year":"2020","journal-title":"Comput. Struct. Biotechnol. J"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"R15","DOI":"10.1186\/bcr2824","article-title":"Quantification and clinical relevance of gene amplification at chromosome 17q12-q21 in human epidermal growth factor receptor 2-amplified breast cancers","volume":"13","author":"Lamy","year":"2011","journal-title":"Breast Cancer Res"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"435","DOI":"10.1038\/gim.2017.119","article-title":"Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test","volume":"20","author":"Lionel","year":"2018","journal-title":"Genet. Med"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1469","DOI":"10.1038\/s41380-021-01418-1","article-title":"Application of deep learning algorithm on whole genome sequencing data uncovers structural variants associated with multiple mental disorders in African American patients","volume":"27","author":"Liu","year":"2022","journal-title":"Mol. Psychiatry"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1214\/aoms\/1177730491","article-title":"On a test of whether one of two random variables is stochastically larger than the other","volume":"18","author":"Mann","year":"1947","journal-title":"Ann. Math. Statist"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"10","DOI":"10.14806\/ej.17.1.200","article-title":"Cutadapt removes adapter sequences from high-throughput sequencing reads","volume":"17","author":"Martin","year":"2011","journal-title":"EMBnet J"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1080\/01621459.1951.10500769","article-title":"The Kolmogorov-Smirnov test for goodness of fit","volume":"46","author":"Massey","year":"1951","journal-title":"J. Am. Stat. Assoc"},{"year":"2020","author":"McInnes","key":"2023041408204285100_"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1109\/79.543975","article-title":"The expectation-maximization algorithm","volume":"13","author":"Moon","year":"1996","journal-title":"IEEE Signal Process. Mag"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1038\/nature17676","article-title":"Landscape of somatic mutations in 560 breast cancer whole-genome sequences","volume":"534","author":"Nik-Zainal","year":"2016","journal-title":"Nature"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"1160","DOI":"10.1200\/JCO.2008.18.1370","article-title":"Supervised risk predictor of breast cancer based on intrinsic subtypes","volume":"27","author":"Parker","year":"2009","journal-title":"JCO"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1038\/s41586-020-1943-3","article-title":"The repertoire of mutational signatures in human cancer","volume":"578","author":"PCAWG Mutational Signatures Working Group","year":"2020","journal-title":"Nature"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1038\/s41586-019-1913-9","article-title":"Patterns of somatic structural variation in human cancer genomes","volume":"578","author":"PCAWG Structural Variation Working Group","year":"2020","journal-title":"Nature"},{"key":"2023041408204285100_","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1038\/s41571-018-0114-z","article-title":"State-of-the-art strategies for targeting the DNA damage response in cancer","volume":"16","author":"Pili\u00e9","year":"2019","journal-title":"Nat. Rev. Clin. Oncol"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"e0123261","DOI":"10.1371\/journal.pone.0123261","article-title":"ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets","volume":"10","author":"Rydbeck","year":"2015","journal-title":"PLoS One"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1186\/s13073-018-0606-6","article-title":"Complex structural variants in Mendelian disorders: identification and breakpoint resolution using short- and long-read genome sequencing","volume":"10","author":"Sanchis-Juan","year":"2018","journal-title":"Genome Med"},{"year":"2021","author":"Sanger Institute","key":"2023041408204285100_"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1038\/nature15394","article-title":"An integrated map of structural variation in 2,504 human genomes","volume":"526","author":"Sudmant","year":"2015","journal-title":"Nature"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"540","DOI":"10.1093\/bioinformatics\/btab662","article-title":"Viola: a structural variant signature extractor with user-defined classifications","volume":"38","author":"Sugita","year":"2022","journal-title":"Bioinformatics"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1038\/s41586-020-1969-6","article-title":"Pan-cancer analysis of whole genomes","volume":"578","author":"The ICGC\/TCGA Pan-Cancer Analysis of Whole Genomes Consortium","year":"2020","journal-title":"Nature"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"784","DOI":"10.1137\/1118101","article-title":"Calculation of the Wasserstein distance between probability distributions on the line","volume":"18","author":"Vallender","year":"1974","journal-title":"Theory Probab. Appl"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1038\/s41698-021-00155-6","article-title":"Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology","volume":"5","author":"van Belzen","year":"2021","journal-title":"NPJ Precis. Oncol"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1016\/j.currproblcancer.2010.12.002","article-title":"PARP inhibitor treatment in ovarian and breast cancer","volume":"35","author":"Weil","year":"2011","journal-title":"Curr. Probl. Cancer"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"e6879","DOI":"10.1097\/MD.0000000000006879","article-title":"Effective and extensible feature extraction method using genetic algorithm-based frequency-domain feature search for epileptic EEG multiclassification","volume":"96","author":"Wen","year":"2017","journal-title":"Medicine"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"e1009303","DOI":"10.1371\/journal.pgen.1009303","article-title":"Creating artificial human genomes using generative neural networks","volume":"17","author":"Yelmen","year":"2021","journal-title":"PLoS Genet"},{"key":"2023041408204285100_","doi-asserted-by":"crossref","first-page":"giaa145","DOI":"10.1093\/gigascience\/giaa145","article-title":"Parliament2: accurate structural variant calling at scale","volume":"9","author":"Zarate","year":"2020","journal-title":"Gigascience"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac513\/45296702\/btac513.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/19\/4466\/49884896\/btac513.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/19\/4466\/49884896\/btac513.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,25]],"date-time":"2023-11-25T10:10:46Z","timestamp":1700907046000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/19\/4466\/6656344"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,8,5]]},"references-count":50,"journal-issue":{"issue":"19","published-print":{"date-parts":[[2022,9,30]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac513","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2022,10,1]]},"published":{"date-parts":[[2022,8,5]]}}}