{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T18:27:33Z","timestamp":1754159253463,"version":"3.41.2"},"reference-count":45,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2025,7,8]],"date-time":"2025-07-08T00:00:00Z","timestamp":1751932800000},"content-version":"vor","delay-in-days":7,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004359","name":"Swedish Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100004359","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Erik Philip-S\u00f6rensen Foundation","award":["G2023-029"],"award-info":[{"award-number":["G2023-029"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Efforts to address health disparities are often limited by the lack of robust computational tools for inferring genetic ancestry by calculating an individual\u2019s genetic similarity to continental groups. We have already shown that a preferred alternative to self-described race is using ancestry-informative markers (AIMs) that can be classified into ancestral components and used to estimate their similarity to those of known populations to identify continental groups. However, real-world genomic data can present challenges, including limited availability of germline DNA, a small number of AIMs for each sample, and the use of different variant calling software, limiting the application of existing solutions.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Here, we describe a novel supervised machine-learning tool AncestryGeni, which infers genetic ancestry for samples with even a hundred markers and is applicable to any genomic data, including whole exome sequencing (WES) and RNA sequencing (RNA-Seq) data. Applying AncestryGeni to a real-world genomic dataset obtained from the Multiple Myeloma Research Foundation (MMRF) CoMMpass study, we show that it is more accurate than the commonly used FastNGSadmix when using nonstandard genomic material. We also demonstrate that when using AncestryGeni, the tumor-derived sequence obtained from WES and RNA-Seq can be a robust data source to accurately estimate an individual\u2019s genetic similarity to a continental group.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>AncestryGeni pipeline is available at https:\/\/github.com\/eelhaik\/AncestryGeni\/tree\/main.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf391","type":"journal-article","created":{"date-parts":[[2025,7,8]],"date-time":"2025-07-08T11:33:18Z","timestamp":1751974398000},"source":"Crossref","is-referenced-by-count":0,"title":["AncestryGeni: a novel genetic ancestry classification pipeline for small and noisy sequence data"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4795-1084","authenticated-orcid":false,"given":"Eran","family":"Elhaik","sequence":"first","affiliation":[{"name":"Department of Biology, Lund University , Lund 22362,","place":["Sweden"]}]},{"given":"Sara","family":"Behnamian","sequence":"additional","affiliation":[{"name":"Centre for GeoGenetics, Globe Institute, University of Copenhagen , Copenhagen, 1350,","place":["Denmark"]},{"name":"Pioneer Centre for AI, University of Copenhagen , Copenhagen, 1350,","place":["Denmark"]}]},{"given":"Michael","family":"Howe","sequence":"additional","affiliation":[{"name":"Division of Hematology, Department of Internal Medicine, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]},{"given":"Hongwei","family":"Tang","sequence":"additional","affiliation":[{"name":"Division of Hematopathology, Department of Laboratory Medicine and Pathology, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0756-2922","authenticated-orcid":false,"given":"Huihuang","family":"Yan","sequence":"additional","affiliation":[{"name":"Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3348-7439","authenticated-orcid":false,"given":"Shulan","family":"Tian","sequence":"additional","affiliation":[{"name":"Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]},{"given":"Suganti","family":"Shivaram","sequence":"additional","affiliation":[{"name":"Division of Hematopathology, Department of Laboratory Medicine and Pathology, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]},{"given":"Cinthya","family":"Zepeda Mendoza","sequence":"additional","affiliation":[{"name":"Division of Hematopathology, Department of Laboratory Medicine and Pathology, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]},{"given":"Kylee","family":"MacLachlan","sequence":"additional","affiliation":[{"name":"Myeloma Service, Department of Medicine, Memorial Sloan Kettering Cancer Center , New York, NY, 10065,","place":["United States"]}]},{"given":"Saad","family":"Usmani","sequence":"additional","affiliation":[{"name":"Myeloma Service, Department of Medicine, Memorial Sloan Kettering Cancer Center , New York, NY, 10065,","place":["United States"]}]},{"given":"Mehdi","family":"Pirooznia","sequence":"additional","affiliation":[{"name":"School of Medicine, Johns Hopkins University , Baltimore, MD, 21205,","place":["United States"]},{"name":"Interventional Oncology, Johnson & Johnson Enterprise R&D , Raritan, NJ, 08869,","place":["United States"]}]},{"given":"Gareth","family":"Morgan","sequence":"additional","affiliation":[{"name":"Multiple Myeloma Research Program, Perlmutter Cancer Center, NYU Langone Medical Center , New York, NY, 10016,","place":["United States"]}]},{"given":"Patrick","family":"Blaney","sequence":"additional","affiliation":[{"name":"Multiple Myeloma Research Program, Perlmutter Cancer Center, NYU Langone Medical Center , New York, NY, 10016,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5017-1620","authenticated-orcid":false,"given":"Francesco","family":"Maura","sequence":"additional","affiliation":[{"name":"Myeloma Service, Department of Medicine, Memorial Sloan Kettering Cancer Center , New York, NY, 10065,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5229-4897","authenticated-orcid":false,"given":"Linda B","family":"Baughn","sequence":"additional","affiliation":[{"name":"Division of Hematopathology, Department of Laboratory Medicine and Pathology, Mayo Clinic , Rochester, MN, 55905,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,7,8]]},"reference":[{"key":"2025072501120991400_btaf391-B1","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1186\/1471-2105-12-246","article-title":"Enhancements to the ADMIXTURE algorithm for individual ancestry estimation","volume":"12","author":"Alexander","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2025072501120991400_btaf391-B2","doi-asserted-by":"crossref","first-page":"1359","DOI":"10.1093\/bioinformatics\/bts144","article-title":"Fast and accurate inference of local ancestry in Latino populations","volume":"28","author":"Baran","year":"2012","journal-title":"Bioinformatics"},{"key":"2025072501120991400_btaf391-B3","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1038\/s41408-020-0294-5","article-title":"The CCND1 c.870G risk allele is enriched in individuals of African ancestry with plasma cell dyscrasias","volume":"10","author":"Baughn","year":"2020","journal-title":"Blood Cancer J"},{"key":"2025072501120991400_btaf391-B4","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1038\/s41408-018-0132-1","article-title":"Differences in genomic abnormalities among African individuals with monoclonal gammopathies using calculated ancestry","volume":"8","author":"Baughn","year":"2018","journal-title":"Blood Cancer J"},{"key":"2025072501120991400_btaf391-B5","doi-asserted-by":"crossref","first-page":"100270","DOI":"10.1016\/j.crmeth.2022.100270","article-title":"Temporal population structure, a genetic dating method for ancient Eurasian genomes from the past 10,000 years","volume":"2","author":"Behnamian","year":"2022","journal-title":"Cell Rep Methods"},{"key":"2025072501120991400_btaf391-B6","doi-asserted-by":"crossref","first-page":"2114","DOI":"10.1093\/bioinformatics\/btu170","article-title":"Trimmomatic: a flexible trimmer for Illumina sequence data","volume":"30","author":"Bolger","year":"2014","journal-title":"Bioinformatics"},{"key":"2025072501120991400_btaf391-B7","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1016\/j.ajhg.2014.11.010","article-title":"The genetic ancestry of African Americans, Latinos, and European Americans across the United States","volume":"96","author":"Bryc","year":"2015","journal-title":"Am J Hum Genet"},{"key":"2025072501120991400_btaf391-B8","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1186\/s12864-021-07618-x","article-title":"Population genetic considerations for using biobanks as international resources in the pandemic era and beyond","volume":"22","author":"Carress","year":"2021","journal-title":"BMC Genomics"},{"year":"2022","author":"Coop","key":"2025072501120991400_btaf391-B9","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2207.11595,"},{"year":"2006","author":"Davis","key":"2025072501120991400_btaf391-B10"},{"key":"2025072501120991400_btaf391-B11","doi-asserted-by":"crossref","first-page":"e49837","DOI":"10.1371\/journal.pone.0049837","article-title":"Empirical distributions of F(ST) from large-scale human polymorphism data","volume":"7","author":"Elhaik","year":"2012","journal-title":"PLoS One"},{"key":"2025072501120991400_btaf391-B12","doi-asserted-by":"crossref","first-page":"14683","DOI":"10.1038\/s41598-022-14395-4","article-title":"Principal component analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated","volume":"12","author":"Elhaik","year":"2022","journal-title":"Sci Rep"},{"key":"2025072501120991400_btaf391-B13","doi-asserted-by":"crossref","first-page":"2243","DOI":"10.1093\/bioinformatics\/bty946","article-title":"Pair Matcher (PaM): fast model-based optimization of treatment\/case-control matches","volume":"35","author":"Elhaik","year":"2019","journal-title":"Bioinformatics"},{"key":"2025072501120991400_btaf391-B14","doi-asserted-by":"crossref","first-page":"3513","DOI":"10.1038\/ncomms4513","article-title":"Geographic population structure analysis of worldwide human populations infers their biogeographical origins","volume":"5","author":"Elhaik","year":"2014","journal-title":"Nat Commun"},{"key":"2025072501120991400_btaf391-B15","doi-asserted-by":"crossref","first-page":"1","DOI":"10.3390\/genes9120625","article-title":"Ancient ancestry informative markers for identifying fine-scale ancient population structure in Eurasians","volume":"9","author":"Esposito","year":"2018","journal-title":"Genes (Basel)"},{"key":"2025072501120991400_btaf391-B16","doi-asserted-by":"crossref","first-page":"1567","DOI":"10.1093\/genetics\/164.4.1567","article-title":"Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies","volume":"164","author":"Falush","year":"2003","journal-title":"Genetics"},{"key":"2025072501120991400_btaf391-B17","first-page":"1","article-title":"Linear discriminant analysis","volume":"392","author":"Fischer","year":"1936","journal-title":"Stat Discrete Methods Data Sci"},{"key":"2025072501120991400_btaf391-B18","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Genomes Project","year":"2015","journal-title":"Nature"},{"key":"2025072501120991400_btaf391-B19","doi-asserted-by":"crossref","first-page":"578","DOI":"10.1093\/gbe\/evt028","article-title":"On the immortality of television sets: \"function\" in the human genome according to the evolution-free gospel of ENCODE","volume":"5","author":"Graur","year":"2013","journal-title":"Genome Biol Evol"},{"key":"2025072501120991400_btaf391-B20","first-page":"xi","volume-title":"Signal Detection Theory and Psychophysics","author":"Green","year":"1966"},{"key":"2025072501120991400_btaf391-B21","doi-asserted-by":"crossref","first-page":"609","DOI":"10.1038\/leu.2011.368","article-title":"Disparities in the prevalence, pathogenesis and progression of monoclonal gammopathy of undetermined significance and multiple myeloma between blacks and whites","volume":"26","author":"Greenberg","year":"2012","journal-title":"Leukemia"},{"key":"2025072501120991400_btaf391-B22","doi-asserted-by":"crossref","first-page":"986","DOI":"10.1002\/humu.24298","article-title":"Annotating and prioritizing genomic variants using the ensembl variant effect predictor \u2013 a tutorial","volume":"43","author":"Hunt","year":"2022","journal-title":"Hum Mutat"},{"key":"2025072501120991400_btaf391-B23","doi-asserted-by":"crossref","first-page":"3148","DOI":"10.1093\/bioinformatics\/btx474","article-title":"fastNGSadmix: admixture proportions and principal component analysis of a single NGS sample","volume":"33","author":"Jorsboe","year":"2017","journal-title":"Bioinformatics"},{"key":"2025072501120991400_btaf391-B24","doi-asserted-by":"crossref","first-page":"591","DOI":"10.1038\/s41592-018-0051-x","article-title":"Strelka2: fast and accurate calling of germline and somatic variants","volume":"15","author":"Kim","year":"2018","journal-title":"Nat Methods"},{"key":"2025072501120991400_btaf391-B25","doi-asserted-by":"crossref","first-page":"356","DOI":"10.1186\/s12859-014-0356-4","article-title":"ANGSD: analysis of next generation sequencing data","volume":"15","author":"Korneliussen","year":"2014","journal-title":"BMC Bioinformatics"},{"key":"2025072501120991400_btaf391-B26","doi-asserted-by":"crossref","first-page":"e108","DOI":"10.1093\/nar\/gkw227","article-title":"VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research","volume":"44","author":"Lai","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2025072501120991400_btaf391-B27","doi-asserted-by":"crossref","first-page":"1468","DOI":"10.1016\/S0025-6196(11)61089-6","article-title":"Prevalence of monoclonal gammopathy of undetermined significance among men in Ghana","volume":"82","author":"Landgren","year":"2007","journal-title":"Mayo Clin Proc"},{"key":"2025072501120991400_btaf391-B28","doi-asserted-by":"crossref","first-page":"791","DOI":"10.1182\/blood-2008-12-191676","article-title":"Risk of plasma cell and lymphoproliferative disorders among 14621 first-degree relatives of 4458 patients with monoclonal gammopathy of undetermined significance in Sweden","volume":"114","author":"Landgren","year":"2009","journal-title":"Blood"},{"key":"2025072501120991400_btaf391-B29","doi-asserted-by":"crossref","first-page":"5412","DOI":"10.1182\/blood-2008-12-194241","article-title":"Monoclonal gammopathy of undetermined significance (MGUS) consistently precedes multiple myeloma: a prospective study","volume":"113","author":"Landgren","year":"2009","journal-title":"Blood"},{"key":"2025072501120991400_btaf391-B30","doi-asserted-by":"crossref","first-page":"e1002453","DOI":"10.1371\/journal.pgen.1002453","article-title":"Inference of population structure using dense haplotype data","volume":"8","author":"Lawson","year":"2012","journal-title":"PLoS Genet"},{"key":"2025072501120991400_btaf391-B31","doi-asserted-by":"crossref","first-page":"589","DOI":"10.1093\/bioinformatics\/btp698","article-title":"Fast and accurate long-read alignment with Burrows\u2013Wheeler transform","volume":"26","author":"Li","year":"2010","journal-title":"Bioinformatics"},{"key":"2025072501120991400_btaf391-B32","doi-asserted-by":"crossref","first-page":"1483","DOI":"10.1126\/science.aab4082","article-title":"Somatic mutation in cancer and normal cells","volume":"349","author":"Martincorena","year":"2015","journal-title":"Science"},{"key":"2025072501120991400_btaf391-B33","doi-asserted-by":"crossref","first-page":"e1008624","DOI":"10.1371\/journal.pgen.1008624","article-title":"What is ancestry?","volume":"16","author":"Mathieson","year":"2020","journal-title":"PLoS Genet"},{"key":"2025072501120991400_btaf391-B34","doi-asserted-by":"crossref","first-page":"1229","DOI":"10.1200\/JCO.23.01277","article-title":"Genomic classification and individualized prognosis in multiple myeloma","volume":"42","author":"Maura","year":"2024","journal-title":"J Clin Oncol"},{"volume-title":"Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field","year":"2023","author":"National Academies of Sciences, Engineering, and Medicine","key":"2025072501120991400_btaf391-B35"},{"key":"2025072501120991400_btaf391-B36","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1038\/538161a","article-title":"Genomics is failing on diversity","volume":"538","author":"Popejoy","year":"2016","journal-title":"Nature"},{"key":"2025072501120991400_btaf391-B37","doi-asserted-by":"crossref","first-page":"983","DOI":"10.1038\/nbt.4235","article-title":"A universal SNP and small-indel variant caller using deep neural networks","volume":"36","author":"Poplin","year":"2018","journal-title":"Nat Biotechnol"},{"key":"2025072501120991400_btaf391-B38","doi-asserted-by":"crossref","first-page":"3107","DOI":"10.1200\/JCO.20.00461","article-title":"Genome-wide somatic alterations in multiple myeloma reveal a superior outcome group","volume":"38","author":"Samur","year":"2020","journal-title":"J Clin Oncol"},{"key":"2025072501120991400_btaf391-B39","first-page":"12","article-title":"Cancer statistics, 2024","volume":"74","author":"Siegel","year":"2024","journal-title":"CA: Cancer J Clin"},{"key":"2025072501120991400_btaf391-B40","doi-asserted-by":"crossref","first-page":"1080","DOI":"10.1016\/j.cell.2019.04.032","article-title":"The missing diversity in human genetic studies","volume":"177","author":"Sirugo","year":"2019","journal-title":"Cell"},{"key":"2025072501120991400_btaf391-B41","doi-asserted-by":"crossref","first-page":"1878","DOI":"10.1038\/s41588-024-01853-0","article-title":"Comprehensive molecular profiling of multiple myeloma identifies refined copy number and expression subtypes","volume":"56","author":"Skerget","year":"2024","journal-title":"Nat Genet"},{"key":"2025072501120991400_btaf391-B42","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1038\/nature15394","article-title":"An integrated map of structural variation in 2,504 human genomes","volume":"526","author":"Sudmant","year":"2015","journal-title":"Nature."},{"key":"2025072501120991400_btaf391-B43","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1002\/gepi.20064","article-title":"Estimation of individual admixture: analytical and study design considerations","volume":"28","author":"Tang","year":"2005","journal-title":"Genet Epidemiol"},{"key":"2025072501120991400_btaf391-B44","doi-asserted-by":"crossref","first-page":"5418","DOI":"10.1182\/blood-2008-12-195008","article-title":"A monoclonal gammopathy precedes multiple myeloma in most patients","volume":"113","author":"Weiss","year":"2009","journal-title":"Blood"},{"key":"2025072501120991400_btaf391-B45","doi-asserted-by":"crossref","first-page":"2263","DOI":"10.1056\/NEJMra1510065","article-title":"Interpreting geographic variations in results of randomized, controlled trials","volume":"375","author":"Yusuf","year":"2016","journal-title":"N Engl J Med"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf391\/63703925\/btaf391.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/7\/btaf391\/63703925\/btaf391.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/7\/btaf391\/63703925\/btaf391.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T05:12:22Z","timestamp":1753420342000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf391\/8193679"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":45,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf391","relation":{},"ISSN":["1367-4811"],"issn-type":[{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,7,1]]},"article-number":"btaf391"}}