{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,8]],"date-time":"2026-02-08T18:43:14Z","timestamp":1770576194580,"version":"3.49.0"},"reference-count":51,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2024,1,25]],"date-time":"2024-01-25T00:00:00Z","timestamp":1706140800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003977","name":"Israel Science Foundation","doi-asserted-by":"publisher","award":["2818\/21"],"award-info":[{"award-number":["2818\/21"]}],"id":[{"id":"10.13039\/501100003977","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,2,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The data underlying this article are available in Github, at https:\/\/github.com\/elyawy\/SpartaSim and https:\/\/github.com\/elyawy\/SpartaPipeline.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae043","type":"journal-article","created":{"date-parts":[[2024,1,25]],"date-time":"2024-01-25T09:46:46Z","timestamp":1706176006000},"source":"Crossref","is-referenced-by-count":9,"title":["Statistical framework to determine indel-length distribution"],"prefix":"10.1093","volume":"40","author":[{"given":"Elya","family":"Wygoda","sequence":"first","affiliation":[{"name":"The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel"}]},{"given":"Gil","family":"Loewenthal","sequence":"additional","affiliation":[{"name":"The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel"}]},{"given":"Asher","family":"Moshe","sequence":"additional","affiliation":[{"name":"The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel"}]},{"given":"Michael","family":"Alburquerque","sequence":"additional","affiliation":[{"name":"The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel"}]},{"given":"Itay","family":"Mayrose","sequence":"additional","affiliation":[{"name":"School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9463-2575","authenticated-orcid":false,"given":"Tal","family":"Pupko","sequence":"additional","affiliation":[{"name":"The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University , Tel Aviv 69978, Israel"}]}],"member":"286","published-online":{"date-parts":[[2024,1,25]]},"reference":[{"key":"2024021007494372200_btae043-B1","doi-asserted-by":"crossref","first-page":"603","DOI":"10.1016\/S0092-8240(86)90010-8","article-title":"Optimal sequence alignment using affine gap costs","volume":"48","author":"Altschul","year":"1986","journal-title":"Bull Math Biol"},{"key":"2024021007494372200_btae043-B2","doi-asserted-by":"crossref","first-page":"7708","DOI":"10.1073\/pnas.1230533100","article-title":"Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions\/deletions as the major path to genomic divergence","volume":"100","author":"Anzai","year":"2003","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024021007494372200_btae043-B3","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Auton","year":"2015","journal-title":"Nature"},{"key":"2024021007494372200_btae043-B51","doi-asserted-by":"crossref","first-page":"W580","DOI":"10.1093\/nar\/gks498","article-title":"FastML: a web server for probabilistic reconstruction of ancestral sequences","volume":"40","author":"Ashkenazy","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2024021007494372200_btae043-B4","doi-asserted-by":"crossref","first-page":"2025","DOI":"10.1093\/genetics\/162.4.2025","article-title":"Approximate Bayesian computation in population genetics","volume":"162","author":"Beaumont","year":"2002","journal-title":"Genetics"},{"key":"2024021007494372200_btae043-B5","doi-asserted-by":"crossref","first-page":"1065","DOI":"10.1006\/jmbi.1993.1105","article-title":"Empirical and structural models for insertions and deletions in the divergent evolution of proteins","volume":"229","author":"Benner","year":"1993","journal-title":"J Mol Biol"},{"key":"2024021007494372200_btae043-B6","doi-asserted-by":"crossref","first-page":"1160","DOI":"10.1073\/pnas.1220450110","article-title":"Evolutionary inference via the Poisson indel process","volume":"110","author":"Bouchard-C\u00f4t\u00e9","year":"2013","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024021007494372200_btae043-B7","doi-asserted-by":"crossref","first-page":"13633","DOI":"10.1073\/pnas.172510699","article-title":"Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels","volume":"99","author":"Britten","year":"2002","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024021007494372200_btae043-B8","doi-asserted-by":"crossref","first-page":"iii31","DOI":"10.1093\/bioinformatics\/bti1200","article-title":"DNA assembly with gaps (Dawg): simulating sequence evolution","volume":"21 (Suppl. 3)","author":"Cartwright","year":"2005","journal-title":"Bioinformatics"},{"key":"2024021007494372200_btae043-B9","doi-asserted-by":"crossref","first-page":"527","DOI":"10.1186\/1471-2105-7-527","article-title":"Logarithmic gap costs decrease alignment accuracy","volume":"7","author":"Cartwright","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2024021007494372200_btae043-B10","doi-asserted-by":"crossref","first-page":"473","DOI":"10.1093\/molbev\/msn275","article-title":"Problems and solutions for estimating indel rates and length distributions","volume":"26","author":"Cartwright","year":"2009","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B11","doi-asserted-by":"crossref","first-page":"3903","DOI":"10.1098\/rstb.2008.0177","article-title":"A likelihood framework to analyse phyletic patterns","volume":"363","author":"Cohen","year":"2008","journal-title":"Philos Trans R Soc Lond B Biol Sci"},{"key":"2024021007494372200_btae043-B12","author":"Dotan","year":"2023"},{"key":"2024021007494372200_btae043-B13","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1186\/1471-2105-5-113","article-title":"MUSCLE: a multiple sequence alignment method with reduced time and space complexity","volume":"5","author":"Edgar","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2024021007494372200_btae043-B14","doi-asserted-by":"crossref","first-page":"370","DOI":"10.2174\/138920207783406479","article-title":"Patterns of insertion and deletion in mammalian genomes","volume":"8","author":"Fan","year":"2007","journal-title":"Curr Genomics"},{"key":"2024021007494372200_btae043-B15","doi-asserted-by":"crossref","first-page":"1879","DOI":"10.1093\/molbev\/msp098","article-title":"INDELible: a flexible simulator of biological sequence evolution","volume":"26","author":"Fletcher","year":"2009","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.2202\/1544-6115.1678","article-title":"Deviance information criteria for model selection in approximate Bayesian computation","volume":"10","author":"Francois","year":"2011","journal-title":"Stat Appl Genet Mol Biol"},{"key":"2024021007494372200_btae043-B17","doi-asserted-by":"crossref","first-page":"2340","DOI":"10.1021\/j100540a008","article-title":"Exact stochastic simulation of coupled chemical reactions","volume":"81","author":"Gillespie","year":"1977","journal-title":"J Phys Chem"},{"key":"2024021007494372200_btae043-B18","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1006\/mpev.1993.1006","article-title":"Evolution of a noncoding region of the chloroplast genome","volume":"2","author":"Golenberg","year":"1993","journal-title":"Mol Phylogenet Evol"},{"key":"2024021007494372200_btae043-B19","doi-asserted-by":"crossref","first-page":"464","DOI":"10.1007\/BF00164032","article-title":"The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment","volume":"40","author":"Gu","year":"1995","journal-title":"J Mol Evol"},{"key":"2024021007494372200_btae043-B20","doi-asserted-by":"crossref","first-page":"D309","DOI":"10.1093\/nar\/gky1085","article-title":"eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses","volume":"47","author":"Huerta-Cepas","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024021007494372200_btae043-B21","doi-asserted-by":"crossref","first-page":"329","DOI":"10.1534\/genetics.108.090431","article-title":"Multilocus patterns of nucleotide polymorphism and the demographic history of Populus tremula","volume":"180","author":"Ingvarsson","year":"2008","journal-title":"Genetics"},{"key":"2024021007494372200_btae043-B22","doi-asserted-by":"crossref","first-page":"7217","DOI":"10.1093\/nar\/gkv677","article-title":"The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection","volume":"43","author":"Jiang","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2024021007494372200_btae043-B23","doi-asserted-by":"crossref","first-page":"1280","DOI":"10.1093\/gbe\/evx084","article-title":"Inferring rates and length-distributions of indels using approximate Bayesian computation","volume":"9","author":"Karin","year":"2017","journal-title":"Genome Biol Evol"},{"key":"2024021007494372200_btae043-B24","doi-asserted-by":"crossref","first-page":"772","DOI":"10.1093\/molbev\/mst010","article-title":"MAFFT multiple sequence alignment software version 7: improvements in performance and usability","volume":"30","author":"Katoh","year":"2013","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B25","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1007\/BF01731581","article-title":"A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences","volume":"16","author":"Kimura","year":"1980","journal-title":"J Mol Evol"},{"key":"2024021007494372200_btae043-B26","doi-asserted-by":"crossref","first-page":"957","DOI":"10.1038\/s41559-019-0881-7","article-title":"Ancient admixture from an extinct ape lineage into bonobos","volume":"3","author":"Kuhlwilm","year":"2019","journal-title":"Nat Ecol Evol"},{"key":"2024021007494372200_btae043-B27","doi-asserted-by":"crossref","DOI":"10.1093\/acprof:oso\/9780199299188.001.0001","volume-title":"Ancestral Sequence Reconstruction","author":"Liberles","year":"2007"},{"key":"2024021007494372200_btae043-B28","doi-asserted-by":"crossref","first-page":"5769","DOI":"10.1093\/molbev\/msab266","article-title":"A probabilistic model for indel evolution: differentiating insertions from deletions","volume":"38","author":"Loewenthal","year":"2021","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B29","doi-asserted-by":"crossref","first-page":"220223","DOI":"10.1098\/rsob.220223","article-title":"The evolutionary dynamics that retain long neutral genomic sequences in face of indel deletion bias: a model and its application to human introns","volume":"12","author":"Loewenthal","year":"2022","journal-title":"Open Biol"},{"key":"2024021007494372200_btae043-B30","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1093\/nar\/28.1.85","article-title":"YIDB: the yeast intron database","volume":"28","author":"Lopez","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2024021007494372200_btae043-B31","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1007\/978-1-62703-646-7_10","volume-title":"Multiple Sequence Alignment Methods","author":"L\u00f6ytynoja","year":"2014"},{"key":"2024021007494372200_btae043-B32","doi-asserted-by":"crossref","first-page":"298","DOI":"10.1101\/gr.6725608","article-title":"Uncertainty in homology inferences: assessing and improving genomic sequence alignment","volume":"18","author":"Lunter","year":"2008","journal-title":"Genome Res"},{"key":"2024021007494372200_btae043-B33","first-page":"49","author":"Mahalanobis","year":"1936"},{"key":"2024021007494372200_btae043-B34","doi-asserted-by":"crossref","first-page":"lqaa092","DOI":"10.1093\/nargab\/lqaa092","article-title":"Accelerating phylogeny-aware alignment with indel evolution using short time Fourier transform","volume":"2","author":"Maiolo","year":"2020","journal-title":"NAR Genom Bioinform"},{"key":"2024021007494372200_btae043-B35","doi-asserted-by":"crossref","first-page":"331","DOI":"10.1186\/s12859-018-2357-1","article-title":"Progressive multiple sequence alignment with indel evolution","volume":"19","author":"Maiolo","year":"2018","journal-title":"BMC Bioinformatics"},{"key":"2024021007494372200_btae043-B36","doi-asserted-by":"crossref","first-page":"770","DOI":"10.1093\/oxfordjournals.molbev.a025980","article-title":"Genome size and intron size in drosophila","volume":"15","author":"Moriyama","year":"1998","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B37","doi-asserted-by":"crossref","first-page":"msac231","DOI":"10.1093\/molbev\/msac231","article-title":"An approximate Bayesian computation approach for modeling genome rearrangements","volume":"39","author":"Moshe","year":"2022","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B38","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"J Mol Biol"},{"key":"2024021007494372200_btae043-B39","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1016\/0014-5793(96)00636-9","article-title":"The size differences among mammalian introns are due to the accumulation of small deletions","volume":"390","author":"Ogata","year":"1996","journal-title":"FEBS Lett"},{"key":"2024021007494372200_btae043-B40","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1016\/0022-2836(92)91008-D","article-title":"Analysis of insertions\/deletions in protein structures","volume":"224","author":"Pascarella","year":"1992","journal-title":"J Mol Biol"},{"key":"2024021007494372200_btae043-B41","doi-asserted-by":"crossref","first-page":"1667","DOI":"10.1093\/genetics\/164.4.1667","article-title":"Estimating the time since the fixation of a beneficial allele","volume":"164","author":"Przeworski","year":"2003","journal-title":"Genetics"},{"key":"2024021007494372200_btae043-B42","doi-asserted-by":"crossref","first-page":"102","DOI":"10.1002\/prot.1129","article-title":"Distribution of indel lengths","volume":"45","author":"Qian","year":"2001","journal-title":"Proteins"},{"key":"2024021007494372200_btae043-B43","first-page":"504","article-title":"Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates","volume":"11","author":"Saitou","year":"1994","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B44","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1093\/sysbio\/49.2.369","article-title":"Gaps as characters in sequence-based phylogenetic analyses","volume":"49","author":"Simmons","year":"2000","journal-title":"Syst Biol"},{"key":"2024021007494372200_btae043-B45","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J Mol Biol"},{"key":"2024021007494372200_btae043-B46","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1111\/j.1471-8286.2007.01997.x","article-title":"Onesamp: a program to estimate effective population size using approximate Bayesian computation","volume":"8","author":"Tallmon","year":"2008","journal-title":"Mol Ecol Resour"},{"key":"2024021007494372200_btae043-B47","doi-asserted-by":"crossref","first-page":"R37","DOI":"10.1186\/gb-2008-9-2-r37","article-title":"Sequence context affects the rate of short insertions and deletions in flies and primates","volume":"9","author":"Tanay","year":"2008","journal-title":"Genome Biol"},{"key":"2024021007494372200_btae043-B48","doi-asserted-by":"crossref","first-page":"114","DOI":"10.1007\/BF02193625","article-title":"An evolutionary model for maximum likelihood alignment of DNA sequences","volume":"33","author":"Thorne","year":"1991","journal-title":"J Mol Evol"},{"issue":"7","key":"2024021007494372200_btae043-B52","doi-asserted-by":"crossref","first-page":"1783","DOI":"10.1093\/molbev\/msy055","article-title":"Alignment Modulates Ancestral Sequence Reconstruction Accuracy","volume":"35","author":"Vialle","year":"2018","journal-title":"Mol Biol Evol"},{"key":"2024021007494372200_btae043-B50","doi-asserted-by":"crossref","first-page":"682","DOI":"10.1007\/s00239-006-0045-7","article-title":"Comparative genomic analysis of human and chimpanzee indicates a key role for indels in primate evolution","volume":"63","author":"Wetterbom","year":"2006","journal-title":"J Mol Evol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae043\/56412559\/btae043.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/2\/btae043\/56645496\/btae043.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/2\/btae043\/56645496\/btae043.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T07:50:30Z","timestamp":1707551430000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae043\/7588892"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,1,25]]},"references-count":51,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,2,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae043","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,2,1]]},"published":{"date-parts":[[2024,1,25]]},"article-number":"btae043"}}