{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T23:16:19Z","timestamp":1776122179110,"version":"3.50.1"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2020,10,27]],"date-time":"2020-10-27T00:00:00Z","timestamp":1603756800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["DBI-1350041"],"award-info":[{"award-number":["DBI-1350041"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IOS-1445025"],"award-info":[{"award-number":["IOS-1445025"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IOS-1732253"],"award-info":[{"award-number":["IOS-1732253"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IOS-1758800"],"award-info":[{"award-number":["IOS-1758800"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,5,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding &amp;lt;1% to the suffix array\u2019s memory footprint.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The source code and tutorial are available open-source at https:\/\/github.com\/mkirsche\/sapling.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa911","type":"journal-article","created":{"date-parts":[[2020,10,13]],"date-time":"2020-10-13T15:31:15Z","timestamp":1602603075000},"page":"744-749","source":"Crossref","is-referenced-by-count":19,"title":["Sapling: accelerating suffix array queries with learned data models"],"prefix":"10.1093","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6631-4761","authenticated-orcid":false,"given":"Melanie","family":"Kirsche","sequence":"first","affiliation":[{"name":"Department of Computer Science, Johns Hopkins University , Baltimore, MD 21218, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0332-1956","authenticated-orcid":false,"given":"Arun","family":"Das","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Johns Hopkins University , Baltimore, MD 21218, USA"}]},{"given":"Michael C","family":"Schatz","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Johns Hopkins University , Baltimore, MD 21218, USA"},{"name":"Department of Biology, Johns Hopkins University , Baltimore, MD 21218, USA"},{"name":"Cold Spring Harbor Laboratory , Cold Spring Harbor, NY 11724, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,10,27]]},"reference":[{"key":"2023051800332687500_btaa911-B1"},{"key":"2023051800332687500_btaa911-B2","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol"},{"key":"2023051800332687500_btaa911-B3","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/0020-0190(96)00083-X","article-title":"Fast and practical approximate string matching","volume":"59","author":"Baeza-Yates","year":"1996","journal-title":"Inf. Process. Lett"},{"key":"2023051800332687500_btaa911-B4","author":"Brett","year":"2016"},{"key":"2023051800332687500_btaa911-B5","doi-asserted-by":"crossref","first-page":"238","DOI":"10.1186\/1471-2105-13-238","article-title":"Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory","volume":"13","author":"Chaisson","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023051800332687500_btaa911-B6","volume-title":"Handbook of Exact String Matching Algorithms","author":"Charras","year":"2004"},{"key":"2023051800332687500_btaa911-B7","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/BF02551274","article-title":"Approximation by superpositions of a sigmoidal function","volume":"2","author":"Cybenko","year":"1989","journal-title":"Math Control Signals Syst"},{"key":"2023051800332687500_btaa911-B8","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1093\/bioinformatics\/bts635","article-title":"STAR: ultrafast universal RNA-seq aligner","volume":"29","author":"Dobin","year":"2013","journal-title":"Bioinformatics"},{"key":"2023051800332687500_btaa911-B9","first-page":"390","author":"Ferragina","year":"2000"},{"key":"2023051800332687500_btaa911-B10","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1186\/1471-2105-9-167","article-title":"Efficient computation of absent words in genomic sequences","volume":"9","author":"Herold","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023051800332687500_btaa911-B11","article-title":"LISA: towards learned DNA sequence search","author":"Ho","year":"2019"},{"key":"2023051800332687500_btaa911-B12","year":"2016"},{"key":"2023051800332687500_btaa911-B13","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1147\/rd.312.0249","article-title":"Efficient randomized pattern-matching algorithms","volume":"31","author":"Karp","year":"1987","journal-title":"IBM J. Res. Dev"},{"key":"2023051800332687500_btaa911-B14","article-title":"The case for learned index structures","author":"Kraska","year":"2017","journal-title":"arXiv: 1712.01208 [cs.DB]"},{"key":"2023051800332687500_btaa911-B15","doi-asserted-by":"crossref","first-page":"R25","DOI":"10.1186\/gb-2009-10-3-r25","article-title":"Ultrafast and memory-efficient alignment of short DNA sequences to the human genome","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol"},{"key":"2023051800332687500_btaa911-B16","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"2023051800332687500_btaa911-B17","article-title":"Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM","author":"Li","year":"2013","journal-title":"arXiv: 1303.3997 [q-Bio.GN]"},{"key":"2023051800332687500_btaa911-B18","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The Sequence Alignment\/Map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023051800332687500_btaa911-B19","doi-asserted-by":"crossref","first-page":"935","DOI":"10.1137\/0222058","article-title":"Suffix arrays: a new method for on-line string searches","volume":"22","author":"Manber","year":"1993","journal-title":"SIAM J. Comput"},{"key":"2023051800332687500_btaa911-B20","doi-asserted-by":"crossref","first-page":"e1005944","DOI":"10.1371\/journal.pcbi.1005944","article-title":"MUMmer4: a fast and versatile genome alignment system","volume":"14","author":"Mar\u00e7ais","year":"2018","journal-title":"PLoS Comput. Biol"},{"key":"2023051800332687500_btaa911-B21","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1038\/nrg2986","article-title":"Genotype and SNP calling from next-generation sequencing data","volume":"12","author":"Nielsen","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023051800332687500_btaa911-B22","doi-asserted-by":"crossref","first-page":"669","DOI":"10.1038\/nrg2641","article-title":"ChIP-seq: advantages and challenges of a maturing technology","volume":"10","author":"Park","year":"2009","journal-title":"Nat. Rev. Genet"},{"key":"2023051800332687500_btaa911-B23","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems 32","author":"Paszke","year":"2019"},{"key":"2023051800332687500_btaa911-B24","article-title":"Searching for activation functions","author":"Ramachandran","year":"2017","journal-title":"arXiv: 1710.05941 [cs.NE]"},{"key":"2023051800332687500_btaa911-B25","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1038\/s41592-018-0001-7","article-title":"Accurate detection of complex structural variations using single-molecule sequencing","volume":"15","author":"Sedlazeck","year":"2018","journal-title":"Nat. Methods"},{"key":"2023051800332687500_btaa911-B26","doi-asserted-by":"crossref","first-page":"640","DOI":"10.1038\/msb.2012.61","article-title":"High-throughput sequencing for biology and medicine","volume":"9","author":"Soon","year":"2013","journal-title":"Mol. Syst. Biol"},{"key":"2023051800332687500_btaa911-B27","doi-asserted-by":"crossref","first-page":"802","DOI":"10.1093\/bioinformatics\/btt042","article-title":"essaMEM: finding maximal exact matches using enhanced sparse suffix arrays","volume":"29","author":"Vyverman","year":"2013","journal-title":"Bioinformatics"},{"key":"2023051800332687500_btaa911-B28","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nrg2484","article-title":"RNA-Seq: a revolutionary tool for transcriptomics","volume":"10","author":"Wang","year":"2009","journal-title":"Nat. Rev. Genet"},{"key":"2023051800332687500_btaa911-B29","first-page":"1","author":"Weiner","year":"1973"},{"key":"2023051800332687500_btaa911-B30","doi-asserted-by":"crossref","first-page":"e82138","DOI":"10.1371\/journal.pone.0082138","article-title":"SSW library: an SIMD Smith-Waterman C\/C++ library for use in genomic applications","volume":"8","author":"Zhao","year":"2013","journal-title":"PLoS One"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa911\/35204230\/btaa911.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/6\/744\/50356368\/btaa911.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/6\/744\/50356368\/btaa911.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,17]],"date-time":"2023-05-17T20:34:58Z","timestamp":1684355698000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/6\/744\/5941464"}},"subtitle":[],"editor":[{"given":"Robinson","family":"Peter","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,10,27]]},"references-count":30,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2021,5,5]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa911","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.01.29.925768","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,3,15]]},"published":{"date-parts":[[2020,10,27]]}}}