{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:24Z","timestamp":1772138064029,"version":"3.50.1"},"reference-count":46,"publisher":"Oxford University Press (OUP)","issue":"4","license":[{"start":{"date-parts":[[2024,4,8]],"date-time":"2024-04-08T00:00:00Z","timestamp":1712534400000},"content-version":"vor","delay-in-days":10,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,3,29]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The open-source GitHub repository agduncan94\/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae190","type":"journal-article","created":{"date-parts":[[2024,4,5]],"date-time":"2024-04-05T20:31:17Z","timestamp":1712349077000},"source":"Crossref","is-referenced-by-count":10,"title":["Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation"],"prefix":"10.1093","volume":"40","author":[{"given":"Andrew G","family":"Duncan","sequence":"first","affiliation":[{"name":"Cell & Systems Biology, University of Toronto , Toronto, ON M5S 3G5, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7147-4604","authenticated-orcid":false,"given":"Jennifer A","family":"Mitchell","sequence":"additional","affiliation":[{"name":"Cell & Systems Biology, University of Toronto , Toronto, ON M5S 3G5, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3118-3121","authenticated-orcid":false,"given":"Alan M","family":"Moses","sequence":"additional","affiliation":[{"name":"Cell & Systems Biology, University of Toronto , Toronto, ON M5S 3G5, Canada"}]}],"member":"286","published-online":{"date-parts":[[2024,4,8]]},"reference":[{"key":"2024042423461348900_btae190-B1","first-page":"265","author":"Abadi"},{"key":"2024042423461348900_btae190-B2","author":"Alam","year":"2023"},{"key":"2024042423461348900_btae190-B3","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1038\/s41586-020-2871-y","article-title":"Progressive cactus is a multiple-genome aligner for the thousand-genome era","volume":"587","author":"Armstrong","year":"2020","journal-title":"Nature"},{"key":"2024042423461348900_btae190-B4","doi-asserted-by":"crossref","first-page":"1196","DOI":"10.1038\/s41592-021-01252-x","article-title":"Effective gene expression prediction from sequence by integrating long-range interactions","volume":"18","author":"Avsec","year":"2021","journal-title":"Nat Methods"},{"key":"2024042423461348900_btae190-B5","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1038\/s41588-021-00782-6","article-title":"Base-resolution models of transcription-factor binding reveal soft motif syntax","volume":"53","author":"Avsec","year":"2021","journal-title":"Nat Genet"},{"key":"2024042423461348900_btae190-B6","doi-asserted-by":"crossref","first-page":"1837","DOI":"10.1093\/bioinformatics\/bty893","article-title":"Simple tricks of convolutional neural network architectures improve DNA-protein binding prediction","volume":"35","author":"Cao","year":"2019","journal-title":"Bioinformatics"},{"key":"2024042423461348900_btae190-B7","doi-asserted-by":"crossref","first-page":"613","DOI":"10.1038\/s41588-022-01048-5","article-title":"DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers","volume":"54","author":"de Almeida","year":"2022","journal-title":"Nat Genet"},{"key":"2024042423461348900_btae190-B8","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1038\/s41586-023-06661-w","article-title":"Hold out the genome: a roadmap to solving the cis-regulatory code","volume":"625","author":"de Boer","year":"2024","journal-title":"Nature"},{"key":"2024042423461348900_btae190-B9","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1038\/s41587-019-0315-8","article-title":"Deciphering eukaryotic gene-regulatory logic with 100 million random promoters","volume":"38","author":"De Boer","year":"2020","journal-title":"Nat Biotechnol"},{"key":"2024042423461348900_btae190-B10","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1038\/s41586-020-2873-9","article-title":"Dense sampling of bird diversity increases power of comparative genomics","volume":"587","author":"Feng","year":"2020","journal-title":"Nature"},{"key":"2024042423461348900_btae190-B11","doi-asserted-by":"crossref","first-page":"E79","DOI":"10.1371\/journal.pbio.0020079","article-title":"Extensive association of functionally and cytotopically related mRNAs with puf family RNA-binding proteins in yeast","volume":"2","author":"Gerber","year":"2004","journal-title":"PLoS Biol"},{"key":"2024042423461348900_btae190-B12","doi-asserted-by":"crossref","first-page":"baw035","DOI":"10.1093\/database\/baw035","article-title":"ATtRACT-a database of RNA-binding proteins and associated motifs","volume":"2016","author":"Giudice","year":"2016","journal-title":"Database (Oxford)"},{"key":"2024042423461348900_btae190-B13","doi-asserted-by":"crossref","first-page":"1341","DOI":"10.1093\/bioinformatics\/btt128","article-title":"HAL: a hierarchical format for storing and analyzing multiple genome alignments","volume":"29","author":"Hickey","year":"2013","journal-title":"Bioinformatics"},{"key":"2024042423461348900_btae190-B14","doi-asserted-by":"crossref","first-page":"e1002307","DOI":"10.1371\/journal.pbio.1002307","article-title":"Evolutionary conservation and diversification of puf RNA binding proteins and their mRNA targets","volume":"13","author":"Hogan","year":"2015","journal-title":"PLoS Biol"},{"key":"2024042423461348900_btae190-B15","doi-asserted-by":"crossref","first-page":"1635","DOI":"10.1093\/molbev\/msw046","article-title":"ETE 3: reconstruction, analysis, and visualization of phylogenomic data","volume":"33","author":"Huerta-Cepas","year":"2016","journal-title":"Mol Biol Evol"},{"key":"2024042423461348900_btae190-B16","doi-asserted-by":"crossref","first-page":"e1008050","DOI":"10.1371\/journal.pcbi.1008050","article-title":"Cross-species regulatory sequence activity prediction","volume":"16","author":"Kelley","year":"2020","journal-title":"PLoS Comput Biol"},{"key":"2024042423461348900_btae190-B17","doi-asserted-by":"crossref","first-page":"990","DOI":"10.1101\/gr.200535.115","article-title":"Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks","volume":"26","author":"Kelley","year":"2016","journal-title":"Genome Res"},{"key":"2024042423461348900_btae190-B18","author":"Kim","year":"2023"},{"key":"2024042423461348900_btae190-B19","author":"Kingma"},{"key":"2024042423461348900_btae190-B20","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.coisb.2020.04.001","article-title":"Deep learning for inferring transcription factor binding sites","volume":"19","author":"Koo","year":"2020","journal-title":"Curr Opin Syst Biol"},{"key":"2024042423461348900_btae190-B21","doi-asserted-by":"crossref","first-page":"e1008925","DOI":"10.1371\/journal.pcbi.1008925","article-title":"Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks","volume":"17","author":"Koo","year":"2021","journal-title":"PLoS Comput Biol"},{"key":"2024042423461348900_btae190-B22","doi-asserted-by":"crossref","first-page":"735","DOI":"10.1038\/s41586-023-06798-8","article-title":"Identification of constrained sequence elements across 239 primate genomes","volume":"625","author":"Kuderna","year":"2023","journal-title":"Nature"},{"key":"2024042423461348900_btae190-B23","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1186\/s13059-023-02941-w","article-title":"EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations","volume":"24","author":"Lee","year":"2023","journal-title":"Genome Biol"},{"key":"2024042423461348900_btae190-B24","doi-asserted-by":"crossref","first-page":"4325","DOI":"10.1073\/pnas.1720115115","article-title":"Earth BioGenome project: sequencing life for the future of life","volume":"115","author":"Lewin","year":"2018","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024042423461348900_btae190-B25","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1016\/j.aiopen.2022.03.001","article-title":"Data augmentation approaches in natural language processing: a survey","volume":"3","author":"Li","year":"2022","journal-title":"AI Open"},{"key":"2024042423461348900_btae190-B26","author":"Lu"},{"key":"2024042423461348900_btae190-B27","doi-asserted-by":"crossref","first-page":"e1010238","DOI":"10.1371\/journal.pcbi.1010238","article-title":"Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning","volume":"18","author":"Lu","year":"2022","journal-title":"PLoS Comput Biol"},{"key":"2024042423461348900_btae190-B28","doi-asserted-by":"crossref","first-page":"25655","DOI":"10.1073\/pnas.2011795117","article-title":"Deep learning of immune cell differentiation","volume":"117","author":"Maslova","year":"2020","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024042423461348900_btae190-B29","doi-asserted-by":"crossref","first-page":"1815","DOI":"10.1101\/gr.260844.120","article-title":"Cross-species analysis of enhancer logic using deep learning","volume":"30","author":"Minnoye","year":"2020","journal-title":"Genome Res"},{"key":"2024042423461348900_btae190-B30","author":"Mourad","year":"2024"},{"key":"2024042423461348900_btae190-B31","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1186\/s13059-023-02985-y","article-title":"ExplaiNN: interpretable and transparent neural networks for genomics","volume":"24","author":"Novakovsky","year":"2023","journal-title":"Genome Biol"},{"key":"2024042423461348900_btae190-B32","author":"Paszke"},{"key":"2024042423461348900_btae190-B33","doi-asserted-by":"crossref","first-page":"1512","DOI":"10.1101\/gr.123356.111","article-title":"Cactus: algorithms for genome multiple sequence alignment","volume":"21","author":"Paten","year":"2011","journal-title":"Genome Res"},{"key":"2024042423461348900_btae190-B34","doi-asserted-by":"crossref","first-page":"D20","DOI":"10.1093\/nar\/gkab1112","article-title":"Database resources of the National Center for Biotechnology Information","volume":"50","author":"Sayers","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2024042423461348900_btae190-B35","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1186\/1471-2164-10-269","article-title":"Definition, conservation and epigenetics of housekeeping and tissue-enriched genes","volume":"10","author":"She","year":"2009","journal-title":"BMC Genomics"},{"key":"2024042423461348900_btae190-B36","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1186\/s40537-019-0197-0","article-title":"A survey on image data augmentation for deep learning","volume":"6","author":"Shorten","year":"2019","journal-title":"J Big Data"},{"key":"2024042423461348900_btae190-B37","author":"Shrikumar","year":"2020"},{"key":"2024042423461348900_btae190-B38","author":"Tareen"},{"key":"2024042423461348900_btae190-B39","doi-asserted-by":"crossref","first-page":"1088","DOI":"10.1038\/s42256-022-00570-9","article-title":"Evaluating deep learning for predicting epigenomic profiles","volume":"4","author":"Toneyan","year":"2022","journal-title":"Nat Mach Intell"},{"key":"2024042423461348900_btae190-B40","doi-asserted-by":"crossref","first-page":"554","DOI":"10.1016\/j.cell.2015.01.006","article-title":"Enhancer evolution across 20 mammalian species","volume":"160","author":"Villar","year":"2015","journal-title":"Cell"},{"key":"2024042423461348900_btae190-B41","doi-asserted-by":"crossref","first-page":"D88","DOI":"10.1093\/nar\/gkl822","article-title":"VISTA enhancer browser\u2013a database of tissue-specific human enhancers","volume":"35","author":"Visel","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2024042423461348900_btae190-B42","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1016\/j.tig.2009.12.002","article-title":"Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same","volume":"26","author":"Weirauch","year":"2010","journal-title":"Trends Genet"},{"key":"2024042423461348900_btae190-B43","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1093\/bioinformatics\/btu613","article-title":"The ensembl REST API: ensembl data for any language","volume":"31","author":"Yates","year":"2015","journal-title":"Bioinformatics"},{"key":"2024042423461348900_btae190-B44","doi-asserted-by":"crossref","first-page":"D754","DOI":"10.1093\/nar\/gkx1098","article-title":"Ensembl 2018","volume":"46","author":"Zerbino","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2024042423461348900_btae190-B45","doi-asserted-by":"crossref","first-page":"4339","DOI":"10.1093\/bioinformatics\/btaa493","article-title":"HALPER facilitates the identification of regulatory element orthologs across species","volume":"36","author":"Zhang","year":"2020","journal-title":"Bioinformatics"},{"key":"2024042423461348900_btae190-B46","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1038\/s41586-020-2876-6","article-title":"A comparative genomics multitool for scientific discovery and conservation","volume":"587","author":"Zoonomia Consortium","year":"2020","journal-title":"Nature"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae190\/57185715\/btae190.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/4\/btae190\/57322341\/btae190.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/4\/btae190\/57322341\/btae190.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,24]],"date-time":"2024-04-24T19:46:40Z","timestamp":1713988000000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae190\/7642397"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,3,29]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,3,29]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae190","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.09.15.558005","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,4,1]]},"published":{"date-parts":[[2024,3,29]]},"article-number":"btae190"}}