{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T01:45:40Z","timestamp":1778636740981,"version":"3.51.4"},"reference-count":30,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2020,6,5]],"date-time":"2020-06-05T00:00:00Z","timestamp":1591315200000},"content-version":"vor","delay-in-days":156,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01 GM125878"],"award-info":[{"award-number":["R01 GM125878"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Intramural Research Program of the NIH"},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>For optimal performance, machine learning methods for protein sequence\/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26\u2009212\u2009066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease\u2013endonuclease\u2013phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https:\/\/www.igs.umaryland.edu\/labs\/neuwald\/software\/mapgaps\/.<\/jats:p>","DOI":"10.1093\/database\/baaa042","type":"journal-article","created":{"date-parts":[[2020,5,12]],"date-time":"2020-05-12T03:12:20Z","timestamp":1589253140000},"source":"Crossref","is-referenced-by-count":5,"title":["Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments"],"prefix":"10.1093","volume":"2020","author":[{"given":"Andrew F","family":"Neuwald","sequence":"first","affiliation":[{"name":"Institute for Genome Sciences"},{"name":"Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, 670 W. Baltimore Street, Baltimore, MD 21201, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christopher J","family":"Lanczycki","sequence":"first","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, 8600 Rockville Pike, Bethesda, MD 20894, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Theresa K","family":"Hodges","sequence":"first","affiliation":[{"name":"Institute for Genome Sciences"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aron","family":"Marchler-Bauer","sequence":"first","affiliation":[{"name":"National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, 8600 Rockville Pike, Bethesda, MD 20894, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2020,6,8]]},"reference":[{"key":"2020061611132201800_ref1","doi-asserted-by":"crossref","first-page":"1607","DOI":"10.1016\/j.cell.2012.04.012","article-title":"Three-dimensional structures of membrane proteins from genomic sequencing","volume":"149","author":"Hopf","year":"2012","journal-title":"Cell"},{"key":"2020061611132201800_ref2","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1093\/bioinformatics\/btr638","article-title":"PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments","volume":"28","author":"Jones","year":"2012","journal-title":"Bioinformatics"},{"key":"2020061611132201800_ref3","doi-asserted-by":"crossref","first-page":"15674","DOI":"10.1073\/pnas.1314045110","article-title":"Assessing the utility of coevolution-based residue\u2013residue contact predictions in a sequence- and structure-rich era","volume":"110","author":"Kamisetty","year":"2013","journal-title":"Proc. Natl. Acad. Sci. U. S. A."},{"key":"2020061611132201800_ref4","doi-asserted-by":"crossref","first-page":"e28766","DOI":"10.1371\/journal.pone.0028766","article-title":"Protein 3D structure computed from evolutionary sequence variation","volume":"6","author":"Marks","year":"2011","journal-title":"PLoS One"},{"key":"2020061611132201800_ref5","doi-asserted-by":"crossref","first-page":"1072","DOI":"10.1038\/nbt.2419","article-title":"Protein structure prediction from sequence variation","volume":"30","author":"Marks","year":"2012","journal-title":"Nat. Biotechnol."},{"key":"2020061611132201800_ref6","doi-asserted-by":"crossref","first-page":"E1293","DOI":"10.1073\/pnas.1111471108","article-title":"Direct-coupling analysis of residue coevolution captures native contacts across many protein families","volume":"108","author":"Morcos","year":"2011","journal-title":"Proc. Natl. Acad. Sci. U. S. A."},{"key":"2020061611132201800_ref7","doi-asserted-by":"crossref","first-page":"E1540","DOI":"10.1073\/pnas.1120036109","article-title":"Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis","volume":"109","author":"Nugent","year":"2012","journal-title":"Proc. Natl. Acad. Sci. U. S. A."},{"key":"2020061611132201800_ref8","doi-asserted-by":"crossref","first-page":"e29880","DOI":"10.7554\/eLife.29880","article-title":"Inferring joint sequence-structural determinants of protein functional specificity","volume":"7","author":"Neuwald","year":"2018","journal-title":"Elife"},{"key":"2020061611132201800_ref9","doi-asserted-by":"crossref","first-page":"e1006237","DOI":"10.1371\/journal.pcbi.1006237","article-title":"Statistical investigations of protein residue direct couplings","volume":"14","author":"Neuwald","year":"2018","journal-title":"PLoS Comput. Biol."},{"key":"2020061611132201800_ref10","doi-asserted-by":"crossref","first-page":"355","DOI":"10.1186\/1471-2105-8-355","article-title":"Accuracy of structure-based sequence alignment of automatic methods","volume":"8","author":"Kim","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2020061611132201800_ref11","doi-asserted-by":"crossref","first-page":"2257","DOI":"10.1093\/molbev\/msq115","article-title":"The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection","volume":"27","author":"Fletcher","year":"2010","journal-title":"Mol. Biol. Evol."},{"key":"2020061611132201800_ref12","doi-asserted-by":"crossref","first-page":"e18093","DOI":"10.1371\/journal.pone.0018093","article-title":"A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives","volume":"6","author":"Thompson","year":"2011","journal-title":"PLoS One"},{"key":"2020061611132201800_ref13","doi-asserted-by":"crossref","first-page":"3057","DOI":"10.1093\/molbev\/msu231","article-title":"Alignment errors strongly impact likelihood-based tests for comparing topologies","volume":"31","author":"Levy Karin","year":"2014","journal-title":"Mol. Biol. Evol."},{"key":"2020061611132201800_ref14","doi-asserted-by":"crossref","first-page":"e1004936","DOI":"10.1371\/journal.pcbi.1004936","article-title":"Bayesian top-down protein sequence alignment with inferred position-specific gap penalties","volume":"12","author":"Neuwald","year":"2016","journal-title":"PLoS Comput. Biol."},{"key":"2020061611132201800_ref15","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1101\/gr.862303","article-title":"Ran's C-terminal, basic patch, and nucleotide exchange mechanisms in light of a canonical structure for Rab, Rho, Ras, and Ran GTPases","volume":"13","author":"Neuwald","year":"2003","journal-title":"Genome Res."},{"key":"2020061611132201800_ref16","doi-asserted-by":"crossref","first-page":"1869","DOI":"10.1093\/bioinformatics\/btp342","article-title":"Rapid detection, classification and accurate alignment of up to a million or more related protein sequences","volume":"25","author":"Neuwald","year":"2009","journal-title":"Bioinformatics"},{"key":"2020061611132201800_ref17","doi-asserted-by":"crossref","first-page":"D200","DOI":"10.1093\/nar\/gkw1129","article-title":"CDD\/SPARCLE: functional classification of proteins via subfamily domain architectures","volume":"45","author":"Marchler-Bauer","year":"2017","journal-title":"Nucleic Acids Res."},{"key":"2020061611132201800_ref18","doi-asserted-by":"crossref","first-page":"D222","DOI":"10.1093\/nar\/gku1221","article-title":"CDD: NCBI's conserved domain database","volume":"43","author":"Marchler-Bauer","year":"2015","journal-title":"Nucleic Acids Res."},{"key":"2020061611132201800_ref19","doi-asserted-by":"crossref","first-page":"3939","DOI":"10.1093\/bioinformatics\/bty495","article-title":"PASTA for proteins","volume":"34","author":"Collins","year":"2018","journal-title":"Bioinformatics"},{"key":"2020061611132201800_ref20","doi-asserted-by":"crossref","first-page":"2490","DOI":"10.1093\/bioinformatics\/bty121","article-title":"Parallelization of MAFFT for large-scale multiple sequence alignments","volume":"34","author":"Nakamura","year":"2018","journal-title":"Bioinformatics"},{"key":"2020061611132201800_ref21","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2020061611132201800_ref22","doi-asserted-by":"crossref","first-page":"D94","DOI":"10.1093\/nar\/gky989","article-title":"GenBank","volume":"47","author":"Sayers","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"2020061611132201800_ref23","doi-asserted-by":"crossref","first-page":"D427","DOI":"10.1093\/nar\/gky995","article-title":"The Pfam protein families database in 2019","volume":"47","author":"El-Gebali","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"2020061611132201800_ref24","doi-asserted-by":"crossref","first-page":"3128","DOI":"10.1093\/bioinformatics\/btu500","article-title":"CCMpred--fast and precise prediction of protein residue-residue contacts from correlated mutations","volume":"30","author":"Seemayer","year":"2014","journal-title":"Bioinformatics"},{"key":"2020061611132201800_ref25","doi-asserted-by":"crossref","first-page":"431","DOI":"10.1186\/1471-2105-11-431","article-title":"Hidden Markov model speed heuristic and iterative HMM search procedure","volume":"11","author":"Johnson","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2020061611132201800_ref26","doi-asserted-by":"crossref","first-page":"bax008","DOI":"10.1093\/database\/bax008","article-title":"Workflow and web application for annotating NCBI BioProject transcriptome data","volume":"2017","author":"Vera Alvarez","year":"2017","journal-title":"Database (Oxford)"},{"key":"2020061611132201800_ref27","doi-asserted-by":"crossref","first-page":"D666","DOI":"10.1093\/nar\/gky901","article-title":"IMG\/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes","volume":"47","author":"Chen","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"2020061611132201800_ref28","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1126\/science.aah4043","article-title":"Protein structure determination using metagenome sequence data","volume":"355","author":"Ovchinnikov","year":"2017","journal-title":"Science"},{"key":"2020061611132201800_ref29","doi-asserted-by":"crossref","first-page":"3308","DOI":"10.1093\/bioinformatics\/bty341","article-title":"High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features","volume":"34","author":"Jones","year":"2018","journal-title":"Bioinformatics"},{"key":"2020061611132201800_ref30","doi-asserted-by":"crossref","first-page":"e39397","DOI":"10.7554\/eLife.39397","article-title":"Learning protein constitutive motifs from sequence data","volume":"8","author":"Tubiana","year":"2019","journal-title":"Elife"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa042\/33394075\/baaa042.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa042\/33394075\/baaa042.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,6,16]],"date-time":"2020-06-16T16:47:47Z","timestamp":1592326067000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaa042\/5850901"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,1,1]]},"references-count":30,"URL":"https:\/\/doi.org\/10.1093\/database\/baaa042","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020]]},"published":{"date-parts":[[2020,1,1]]},"article-number":"baaa042"}}