{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T22:15:04Z","timestamp":1779228904511,"version":"3.51.4"},"reference-count":24,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,2,28]],"date-time":"2024-02-28T00:00:00Z","timestamp":1709078400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,28]],"date-time":"2024-02-28T00:00:00Z","timestamp":1709078400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"crossref","award":["1R01AI143254"],"award-info":[{"award-number":["1R01AI143254"]}],"id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"crossref","award":["1R01AI143254"],"award-info":[{"award-number":["1R01AI143254"]}],"id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Purpose<\/jats:title>\n                <jats:p>Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20\u201335% pairwise identity (so called \"twilight zone\") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970\u2019s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Methods<\/jats:title>\n                <jats:p>We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with &lt;10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusion<\/jats:title>\n                <jats:p>Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-024-05699-5","type":"journal-article","created":{"date-parts":[[2024,2,28]],"date-time":"2024-02-28T03:02:28Z","timestamp":1709089348000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Protein embedding based alignment"],"prefix":"10.1186","volume":"25","author":[{"given":"Benjamin Giovanni","family":"Iovino","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuzhen","family":"Ye","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,2,28]]},"reference":[{"issue":"3","key":"5699_CR1","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","volume":"215","author":"Stephen F Altschul","year":"1990","unstructured":"Altschul Stephen F, Gish Warren, Miller Webb, Myers Eugene W, Lipman David J. Basic local alignment search tool. J Mol Biol. 1990;215(3):403\u201310.","journal-title":"J Mol Biol"},{"key":"5699_CR2","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171\u20134186"},{"key":"5699_CR3","volume-title":"ORFS A A primer on how to analyze derived amino acid sequences","author":"RF Doolittle","year":"1986","unstructured":"Doolittle RF. ORFS A A primer on how to analyze derived amino acid sequences. Sausalito: University Science Books; 1986."},{"issue":"10","key":"5699_CR4","doi-asserted-by":"publisher","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","volume":"44","author":"Ahmed Elnaggar","year":"2022","unstructured":"Elnaggar Ahmed, Heinzinger Michael, Dallago Christian, Ghalia Rehawi Yu, Wang Llion Jones, Gibbs Tom, Feher Tamas, Angerer Christoph, Steinegger Martin, Bhowmik Debsindhu, Rost Burkhard. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44(10):7112\u201327.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"22","key":"5699_CR5","doi-asserted-by":"publisher","first-page":"10915","DOI":"10.1073\/pnas.89.22.10915","volume":"89","author":"S Henikoff","year":"1992","unstructured":"Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci. 1992;89(22):10915\u20139.","journal-title":"Proc Natl Acad Sci"},{"issue":"3","key":"5699_CR6","doi-asserted-by":"publisher","first-page":"499","DOI":"10.1002\/prot.22458","volume":"77","author":"Kristoffer Illerg\u00e5rd","year":"2009","unstructured":"Illerg\u00e5rd Kristoffer, Ardell David H, Elofsson Arne. Structure is three to ten times more conserved than sequence a study of structural response in protein cores. Proteins Struct Funct Bioinf. 2009;77(3):499\u2013508.","journal-title":"Proteins Struct Funct Bioinf"},{"issue":"7873","key":"5699_CR7","doi-asserted-by":"publisher","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","volume":"596","author":"John Jumper","year":"2021","unstructured":"...Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, \u017d\u00eddek Augustin, Potapenko Anna, Bridgland Alex, Meyer Clemens, Kohl Simon A. A, Ballard Andrew J, Cowie Andrew, Romera-Paredes Bernardino, Nikolov Stanislav, Jain Rishub, Adler Jonas, Back Trevor, Petersen Stig, Reiman David, Clancy Ellen, Zielinski Michal, Steinegger Martin, Pacholska Michalina, Berghammer Tamas, Bodenstein Sebastian, Silver David, Vinyals Oriol, Senior Andrew W, Kavukcuoglu Koray, Kohli Pushmeet, Hassabis Demis. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583\u20139.","journal-title":"Nature"},{"issue":"1","key":"5699_CR8","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-016-1414-x","volume":"18","author":"Keul Frank","year":"2017","unstructured":"Frank Keul, Martin Hess, Michael Goesele, Kay Hamacher. PFASUM: a substitution matrix from pfam structural alignments. BMC Bioinformatics. 2017;18(1):1\u201314.","journal-title":"BMC Bioinformatics"},{"issue":"W1","key":"5699_CR9","doi-asserted-by":"publisher","first-page":"W60","DOI":"10.1093\/nar\/gkaa443","volume":"48","author":"Zhanwen Li","year":"2020","unstructured":"Li Zhanwen, Jaroszewski Lukasz, Iyer Mallika, Sedova Mayya, Godzik Adam. FATCAT 2.0: towards a better understanding of the structural diversity of proteins. Nucleic Acids Res. 2020;48(W1):W60\u20134.","journal-title":"Nucleic Acids Res"},{"issue":"6637","key":"5699_CR10","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.1126\/science.ade2574","volume":"379","author":"Zeming Lin","year":"2023","unstructured":"Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Wenting Lu, Smetanin Nikita, Verkuil Robert, Kabeli Ori, Shmueli Yaniv, dos Santos Allan, Costa Maryam Fazel-Zarandi, Sercu Tom, Candido Salvatore, Rives Alexander. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123\u201330.","journal-title":"Science"},{"issue":"1","key":"5699_CR11","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1038\/s41592-022-01700-2","volume":"20","author":"Felipe Llinares-L\u00f3pez","year":"2022","unstructured":"Llinares-L\u00f3pez Felipe, Berthet Quentin, Blondel Mathieu, Teboul Olivier, Vert Jean-Philippe. Deep embedding and alignment of protein sequences. Nat Methods. 2022;20(1):104\u201311.","journal-title":"Nat Methods"},{"issue":"7","key":"5699_CR12","first-page":"1145","volume":"33","author":"CD McWhite","year":"2023","unstructured":"McWhite CD, Armour-Garb I, Singh M. Leveraging protein language models for accurate multiple sequence alignments. Genome Res. 2023;33(7):1145\u201353.","journal-title":"Genome Res"},{"issue":"D1","key":"5699_CR13","doi-asserted-by":"publisher","first-page":"D412","DOI":"10.1093\/nar\/gkaa913","volume":"49","author":"Jaina Mistry","year":"2020","unstructured":"Mistry Jaina, Chuguransky Sara, Williams Lowri, Qureshi Matloob, Salazar Gustavo A, Sonnhammer Erik L L, Tosatto Silvio C E, Paladin Lisanna, Raj Shriya, Richardson Lorna J, Finn Robert D, Bateman Alex. Pfam: the protein families database in 2021. Nucleic Acids Res. 2020;49(D1):D412\u20139.","journal-title":"Nucleic Acids Res"},{"issue":"3","key":"5699_CR14","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","volume":"48","author":"Saul B Needleman","year":"1970","unstructured":"Needleman Saul B, Wunsch Christian D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443\u201353.","journal-title":"J Mol Biol"},{"key":"5699_CR15","doi-asserted-by":"publisher","first-page":"1750","DOI":"10.1016\/j.csbj.2021.03.022","volume":"19","author":"Dan Ofer","year":"2021","unstructured":"Ofer Dan, Brandes Nadav, Linial Michal. The language of proteins: NLP, machine learning and protein sequences. Comput Struct Biotechnol J. 2021;19:1750\u20138.","journal-title":"Comput Struct Biotechnol J"},{"issue":"2","key":"5699_CR16","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1093\/protein\/12.2.85","volume":"12","author":"Burkhard Rost","year":"1999","unstructured":"Rost Burkhard. Twilight zone of protein sequence alignments. Protein Eng Des Sel. 1999;12(2):85\u201394.","journal-title":"Protein Eng Des Sel"},{"key":"5699_CR17","doi-asserted-by":"publisher","first-page":"1033775","DOI":"10.3389\/fbinf.2022.1033775","volume":"2","author":"K Sch\u00fctze","year":"2022","unstructured":"Sch\u00fctze K, Heinzinger M, Steinegger M, Rost B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. Front Bioinform. 2022;2:1033775.","journal-title":"Front Bioinform."},{"issue":"1","key":"5699_CR18","doi-asserted-by":"publisher","first-page":"539","DOI":"10.1038\/msb.2011.75","volume":"7","author":"Fabian Sievers","year":"2011","unstructured":"Sievers Fabian, Wilm Andreas, Dineen David, Gibson Toby J, Karplus Kevin, Li Weizhong, Lopez Rodrigo, McWilliam Hamish, Remmert Michael, S\u00f6ding Johannes, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol. 2011;7(1):539.","journal-title":"Mol Syst Biol"},{"issue":"7","key":"5699_CR19","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1038\/s41592-019-0437-4","volume":"16","author":"Martin Steinegger","year":"2019","unstructured":"Steinegger Martin, Mirdita Milot, S\u00f6ding Johannes. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods. 2019;16(7):603\u20136.","journal-title":"Nat Methods"},{"issue":"1","key":"5699_CR20","doi-asserted-by":"publisher","first-page":"5242","DOI":"10.1038\/s41467-018-04964-5","volume":"9","author":"M Steinegger","year":"2018","unstructured":"Steinegger M, S\u00f6ding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):5242.","journal-title":"Nat Commun"},{"issue":"6","key":"5699_CR21","doi-asserted-by":"publisher","first-page":"926","DOI":"10.1093\/bioinformatics\/btu739","volume":"31","author":"Baris E Suzek","year":"2014","unstructured":"Suzek Baris E, Wang Yuqi, Huang Hongzhan, McGarvey Peter B, Wu Cathy H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2014;31(6):926\u201332.","journal-title":"Bioinformatics"},{"issue":"1","key":"5699_CR22","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1002\/prot.20527","volume":"61","author":"Julie\u00a0D Thompson","year":"2005","unstructured":"Thompson Julie\u00a0D, Koehl Patrice, Ripp Raymond, Poch Olivier. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Struct Funct, Bioinf. 2005;61(1):127\u201336.","journal-title":"Proteins: Struct Funct, Bioinf"},{"key":"5699_CR23","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. Adv Neural Inform Process Syst 2017;30"},{"issue":"suppl2","key":"5699_CR24","first-page":"246","volume":"19","author":"Ye Yuzhen","year":"2003","unstructured":"Yuzhen Ye, Adam Godzik. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics. 2003;19(suppl2):246\u201355.","journal-title":"Bioinformatics"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-024-05699-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-024-05699-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-024-05699-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,28]],"date-time":"2024-02-28T03:02:56Z","timestamp":1709089376000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-024-05699-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,28]]},"references-count":24,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["5699"],"URL":"https:\/\/doi.org\/10.1186\/s12859-024-05699-5","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,28]]},"assertion":[{"value":"3 July 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 February 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate\u00a0"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"85"}}