{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T14:19:20Z","timestamp":1754144360882,"version":"3.41.2"},"reference-count":58,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T00:00:00Z","timestamp":1752537600000},"content-version":"vor","delay-in-days":14,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Jetstream2","award":["CIS240916"],"award-info":[{"award-number":["CIS240916"]}]},{"name":"Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support"},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#2138259","#2138286","#2138307","#2137603","#2138296"],"award-info":[{"award-number":["#2138259","#2138286","#2138307","#2137603","#2138296"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000057","name":"NIGMS","doi-asserted-by":"publisher","award":["R01GM132600","DOE BER DE-SC0021216"],"award-info":[{"award-number":["R01GM132600","DOE BER DE-SC0021216"]}],"id":[{"id":"10.13039\/100000057","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Summary<\/jats:title>\n                  <jats:p>Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling. Here, we present Neural Embeddings for Amino acid Relationships (NEAR), a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database. NEAR\u2019s ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme. Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding and search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript, we focus on a more straightforward analysis of the model\u2019s value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely used profile hidden Markov model (pHMM) search tool HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>NEAR is under an open-source license. Code and data curation instructions can be found at https:\/\/github.com\/TravisWheelerLab\/NEAR.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf198","type":"journal-article","created":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:58Z","timestamp":1752584578000},"page":"i449-i457","source":"Crossref","is-referenced-by-count":0,"title":["NEAR: neural embeddings for amino acid relationships"],"prefix":"10.1093","volume":"41","author":[{"given":"Daniel","family":"Olson","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Montana , Missoula, MT 59812,","place":["United States"]}]},{"given":"Thomas","family":"Colligan","sequence":"additional","affiliation":[{"name":"College of Pharmacy, University of Arizona, Tucson, AZ 85721,","place":["United States"]}]},{"given":"Daphne","family":"Demekas","sequence":"additional","affiliation":[{"name":"College of Pharmacy, University of Arizona, Tucson, AZ 85721,","place":["United States"]}]},{"given":"Jack W","family":"Roddy","sequence":"additional","affiliation":[{"name":"College of Pharmacy, University of Arizona, Tucson, AZ 85721,","place":["United States"]}]},{"given":"Ken","family":"Youens-Clark","sequence":"additional","affiliation":[{"name":"College of Pharmacy, University of Arizona, Tucson, AZ 85721,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2004-1785","authenticated-orcid":false,"given":"Travis J","family":"Wheeler","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Montana , Missoula, MT 59812,","place":["United States"]},{"name":"College of Pharmacy, University of Arizona, Tucson, AZ 85721,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,7,15]]},"reference":[{"year":"2023","author":"Anderson","key":"2025071509025193100_btaf198-B1"},{"key":"2025071509025193100_btaf198-B2","doi-asserted-by":"crossref","first-page":"1798","DOI":"10.1109\/TPAMI.2013.50","article-title":"Representation learning: a review and new perspectives","volume":"35","author":"Bengio","year":"2013","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2025071509025193100_btaf198-B3","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1145\/3569951.3597559","volume-title":"Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good, PEARC \u201923","author":"Boerner","year":"2023"},{"key":"2025071509025193100_btaf198-B4","doi-asserted-by":"crossref","first-page":"2102","DOI":"10.1093\/bioinformatics\/btac020","article-title":"ProteinBERT: a universal deep-learning model of protein sequence and function","volume":"38","author":"Brandes","year":"2022","journal-title":"Bioinformatics"},{"key":"2025071509025193100_btaf198-B5","doi-asserted-by":"crossref","first-page":"27","DOI":"10.21105\/joss.00027","article-title":"Sourmash: a library for MinHash sketching of DNA","volume":"1","author":"Brown","year":"2016","journal-title":"J Open Source Softw"},{"key":"2025071509025193100_btaf198-B6","doi-asserted-by":"crossref","first-page":"366","DOI":"10.1038\/s41592-021-01101-x","article-title":"Sensitive protein alignments at tree-of-life scale using DIAMOND","volume":"18","author":"Buchfink","year":"2021","journal-title":"Nat Methods"},{"key":"2025071509025193100_btaf198-B7","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1186\/1471-2105-10-421","article-title":"BLAST+: architecture and applications","volume":"10","author":"Camacho","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2025071509025193100_btaf198-B8","doi-asserted-by":"crossref","first-page":"373","DOI":"10.1038\/s41467-017-02342-1","article-title":"A global ocean atlas of eukaryotic genes","volume":"9","author":"Carradec","year":"2018","journal-title":"Nature Commun"},{"year":"2021","author":"Chen","key":"2025071509025193100_btaf198-B9"},{"year":"2020","author":"Clevert","key":"2025071509025193100_btaf198-B10"},{"year":"2019","author":"Devlin","key":"2025071509025193100_btaf198-B11","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"2025071509025193100_btaf198-B12","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511790492","volume-title":"Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids","author":"Durbin","year":"1998"},{"key":"2025071509025193100_btaf198-B13","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1093\/bioinformatics\/14.9.755","article-title":"Profile Hidden Markov models","volume":"14","author":"Eddy","year":"1998","journal-title":"Bioinformatics"},{"key":"2025071509025193100_btaf198-B14","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pcbi.1002195","article-title":"Accelerated profile HMM searches","volume":"7","author":"Eddy","year":"2011","journal-title":"PLoS Comput Biol"},{"key":"2025071509025193100_btaf198-B15","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1038\/s41586-021-04332-2","article-title":"Petabase-scale sequence alignment catalyses viral discovery","volume":"602","author":"Edgar","year":"2022","journal-title":"Nature"},{"key":"2025071509025193100_btaf198-B16","doi-asserted-by":"crossref","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2022","journal-title":"IEEE Transact Pattern Anal Mach Intel"},{"year":"2024","author":"ESM Team","key":"2025071509025193100_btaf198-B17"},{"key":"2025071509025193100_btaf198-B18","doi-asserted-by":"crossref","first-page":"e23","DOI":"10.1093\/nar\/gkq1212","article-title":"A new repeat-masking method enables specific detection of homologous sequences","volume":"39","author":"Frith","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2025071509025193100_btaf198-B19","doi-asserted-by":"crossref","first-page":"1165","DOI":"10.1101\/gr.279464.124","article-title":"A simple method for finding related sequences by adding probabilities of alternative alignments","volume":"34","author":"Frith","year":"2024","journal-title":"Genome Res"},{"key":"2025071509025193100_btaf198-B20","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1186\/1471-2105-11-80","article-title":"Parameters for accurate genome alignment","volume":"11","author":"Frith","year":"2010","journal-title":"BMC Bioinformatics"},{"year":"2017","author":"Fu","key":"2025071509025193100_btaf198-B21"},{"key":"2025071509025193100_btaf198-B22","doi-asserted-by":"crossref","first-page":"4355","DOI":"10.1073\/pnas.84.13.4355","article-title":"Profile analysis: detection of distantly related proteins","volume":"84","author":"Gribskov","year":"1987","journal-title":"Proc Natl Acad Sci U S A"},{"year":"2020","author":"Guo","key":"2025071509025193100_btaf198-B23"},{"key":"2025071509025193100_btaf198-B24","doi-asserted-by":"crossref","first-page":"975","DOI":"10.1038\/s41587-023-01917-2","article-title":"Protein remote homology detection and structural alignment using deep learning","volume":"42","author":"Hamamsy","year":"2024","journal-title":"Nat Biotechnol"},{"key":"2025071509025193100_btaf198-B25","doi-asserted-by":"publisher","DOI":"10.1145\/3437359.3465565","volume-title":"Practice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions, PEARC\u201921","author":"Hancock","year":"2021"},{"year":"2016","author":"He","key":"2025071509025193100_btaf198-B26"},{"year":"2024","author":"Iovino","key":"2025071509025193100_btaf198-B27"},{"year":"2018","author":"Iwasaki","key":"2025071509025193100_btaf198-B28"},{"key":"2025071509025193100_btaf198-B29","doi-asserted-by":"crossref","first-page":"535","DOI":"10.1109\/TBDATA.2019.2921572","article-title":"Billion-scale similarity search with GPUs","volume":"7","author":"Johnson","year":"2021","journal-title":"IEEE Trans Big Data"},{"key":"2025071509025193100_btaf198-B30","doi-asserted-by":"crossref","first-page":"846","DOI":"10.1093\/bioinformatics\/14.10.846","article-title":"Hidden markov models for detecting remote protein homologies","volume":"14","author":"Karplus","year":"1998","journal-title":"Bioinformatics"},{"key":"2025071509025193100_btaf198-B31","doi-asserted-by":"crossref","first-page":"487","DOI":"10.1101\/gr.113985.110","article-title":"Adaptive seeds tame genomic sequence comparison","volume":"21","author":"Kie\u0142basa","year":"2011","journal-title":"Genome Res"},{"year":"2016","author":"Kimothi","key":"2025071509025193100_btaf198-B32"},{"year":"2017","author":"Kingma","key":"2025071509025193100_btaf198-B33"},{"year":"2024","author":"Krause","key":"2025071509025193100_btaf198-B34"},{"key":"2025071509025193100_btaf198-B35","doi-asserted-by":"crossref","first-page":"1501","DOI":"10.1006\/jmbi.1994.1104","article-title":"Hidden markov models in computational biology: applications to protein modeling","volume":"235","author":"Krogh","year":"1994","journal-title":"J Mol Biol"},{"key":"2025071509025193100_btaf198-B36","doi-asserted-by":"crossref","first-page":"220","DOI":"10.1038\/s42256-024-00795-w","article-title":"Protein function prediction as approximate semantic entailment","volume":"6","author":"Kulmanov","year":"2024","journal-title":"Nat Mach Intell"},{"year":"2023","author":"Lee","key":"2025071509025193100_btaf198-B37"},{"key":"2025071509025193100_btaf198-B38","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1186\/s40168-020-00808-x","article-title":"MetaEuk\u2014sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics","volume":"8","author":"Levy Karin","year":"2020","journal-title":"Microbiome"},{"key":"2025071509025193100_btaf198-B39","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2025071509025193100_btaf198-B40","doi-asserted-by":"crossref","first-page":"2775","DOI":"10.1038\/s41467-024-46808-5","article-title":"Plmsearch: protein language model powers accurate and fast sequence search for remote homology","volume":"15","author":"Liu","year":"2024","journal-title":"Nat Commun"},{"key":"2025071509025193100_btaf198-B41","doi-asserted-by":"crossref","first-page":"824","DOI":"10.1109\/TPAMI.2018.2889473","article-title":"Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs","volume":"42","author":"Malkov","year":"2020","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2025071509025193100_btaf198-B42","first-page":"1145","article-title":"Leveraging protein language models for accurate multiple sequence alignments","volume":"33","author":"McWhite","year":"2023","journal-title":"Genome Res"},{"year":"2013","author":"Mikolov","key":"2025071509025193100_btaf198-B43"},{"key":"2025071509025193100_btaf198-B44","doi-asserted-by":"crossref","first-page":"D170","DOI":"10.1093\/nar\/gkw1081","article-title":"Uniclust databases of clustered and deeply annotated protein sequences and alignments","volume":"45","author":"Mirdita","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2025071509025193100_btaf198-B45","doi-asserted-by":"crossref","first-page":"e01468\u201321","DOI":"10.1128\/msystems.01468-21","article-title":"Quantifying and cataloguing unknown sequences within human microbiomes","volume":"7","author":"Modha","year":"2022","journal-title":"Msystems"},{"key":"2025071509025193100_btaf198-B46","doi-asserted-by":"crossref","first-page":"vbae149","DOI":"10.1093\/bioadv\/vbae149","article-title":"Ultra-effective labeling of tandem repeats in genomic sequence","volume":"4","author":"Olson","year":"2024","journal-title":"Bioinform Adv"},{"key":"2025071509025193100_btaf198-B47","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/5.18626","article-title":"A tutorial on hidden markov models and selected applications in speech recognition","volume":"77","author":"Rabiner","year":"1989","journal-title":"Proc IEEE"},{"year":"2018","author":"Radford","key":"2025071509025193100_btaf198-B48"},{"key":"2025071509025193100_btaf198-B49","first-page":"5485","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J Mach Learn Res"},{"key":"2025071509025193100_btaf198-B50","doi-asserted-by":"crossref","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proceedings of the National Academy of Sciences"},{"year":"2024","author":"Roddy","key":"2025071509025193100_btaf198-B51"},{"key":"2025071509025193100_btaf198-B52","doi-asserted-by":"crossref","first-page":"2080","DOI":"10.1101\/gr.275648.121","article-title":"Effective sequence similarity detection with strobemers","volume":"31","author":"Sahlin","year":"2021","journal-title":"Genome Res"},{"key":"2025071509025193100_btaf198-B53","doi-asserted-by":"crossref","first-page":"1033775","DOI":"10.3389\/fbinf.2022.1033775","article-title":"Nearest neighbor search on embeddings rapidly identifies distant protein relations","volume":"2","author":"Sch\u00fctze","year":"2022","journal-title":"Front Bioinform"},{"key":"2025071509025193100_btaf198-B54","first-page":"1857","article-title":"Improved deep metric learning with multi-class N-pair loss objective","volume":"29","author":"Sohn","year":"2016","journal-title":"Adv Neural Info Process Syst"},{"key":"2025071509025193100_btaf198-B55","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1038\/nbt.3988","article-title":"MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets","volume":"35","author":"Steinegger","year":"2017","journal-title":"Nat Biotechnol"},{"key":"2025071509025193100_btaf198-B56","doi-asserted-by":"crossref","first-page":"926","DOI":"10.1093\/bioinformatics\/btu739","article-title":"UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches","volume":"31","author":"Suzek","year":"2015","journal-title":"Bioinformatics"},{"key":"2025071509025193100_btaf198-B57","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025071509025193100_btaf198-B58","doi-asserted-by":"crossref","first-page":"702","DOI":"10.1002\/prot.20264","article-title":"Scoring function for automated assessment of protein structure template quality","volume":"57","author":"Zhang","year":"2004","journal-title":"Proteins: Structure, Function, and Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i449\/63745255\/btaf198.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i449\/63745255\/btaf198.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:03:05Z","timestamp":1752584585000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/41\/Supplement_1\/i449\/8199346"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":58,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf198","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,7,1]]}}}