{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T15:41:58Z","timestamp":1774539718904,"version":"3.50.1"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T00:00:00Z","timestamp":1752537600000},"content-version":"vor","delay-in-days":14,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01-GM076275"],"award-info":[{"award-number":["R01-GM076275"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Princeton Laboratory for Artificial Intelligence"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Protein language models (PLMs) are amongst the most exciting recent advances for characterizing protein sequences, and have enabled a diverse set of applications, including structure determination, functional property prediction, and mutation impact assessment, all from single protein sequences alone. State-of-the-art PLMs leverage transformer architectures originally developed for natural language processing, and are pre-trained on large protein databases to generate contextualized representations of individual amino acids. To harness the power of these PLMs to predict protein-level properties, these per-residue embeddings are typically \u201cpooled\u201d to fixed-size vectors that are further utilized in downstream prediction networks. Common pooling strategies include Cls-Pooling and Avg-Pooling, but neither of these approaches can capture the local substructures and long-range interactions observed in proteins.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We propose the use of attention pooling, which can naturally capture these important features of proteins. To make the expensive attention operator (quadratic in the length of the input protein) feasible in practice, we introduce bag-of-mer pooling, or BoM-Pooling, a locality-aware hierarchical pooling technique that combines windowed average pooling with attention pooling. We empirically demonstrate that both full attention pooling and BoM-Pooling outperform previous pooling strategies on three important, diverse tasks: (i) predicting the activities of two proteins as they are varied; (ii) detecting remote homologs; and (iii) predicting signaling protein interactions with peptides. Overall, our work highlights the advantages of biologically inspired pooling techniques in protein sequence modeling and is a step toward more effective adaptations of language models in biological settings.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>https:\/\/github.com\/Singh-Lab\/bom-pooling.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf178","type":"journal-article","created":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:17Z","timestamp":1752584537000},"page":"i217-i226","source":"Crossref","is-referenced-by-count":3,"title":["Locality-aware pooling enhances protein language model performance across varied applications"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1053-6744","authenticated-orcid":false,"given":"Minh","family":"Hoang","sequence":"first","affiliation":[{"name":"Lewis-Sigler Institute of Integrative Genomics, Princeton University , Princeton, NJ 08540,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8271-6026","authenticated-orcid":false,"given":"Mona","family":"Singh","sequence":"additional","affiliation":[{"name":"Lewis-Sigler Institute of Integrative Genomics, Princeton University , Princeton, NJ 08540,","place":["United States"]},{"name":"Department of Computer Science, Princeton University , Princeton, NJ 08540,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,7,15]]},"reference":[{"key":"2025071509021094000_btaf178-B1","doi-asserted-by":"crossref","first-page":"1315","DOI":"10.1038\/s41592-019-0598-1","article-title":"Unified rational protein engineering with sequence-based deep representation learning","volume":"16","author":"Alley","year":"2019","journal-title":"Nat Methods"},{"key":"2025071509021094000_btaf178-B2","author":"Bepler","year":"2019"},{"key":"2025071509021094000_btaf178-B3","doi-asserted-by":"crossref","first-page":"654","DOI":"10.1016\/j.cels.2021.05.017","article-title":"Learning the protein language: evolution, structure, and function","volume":"12","author":"Bepler","year":"2021","journal-title":"Cell Syst"},{"key":"2025071509021094000_btaf178-B4","doi-asserted-by":"crossref","first-page":"1512","DOI":"10.1038\/s41588-023-01465-0","article-title":"Genome-wide prediction of disease variant effects with a deep protein language model","volume":"55","author":"Brandes","year":"2023","journal-title":"Nat Genet"},{"key":"2025071509021094000_btaf178-B5","doi-asserted-by":"crossref","first-page":"1617","DOI":"10.1038\/s41587-022-01432-w","article-title":"Single-sequence protein structure prediction using a language model and deep learning","volume":"40","author":"Chowdhury","year":"2022","journal-title":"Nat Biotechnol"},{"key":"2025071509021094000_btaf178-B6","first-page":"197","volume-title":"ACM-SIAM Symposium on Discrete Algorithms","author":"Cormode"},{"key":"2025071509021094000_btaf178-B7","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1038\/s41592-019-0687-1","article-title":"Biophysical prediction of protein-peptide interactions and signaling networks using machine learning","volume":"17","author":"Cunningham","year":"2020","journal-title":"Nat Methods"},{"key":"2025071509021094000_btaf178-B8","volume-title":"Annual Conference of the North American Chapter of the ACL","author":"Devlin","year":"2019"},{"key":"2025071509021094000_btaf178-B9","first-page":"7112","author":"Elnaggar","year":"2021"},{"key":"2025071509021094000_btaf178-B10","doi-asserted-by":"crossref","first-page":"D304","DOI":"10.1093\/nar\/gkt1240","article-title":"SCOPe: structural classification of proteins\u2014extended, integrating SCOP and ASTRAL data and classification of new structures","volume":"42","author":"Fox","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2025071509021094000_btaf178-B11","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1016\/j.cels.2017.11.003","article-title":"Quantitative missense variant effect prediction using large-scale mutagenesis data","volume":"6","author":"Gray","year":"2018","journal-title":"Cell Syst"},{"key":"2025071509021094000_btaf178-B12","doi-asserted-by":"crossref","first-page":"975","DOI":"10.1038\/s41587-023-01917-2","article-title":"Protein remote homology detection and structural alignment using deep learning","volume":"42","author":"Hamamsy","year":"2024","journal-title":"Nat Biotechnol"},{"key":"2025071509021094000_btaf178-B13","first-page":"770","author":"He","year":"2016"},{"key":"2025071509021094000_btaf178-B14","doi-asserted-by":"crossref","first-page":"723","DOI":"10.1186\/s12859-019-3220-8","article-title":"Modeling aspects of the language of life through transfer-learning protein sequences","volume":"20","author":"Heinzinger","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2025071509021094000_btaf178-B15","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1089\/cmb.2023.0212","article-title":"Density and conservation optimization of the generalized masked-minimizer sketching scheme","volume":"31","author":"Hoang","year":"2024","journal-title":"J Comput Biol"},{"key":"2025071509021094000_btaf178-B16","first-page":"52","author":"Hoang","year":"2022"},{"key":"2025071509021094000_btaf178-B17","doi-asserted-by":"crossref","first-page":"1288","DOI":"10.1089\/cmb.2022.0275","article-title":"Differentiable learning of sequence-specific minimizer schemes with DeepMinimizer","volume":"29","author":"Hoang","year":"2022","journal-title":"J Comput Biol"},{"key":"2025071509021094000_btaf178-B18","first-page":"709","volume-title":"European Conference on Computer Vision","author":"Jia","year":"2022"},{"key":"2025071509021094000_btaf178-B19","doi-asserted-by":"crossref","first-page":"e2308788121","DOI":"10.1073\/pnas.2308788121","article-title":"Single-sequence protein structure prediction by integrating protein language models","volume":"121","author":"Jing","year":"2024","journal-title":"Proc Natl Acad Sci USA"},{"key":"2025071509021094000_btaf178-B20","doi-asserted-by":"crossref","first-page":"431","DOI":"10.1186\/1471-2105-11-431","article-title":"Hidden Markov model speed heuristic and iterative HMM search procedure","volume":"11","author":"Johnson","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2025071509021094000_btaf178-B21","author":"Kingma","year":"2014"},{"key":"2025071509021094000_btaf178-B22","first-page":"3087","article-title":"An attention pooling based representation learning method for speech emotion recognition","volume":"19","author":"Li","year":"2018","journal-title":"Int Speech Commun Assoc"},{"key":"2025071509021094000_btaf178-B23","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2025071509021094000_btaf178-B24","author":"Loshchilov"},{"key":"2025071509021094000_btaf178-B25","first-page":"29287","article-title":"Language models enable zero-shot prediction of the effects of mutations on protein function","volume":"34","author":"Meier","year":"2021","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025071509021094000_btaf178-B26","doi-asserted-by":"crossref","first-page":"ra2","DOI":"10.1126\/scisignal.1159433","article-title":"Linear motif atlas for phosphorylation-dependent signaling","volume":"1","author":"Miller","year":"2008","journal-title":"Sci Signal"},{"key":"2025071509021094000_btaf178-B27","doi-asserted-by":"crossref","article-title":"Aggregating residue-level protein language model embeddings with optimal transport","author":"NaderiAlizadeh","DOI":"10.1093\/bioadv\/vbaf060"},{"key":"2025071509021094000_btaf178-B28","doi-asserted-by":"crossref","first-page":"5237","DOI":"10.1128\/jb.174.16.5237-5243.1992","article-title":"Identification of amino acid substitutions that alter the substrate specificity of TEM-1 \u03b2-lactamase","volume":"174","author":"Palzkill","year":"1992","journal-title":"J Bacteriol"},{"key":"2025071509021094000_btaf178-B29","first-page":"2435","author":"Radiya-Dixit","year":"2020"},{"key":"2025071509021094000_btaf178-B30","volume-title":"Advances in Neural Information Processing Systems","author":"Rao","year":"2019"},{"key":"2025071509021094000_btaf178-B31","doi-asserted-by":"crossref","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci USA"},{"key":"2025071509021094000_btaf178-B32","doi-asserted-by":"crossref","first-page":"397","DOI":"10.1038\/nature17995","article-title":"Local fitness landscape of the green fluorescent protein","volume":"533","author":"Sarkisyan","year":"2016","journal-title":"Nature"},{"key":"2025071509021094000_btaf178-B33","doi-asserted-by":"crossref","first-page":"7407","DOI":"10.1038\/s41467-024-51844-2","article-title":"Fine-tuning protein language models boosts predictions across diverse tasks","volume":"15","author":"Schmirler","year":"2024","journal-title":"Nat Commun"},{"key":"2025071509021094000_btaf178-B34","first-page":"815","author":"Schroff","year":"2015"},{"key":"2025071509021094000_btaf178-B35","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2405840121","article-title":"Democratizing protein language models with parameter-efficient fine-tuning","volume":"121","author":"Sledzieski","year":"2024","journal-title":"Proc Natl Acad Sci USA"},{"key":"2025071509021094000_btaf178-B36","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1038\/nbt.3988","article-title":"MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets","volume":"35","author":"Steinegger","year":"2017","journal-title":"Nat Biotechnol"},{"key":"2025071509021094000_btaf178-B37","doi-asserted-by":"crossref","first-page":"2997","DOI":"10.1093\/nar\/10.9.2997","article-title":"Use of the \u2018perceptron\u2019 algorithm to distinguish translational initiation sites in E. coli","volume":"10","author":"Stormo","year":"1982","journal-title":"Nucleic Acids Res"},{"key":"2025071509021094000_btaf178-B38","article-title":"Attention is all you need","author":"Vaswani","journal-title":"Advances in Neural Information Processing Syst"},{"key":"2025071509021094000_btaf178-B39","author":"Xie","year":"2024"},{"key":"2025071509021094000_btaf178-B40","first-page":"12437","author":"Zhang","year":"2021"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i217\/63745434\/btaf178.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i217\/63745434\/btaf178.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:21Z","timestamp":1752584541000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/41\/Supplement_1\/i217\/8199370"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":40,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf178","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,7,1]]}}}