{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T16:57:03Z","timestamp":1776185823907,"version":"3.50.1"},"reference-count":47,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2024,10,26]],"date-time":"2024-10-26T00:00:00Z","timestamp":1729900800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/S024093\/1"],"award-info":[{"award-number":["EP\/S024093\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and\/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>AbLang-2 is trained on both unpaired and paired data, and is freely available at https:\/\/github.com\/oxpig\/AbLang2.git.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae618","type":"journal-article","created":{"date-parts":[[2024,10,23]],"date-time":"2024-10-23T23:25:14Z","timestamp":1729725914000},"source":"Crossref","is-referenced-by-count":50,"title":["Addressing the antibody germline bias and its effect on language models for improved antibody design"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6348-4650","authenticated-orcid":false,"given":"Tobias H","family":"Olsen","sequence":"first","affiliation":[{"name":"Department of Statistics, University of Oxford , Oxford OX1 3LB,","place":["United Kingdom"]},{"name":"GSK Medicines Research Centre, GSK , Stevenage SG1 2NY,","place":["United Kingdom"]}]},{"given":"Iain H","family":"Moal","sequence":"additional","affiliation":[{"name":"GSK Medicines Research Centre, GSK , Stevenage SG1 2NY,","place":["United Kingdom"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1388-2252","authenticated-orcid":false,"given":"Charlotte M","family":"Deane","sequence":"additional","affiliation":[{"name":"Department of Statistics, University of Oxford , Oxford OX1 3LB,","place":["United Kingdom"]}]}],"member":"286","published-online":{"date-parts":[[2024,10,26]]},"reference":[{"key":"2024110805205355600_btae618-B1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2907070","volume":"49","author":"Branco","year":"2016"},{"key":"2024110805205355600_btae618-B2","doi-asserted-by":"publisher","first-page":"393","DOI":"10.1038\/s41586-019-0879-y","article-title":"Commonality despite exceptional diversity in the baseline human antibody repertoire","volume":"566","author":"Briney","year":"2019","journal-title":"Nature"},{"key":"2024110805205355600_btae618-B3","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2005.14165","article-title":"Language models are few-shot learners","author":"Brown","year":"2020"},{"key":"2024110805205355600_btae618-B4","doi-asserted-by":"publisher","first-page":"100967","DOI":"10.1016\/j.patter.2024.100967","article-title":"Improving antibody language models with native pairing","volume":"5","author":"Burbach","year":"2024","journal-title":"Patterns"},{"key":"2024110805205355600_btae618-B5","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1810.04805","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2018"},{"key":"2024110805205355600_btae618-B6","doi-asserted-by":"publisher","first-page":"298","DOI":"10.1093\/bioinformatics\/btv552","article-title":"ANARCI: antigen receptor numbering and receptor classification","volume":"32","author":"Dunbar","year":"2016","journal-title":"Bioinformatics"},{"key":"2024110805205355600_btae618-B7","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: toward understanding the language of life through self-supervised learning","author":"Elnaggar","year":"2021","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2024110805205355600_btae618-B8","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1093\/ajcp\/aqaa112","article-title":"Review of current advances in serologic testing for COVID-19","volume":"154","author":"Espejo","year":"2020","journal-title":"Am J Clin Pathol"},{"key":"2024110805205355600_btae618-B9","author":"Falcon","year":"2019"},{"key":"2024110805205355600_btae618-B10","doi-asserted-by":"publisher","first-page":"4348","DOI":"10.1038\/s41467-022-32007-7","article-title":"ProtGPT2 is a deep unsupervised language model for protein design","volume":"13","author":"Ferruz","year":"2022","journal-title":"Nat Commun"},{"key":"2024110805205355600_btae618-B11","doi-asserted-by":"publisher","first-page":"59","DOI":"10.18653\/v1\/2022.ltedi-1.8","author":"Gira","year":"2022"},{"key":"2024110805205355600_btae618-B12","doi-asserted-by":"publisher","first-page":"275","DOI":"10.1038\/s41587-023-01763-2","article-title":"Efficient evolution of human antibodies from general protein language models","volume":"42","author":"Hie","year":"2024","journal-title":"Nat Biotechnol"},{"key":"2024110805205355600_btae618-B13","doi-asserted-by":"publisher","first-page":"352","DOI":"10.1038\/s41586-022-05371-z","article-title":"Functional antibodies exhibit light chain coherence","volume":"611","author":"Jaffe","year":"2022","journal-title":"Nature"},{"key":"2024110805205355600_btae618-B14","doi-asserted-by":"publisher","first-page":"2153410","DOI":"10.1080\/19420862.2022.2153410","article-title":"Antibodies to watch in 2023","volume":"15","author":"Kaplon","year":"2023","journal-title":"MAbs"},{"key":"2024110805205355600_btae618-B15","doi-asserted-by":"publisher","first-page":"540","DOI":"10.1038\/s41587-020-0512-5","article-title":"Developing therapeutic monoclonal antibodies at pandemic pace","volume":"38","author":"Kelley","year":"2020","journal-title":"Nat Biotechnol"},{"key":"2024110805205355600_btae618-B16","doi-asserted-by":"publisher","first-page":"540","DOI":"10.5483\/BMBRep.2019.52.9.192","article-title":"Deep sequencing of B cell receptor repertoire","volume":"52","author":"Kim","year":"2019","journal-title":"BMB Rep"},{"key":"2024110805205355600_btae618-B17","doi-asserted-by":"publisher","first-page":"389","DOI":"10.3389\/fimmu.2017.00389","article-title":"Different somatic hypermutation levels among antibody subclasses disclosed by a new next-generation sequencing-based antibody repertoire analysis","volume":"8","author":"Kitaura","year":"2017","journal-title":"Front Immunol"},{"key":"2024110805205355600_btae618-B18","doi-asserted-by":"publisher","first-page":"100513","DOI":"10.1016\/j.patter.2022.100513","article-title":"Deciphering the language of antibodies using self-supervised learning","volume":"3","author":"Leem","year":"2022","journal-title":"Patterns (N Y)"},{"key":"2024110805205355600_btae618-B19","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2858826","article-title":"Focal loss for dense object detection","volume-title":"IEEE Trans Pattern Anal Mach Intell","author":"Lin","year":"2020"},{"key":"2024110805205355600_btae618-B20","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2024110805205355600_btae618-B21","doi-asserted-by":"publisher","author":"Liu","year":"2019","DOI":"10.48550\/arXiv.1907.11692"},{"key":"2024110805205355600_btae618-B22","doi-asserted-by":"publisher","first-page":"46","DOI":"10.1038\/nri.2017.106","article-title":"Beyond binding: antibody effector functions in infectious diseases","volume":"18","author":"Lu","year":"2018","journal-title":"Nat Rev Immunol"},{"key":"2024110805205355600_btae618-B23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12929-019-0592-z","article-title":"Development of therapeutic antibodies for the treatment of diseases","volume":"27","author":"Lu","year":"2020","journal-title":"J Biomed Sci"},{"key":"2024110805205355600_btae618-B24","doi-asserted-by":"publisher","first-page":"9823","DOI":"10.1074\/jbc.REV120.010181","article-title":"How repertoire data are changing antibody science","volume":"295","author":"Marks","year":"2020","journal-title":"J Biol Chem"},{"key":"2024110805205355600_btae618-B25","doi-asserted-by":"publisher","author":"Meier","year":"2021","DOI":"10.1101\/2021.07.09.450648"},{"key":"2024110805205355600_btae618-B26","doi-asserted-by":"publisher","first-page":"968","DOI":"10.1016\/j.cels.2023.10.002","article-title":"ProGen2: exploring the boundaries of protein language models","volume":"14","author":"Nijkamp","year":"2023","journal-title":"Cell Syst"},{"key":"2024110805205355600_btae618-B27","doi-asserted-by":"publisher","first-page":"1549","DOI":"10.1093\/bib\/bbz095","article-title":"Computational approaches to therapeutic antibody design: established methods and emerging trends","volume":"21","author":"Norman","year":"2020","journal-title":"Brief Bioinform"},{"key":"2024110805205355600_btae618-B28","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1002\/pro.4205","article-title":"Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences","volume":"31","author":"Olsen","year":"2022","journal-title":"Protein Sci"},{"key":"2024110805205355600_btae618-B29","doi-asserted-by":"publisher","first-page":"vbac046","DOI":"10.1093\/bioadv\/vbac046","article-title":"AbLang: an antibody language model for completing antibody sequences","volume":"2","author":"Olsen","year":"2022","journal-title":"Bioinform Adv"},{"key":"2024110805205355600_btae618-B30","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems 32","author":"Paszke","year":"2019"},{"key":"2024110805205355600_btae618-B31","doi-asserted-by":"publisher","first-page":"2020203","DOI":"10.1080\/19420862.2021.2020203","article-title":"BioPhi: a platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning","volume":"14","author":"Prihoda","year":"2022","journal-title":"MAbs"},{"key":"2024110805205355600_btae618-B32","author":"Radford","year":"2019"},{"key":"2024110805205355600_btae618-B33","doi-asserted-by":"publisher","first-page":"4025","DOI":"10.1073\/pnas.1810576116","article-title":"Five computational developability guidelines for therapeutic antibody profiling","volume":"116","author":"Raybould","year":"2019","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024110805205355600_btae618-B34","doi-asserted-by":"publisher","first-page":"D383","DOI":"10.1093\/nar\/gkz827","article-title":"Thera-SAbDab: the therapeutic structural antibody database","volume":"48","author":"Raybould","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2024110805205355600_btae618-B35","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024110805205355600_btae618-B36","doi-asserted-by":"publisher","author":"Ruffolo","year":"2021, :","DOI":"10.48550\/arXiv.2112.07782"},{"key":"2024110805205355600_btae618-B37","doi-asserted-by":"publisher","author":"Salazar","year":"2019","DOI":"10.48550\/arXiv.1910.14659"},{"key":"2024110805205355600_btae618-B38","doi-asserted-by":"publisher","author":"Shaw","year":"2023","DOI":"10.1101\/2023.09.28.560044"},{"key":"2024110805205355600_btae618-B39","doi-asserted-by":"publisher","author":"Shazeer","year":"2020","DOI":"10.48550\/arXiv.2002.05202"},{"key":"2024110805205355600_btae618-B40","doi-asserted-by":"publisher","first-page":"2542","DOI":"10.1038\/s41467-018-04964-5","article-title":"Clustering huge protein sequence sets in linear time","volume":"9","author":"Steinegger","year":"2018","journal-title":"Nat Commun"},{"key":"2024110805205355600_btae618-B41","doi-asserted-by":"publisher","author":"Sun","DOI":"10.18653\/v1\/P19-1159"},{"key":"2024110805205355600_btae618-B42","doi-asserted-by":"publisher","first-page":"926","DOI":"10.1093\/bioinformatics\/btu739","article-title":"UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches","volume":"31","author":"Suzek","year":"2015","journal-title":"Bioinformatics"},{"key":"2024110805205355600_btae618-B43","doi-asserted-by":"publisher","author":"Tay","year":"2023","DOI":"10.48550\/arXiv.2205.05131"},{"key":"2024110805205355600_btae618-B44","doi-asserted-by":"publisher","first-page":"1244","DOI":"10.1016\/j.jmb.2017.03.014","article-title":"Prediction and reduction of the aggregation of monoclonal antibodies","volume":"429","author":"van der Kant","year":"2017","journal-title":"J Mol Biol"},{"key":"2024110805205355600_btae618-B45","doi-asserted-by":"publisher","first-page":"2023938","DOI":"10.1080\/19420862.2021.2023938","article-title":"In silico prediction of post-translational modifications in therapeutic antibodies","volume":"14","author":"Vatsa","year":"2022","journal-title":"MAbs"},{"key":"2024110805205355600_btae618-B46","doi-asserted-by":"publisher","first-page":"W34","DOI":"10.1093\/nar\/gkt382","article-title":"IgBLAST: an immunoglobulin variable domain sequence analysis tool","volume":"41","author":"Ye","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2024110805205355600_btae618-B47","doi-asserted-by":"publisher","author":"Zheng","DOI":"10.18653\/v1\/2021.emnlp-main.257"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae618\/60129228\/btae618.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/11\/btae618\/60530393\/btae618.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/11\/btae618\/60530393\/btae618.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T00:21:06Z","timestamp":1731025266000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae618\/7845256"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,10,26]]},"references-count":47,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae618","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.02.02.578678","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,11]]},"published":{"date-parts":[[2024,10,26]]},"article-number":"btae618"}}