{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,2]],"date-time":"2025-10-02T06:15:05Z","timestamp":1759385705687,"version":"build-2065373602"},"reference-count":36,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,10,2]],"date-time":"2025-10-02T00:00:00Z","timestamp":1759363200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:p>T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR \u03b2-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.<\/jats:p>","DOI":"10.3389\/fbinf.2025.1623488","type":"journal-article","created":{"date-parts":[[2025,10,2]],"date-time":"2025-10-02T05:31:45Z","timestamp":1759383105000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans"],"prefix":"10.3389","volume":"5","author":[{"given":"Sanskriti","family":"Baranwal","sequence":"first","affiliation":[]},{"given":"Ricardo Avila","family":"Sanchez","sequence":"additional","affiliation":[]},{"given":"Clement-Andi","family":"Edet","sequence":"additional","affiliation":[]},{"given":"Erick","family":"Chastain","sequence":"additional","affiliation":[]},{"given":"Inimary","family":"Toby","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,10,2]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"1315","DOI":"10.1038\/s41592-019-0598-1","article-title":"Unified rational protein engineering with sequence-based deep representation learning","volume":"16","author":"Alley","year":"2019","journal-title":"Nat. Methods"},{"key":"B2","doi-asserted-by":"publisher","first-page":"788","DOI":"10.1001\/jama.2016.0291","article-title":"Epidemiology, patterns of care, and mortality for patients with acute respiratory distress syndrome in intensive care units in 50 countries","volume":"315","author":"Bellani","year":"2016","journal-title":"JAMA"},{"key":"B3","doi-asserted-by":"publisher","first-page":"380","DOI":"10.1038\/nmeth.3364","article-title":"MiXCR: software for comprehensive adaptive immunity profiling","volume":"12","author":"Bolotin","year":"2015","journal-title":"Nat. Methods"},{"key":"B4","doi-asserted-by":"publisher","first-page":"798","DOI":"10.1177\/10815589241270612","article-title":"Acute respiratory distress syndrome: a review of ARDS across the life course","volume":"72","author":"Cave","year":"2024","journal-title":"J. Investigative Med."},{"key":"B5","doi-asserted-by":"publisher","first-page":"640725","DOI":"10.3389\/fimmu.2021.640725","article-title":"TCRMatch: predicting T-cell receptor specificity based on sequence similarity to previously characterized receptors","volume":"12","author":"Chronister","year":"2021","journal-title":"Front. Immunol."},{"key":"B6","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1038\/nature22383","article-title":"Quantifiable predictive features define epitope-specific T cell receptor repertoires","volume":"547","author":"Dash","year":"2017","journal-title":"Nature"},{"key":"B27","doi-asserted-by":"publisher","first-page":"1213706","DOI":"10.3389\/fgene.2023.1213706","article-title":"Erratum: 10 Years of toxicogenomics section in frontiers in Genetics: past discoveries and future Perspectives","volume":"14","author":"ElSayed","year":"2023","journal-title":"Front. Genet."},{"key":"B7","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1038\/ng.3822","article-title":"Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire","volume":"49","author":"Emerson","year":"2017","journal-title":"Nat. Genet."},{"key":"B8","doi-asserted-by":"publisher","first-page":"698","DOI":"10.1001\/jama.2017.21907","article-title":"ARDS: advances in diagnosis and treatment","volume":"319","author":"Fan","year":"2018","journal-title":"JAMA"},{"key":"B9","doi-asserted-by":"publisher","first-page":"94","DOI":"10.1038\/nature22976","article-title":"Identifying specificity groups in the T cell receptor repertoire","volume":"547","author":"Glanville","year":"2017","journal-title":"Nature"},{"key":"B10","doi-asserted-by":"publisher","first-page":"1467","DOI":"10.1016\/j.celrep.2017.04.054","article-title":"Systems analysis reveals high genetic and antigen-driven predetermination of antibody repertoires throughout B cell development","volume":"19","author":"Greiff","year":"2017","journal-title":"Cell Rep."},{"key":"B11","volume-title":"Modeling the language of life","author":"Heinzinger","year":"2019"},{"key":"B12","doi-asserted-by":"publisher","first-page":"825","DOI":"10.3390\/biom13050825","article-title":"Analysis of CDR3 sequences from T-cell receptor \u03b2 in acute respiratory distress syndrome","volume":"13","author":"Hey","year":"2023","journal-title":"Biomolecules"},{"key":"B21","doi-asserted-by":"publisher","first-page":"308","DOI":"10.1016\/j.ijid.2021.10.033","article-title":"T-cell receptor repertoires as potential diagnostic markers for patients with COVID-19","volume":"113","author":"Hou","year":"2021","journal-title":"Int. J. Infect. Dis."},{"key":"B13","doi-asserted-by":"publisher","first-page":"013011","DOI":"10.1103\/PRXLife.2.013011","article-title":"Local and global variability in developing human T-cell repertoires","volume":"2","author":"Isacchini","year":"2024","journal-title":"PRX Life"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.3389\/fimmu.2022.858057\/full","article-title":"Machine learning approaches to TCR repertoire analysis","volume":"13","author":"Katayama","year":"2022","journal-title":"Front. Immunol."},{"key":"B14","doi-asserted-by":"publisher","first-page":"1623","DOI":"10.1038\/s41591-020-1038-6","article-title":"A dynamic COVID-19 immune signature includes associations with poor prognosis","volume":"26","author":"Laing","year":"2020","journal-title":"Nat. Med."},{"key":"B15","doi-asserted-by":"publisher","first-page":"535","DOI":"10.1038\/nbt.1856","article-title":"Autoantigen discovery with a synthetic human peptidome","volume":"29","author":"Larman","year":"2011","journal-title":"Nat. Biotechnol."},{"key":"B17","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1038\/s41392-025-02127-9","article-title":"Advances in acute respiratory distress syndrome: focusing on heterogeneity, pathophysiology, and therapeutic strategies","volume":"10","author":"Ma","year":"2025","journal-title":"Signal Transduct. Target. Ther."},{"key":"B18","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1038\/s41572-019-0069-0","article-title":"Acute respiratory distress syndrome","volume":"5","author":"Matthay","year":"2019","journal-title":"Nat. Rev. Dis. Prim."},{"key":"B19","article-title":"TCR repertoire sequencing","author":"Mazzotti","year":"2022","journal-title":"Encyclopedia"},{"key":"B23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1164\/rccm.202303-0558WS","article-title":"A new global definition of ARDS","volume":"207","author":"Matthay","year":"2023","journal-title":"AJRCCM"},{"key":"B20","volume-title":"Efficient estimation of Word representations in vector space","author":"Mikolov","year":"2013"},{"key":"B22","doi-asserted-by":"publisher","first-page":"76","DOI":"10.1038\/s42003-023-04447-4","article-title":"Machine learning identifies T cell receptor repertoire signatures associated with COVID-19 severity","volume":"6","author":"Park","year":"2023","journal-title":"Commun. Biol."},{"key":"B24","doi-asserted-by":"publisher","first-page":"691","DOI":"10.1016\/S2213-2600(18)30177-2","article-title":"Acute respiratory distress syndrome subphenotypes and differential response to simvastatin: secondary analysis of a randomised controlled trial","volume":"6","author":"Reilly","year":"2018","journal-title":"Lancet Respir. Med."},{"key":"B25","doi-asserted-by":"publisher","first-page":"4099","DOI":"10.1182\/blood-2009-04-217604","article-title":"Comprehensive assessment of T-cell receptor \u03b2-chain diversity in \u03b1\u03b2 T cells","volume":"114","author":"Robins","year":"2009","journal-title":"Blood"},{"key":"B26","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1186\/s12896-017-0379-9","article-title":"Overview of methodologies for T-cell receptor repertoire analysis","volume":"17","author":"Rosati","year":"2017","journal-title":"BMC Biotechnol."},{"key":"B28","doi-asserted-by":"publisher","first-page":"e1008394","DOI":"10.1371\/journal.pcbi.1008394","article-title":"Population variability in the generation and selection of T-cell repertoires","volume":"16","author":"Sethna","year":"2020","journal-title":"PLOS Comput. Biol."},{"key":"B29","author":"Sharma","year":"2023","journal-title":"Acute respiratory distress syndromeStatPearls"},{"key":"B30","doi-asserted-by":"publisher","first-page":"653","DOI":"10.1038\/nmeth.2960","article-title":"Towards error-free profiling of immune repertoires","volume":"11","author":"Shugay","year":"2014","journal-title":"Nat. Methods"},{"key":"B31","doi-asserted-by":"publisher","first-page":"1605","DOI":"10.1038\/s41467-021-21879-w","article-title":"DeepTCR: structural concepts in TCRs","volume":"12","author":"Sidhom","year":"2021","journal-title":"Nat. Commun."},{"key":"B32","doi-asserted-by":"publisher","first-page":"1059","DOI":"10.1016\/j.cels.2023.11.004","article-title":"Machine learning analysis of the T cell receptor repertoire identifies sequence features of self-reactivity","volume":"14","author":"Textor","year":"2023","journal-title":"Cell Syst."},{"key":"B33","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2006.15222","article-title":"BERTology meets Biology","author":"Vig","year":"2021","journal-title":"ICLR"},{"key":"B34","doi-asserted-by":"publisher","first-page":"e076612","DOI":"10.1136\/bmj-2023-076612","article-title":"Acute respiratory distress syndrome","volume":"387","author":"Wick","year":"2024","journal-title":"BMJ"},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.3389\/fimmu.2021.680687","article-title":"Immune2vec","volume":"12","author":"Wolock","year":"2022","journal-title":"Front. Immunol."},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.1101\/2023.04.12.536635","article-title":"Context-aware amino acid embedding","author":"Zhang","year":"2023","journal-title":"eLife"}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1623488\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,2]],"date-time":"2025-10-02T05:31:49Z","timestamp":1759383109000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1623488\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,2]]},"references-count":36,"alternative-id":["10.3389\/fbinf.2025.1623488"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2025.1623488","relation":{},"ISSN":["2673-7647"],"issn-type":[{"value":"2673-7647","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,2]]},"article-number":"1623488"}}