{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T19:12:49Z","timestamp":1770750769209,"version":"3.50.0"},"reference-count":51,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T00:00:00Z","timestamp":1758067200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004359","name":"Vetenskapsr\u00e5det","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100004359","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003170","name":"Stiftelsen f\u00f6r Kunskaps- och Kompetensutveckling","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100003170","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:p>Transformer models pre-trained on self-supervised tasks and fine-tuned on downstream objectives have achieved remarkable results across a variety of domains. However, fine-tuning these models for clinical predictions from longitudinal medical data, such as electronic health records (EHR), remains challenging due to limited labeled data and the complex, event-driven nature of medical sequences. While self-attention mechanisms are powerful for capturing relationships within sequences, they may underperform when modeling subtle dependencies between sparse clinical events under limited supervision. We introduce a simple yet effective fine-tuning technique, Adaptive Noise-Augmented Attention (ANAA), which injects adaptive noise directly into the self-attention weights and applies a 2D Gaussian kernel to smooth the resulting attention maps. This mechanism broadens the attention distribution across tokens while refining it to emphasize more informative events. Unlike prior approaches that require expensive modifications to the architecture and pre-training phase, ANAA operates entirely during fine-tuning. Empirical results across multiple clinical prediction tasks demonstrate consistent performance improvements. Furthermore, we analyze how ANAA shapes the learned attention behavior, offering interpretable insights into the model's handling of temporal dependencies in EHR data.<\/jats:p>","DOI":"10.3389\/frai.2025.1663484","type":"journal-article","created":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T05:40:11Z","timestamp":1758087611000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Adaptive noise-augmented attention for enhancing Transformer fine-tuning on longitudinal medical data"],"prefix":"10.3389","volume":"8","author":[{"given":"Ali","family":"Amirahmadi","sequence":"first","affiliation":[]},{"given":"Farzaneh","family":"Etminani","sequence":"additional","affiliation":[]},{"given":"Mattias","family":"Ohlsson","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,9,17]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"e68138","DOI":"10.2196\/68138","article-title":"Trajectory-ordered objectives for self-supervised representation learning of temporal healthcare data using transformers: Model development and evaluation study","volume":"13","author":"Amirahmadi","year":"2025","journal-title":"JMIR Med. Inform"},{"key":"B2","doi-asserted-by":"publisher","first-page":"104430","DOI":"10.1016\/j.jbi.2023.104430","article-title":"Deep learning prediction models based on ehr trajectories: a systematic review","volume":"144","author":"Amirahmadi","year":"2023","journal-title":"J. Biomed. Inform"},{"key":"B3","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2310.02980","article-title":"Never train from scratch: fair comparison of long-sequence models requires data-driven priors","author":"Amos","year":"2023","journal-title":"arXiv preprint arXiv:2310.02980"},{"key":"B4","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1111\/j.1365-2796.1993.tb00647.x","article-title":"The malmo diet and cancer study. Design and feasibility","volume":"233","author":"Berglund","year":"1993","journal-title":"J. Intern. Med"},{"key":"B5","doi-asserted-by":"publisher","first-page":"104616","DOI":"10.1016\/j.jbi.2024.104616","article-title":"Graph neural networks for clinical risk prediction based on electronic health records: a survey","volume":"151","author":"Boll","year":"2024","journal-title":"J. Biomed. Inform"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2108.07258","article-title":"On the opportunities and risks of foundation models","author":"Bommasani","year":"2021","journal-title":"arXiv preprint arXiv:2108.07258"},{"key":"B7","doi-asserted-by":"publisher","first-page":"1877","DOI":"10.48550\/arXiv.2005.14165","article-title":"Language models are few-shot learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B8","doi-asserted-by":"publisher","first-page":"16603","DOI":"10.48550\/arXiv.2007.07368","article-title":"Explicit regularisation in gaussian noise injections","volume":"33","author":"Camuto","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B9","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.194","article-title":"Mixtext: linguistically-informed interpolation of hidden space for semi-supervised text classification","author":"Chen","year":"2020","journal-title":"arXiv preprint arXiv:2004.12239"},{"key":"B10","first-page":"9640","article-title":"\u201cAn empirical study of training self-supervised vision transformers,\u201d","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Chen","year":"2021"},{"key":"B11","doi-asserted-by":"publisher","first-page":"606","DOI":"10.1609\/aaai.v34i01.5400","article-title":"Learning the graphical structure of electronic health records with graph convolutional transformer","volume":"34","author":"Choi","year":"2020","journal-title":"Proc. AAAI Conf. Artif. Intell"},{"key":"B12","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4828","article-title":"What does bert look at? an analysis of bert's attention","author":"Clark","year":"2019","journal-title":"arXiv preprint arXiv:1906.04341"},{"key":"B13","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1810.04805","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2018","journal-title":"arXiv preprint arXiv:1810.04805"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.14218\/JCTH.2022.00006S","article-title":"Longnet: scaling transformers to 1,000,000,000 tokens","author":"Ding","year":"2023","journal-title":"arXiv preprint arXiv:2307.02486"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2010.11929","article-title":"An image is worth 16 \u00d7 16 words: transformers for image recognition at scale","author":"Dosovitskiy","year":"2020","journal-title":"arXiv preprint arXiv"},{"key":"B16","doi-asserted-by":"publisher","first-page":"21271","DOI":"10.48550\/arXiv.2006.07733","article-title":"Bootstrap your own latent-a new approach to self-supervised learning","volume":"33","author":"Grill","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B17","doi-asserted-by":"publisher","first-page":"3767","DOI":"10.1038\/s41598-023-30820-8","article-title":"Ehr foundation models improve robustness in the presence of temporal distribution shift","volume":"13","author":"Guo","year":"2023","journal-title":"Sci. Rep"},{"key":"B18","doi-asserted-by":"publisher","first-page":"12963","DOI":"10.1609\/aaai.v35i14.17533","article-title":"Self-attention attribution: interpreting information interactions inside transformer","volume":"35","author":"Hao","year":"2021","journal-title":"Proc. AAAI Conf. Artif. Intell"},{"key":"B19","first-page":"6185","article-title":"\u201cNeighborhood attention transformer,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Hassani","year":"2023"},{"key":"B20","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2104.05704","article-title":"Escaping the big data paradigm with compact transformers","author":"Hassani","year":"2021","journal-title":"arXiv preprint arXiv:2104.05704"},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2310.05914","article-title":"Neftune: Noisy embeddings improve instruction finetuning","author":"Jain","year":"2023","journal-title":"arXiv preprint arXiv:2310.05914"},{"key":"B22","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1902.10186","article-title":"Attention is not explanation","author":"Jain","year":"2019","journal-title":"arXiv preprint arXiv:1902.10186"},{"key":"B23","unstructured":"Johnson\n              A.\n            \n            \n              Bulgarelli\n              L.\n            \n            \n              Pollard\n              T.\n            \n            \n              Horng\n              S.\n            \n            \n              Celi\n              L. A.\n            \n            \n              Mark\n              R.\n            \n          \n          Mimic-iv. PhysioNet\n          \n          2020"},{"key":"B24","doi-asserted-by":"publisher","first-page":"1397298","DOI":"10.3389\/frai.2024.1397298","article-title":"Self-attention with temporal prior: can we learn more from the arrow of time?","volume":"7","author":"Kim","year":"2024","journal-title":"Front. Artif. Intell"},{"key":"B25","first-page":"60","article-title":"\u201cRobust optimization as data augmentation for large-scale graphs,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Kong","year":"2022"},{"key":"B26","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1445","article-title":"Revealing the dark secrets of bert","author":"Kovaleva","year":"2019","journal-title":"arXiv preprint arXiv"},{"key":"B27","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1909.11942","article-title":"Albert: a lite bert for self-supervised learning of language representations","author":"Lan","year":"2019","journal-title":"arXiv preprint arXiv:1909.11942"},{"key":"B28","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","author":"Lewis","year":"2019","journal-title":"arXiv preprint arXiv:1910.13461"},{"key":"B29","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2405.03066","article-title":"A scoping review of using large language models (LLMS) to investigate electronic health records (EHRS)","author":"Li","year":"2024","journal-title":"arXiv preprint arXiv"},{"key":"B30","doi-asserted-by":"publisher","first-page":"1106","DOI":"10.1109\/JBHI.2022.3224727","article-title":"Hi-behrt: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records","volume":"27","author":"Li","year":"2022","journal-title":"IEEE J. Biomed. Health Inform"},{"key":"B31","doi-asserted-by":"publisher","first-page":"7155","DOI":"10.1038\/s41598-020-62922-y","article-title":"Behrt: transformer for electronic health records","volume":"10","author":"Li","year":"2020","journal-title":"Sci. Rep"},{"key":"B32","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3560815","article-title":"Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing","volume":"55","author":"Liu","year":"2023","journal-title":"ACM Comput. Surv"},{"key":"B33","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1907.11692","article-title":"Roberta: a robustly optimized bert pretraining approach","author":"Liu","year":"2019","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"B34","first-page":"239","article-title":"\u201cCehr-bert: incorporating temporal information from structured ehr data to improve prediction tasks,\u201d","volume-title":"Machine Learning for Health","author":"Pang","year":"2021"},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2108.12409","article-title":"Train short, test long: attention with linear biases enables input length extrapolation","year":"2021","journal-title":"arXiv preprint arXiv:2108.12409"},{"key":"B36","unstructured":"Radford\n              A.\n            \n            \n              Wu\n              J.\n            \n            \n              Child\n              R.\n            \n            \n              Luan\n              D.\n            \n            \n              Amodei\n              D.\n            \n            \n              Sutskever\n              I.\n            \n          \n          35637722\n          Language Models Are Unsupervised Multitask Learners\n          \n          2019"},{"key":"B37","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1038\/s41746-021-00455-y","article-title":"Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction","volume":"4","author":"Rasmy","year":"2021","journal-title":"NPJ Digit. Med"},{"key":"B38","doi-asserted-by":"crossref","first-page":"3503","DOI":"10.1145\/3447548.3467069","article-title":"\u201cRapt: pre-training of time-aware transformer for learning robust healthcare representation,\u201d","volume-title":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","author":"Ren","year":"2021"},{"key":"B39","doi-asserted-by":"publisher","author":"Serrano","year":"2019","DOI":"10.48550\/arXiv.1906.03731"},{"key":"B40","doi-asserted-by":"publisher","first-page":"127063","DOI":"10.1016\/j.neucom.2023.127063","article-title":"Roformer: enhanced transformer with rotary position embedding","volume":"568","author":"Su","year":"2024","journal-title":"Neurocomputing"},{"key":"B41","first-page":"3319","article-title":"\u201cAxiomatic attribution for deep networks,\u201d","volume-title":"International Conference on Machine Learning","author":"Sundararajan","year":"2017"},{"key":"B42","first-page":"10347","article-title":"\u201cTraining data-efficient image transformers and distillation through attention,\u201d","volume-title":"International Conference on Machine Learning","author":"Touvron","year":"2021"},{"key":"B43","doi-asserted-by":"publisher","first-page":"5999","DOI":"10.48550\/arXiv.1706.03762","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B44","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1038\/s41746-023-00879-8","article-title":"The shaky foundations of large language models and foundation models for electronic health records","volume":"6","author":"Wornow","year":"2023","journal-title":"NPJ Digit. Med"},{"key":"B45","doi-asserted-by":"publisher","first-page":"13727","DOI":"10.1609\/aaai.v37i11.26608","article-title":"Adversarial self-attention for language understanding","volume":"37","author":"Wu","year":"2023","journal-title":"Proc. AAAI Conf. Artif. Intell"},{"key":"B46","doi-asserted-by":"publisher","first-page":"1419","DOI":"10.1093\/jamia\/ocy068","article-title":"Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review","volume":"25","author":"Xiao","year":"2018","journal-title":"J. Am. Med. Inform. Assoc"},{"key":"B47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.182","article-title":"Hype: better pre-trained language model fine-tuning with hidden representation perturbation","author":"Yuan","year":"2022","journal-title":"arXiv preprint arXiv:2212.08853"},{"key":"B48","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-025-01692-1","article-title":"A scoping review of self-supervised representation learning for clinical decision making using ehr categorical data","volume":"8","author":"Yuanyuan","year":"2025","journal-title":"NPJ Digit. Med"},{"key":"B49","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1907.11065","article-title":"Dropattention: a regularization method for fully-connected self-attention networks","author":"Zehui","year":"2019","journal-title":"arXiv preprint arXiv"},{"key":"B50","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1909.11764","article-title":"Freelb: enhanced adversarial training for natural language understanding","author":"Zhu","year":"2019","journal-title":"arXiv preprint arXiv:1909.11764"},{"key":"B51","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3450439.3451855","article-title":"\u201cVariationally regularized graph-based representation learning for electronic health records,\u201d","volume-title":"Proceedings of the Conference on Health, Inference, and Learning","author":"Zhu","year":"2021"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1663484\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T05:40:17Z","timestamp":1758087617000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1663484\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,17]]},"references-count":51,"alternative-id":["10.3389\/frai.2025.1663484"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1663484","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,17]]},"article-number":"1663484"}}