{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T15:36:38Z","timestamp":1759332998948,"version":"3.40.5"},"reference-count":58,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T00:00:00Z","timestamp":1743984000000},"content-version":"vor","delay-in-days":96,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,19]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Enforcing representation smoothness in pre-trained language models (PLMs) through Jacobian and Hessian regularization provides an effective approach for enhancing both robustness and generalization. Although such regularization methods have proven effective in computer vision, their application in natural language processing, where PLM inputs are derived from a discrete domain, poses unique challenges. We introduce JacHess, a regularization approach for PLMs that minimizes the norms of the Jacobian and Hessian matrices in intermediate representations, using embeddings as substitutes for discrete token inputs. JacHess supports dual-mode regularization, alternating between fine-tuning with labeled data and regularization with unlabeled data. We evaluate JacHess on the GLUE benchmark and demonstrate that it consistently and significantly improves in-distribution generalization and enhances performance under domain shift. Across diverse PLMs, JacHess outperforms comparable representation-based regularization methods and unregularized fine-tuning, while also improving model calibration. Our findings, coupled with a computationally efficient estimator for the Jacobian and Hessian norms, position JacHess as a robust and widely applicable solution for enhancing PLM performance.<\/jats:p>","DOI":"10.1162\/tacl_a_00739","type":"journal-article","created":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T18:52:20Z","timestamp":1744051940000},"page":"264-280","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":2,"title":["From Robustness to Improved Generalization and Calibration in Pre-trained Language Models"],"prefix":"10.1162","volume":"13","author":[{"given":"Josip","family":"Juki\u0107","sequence":"first","affiliation":[{"name":"TakeLab, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia. josip.jukic@fer.hr"}]},{"given":"Jan","family":"\u0160najder","sequence":"additional","affiliation":[{"name":"TakeLab, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia. jan.snajder@fer.hr"}]}],"member":"281","published-online":{"date-parts":[[2025,3,19]]},"reference":[{"key":"2025051914251492200_bib1","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1016\/j.inffus.2021.05.008","article-title":"A review of uncertainty quantification in deep learning: Techniques, applications and challenges","volume":"76","author":"Abdar","year":"2021","journal-title":"Information Fusion"},{"key":"2025051914251492200_bib2","article-title":"Better fine-tuning by reducing representational collapse","volume-title":"International Conference on Learning Representations","author":"Aghajanyan","year":"2021"},{"key":"2025051914251492200_bib3","doi-asserted-by":"publisher","first-page":"7360","DOI":"10.18653\/v1\/2022.acl-long.508","article-title":"Sharpness-aware minimization improves language model generalization","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Bahri","year":"2022"},{"key":"2025051914251492200_bib4","article-title":"Spectrally-normalized margin bounds for neural networks","volume-title":"Advances in Neural Information Processing Systems","author":"Bartlett","year":"2017"},{"key":"2025051914251492200_bib5","first-page":"28811","article-title":"A universal law of robustness via isoperimetry","volume":"34","author":"Bubeck","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914251492200_bib6","article-title":"Sobolev training for neural networks","volume":"30","author":"Czarnecki","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"1\u20132","key":"2025051914251492200_bib7","doi-asserted-by":"publisher","first-page":"12","DOI":"10.2307\/2987588","article-title":"The comparison and evaluation of forecasters","volume":"32","author":"DeGroot","year":"1983","journal-title":"Journal of the Royal Statistical Society: Series D (The Statistician)"},{"key":"2025051914251492200_bib8","first-page":"2590","article-title":"Toward better generalization bounds with locally elastic stability","volume-title":"International Conference on Machine Learning","author":"Deng","year":"2021"},{"key":"2025051914251492200_bib9","doi-asserted-by":"publisher","first-page":"295","DOI":"10.18653\/v1\/2020.emnlp-main.21","article-title":"Calibration of pre-trained transformers","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Desai","year":"2020"},{"key":"2025051914251492200_bib10","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2025051914251492200_bib11","first-page":"2333","article-title":"Why neural networks find simple solutions: The many regularizers of geometric complexity","volume":"35","author":"Dherin","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"6","key":"2025051914251492200_bib12","doi-asserted-by":"publisher","first-page":"991","DOI":"10.1109\/72.165600","article-title":"Improving generalization performance using double backpropagation","volume":"3","author":"Drucker","year":"1992","journal-title":"IEEE Transactions on Neural Networks"},{"key":"2025051914251492200_bib13","article-title":"Sharpness-aware minimization for efficiently improving generalization","volume-title":"International Conference on Learning Representations","author":"Foret","year":"2021"},{"issue":"11","key":"2025051914251492200_bib14","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1145\/3422622","article-title":"Generative adversarial networks","volume":"63","author":"Goodfellow","year":"2020","journal-title":"Communications of the ACM"},{"key":"2025051914251492200_bib15","doi-asserted-by":"publisher","first-page":"393","DOI":"10.1007\/s10994-020-05929-w","article-title":"Regularisation of neural networks by enforcing Lipschitz continuity","volume":"110","author":"Gouk","year":"2021","journal-title":"Machine Learning"},{"key":"2025051914251492200_bib16","first-page":"1321","article-title":"On calibration of modern neural networks","volume-title":"International Conference on Machine Learning","author":"Guo","year":"2017"},{"key":"2025051914251492200_bib17","doi-asserted-by":"publisher","first-page":"8342","DOI":"10.18653\/v1\/2020.acl-main.740","article-title":"Don\u2019t stop pretraining: Adapt language models to domains and tasks","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Gururangan","year":"2020"},{"key":"2025051914251492200_bib18","article-title":"Robust learning with Jacobian regularization","author":"Hoffman","year":"2019","journal-title":"arXiv preprint arXiv:1908.02729v1"},{"key":"2025051914251492200_bib19","first-page":"2790","article-title":"Parameter-efficient transfer learning for NLP","volume-title":"International Conference on Machine Learning","author":"Houlsby","year":"2019"},{"key":"2025051914251492200_bib20","article-title":"LoRA: Low-rank adaptation of large language models","volume-title":"International Conference on Learning Representations","author":"Edward","year":"2021"},{"issue":"10","key":"2025051914251492200_bib21","doi-asserted-by":"publisher","first-page":"1161","DOI":"10.1038\/s42256-023-00729-y","article-title":"A taxonomy and review of generalization research in NLP","volume":"5","author":"Hupkes","year":"2023","journal-title":"Nature Machine Intelligence"},{"issue":"3","key":"2025051914251492200_bib22","doi-asserted-by":"publisher","first-page":"1059","DOI":"10.1080\/03610918908812806","article-title":"A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines","volume":"18","author":"Hutchinson","year":"1989","journal-title":"Communications in Statistics-Simulation and Computation"},{"key":"2025051914251492200_bib23","doi-asserted-by":"publisher","first-page":"2177","DOI":"10.18653\/v1\/2020.acl-main.197","article-title":"SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Jiang","year":"2020"},{"key":"2025051914251492200_bib24","first-page":"10866","article-title":"Robustness implies generalization via data-dependent generalization bounds","volume-title":"International Conference on Machine Learning","author":"Kawaguchi","year":"2022"},{"key":"2025051914251492200_bib25","article-title":"Some intriguing aspects about Lipschitz continuity of neural networks","volume-title":"International Conference on Learning Representations","author":"Khromov","year":"2024"},{"key":"2025051914251492200_bib26","article-title":"Simple and scalable predictive uncertainty estimation using deep ensembles","volume":"30","author":"Lakshminarayanan","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914251492200_bib27","article-title":"Lipschitz constant estimation of neural networks via sparse polynomial optimization","volume-title":"International Conference on Learning Representations","author":"Latorre","year":"2020"},{"key":"2025051914251492200_bib28","first-page":"4370","article-title":"Why robust generalization in deep learning is difficult: Perspective of expressive power","volume":"35","author":"Li","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914251492200_bib29","doi-asserted-by":"publisher","first-page":"12178","DOI":"10.18653\/v1\/2023.emnlp-main.748","article-title":"PAC-tuning: Fine-tuning pre-trained language models with PAC-driven perturbed gradient descent","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Liu","year":"2023"},{"key":"2025051914251492200_bib30","article-title":"Decoupled weight decay regularization","volume-title":"International Conference on Learning Representations","author":"Loshchilov","year":"2018"},{"key":"2025051914251492200_bib31","first-page":"142","article-title":"Learning word vectors for sentiment analysis","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies","author":"Maas","year":"2011"},{"key":"2025051914251492200_bib32","article-title":"Input Hessian regularization of neural networks","volume-title":"Workshop on \u201cBeyond first-order methods in ML systems\u201d at the 37th International Conference on Machine Learning","author":"Mustafa","year":"2020"},{"key":"2025051914251492200_bib33","volume-title":"Introductory Lectures on Convex Optimization: A Basic Course","author":"Nesterov","year":"2014","edition":"1"},{"key":"2025051914251492200_bib34","doi-asserted-by":"publisher","first-page":"88","DOI":"10.18653\/v1\/2022.insights-1.12","article-title":"On the impact of data augmentation on downstream performance in natural language processing","volume-title":"Proceedings of the Third Workshop on Insights from Negative Results in NLP","author":"Okimura","year":"2022"},{"issue":"4","key":"2025051914251492200_bib35","doi-asserted-by":"publisher","first-page":"867","DOI":"10.1162\/NECO_a_00928","article-title":"Unifying adversarial training algorithms with data gradient regularization","volume":"29","author":"Ororbia","year":"2017","journal-title":"Neural Computation"},{"key":"2025051914251492200_bib36","first-page":"21","article-title":"A case for new neural network smoothness constraints","volume-title":"Proceedings on \u201cI Can\u2019t Believe It\u2019s Not Better!\u201d at NeurIPS Workshops","author":"Rosca","year":"2020"},{"volume-title":"Principles of Mathematical Analysis","year":"1964","author":"Rudin","key":"2025051914251492200_bib37"},{"key":"2025051914251492200_bib38","article-title":"Adversarially robust generalization requires more data","volume":"31","author":"Schmidt","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914251492200_bib39","article-title":"TRAM: Bridging trust regions and sharpness aware minimization","volume-title":"International Conference on Learning Representations","author":"Sherborne","year":"2024"},{"issue":"16","key":"2025051914251492200_bib40","doi-asserted-by":"publisher","first-page":"4265","DOI":"10.1109\/TSP.2017.2708039","article-title":"Robust large margin deep neural networks","volume":"65","author":"Sokoli\u0107","year":"2017","journal-title":"IEEE Transactions on Signal Processing"},{"issue":"1","key":"2025051914251492200_bib41","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"The Journal of Machine Learning Research"},{"key":"2025051914251492200_bib42","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv preprint arXiv:2307.09288v2"},{"key":"2025051914251492200_bib43","first-page":"9690","article-title":"Uncertainty estimation using a single deep deterministic neural network","volume-title":"International Conference on Machine Learning","author":"Van Amersfoort","year":"2020"},{"key":"2025051914251492200_bib44","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2440-0","volume-title":"The Nature of Statistical Learning Theory","author":"Vapnik","year":"1995"},{"key":"2025051914251492200_bib45","doi-asserted-by":"publisher","first-page":"31","DOI":"10.4467\/20838476SI.18.003.10408","article-title":"Gradient regularization improves accuracy of discriminative models","volume":"27","author":"Varga","year":"2018","journal-title":"Schedae Informaticae"},{"key":"2025051914251492200_bib46","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2025051914251492200_bib47","article-title":"Lipschitz regularity of deep neural networks: Analysis and efficient estimation","volume":"31","author":"Virmaux","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914251492200_bib48","doi-asserted-by":"publisher","first-page":"353","DOI":"10.18653\/v1\/W18-5446","article-title":"GLUE: A multi-task benchmark and analysis platform for natural language understanding","volume-title":"Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP","author":"Wang","year":"2018"},{"key":"2025051914251492200_bib49","first-page":"36093","article-title":"Direct parameterization of Lipschitz-bounded deep networks","volume-title":"International Conference on Machine Learning","author":"Wang","year":"2023"},{"key":"2025051914251492200_bib50","first-page":"871","article-title":"Text smoothing: Enhance various data augmentation methods on text classification tasks","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Xing","year":"2022"},{"key":"2025051914251492200_bib51","doi-asserted-by":"publisher","first-page":"7273","DOI":"10.18653\/v1\/2022.findings-emnlp.538","article-title":"Uncertainty quantification with pre-trained language models: A large-scale empirical analysis","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Xiao","year":"2022"},{"key":"2025051914251492200_bib52","doi-asserted-by":"publisher","first-page":"7322","DOI":"10.1609\/aaai.v33i01.33017322","article-title":"Quantifying uncertainties in natural language processing tasks","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Xiao","year":"2019"},{"key":"2025051914251492200_bib53","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1007\/s10994-011-5268-1","article-title":"Robustness and generalization","volume":"86","author":"Huan","year":"2012","journal-title":"Machine Learning"},{"key":"2025051914251492200_bib54","doi-asserted-by":"publisher","first-page":"1063","DOI":"10.18653\/v1\/2021.naacl-main.84","article-title":"Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Yue","year":"2021"},{"issue":"3","key":"2025051914251492200_bib55","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1145\/3446776","article-title":"Understanding deep learning (still) requires rethinking generalization","volume":"64","author":"Zhang","year":"2021","journal-title":"Communications of the ACM"},{"key":"2025051914251492200_bib56","article-title":"OPT: Open pre-trained transformer language models","author":"Zhang","year":"2022","journal-title":"arXiv preprint arXiv:2205.01068v4"},{"key":"2025051914251492200_bib57","doi-asserted-by":"publisher","first-page":"8646","DOI":"10.18653\/v1\/2022.acl-long.592","article-title":"FlipDA: Effective and robust data augmentation for few-shot learning","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Zhou","year":"2022"},{"key":"2025051914251492200_bib58","article-title":"FreeLB: Enhanced adversarial training for natural language understanding","volume-title":"International Conference on Learning Representations","author":"Zhu","year":"2020"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00739\/2511922\/tacl_a_00739.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00739\/2511922\/tacl_a_00739.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,19]],"date-time":"2025-05-19T18:25:29Z","timestamp":1747679129000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00739\/128711\/From-Robustness-to-Improved-Generalization-and"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":58,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00739","relation":{},"ISSN":["2307-387X"],"issn-type":[{"type":"electronic","value":"2307-387X"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}