{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T22:46:15Z","timestamp":1776811575886,"version":"3.51.2"},"reference-count":19,"publisher":"European Society of Computational Methods in Sciences and Engineering","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["JCM"],"published-print":{"date-parts":[[2022,12,19]]},"abstract":"<jats:p>Recently, AWD-LSTM (ASGD Weight-Dropped LSTM) has achieved good result in the language model, and many AWD-LSTM based models have obtained state-of-the-art perplexities. However, in fact, large-scale neural language models have been shown to be prone to overfitting. In AWD-LSTM original paper, the author decided to adopt the way of retraining calling finetune to get a better result. In this paper, we present a simple yet effective parameter rollback mechanism for neural language models. And we introduce the parameter rollback averaged stochastic gradient descent (PR-ASGD), wherein the parameter \u201cstep\u201d in ASGD will decrease according to a certain probability. Using this strategy, we achieve better word level perplexities on Penn Treebank: 56.26 based on AWD-LSTM model and 53.57 based on AWD-LSTM-MoS (AWD-LSTM Mixture of Softmaxes) model.<\/jats:p>","DOI":"10.3233\/jcm-226215","type":"journal-article","created":{"date-parts":[[2022,8,23]],"date-time":"2022-08-23T11:27:16Z","timestamp":1661254036000},"page":"2375-2385","source":"Crossref","is-referenced-by-count":1,"title":["Parameter rollback averaged stochastic gradient descent for language model"],"prefix":"10.66113","volume":"22","author":[{"given":"Zhao","family":"Cheng","sequence":"first","affiliation":[{"name":"School of Computing Science and Artificial Intelligence, Changzhou University, Changzhou, Jiangsu, China"},{"name":"School of Computer and Computing Science, Zhejiang University City College, Hangzhou, Zhejiang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guanlin","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computing Science and Artificial Intelligence, Changzhou University, Changzhou, Jiangsu, China"},{"name":"School of Computer and Computing Science, Zhejiang University City College, Hangzhou, Zhejiang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenyong","family":"Weng","sequence":"additional","affiliation":[{"name":"School of Computer and Computing Science, Zhejiang University City College, Hangzhou, Zhejiang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qi","family":"Lu","sequence":"additional","affiliation":[{"name":"China National Air Separation Engineering Co., Ltd, Hangzhou, Zhejiang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wujian","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Computer and Computing Science, Zhejiang University City College, Hangzhou, Zhejiang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"55691","reference":[{"key":"10.3233\/JCM-226215_ref1","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"Journal of Machine Learning Research."},{"key":"10.3233\/JCM-226215_ref2","doi-asserted-by":"crossref","unstructured":"Mnih A, Hinton GE. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning. 2007. pp. 641-648.","DOI":"10.1145\/1273496.1273577"},{"key":"10.3233\/JCM-226215_ref3","doi-asserted-by":"crossref","unstructured":"Kombrink S, Mikolov T, Karafi\u00e1t M, Burget L. Recurrent neural network based language modeling in meeting recognition. In Twelfth Annual Conference Ofthe International Speech Communication Association. 2011.","DOI":"10.21437\/Interspeech.2011-720"},{"key":"10.3233\/JCM-226215_ref6","first-page":"02182","article-title":"Regularizing and optimizing lstm language models","volume":"1708","author":"Merity","year":"2017","journal-title":"arXiv preprint arXiv."},{"key":"10.3233\/JCM-226215_ref7","doi-asserted-by":"publisher","first-page":"03953","DOI":"10.48550\/arXiv.1711.03953","article-title":"Breaking the softmax bottleneck: A high-rank rnn language model","volume":"1711","author":"Zhilin","year":"2017","journal-title":"arXiv preprint arXiv."},{"key":"10.3233\/JCM-226215_ref8","first-page":"03805","article-title":"Improving neural language modeling via adversarial training","volume":"1906","author":"Dilin","year":"2019","journal-title":"arXiv preprint arXiv."},{"issue":"4","key":"10.3233\/JCM-226215_ref9","doi-asserted-by":"crossref","first-page":"838","DOI":"10.1137\/0330046","article-title":"Acceleration of stochastic approximation by averaging","volume":"30","author":"Polyak","year":"1992","journal-title":"SIAM Journal on Control and Optimization."},{"issue":"8","key":"10.3233\/JCM-226215_ref10","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long shortterm memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Computation."},{"key":"10.3233\/JCM-226215_ref11","doi-asserted-by":"publisher","first-page":"3555","DOI":"10.48550\/arXiv.1412.3555","article-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling","volume":"1412","author":"Junyoung","year":"2014","journal-title":"arXiv preprint arXiv."},{"issue":"2","key":"10.3233\/JCM-226215_ref12","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1137\/16M1080173","article-title":"Optimization methods for large-scale machine learning","volume":"60","author":"Bottou","year":"2018","journal-title":"Siam Review."},{"key":"10.3233\/JCM-226215_ref13","unstructured":"Hardt M, Recht B, Singer Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning. 2016. pp. 1225-1234."},{"key":"10.3233\/JCM-226215_ref14","unstructured":"Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. 2013. pp. 1139-1147."},{"key":"10.3233\/JCM-226215_ref15","doi-asserted-by":"publisher","first-page":"6980","DOI":"10.48550\/arXiv.1412.6980","article-title":"A method for stochastic optimization","volume":"1412","author":"Kingma","year":"2014","journal-title":"arXiv."},{"issue":"7","key":"10.3233\/JCM-226215_ref16","first-page":"2121","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","volume":"12","author":"Duchi","year":"2011","journal-title":"Journal of Machine Learning Research."},{"issue":"2","key":"10.3233\/JCM-226215_ref17","first-page":"26","article-title":"Lecture 65rmsprop: Divide the gradient by a running average of its recent magnitude","volume":"4","author":"Tieleman","year":"2012","journal-title":"COURSERA: Neural Networks for Machine Learning."},{"issue":"1","key":"10.3233\/JCM-226215_ref18","doi-asserted-by":"publisher","first-page":"4873","DOI":"10.48550\/arXiv.1704.04289","article-title":"Stochastic gradient descent as approximate bayesian inference","volume":"18","author":"Mandt","year":"2017","journal-title":"The Journal of Machine Learning Research."},{"key":"10.3233\/JCM-226215_ref20","doi-asserted-by":"publisher","first-page":"04351","DOI":"10.48550\/arXiv.2001.04351","article-title":"Cluener2020: Fine-grained name entity recognition for Chinese","volume":"2001","author":"Xu","year":"2020","journal-title":"arXiv preprint arXiv."},{"key":"10.3233\/JCM-226215_ref22","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-2025"},{"issue":"3","key":"10.3233\/JCM-226215_ref23","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1504\/IJSNET.2020.111233","article-title":"Automated realtime anomaly detection of temperature sensors through machine-learning","volume":"34","author":"Nayak","year":"2020","journal-title":"International Journal of Sensor Networks."}],"container-title":["Journal of Computational Methods in Sciences and Engineering"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/JCM-226215","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T22:06:36Z","timestamp":1776809196000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/JCM-226215"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,19]]},"references-count":19,"journal-issue":{"issue":"6"},"URL":"https:\/\/doi.org\/10.3233\/jcm-226215","relation":{},"ISSN":["1472-7978","1875-8983"],"issn-type":[{"value":"1472-7978","type":"print"},{"value":"1875-8983","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,19]]}}}