{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,16]],"date-time":"2026-07-16T17:49:57Z","timestamp":1784224197244,"version":"3.55.0"},"reference-count":36,"publisher":"Cambridge University Press (CUP)","issue":"2","license":[{"start":{"date-parts":[[2022,2,9]],"date-time":"2022-02-09T00:00:00Z","timestamp":1644364800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy to computer vision tasks, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks.<\/jats:p>","DOI":"10.1017\/s135132492200002x","type":"journal-article","created":{"date-parts":[[2022,2,9]],"date-time":"2022-02-09T08:58:37Z","timestamp":1644397117000},"page":"316-336","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":8,"title":["An empirical study of cyclical learning rate on neural machine translation"],"prefix":"10.1017","volume":"29","author":[{"given":"Weixuan","family":"Wang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Choon Meng","family":"Lee","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jianfeng","family":"Liu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Talha","family":"Colakoglu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Peng","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"56","published-online":{"date-parts":[[2022,2,9]]},"reference":[{"key":"S135132492200002X_ref12","doi-asserted-by":"crossref","unstructured":"Hoang, C.D.V. , Haffari, G. and Cohn, T. (2017). Decoding as continuous optimization in neural machine translation. arXiv preprint arXiv:1701.02854.","DOI":"10.18653\/v1\/D17-1014"},{"key":"S135132492200002X_ref31","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2017.58"},{"key":"S135132492200002X_ref30","unstructured":"Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs\/1409.1556."},{"key":"S135132492200002X_ref4","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-7908-2604-3_16"},{"key":"S135132492200002X_ref8","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2\u20137, 2019, Volume 1 (Long and Short Papers), pp. 4171\u20134186."},{"key":"S135132492200002X_ref14","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"S135132492200002X_ref1","unstructured":"Aarts, E.H.L. and Korst, J.H.M. (2003). Simulated annealing and boltzmann machines. In Michael A. Arbib (ed), Handbook of Brain Theory and Neural Networks (2nd ed). Cambridge, Massachusetts: MIT Press, pp. 1039\u20131044."},{"key":"S135132492200002X_ref3","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553380"},{"key":"S135132492200002X_ref17","unstructured":"Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs\/1412.6980."},{"key":"S135132492200002X_ref23","unstructured":"Luo, L. , Xiong, Y. , Liu, Y. and Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs\/1902.09843."},{"key":"S135132492200002X_ref9","unstructured":"Dinh, L. , Pascanu, R. , Bengio, S. and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, Volume 70, pp. 1019\u20131028."},{"key":"S135132492200002X_ref16","unstructured":"Keskar, N.S. , Mudigere, D. , Nocedal, J. , Smelyanskiy, M. and Tang, P.T.P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. ArXiv, abs\/1609.04836."},{"key":"S135132492200002X_ref32","unstructured":"Smith, L.N. and Topin, N. (2017). Super-convergence: Very fast training of residual networks using large learning rates. ArXiv, abs\/1708.07120."},{"key":"S135132492200002X_ref34","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N. , Kaiser, L.U. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 5998\u20136008."},{"key":"S135132492200002X_ref25","doi-asserted-by":"crossref","unstructured":"Ott, M. , Edunov, S. , Baevski, A. , Fan, A. , Gross, S. , Ng, N. , Grangier, D. and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2\u20137, 2019, Demonstrations, pp. 48\u201353.","DOI":"10.18653\/v1\/N19-4009"},{"key":"S135132492200002X_ref26","doi-asserted-by":"publisher","DOI":"10.2478\/pralin-2018-0002"},{"key":"S135132492200002X_ref27","unstructured":"Reddi, S.J. , Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond. ArXiv, abs\/1904.09237."},{"key":"S135132492200002X_ref15","first-page":"2261","article-title":"Densely connected convolutional networks","author":"Huang","year":"2016","journal-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"S135132492200002X_ref24","unstructured":"McCandlish, S. , Kaplan, J. , Amodei, D. and Team, O.D. (2018). An empirical model of large-batch training. ArXiv, abs\/1812.06162."},{"key":"S135132492200002X_ref29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1162"},{"key":"S135132492200002X_ref33","unstructured":"Smith, S.L. , Kindermans, P. , Ying, C. and Le, Q.V. (2018). Don\u2019t decay the learning rate, increase the batch size. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30\u2013May 3, 2018, Conference Track Proceedings. OpenReview.net."},{"key":"S135132492200002X_ref35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1137"},{"key":"S135132492200002X_ref10","first-page":"2121","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","volume":"12","author":"Duchi","year":"2010","journal-title":"Journal of Machine Learning Research"},{"key":"S135132492200002X_ref21","unstructured":"Liu, L. , Jiang, H. , He, P. , Chen, W. , Liu, X. , Gao, J. and Han, J. (2019). On the variance of the adaptive learning rate and beyond. ArXiv, abs\/1908.03265."},{"key":"S135132492200002X_ref11","first-page":"770","article-title":"Deep residual learning for image recognition","author":"He","year":"2015","journal-title":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"S135132492200002X_ref28","unstructured":"Ruder, S. (2016). An overview of gradient descent optimization algorithms. ArXiv, abs\/1609.04747."},{"key":"S135132492200002X_ref7","unstructured":"Dauphin, Y. , de Vries, H. and Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7\u201312, 2015, Montreal, Quebec, Canada, pp. 1504\u20131512."},{"key":"S135132492200002X_ref19","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"S135132492200002X_ref5","unstructured":"Cettolo, M. , Girardi, C. and Federico, M. (2012). Wit $^3$ : Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261\u2013268."},{"key":"S135132492200002X_ref6","unstructured":"Chung, J. , G\u00fcl\u00e7ehre, \u00c7. , Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv, abs\/1412.3555."},{"key":"S135132492200002X_ref22","unstructured":"Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983."},{"key":"S135132492200002X_ref36","unstructured":"Zhang, M.R. , Lucas, J. , Hinton, G.E. and Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8\u201314 December 2019, Vancouver, BC, Canada, pp. 9593\u20139604."},{"key":"S135132492200002X_ref20","unstructured":"Li, H. , Xu, Z. , Taylor, G. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3\u20138 December 2018, Montr\u00c9al, Canada, pp. 6391\u20136401."},{"key":"S135132492200002X_ref13","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.1.1"},{"key":"S135132492200002X_ref18","doi-asserted-by":"publisher","DOI":"10.3115\/1557769.1557821"},{"key":"S135132492200002X_ref2","unstructured":"Bahdanau, D. , Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Bengio Y. and LeCun Y. (eds), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7\u20139, 2015, Conference Track Proceedings."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S135132492200002X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,13]],"date-time":"2023-03-13T04:19:51Z","timestamp":1678681191000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S135132492200002X\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,9]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,3]]}},"alternative-id":["S135132492200002X"],"URL":"https:\/\/doi.org\/10.1017\/s135132492200002x","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,2,9]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}