{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,2]],"date-time":"2025-11-02T05:59:33Z","timestamp":1762063173951,"version":"build-2065373602"},"reference-count":35,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2022,8,9]],"date-time":"2022-08-09T00:00:00Z","timestamp":1660003200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the Deutsche Forschungsgemeinschaft (DFG)","award":["318763901\u2014SFB1294","01IS18025A","01IS18037A"],"award-info":[{"award-number":["318763901\u2014SFB1294","01IS18025A","01IS18037A"]}]},{"name":"the BIFOLD-Berlin Institute for the Foundations of Learning and Data","award":["318763901\u2014SFB1294","01IS18025A","01IS18037A"],"award-info":[{"award-number":["318763901\u2014SFB1294","01IS18025A","01IS18037A"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>In this paper, we propose to leverage the Bayesian uncertainty information encoded in parameter distributions to inform the learning procedure for Bayesian models. We derive a first principle stochastic differential equation for the training dynamics of the mean and uncertainty parameter in the variational distributions. On the basis of the derived Bayesian stochastic differential equation, we apply the methodology of stochastic optimal control on the variational parameters to obtain individually controlled learning rates. We show that the resulting optimizer, StochControlSGD, is significantly more robust to large learning rates and can adaptively and individually control the learning rates of the variational parameters. The evolution of the control suggests separate and distinct dynamical behaviours in the training regimes for the mean and uncertainty parameters in Bayesian neural networks.<\/jats:p>","DOI":"10.3390\/e24081097","type":"journal-article","created":{"date-parts":[[2022,8,10]],"date-time":"2022-08-10T02:42:53Z","timestamp":1660099373000},"page":"1097","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Stochastic Control for Bayesian Neural Network Training"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1354-4715","authenticated-orcid":false,"given":"Ludwig","family":"Winkler","sequence":"first","affiliation":[{"name":"Machine Learning Group, Technische Universit\u00e4t Berlin, 10623 Berlin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4042-8920","authenticated-orcid":false,"given":"C\u00e9sar","family":"Ojeda","sequence":"additional","affiliation":[{"name":"Artificial Intelligence Group, Technische Universit\u00e4t Berlin, 10623 Berlin, Germany"}]},{"given":"Manfred","family":"Opper","sequence":"additional","affiliation":[{"name":"Artificial Intelligence Group, Technische Universit\u00e4t Berlin, 10623 Berlin, Germany"},{"name":"Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham, Birmingham B15 2TT, UK"}]}],"member":"1968","published-online":{"date-parts":[[2022,8,9]]},"reference":[{"key":"ref_1","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20136). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_2","unstructured":"Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., and Coates, A. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv."},{"key":"ref_3","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020). Language models are few-shot learners. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1020281327116","article-title":"An introduction to MCMC for machine learning","volume":"50","author":"Andrieu","year":"2003","journal-title":"Mach. Learn."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wainwright, M.J., and Jordan, M.I. (2008). Graphical Models, Exponential Families, and Variational Inference, Now Publishers Inc.","DOI":"10.1561\/9781601981851"},{"key":"ref_6","first-page":"1303","article-title":"Stochastic variational inference","volume":"14","author":"Hoffman","year":"2013","journal-title":"J. Mach. Learn. Res."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Bottou, L. (2010, January 22\u201327). Large-scale machine learning with stochastic gradient descent. Proceedings of the COMPSTAT\u20192010: 19th International Conference on Computational Statistics, Paris, France.","DOI":"10.1007\/978-3-7908-2604-3_16"},{"key":"ref_8","unstructured":"Liu, G.H., and Theodorou, E.A. (2019). Deep learning theory review: An optimal control and dynamical systems perspective. arXiv."},{"key":"ref_9","unstructured":"Orvieto, A., Kohler, J., and Lucchi, A. (2020, January 3\u20136). The role of memory in stochastic optimization. Proceedings of the Uncertainty in Artificial Intelligence (PMLR), Virtual."},{"key":"ref_10","unstructured":"Mandt, S., Hoffman, M.D., and Blei, D.M. (2017). Stochastic gradient descent as approximate Bayesian inference. arXiv."},{"key":"ref_11","unstructured":"Yaida, S. (2018). Fluctuation-dissipation relations for stochastic gradient descent. arXiv."},{"key":"ref_12","unstructured":"Oksendal, B. (2013). Stochastic Differential Equations: An Introduction with Applications, Springer Science & Business Media."},{"key":"ref_13","unstructured":"Depeweg, S., Hernandez-Lobato, J.M., Doshi-Velez, F., and Udluft, S. (2018, January 10\u201315). Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden."},{"key":"ref_14","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the 3th International Conference on Learning Representations (ICLR), San Diego, CA, USA."},{"key":"ref_15","unstructured":"Li, Q., Tai, C., and Weinan, E. (2017, January 6\u201311). Stochastic modified equations and adaptive stochastic gradient algorithms. Proceedings of the International Conference on Machine Learning (PMLR), Sydney, Australia."},{"key":"ref_16","unstructured":"Stengel, R.F. (1994). Optimal Control and Estimation, Courier Corporation."},{"key":"ref_17","unstructured":"LeCun, Y. (2022, March 04). The MNIST Database of Handwritten Digits. Available online: http:\/\/yann.lecun.com\/exdb\/mnist\/."},{"key":"ref_18","unstructured":"Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv."},{"key":"ref_19","unstructured":"Krizhevsky, A., and Hinton, G. (2022, March 04). Convolutional Deep Belief Networks on Cifar-10. Available online: https:\/\/www.cs.toronto.edu\/~kriz\/conv-cifar10-aug2010.pdf."},{"key":"ref_20","unstructured":"Wenzel, F., Roth, K., Veeling, B.S., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., and Nowozin, S. (2020). How good is the bayes posterior in deep neural networks really?. arXiv."},{"key":"ref_21","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 7\u20139). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (PMLR), Lille, France."},{"key":"ref_22","unstructured":"Frankle, J., and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv."},{"key":"ref_23","unstructured":"Smith, S.L., Kindermans, P.J., Ying, C., and Le, Q.V. (2017). Don\u2019t decay the learning rate, increase the batch size. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"743","DOI":"10.1016\/S0893-6080(02)00060-6","article-title":"On-line learning in changing environments with applications in supervised and unsupervised learning","volume":"15","author":"Murata","year":"2002","journal-title":"Neural Netw."},{"key":"ref_25","unstructured":"Murata, N., M\u00fcller, K.R., Ziehe, A., and Amari, S.-I. (1997, January 1\u20136). Adaptive on-line learning in changing environments. Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA."},{"key":"ref_26","unstructured":"Loshchilov, I., and Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv."},{"key":"ref_27","unstructured":"Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015, January 7\u20139). Weight uncertainty in neural network. Proceedings of the International Conference on Machine Learning (PMLR), Lille, France."},{"key":"ref_28","unstructured":"Gal, Y., and Ghahramani, Z. (2016, January 20\u201322). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of the International Conference on Machine Learning (PMLR), New York, NY, USA."},{"key":"ref_29","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_30","unstructured":"Kingma, D.P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick. arXiv."},{"key":"ref_31","unstructured":"Ranganath, R., Gerrish, S., and Blei, D. (2014, January 22\u201324). Black box variational inference. Proceedings of the Artificial Intelligence and Statistics (PMLR), Bejing, China."},{"key":"ref_32","first-page":"1","article-title":"Automatic differentiation in machine learning: A survey","volume":"18","author":"Baydin","year":"2018","journal-title":"J. Mach. Learn. Res."},{"key":"ref_33","first-page":"430","article-title":"Automatic differentiation variational inference","volume":"18","author":"Kucukelbir","year":"2017","journal-title":"J. Mach. Learn. Res."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1214\/aoms\/1177729586","article-title":"A stochastic approximation method","volume":"22","author":"Robbins","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_35","unstructured":"Welling, M., and Teh, Y.W. (July, January 28). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/8\/1097\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:06:20Z","timestamp":1760141180000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/8\/1097"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,9]]},"references-count":35,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2022,8]]}},"alternative-id":["e24081097"],"URL":"https:\/\/doi.org\/10.3390\/e24081097","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2022,8,9]]}}}