{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T06:59:00Z","timestamp":1764053940369,"version":"3.33.0"},"reference-count":42,"publisher":"MIT Press","issue":"2","content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,1,21]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards &amp; Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1\/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1\/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1\/n) for GD in both two-layer and three-layer NNs.<\/jats:p>","DOI":"10.1162\/neco_a_01725","type":"journal-article","created":{"date-parts":[[2024,11,18]],"date-time":"2024-11-18T17:58:32Z","timestamp":1731952712000},"page":"344-402","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":2,"title":["Generalization Guarantees of Gradient Descent for Shallow Neural Networks"],"prefix":"10.1162","volume":"37","author":[{"given":"Puyu","family":"Wang","sequence":"first","affiliation":[{"name":"Hong Kong Baptist University, Hong Kong wangpuyu1026@gmail.com"}]},{"given":"Yunwen","family":"Lei","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Hong Kong leiyw@hku.hk"}]},{"given":"Di","family":"Wang","sequence":"additional","affiliation":[{"name":"King Abdullah University of Science and Technology 23955, Saudi Arabia di.wang@kaust.edu.sa"}]},{"given":"Yiming","family":"Ying","sequence":"additional","affiliation":[{"name":"University of Sydney, Camperdown, NSW 2050, Australia yiming.ying@sydney.edu.au"}]},{"given":"Ding-Xuan","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Sydney, Camperdown, NSW 2050, Australia dingxuan.zhou@sydney.edu.au"}]}],"member":"281","published-online":{"date-parts":[[2025,1,21]]},"reference":[{"key":"2025012818241866000_bib1","article-title":"Learning and generalization in overparameterized neural networks, going beyond two layers","volume-title":"Advances in neural information processing systems","author":"Allen-Zhu","year":"2019"},{"key":"2025012818241866000_bib2","first-page":"242","article-title":"A convergence theory for deep learning via over-parameterization","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Allen-Zhu","year":"2019"},{"key":"2025012818241866000_bib3","article-title":"On exact computation with an infinitely wide neural net","volume-title":"Advances in neural information processing systems","author":"Arora","year":"2019"},{"key":"2025012818241866000_bib4","first-page":"322","article-title":"Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Arora","year":"2019"},{"journal-title":"Neural machine translation by jointly learning to align and translate.","year":"2014","author":"Bahdanau","key":"2025012818241866000_bib5"},{"key":"2025012818241866000_bib6","article-title":"Spectrally-normalized margin bounds for neural networks","volume-title":"Advances in neural information processing systems","author":"Bartlett","year":"2017"},{"key":"2025012818241866000_bib7","doi-asserted-by":"publisher","first-page":"87","DOI":"10.1017\/S0962492921000027","article-title":"Deep learning: A statistical viewpoint","volume":"30","author":"Bartlett","year":"2021","journal-title":"Acta Numerica"},{"key":"2025012818241866000_bib8","first-page":"499","article-title":"Stability and generalization","volume":"2","author":"Bousquet","year":"2002","journal-title":"Journal of Machine Learning Research"},{"key":"2025012818241866000_bib9","article-title":"SGD learns over-parameterized networks that provably generalize on linearly separable data","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Brutzkus","year":"2018"},{"key":"2025012818241866000_bib10","article-title":"Generalization bounds of stochastic gradient descent for wide and deep neural networks","volume-title":"Advances in neural information processing systems","author":"Cao","year":"2019"},{"key":"2025012818241866000_bib11","first-page":"745","article-title":"Stability and generalization of learning algorithms that converge to global optima","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Charles","year":"2018"},{"key":"2025012818241866000_bib12","article-title":"On the global convergence of gradient descent for over- parameterized models using optimal transport","volume-title":"Advances in neural information processing systems","author":"Chizat","year":"2018"},{"key":"2025012818241866000_bib13","article-title":"On lazy training in differentiable programming","volume-title":"Advances in neural information processing systems","author":"Chizat","year":"2019"},{"key":"2025012818241866000_bib14","first-page":"1675","article-title":"Gradient descent finds global minima of deep neural networks","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Du","year":"2019"},{"key":"2025012818241866000_bib15","article-title":"Gradient descent provably optimizes over-parameterized neural networks","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Du","year":"2018"},{"key":"2025012818241866000_bib16","first-page":"297","article-title":"Size-independent sample complexity of neural networks","volume-title":"Proceedings of the Conference on Learning Theory","author":"Golowich","year":"2018"},{"key":"2025012818241866000_bib17","first-page":"1225","article-title":"Train faster, generalize better: Stability of stochastic gradient descent","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Hardt","year":"2016"},{"issue":"6","key":"2025012818241866000_bib18","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Processing Magazine"},{"key":"2025012818241866000_bib19","article-title":"Neural tangent kernel: Convergence and generalization in neural networks","volume-title":"Advances in neural information processing systems","author":"Jacot","year":"2018"},{"key":"2025012818241866000_bib20","first-page":"26135","article-title":"On the generalization power of the overfitted three-layer neural tangent kernel model","volume-title":"Advances in neural information processing systems","author":"Ju","year":"2022"},{"issue":"6","key":"2025012818241866000_bib21","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1145\/3065386","article-title":"ImageNet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Communications of the ACM"},{"key":"2025012818241866000_bib22","first-page":"2820","article-title":"Data-dependent stability of stochastic gradient descent","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Kuzborskij","year":"2018"},{"journal-title":"Learning Lipschitz functions by GD-trained shallow overparameterized ReLU neural networks.","year":"2022","author":"Kuzborskij","key":"2025012818241866000_bib23"},{"key":"2025012818241866000_bib24","article-title":"Stability and generalization analysis of gradient methods for shallow neural networks","volume-title":"Advances in neural information processing systems","author":"Lei","year":"2022"},{"key":"2025012818241866000_bib25","first-page":"5809","article-title":"Fine-grained analysis of stability and generalization for stochastic gradient descent","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Lei","year":"2020"},{"key":"2025012818241866000_bib26","article-title":"Sharper generalization bounds for learning with gradient- dominated objective functions","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Lei","year":"2020"},{"key":"2025012818241866000_bib27","article-title":"Learning overparameterized neural networks via stochastic gradient descent on structured data","volume-title":"Advances in neural information processing systems","author":"Li","year":"2018"},{"journal-title":"Generalization bounds for deep convolutional neural networks.","year":"2019","author":"Long","key":"2025012818241866000_bib28"},{"key":"2025012818241866000_bib29","first-page":"2388","article-title":"Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit","volume-title":"Proceedings of the Conference on Learning Theory","author":"Mei","year":"2019"},{"issue":"33","key":"2025012818241866000_bib30","doi-asserted-by":"crossref","first-page":"E7665","DOI":"10.1073\/pnas.1806579115","article-title":"A mean field view of the landscape of two-layer neural networks","volume":"115","author":"Mei","year":"2018","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2025012818241866000_bib31","first-page":"605","article-title":"Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints","volume-title":"Proceedings of the Conference on Learning Theory","author":"Mou","year":"2018"},{"key":"2025012818241866000_bib32","article-title":"A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Neyshabur","year":"2018"},{"key":"2025012818241866000_bib33","first-page":"1376","article-title":"Norm-based capacity control in neural networks","volume-title":"Proceedings of the Conference on Learning Theory","author":"Neyshabur","year":"2015"},{"journal-title":"How many neurons do we need? A refined analysis for shallow networks trained with gradient descent","year":"2023","author":"Nguyen","key":"2025012818241866000_bib34"},{"journal-title":"Gradient descent can learn less overparameterized two-layer neural networks on classification problems.","year":"2019","author":"Nitanda","key":"2025012818241866000_bib35"},{"key":"2025012818241866000_bib36","article-title":"Stability and generalisation of gradient descent for shallow neural networks without the neural tangent kernel","volume-title":"Advances in neural information processing systems","author":"Richards","year":"2021"},{"key":"2025012818241866000_bib37","first-page":"1990","article-title":"Learning with gradient descent and weakly convex losses","volume-title":"Proceedings of the International Conference on Artificial Intelligence and Statistics","author":"Richards","year":"2021"},{"issue":"7587","key":"2025012818241866000_bib38","doi-asserted-by":"publisher","first-page":"484","DOI":"10.1038\/nature16961","article-title":"Mastering the game of Go with deep neural networks and tree search","volume":"529","author":"Silver","year":"2016","journal-title":"Nature"},{"key":"2025012818241866000_bib39","first-page":"2199","article-title":"Smoothness, low noise and fast rates","volume-title":"Advances in neural information processing systems","author":"Srebro","year":"2010"},{"issue":"156","key":"2025012818241866000_bib40","first-page":"1","article-title":"Generalization and stability of interpolating neural networks with minimal width","volume":"25","author":"Taheri","year":"2024","journal-title":"Journal of Machine Learning Research"},{"issue":"3","key":"2025012818241866000_bib41","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1145\/3446776","article-title":"Understanding deep learning (still) requires rethinking generalization","volume":"64","author":"Zhang","year":"2021","journal-title":"Communications of the ACM"},{"issue":"1","key":"2025012818241866000_bib42","doi-asserted-by":"publisher","first-page":"345","DOI":"10.1007\/s10994-021-06056-w","article-title":"Understanding generalization error of SGD in nonconvex optimization","volume":"111","author":"Zhou","year":"2022","journal-title":"Machine Learning"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/37\/2\/344\/2479727\/neco_a_01725.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/37\/2\/344\/2479727\/neco_a_01725.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,28]],"date-time":"2025-01-28T18:24:46Z","timestamp":1738088686000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/37\/2\/344\/125265\/Generalization-Guarantees-of-Gradient-Descent-for"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,21]]},"references-count":42,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,1,21]]},"published-print":{"date-parts":[[2025,1,21]]}},"URL":"https:\/\/doi.org\/10.1162\/neco_a_01725","relation":{},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"type":"print","value":"0899-7667"},{"type":"electronic","value":"1530-888X"}],"subject":[],"published-other":{"date-parts":[[2025,2]]},"published":{"date-parts":[[2025,1,21]]}}}