{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,15]],"date-time":"2025-08-15T01:02:17Z","timestamp":1755219737193,"version":"3.43.0"},"reference-count":46,"publisher":"IOP Publishing","issue":"3","license":[{"start":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T00:00:00Z","timestamp":1754524800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T00:00:00Z","timestamp":1754524800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/iopscience.iop.org\/info\/page\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/100006208","name":"High Energy Physics","doi-asserted-by":"crossref","award":["DE-SC0023704"],"award-info":[{"award-number":["DE-SC0023704"]}],"id":[{"id":"10.13039\/100006208","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["iopscience.iop.org"],"crossmark-restriction":false},"short-container-title":["Mach. Learn.: Sci. Technol."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width\u2014which govern the evolution of observables during training\u2014saturate at a depth of <jats:inline-formula>\n                     <jats:tex-math\/>\n                     <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" overflow=\"scroll\">\n                        <mml:mrow>\n                           <mml:mrow>\n                              <mml:mo>\u223c<\/mml:mo>\n                           <\/mml:mrow>\n                           <mml:mn>20<\/mml:mn>\n                        <\/mml:mrow>\n                     <\/mml:math>\n                  <\/jats:inline-formula>, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep non-linear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.<\/jats:p>","DOI":"10.1088\/2632-2153\/adf278","type":"journal-article","created":{"date-parts":[[2025,7,21]],"date-time":"2025-07-21T22:55:48Z","timestamp":1753138548000},"page":"035027","update-policy":"https:\/\/doi.org\/10.1088\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Feature learning and generalization in deep networks with orthogonal weights"],"prefix":"10.1088","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4496-5600","authenticated-orcid":true,"given":"Hannah","family":"Day","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9379-1838","authenticated-orcid":true,"given":"Yonatan","family":"Kahn","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5755-2274","authenticated-orcid":true,"given":"Daniel A","family":"Roberts","sequence":"additional","affiliation":[]}],"member":"266","published-online":{"date-parts":[[2025,8,7]]},"reference":[{"key":"mlstadf278bib1","article-title":"Exponential expressivity in deep neural networks through transient chaos","volume":"vol 29","author":"Poole","year":"2016"},{"article-title":"Deep information propagation","year":"2017","author":"Schoenholz","key":"mlstadf278bib2"},{"key":"mlstadf278bib3","article-title":"Which neural net architectures give rise to exploding and vanishing gradients?","volume":"vol 31","author":"Hanin","year":"2018"},{"key":"mlstadf278bib4","article-title":"How to start training: the effect of initialization and architecture","volume":"vol 31","author":"Hanin","year":"2018"},{"article-title":"Finite depth and width corrections to the neural tangent kernel","year":"2020","author":"Hanin","key":"mlstadf278bib5"},{"key":"mlstadf278bib6","first-page":"pp 165","article-title":"Non-gaussian processes and neural networks at finite widths","author":"Yaida","year":"2020"},{"key":"mlstadf278bib7","doi-asserted-by":"publisher","first-page":"p 5","DOI":"10.1017\/9781009023405","author":"Roberts","year":"2022"},{"key":"mlstadf278bib8","first-page":"pp 5042","article-title":"The effect of network width on stochastic gradient descent and generalization: an empirical study","author":"Park","year":"2019"},{"key":"mlstadf278bib9","first-page":"pp 11727","article-title":"Tensor programs IV: feature learning in infinite-width neural networks","author":"Yang","year":"2021"},{"article-title":"Meta-principled family of hyperparameter scaling strategies","year":"2022","author":"Yaida","key":"mlstadf278bib10"},{"article-title":"Adam: a method for stochastic optimization","year":"2015","author":"Kingma","key":"mlstadf278bib11"},{"article-title":"Decoupled weight decay regularization","year":"2019","author":"Loshchilov","key":"mlstadf278bib12"},{"article-title":"Layer normalization","year":"2016","author":"Ba","key":"mlstadf278bib13"},{"key":"mlstadf278bib14","first-page":"40054","article-title":"Critical initialization of wide and deep neural networks using partial jacobians: general theory and applications","volume":"vol 36","author":"Doshi","year":"2023"},{"article-title":"Autoinit: automatic initialization via jacobian tuning","year":"2022","author":"He","key":"mlstadf278bib15"},{"key":"mlstadf278bib16","article-title":"Neural tangent kernel: convergence and generalization in neural networks","volume":"vol 31","author":"Jacot","year":"2018"},{"key":"mlstadf278bib17","article-title":"On exact computation with an infinitely wide neural net","volume":"vol 32","author":"Arora","year":"2019"},{"key":"mlstadf278bib18","article-title":"Wide neural networks of any depth evolve as linear models under gradient descent","volume":"vol 32","author":"Lee","year":"2019"},{"article-title":"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks","year":"2013","author":"Saxe","key":"mlstadf278bib19"},{"key":"mlstadf278bib20","article-title":"Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice","volume":"vol 30","author":"Pennington","year":"2017"},{"key":"mlstadf278bib21","first-page":"pp 1924","article-title":"The emergence of spectral universality in deep networks","author":"Pennington","year":"2018"},{"key":"mlstadf278bib22","first-page":"pp 5393","article-title":"Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks","author":"Xiao","year":"2018"},{"article-title":"Provable benefit of orthogonal initialization in optimizing deep linear networks","year":"2020","author":"Hu","key":"mlstadf278bib23"},{"key":"mlstadf278bib24","doi-asserted-by":"crossref","DOI":"10.24963\/ijcai.2021\/355","article-title":"On the neural tangent kernel of deep networks with orthogonal initialization","author":"Huang","year":"2021"},{"article-title":"Applications of statistical field theory in deep learning","year":"2025","author":"Ringel","key":"mlstadf278bib25"},{"key":"mlstadf278bib26","article-title":"Imagenet classification with deep convolutional neural networks","volume":"vol 25","author":"Krizhevsky","year":"2012"},{"key":"mlstadf278bib27","article-title":"Attention is all you need","volume":"vol 30","author":"Vaswani","year":"2017"},{"article-title":"Effective theory of transformers at initialization","year":"2023","author":"Dinan","key":"mlstadf278bib28"},{"article-title":"Differential learning kinetics govern the transition from memorization to generalization during in-context learning","year":"2024","author":"Nguyen","key":"mlstadf278bib29"},{"key":"mlstadf278bib30","doi-asserted-by":"publisher","first-page":"JHEP12(2020)085","DOI":"10.1007\/JHEP12(2020)085","article-title":"Inclusive search for highly boosted Higgs bosons decaying to bottom quark-antiquark pairs in proton-proton collisions at s= 13 TeV","author":"shape CMS collaboration","year":"2020","journal-title":"J. High Energy Phys."},{"key":"mlstadf278bib31","doi-asserted-by":"publisher","first-page":"1078","DOI":"10.1038\/s41550-020-1131-2","article-title":"Evidence for a vast prograde stellar stream in the solar vicinity","volume":"4","author":"Necib","year":"2020","journal-title":"Nat. Astron."},{"key":"mlstadf278bib32","doi-asserted-by":"publisher","first-page":"999","DOI":"10.1063\/1.523807","article-title":"Asymptotic behavior of group integrals in the limit of infinite rank","volume":"19","author":"Weingarten","year":"1978","journal-title":"J. Math. Phys."},{"key":"mlstadf278bib33","doi-asserted-by":"publisher","first-page":"773","DOI":"10.1007\/s00220-006-1554-3","article-title":"Integration with respect to the Haar measure on unitary, orthogonal and symplectic group","volume":"264","author":"Collins","year":"2006","journal-title":"Commun. Math. Phys."},{"key":"mlstadf278bib34","doi-asserted-by":"publisher","DOI":"10.1063\/1.3251304","article-title":"On some properties of orthogonal Weingarten functions","volume":"50","author":"Collins","year":"2009","journal-title":"J. Math. Phys."},{"article-title":"Moments of random matrices and Weingarten functions","year":"2013","author":"Gu","key":"mlstadf278bib35"},{"key":"mlstadf278bib36","doi-asserted-by":"publisher","first-page":"734","DOI":"10.1090\/noti2474","article-title":"The Weingarten calculus","volume":"69","author":"Collins","year":"2022","journal-title":"Not. Am. Math. Soc."},{"article-title":"Asymptotics of wide networks from Feynman diagrams","year":"2019","author":"Dyer","key":"mlstadf278bib37"},{"key":"mlstadf278bib38","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevD.109.105007","article-title":"Structures of neural network effective theories","volume":"109","author":"Banta","year":"2024","journal-title":"Phys. Rev. D"},{"key":"mlstadf278bib39","first-page":"pp 3364","article-title":"Exact marginal prior distributions of finite bayesian neural networks","volume":"vol 34","author":"Zavatone-Veth","year":"2021"},{"article-title":"Keras","year":"2015","author":"Chollet","key":"mlstadf278bib40"},{"key":"mlstadf278bib41","first-page":"592","article-title":"How to generate random matrices from the classical compact groups","volume":"54","author":"Mezzadri","year":"2007","journal-title":"Not. AMS"},{"article-title":"Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks","year":"2021","author":"Hui","key":"mlstadf278bib42"},{"article-title":"Gradient descent on neural networks typically occurs at the edge of stability","year":"2021","author":"Cohen","key":"mlstadf278bib43"},{"article-title":"Sparser, better, deeper, stronger: Improving sparse training with exact orthogonal initialization","year":"2025","author":"Nowak","key":"mlstadf278bib44"},{"article-title":"Interpretable uncertainty quantification in AI for HEP, Snowmass","year":"2022","author":"Chen","key":"mlstadf278bib45"},{"key":"mlstadf278bib46","first-page":"pp 41","article-title":"Hal: computer system for scalable deep learning","author":"Kindratenko","year":"2020"}],"container-title":["Machine Learning: Science and Technology"],"original-title":[],"link":[{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278","content-type":"text\/html","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"similarity-checking"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T08:55:50Z","timestamp":1754556950000},"score":1,"resource":{"primary":{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adf278"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,7]]},"references-count":46,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,8,7]]},"published-print":{"date-parts":[[2025,9,30]]}},"URL":"https:\/\/doi.org\/10.1088\/2632-2153\/adf278","relation":{},"ISSN":["2632-2153"],"issn-type":[{"type":"electronic","value":"2632-2153"}],"subject":[],"published":{"date-parts":[[2025,8,7]]},"assertion":[{"value":"Feature learning and generalization in deep networks with orthogonal weights","name":"article_title","label":"Article Title"},{"value":"Machine Learning: Science and Technology","name":"journal_title","label":"Journal Title"},{"value":"paper","name":"article_type","label":"Article Type"},{"value":"\u00a9 2025 The Author(s). Published by IOP Publishing Ltd","name":"copyright_information","label":"Copyright Information"},{"value":"2025-03-25","name":"date_received","label":"Date Received","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-07-21","name":"date_accepted","label":"Date Accepted","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-08-07","name":"date_epub","label":"Online publication date","group":{"name":"publication_dates","label":"Publication dates"}}]}}