{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,25]],"date-time":"2026-01-25T02:13:51Z","timestamp":1769307231054,"version":"3.49.0"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,3,29]],"date-time":"2024-03-29T00:00:00Z","timestamp":1711670400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100004358","name":"Samsung Electronics Co., Ltd","doi-asserted-by":"crossref","award":["IO201210-08019-01"],"award-info":[{"award-number":["IO201210-08019-01"]}],"id":[{"id":"10.13039\/100004358","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>\n            <jats:italic>L<\/jats:italic>\n            <jats:sub>2<\/jats:sub>\n            regularization for weights in neural networks is widely used as a standard training trick. In addition to weights, the use of batch normalization involves an additional trainable parameter \u03b3, which acts as a scaling factor. However,\n            <jats:italic>L<\/jats:italic>\n            <jats:sub>2<\/jats:sub>\n            regularization for \u03b3 remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this article, we study whether\n            <jats:italic>L<\/jats:italic>\n            <jats:sub>2<\/jats:sub>\n            regularization for \u03b3 is valid. To explore this issue, we consider two approaches: (1) variance control to make the residual network behave like an identity mapping and (2) stable optimization through the improvement of effective learning rate. Through two analyses, we specify the desirable and undesirable \u03b3 to apply\n            <jats:italic>L<\/jats:italic>\n            <jats:sub>2<\/jats:sub>\n            regularization and propose four guidelines for managing them. In several experiments, we observed that applying\n            <jats:italic>L<\/jats:italic>\n            <jats:sub>2<\/jats:sub>\n            regularization to applicable \u03b3 increased 1% to 4% classification accuracy, whereas applying\n            <jats:italic>L<\/jats:italic>\n            <jats:sub>2<\/jats:sub>\n            regularization to inapplicable \u03b3 decreased 1% to 3% classification accuracy, which is consistent with our four guidelines. Our proposed guidelines were further validated through various tasks and architectures, including variants of residual networks and transformers.\n          <\/jats:p>","DOI":"10.1145\/3643860","type":"journal-article","created":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T11:56:39Z","timestamp":1706788599000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4155-9225","authenticated-orcid":false,"given":"Bum Jun","family":"Kim","sequence":"first","affiliation":[{"name":"Pohang University of Science and Technology, Pohang, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8221-2338","authenticated-orcid":false,"given":"Hyeyeon","family":"Choi","sequence":"additional","affiliation":[{"name":"Pohang University of Science and Technology, Pohang, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5344-8114","authenticated-orcid":false,"given":"Hyeonah","family":"Jang","sequence":"additional","affiliation":[{"name":"Pohang University of Science and Technology, Pohang, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6023-1837","authenticated-orcid":false,"given":"Sang Woo","family":"Kim","sequence":"additional","affiliation":[{"name":"Pohang University of Science and Technology, Pohang, Republic of Korea"}]}],"member":"320","published-online":{"date-parts":[[2024,3,29]]},"reference":[{"key":"e_1_3_2_2_2","article-title":"Layer normalization","volume":"1607","author":"Ba Lei Jimmy","year":"2016","unstructured":"Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs\/1607.06450 (2016).","journal-title":"CoRR"},{"key":"e_1_3_2_3_2","first-page":"446","volume-title":"ECCV (6)","author":"Bossard Lukas","year":"2014","unstructured":"Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 \u2014 Mining discriminative components with random forests. In ECCV (6), Vol. 8694. 446\u2013461. https:\/\/data.vision.ee.ethz.ch\/cvl\/datasets_extra\/food-101\/"},{"key":"e_1_3_2_4_2","volume-title":"ICLR","author":"Brock Andrew","year":"2021","unstructured":"Andrew Brock, Soham De, and Samuel L. Smith. 2021. Characterizing signal propagation to close the performance gap in unnormalized ResNets. In ICLR."},{"key":"e_1_3_2_5_2","first-page":"2","volume-title":"Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign","author":"Cettolo Mauro","year":"2014","unstructured":"Mauro Cettolo, Jan Niehues, Sebastian St\u00fcker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign. 2\u201317. https:\/\/workshop2014.iwslt.org\/"},{"key":"e_1_3_2_6_2","volume-title":"NeurIPS","author":"De Soham","year":"2020","unstructured":"Soham De and Samuel L. Smith. 2020. Batch normalization biases residual blocks towards the identity function in deep networks. In NeurIPS."},{"key":"e_1_3_2_7_2","first-page":"248","volume-title":"CVPR","author":"Deng Jia","year":"2009","unstructured":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. 248\u2013255. https:\/\/www.image-net.org\/"},{"key":"e_1_3_2_8_2","first-page":"4171","volume-title":"NAACL-HLT (1)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1). 4171\u20134186."},{"key":"e_1_3_2_9_2","volume-title":"ICLR","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR."},{"key":"e_1_3_2_10_2","article-title":"Accurate, large minibatch SGD: Training ImageNet in 1 hour","volume":"1706","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal, Piotr Doll\u00e1r, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs\/1706.02677 (2017).","journal-title":"CoRR"},{"key":"e_1_3_2_11_2","first-page":"1026","volume-title":"ICCV","author":"He Kaiming","year":"2015","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV. 1026\u20131034."},{"key":"e_1_3_2_12_2","first-page":"770","volume-title":"CVPR","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770\u2013778."},{"key":"e_1_3_2_13_2","first-page":"630","volume-title":"ECCV (4)","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In ECCV (4), Vol. 9908. 630\u2013645."},{"key":"e_1_3_2_14_2","first-page":"558","volume-title":"CVPR","author":"He Tong","year":"2019","unstructured":"Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of tricks for image classification with convolutional neural networks. In CVPR. 558\u2013567."},{"key":"e_1_3_2_15_2","first-page":"2164","volume-title":"NeurIPS","author":"Hoffer Elad","year":"2018","unstructured":"Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. 2018. Norm matters: Efficient and accurate normalization schemes in deep networks. In NeurIPS. 2164\u20132174."},{"key":"e_1_3_2_16_2","first-page":"595","volume-title":"CVPR","author":"Horn Grant Van","year":"2015","unstructured":"Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge J. Belongie. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR. 595\u2013604. https:\/\/dl.allaboutbirds.org\/nabirds"},{"key":"e_1_3_2_17_2","first-page":"448","volume-title":"ICML","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37. 448\u2013456."},{"key":"e_1_3_2_18_2","article-title":"Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes","volume":"1807","author":"Jia Xianyan","year":"2018","unstructured":"Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. CoRR abs\/1807.11205 (2018).","journal-title":"CoRR"},{"key":"e_1_3_2_19_2","first-page":"1106","volume-title":"NIPS","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1106\u20131114."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3473464"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3200489"},{"key":"e_1_3_2_22_2","first-page":"311","volume-title":"ACL","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311\u2013318."},{"key":"e_1_3_2_23_2","first-page":"3498","volume-title":"CVPR","author":"Parkhi Omkar M.","year":"2012","unstructured":"Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2012. Cats and dogs. In CVPR. 3498\u20133505. https:\/\/www.robots.ox.ac.uk\/vgg\/data\/pets\/"},{"key":"e_1_3_2_24_2","volume-title":"ICLR (Workshop)","author":"Springenberg Jost Tobias","year":"2015","unstructured":"Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. 2015. Striving for simplicity: The all convolutional net. In ICLR (Workshop)."},{"key":"e_1_3_2_25_2","volume-title":"ICLR","author":"Summers Cecilia","year":"2020","unstructured":"Cecilia Summers and Michael J. Dinneen. 2020. Four things everyone should know to improve batch normalization. In ICLR."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3570510"},{"key":"e_1_3_2_27_2","article-title":"L2 regularization versus batch and weight normalization","volume":"1706","author":"Laarhoven Twan van","year":"2017","unstructured":"Twan van Laarhoven. 2017. L2 regularization versus batch and weight normalization. CoRR abs\/1706.05350 (2017).","journal-title":"CoRR"},{"key":"e_1_3_2_28_2","first-page":"5998","volume-title":"NIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998\u20136008."},{"key":"e_1_3_2_29_2","volume-title":"ICLR","author":"Wang Alex","year":"2019","unstructured":"Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR. https:\/\/gluebenchmark.com\/"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3360309"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01198-w"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3570508"},{"key":"e_1_3_2_33_2","first-page":"5987","volume-title":"CVPR","author":"Xie Saining","year":"2017","unstructured":"Saining Xie, Ross B. Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR. 5987\u20135995."},{"key":"e_1_3_2_34_2","volume-title":"ICLR","author":"Yan Junjie","year":"2020","unstructured":"Junjie Yan, Ruosi Wan, Xiangyu Zhang, Wei Zhang, Yichen Wei, and Jian Sun. 2020. Towards stabilizing batch statistics in backward propagation of batch normalization. In ICLR."},{"key":"e_1_3_2_35_2","volume-title":"BMVC","author":"Zagoruyko Sergey","year":"2016","unstructured":"Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. In BMVC."},{"key":"e_1_3_2_36_2","first-page":"818","volume-title":"ECCV (1)","author":"Zeiler Matthew D.","year":"2014","unstructured":"Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV (1), Vol. 8689. 818\u2013833."},{"key":"e_1_3_2_37_2","volume-title":"ICLR","author":"Zhang Guodong","year":"2019","unstructured":"Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger B. Grosse. 2019. Three mechanisms of weight decay regularization. In ICLR."},{"key":"e_1_3_2_38_2","volume-title":"ICLR","author":"Zhang Hongyi","year":"2019","unstructured":"Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019. Fixup initialization: Residual learning without normalization. In ICLR."}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643860","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643860","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:57:34Z","timestamp":1750291054000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643860"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,29]]},"references-count":37,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3643860"],"URL":"https:\/\/doi.org\/10.1145\/3643860","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,29]]},"assertion":[{"value":"2023-07-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-29","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}