{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T07:18:07Z","timestamp":1768979887343,"version":"3.49.0"},"reference-count":87,"publisher":"Association for Computing Machinery (ACM)","license":[{"start":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T00:00:00Z","timestamp":1655856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"abstract":"<jats:p>\n            This paper\n            <jats:sup>1<\/jats:sup>\n            studies how to schedule hyperparameters to improve generalization of both\n            <jats:italic>centralized<\/jats:italic>\n            single-machine stochastic gradient descent (SGD) and\n            <jats:italic>distributed<\/jats:italic>\n            asynchronous SGD (ASGD). SGD augmented with momentum variants (e.g., heavy ball momentum (SHB) and Nesterov\u2019s accelerated gradient (NAG)) has been the default optimizer for many tasks, in both centralized and distributed environments. However, many advanced momentum variants, despite empirical advantage over classical SHB\/NAG, introduce extra hyperparameters to tune. The error-prone tuning is the main barrier for AutoML.\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Centralized<\/jats:italic>\n            <jats:bold>SGD<\/jats:bold>\n            : We first focus on\n            <jats:italic>centralized<\/jats:italic>\n            single-machine SGD and show how to efficiently schedule the hyperparameters of a large class of momentum variants to improve generalization. We propose a unified framework called multistage quasi-hyperbolic momentum (Multistage QHM), which covers a large family of momentum variants as its special cases (e.g. vanilla SGD\/SHB\/NAG). Existing works mainly focus on only scheduling learning rate\n            <jats:italic>\u03b1<\/jats:italic>\n            \u2019s decay, while multistage QHM allows additional varying hyperparameters (e.g., momentum factor), and demonstrates better generalization than only tuning\n            <jats:italic>\u03b1<\/jats:italic>\n            . We show the convergence of multistage QHM for general nonconvex objectives.\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Distributed<\/jats:italic>\n            <jats:bold>SGD<\/jats:bold>\n            : We then extend our theory to distributed asynchronous SGD (ASGD), in which a parameter server distributes data batches to several worker machines and updates parameters via aggregating batch gradients from workers. We quantify the asynchrony between different workers (i.e., gradient staleness), model the dynamics of asynchronous iterations via a stochastic differential equation (SDE), and then derive a PAC-Bayesian generalization bound for ASGD. As a byproduct, we show how a moderately large learning rate helps ASGD to generalize better.\n          <\/jats:p>\n          <jats:p>Our tuning strategies have rigorous justifications rather than a blind trial-and-error as we theoretically prove why our tuning strategies could decrease our derived generalization errors in both cases. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically. Our codes are publicly available https:\/\/github.com\/jsycsjh\/centralized-asynchronous-tuning.<\/jats:p>","DOI":"10.1145\/3544782","type":"journal-article","created":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T13:25:09Z","timestamp":1655904309000},"update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0032-3646","authenticated-orcid":false,"given":"Jianhui","family":"Sun","sequence":"first","affiliation":[{"name":"University of Virginia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1602-1212","authenticated-orcid":false,"given":"Ying","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Michigan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7657-4305","authenticated-orcid":false,"given":"Guangxu","family":"Xun","sequence":"additional","affiliation":[{"name":"Baidu Research"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9723-3246","authenticated-orcid":false,"given":"Aidong","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Virginia"}]}],"member":"320","published-online":{"date-parts":[[2022,6,22]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek\u00a0 G. Murray , Benoit Steiner , Paul Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , and Xiaoqiang Zheng . 2016 . TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) . USENIX Association, Savannah, GA, 265\u2013283. https:\/\/www.usenix.org\/conference\/osdi16\/technical-sessions\/presentation\/abadi Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek\u00a0G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 265\u2013283. https:\/\/www.usenix.org\/conference\/osdi16\/technical-sessions\/presentation\/abadi"},{"key":"e_1_2_1_2_1","unstructured":"Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. ArXiv abs\/1711.04325(2017).  Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. ArXiv abs\/1711.04325(2017)."},{"key":"e_1_2_1_3_1","unstructured":"Jing An J. Lu and Lexing Ying. 2018. Stochastic modified equations for the asynchronous stochastic gradient descent. ArXiv abs\/1805.08244(2018).  Jing An J. Lu and Lexing Ying. 2018. Stochastic modified equations for the asynchronous stochastic gradient descent. ArXiv abs\/1805.08244(2018)."},{"key":"e_1_2_1_4_1","volume-title":"2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8522\u20138531","author":"An W.","unstructured":"W. An , H. Wang , Q. Sun , J. Xu , Q. Dai , and L. Zhang . 2018. A PID Controller Approach for Stochastic Optimization of Deep Networks . In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8522\u20138531 . W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang. 2018. A PID Controller Approach for Stochastic Optimization of Deep Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8522\u20138531."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2020.3026619"},{"key":"e_1_2_1_6_1","unstructured":"Mahmoud Assran Nicolas Loizou Nicolas Ballas and Michael\u00a0G. Rabbat. 2019. Stochastic Gradient Push for Distributed Deep Learning. In ICML.  Mahmoud Assran Nicolas Loizou Nicolas Ballas and Michael\u00a0G. Rabbat. 2019. Stochastic Gradient Push for Distributed Deep Learning. In ICML."},{"key":"e_1_2_1_7_1","series-title":"SIAM Journal on Optimization 30 (01","volume-title":"Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions","author":"Aybat Necdet","year":"2020","unstructured":"Necdet Aybat , Alireza Fallah , Mert G\u00fcrb\u00fczbalaban , and Asuman Ozdaglar . 2020. Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions . SIAM Journal on Optimization 30 (01 2020 ), 717\u2013751. https:\/\/doi.org\/10.1137\/19M1244925 Necdet Aybat, Alireza Fallah, Mert G\u00fcrb\u00fczbalaban, and Asuman Ozdaglar. 2020. Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions. SIAM Journal on Optimization 30 (01 2020), 717\u2013751. https:\/\/doi.org\/10.1137\/19M1244925"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Yoshua Bengio. 2012. Practical Recommendations for Gradient-Based Training of Deep Architectures. In Neural Networks: Tricks of the Trade.  Yoshua Bengio. 2012. Practical Recommendations for Gradient-Based Training of Deep Architectures. In Neural Networks: Tricks of the Trade.","DOI":"10.1007\/978-3-642-35289-8_26"},{"key":"e_1_2_1_9_1","article-title":"Random Search for Hyper-Parameter Optimization","author":"Bergstra James","year":"2012","unstructured":"James Bergstra and Yoshua Bengio . 2012 . Random Search for Hyper-Parameter Optimization . J. Mach. Learn. Res. 13, null ( Feb. 2012), 281\u2013305. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 13, null (Feb. 2012), 281\u2013305.","journal-title":"J. Mach. Learn. Res. 13, null"},{"key":"e_1_2_1_10_1","unstructured":"L. Bottou Frank\u00a0E. Curtis and J. Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. ArXiv abs\/1606.04838(2018).  L. Bottou Frank\u00a0E. Curtis and J. Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. ArXiv abs\/1606.04838(2018)."},{"key":"e_1_2_1_11_1","unstructured":"B. Can Mert G\u00fcrb\u00fczbalaban and Lingjiong Zhu. 2019. Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances. In ICML.  B. Can Mert G\u00fcrb\u00fczbalaban and Lingjiong Zhu. 2019. Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances. In ICML."},{"key":"e_1_2_1_12_1","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14)","author":"Chilimbi Trishul","year":"2014","unstructured":"Trishul Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project Adam: Building an Efficient and Scalable Deep Learning Training System . In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) . USENIX Association, Broomfield, CO, 571\u2013582. https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/chilimbi Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571\u2013582. https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/chilimbi"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems -","volume":"1","author":"Dean Jeffrey","year":"2012","unstructured":"Jeffrey Dean , Greg\u00a0 S. Corrado , Rajat Monga , Kai Chen , Matthieu Devin , Quoc\u00a0 V. Le , Mark\u00a0 Z. Mao , Marc\u2019Aurelio Ranzato , Andrew Senior , Paul Tucker , Ke Yang , and Andrew\u00a0 Y. Ng . 2012 . Large Scale Distributed Deep Networks . In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (Lake Tahoe, Nevada) (NIPS\u201912). Curran Associates Inc., Red Hook, NY, USA, 1223\u20131231. Jeffrey Dean, Greg\u00a0S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc\u00a0V. Le, Mark\u00a0Z. Mao, Marc\u2019Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew\u00a0Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (Lake Tahoe, Nevada) (NIPS\u201912). Curran Associates Inc., Red Hook, NY, USA, 1223\u20131231."},{"key":"e_1_2_1_14_1","volume-title":"Advances in Neural Information Processing Systems 32. Curran Associates","author":"Defazio Aaron","unstructured":"Aaron Defazio . 2019. On the Curved Geometry of Accelerated Optimization . In Advances in Neural Information Processing Systems 32. Curran Associates , Inc ., 1766\u20131775. http:\/\/papers.nips.cc\/paper\/8453-on-the-curved-geometry-of-accelerated-optimization.pdf Aaron Defazio. 2019. On the Curved Geometry of Accelerated Optimization. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 1766\u20131775. http:\/\/papers.nips.cc\/paper\/8453-on-the-curved-geometry-of-accelerated-optimization.pdf"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_16_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs\/1810.04805(2018). arxiv:1810.04805 http:\/\/arxiv.org\/abs\/1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs\/1810.04805(2018). arxiv:1810.04805 http:\/\/arxiv.org\/abs\/1810.04805 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs\/1810.04805(2018). arxiv:1810.04805 http:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_2_1_17_1","volume-title":"Du and Wei Hu","author":"S.","year":"2019","unstructured":"Simon\u00a0 S. Du and Wei Hu . 2019 . Width Provably Matters in Optimization for Deep Linear Neural Networks. CoRR abs\/1901.08572(2019). arxiv:1901.08572 http:\/\/arxiv.org\/abs\/1901.08572 Simon\u00a0S. Du and Wei Hu. 2019. Width Provably Matters in Optimization for Deep Linear Neural Networks. CoRR abs\/1901.08572(2019). arxiv:1901.08572 http:\/\/arxiv.org\/abs\/1901.08572"},{"key":"e_1_2_1_18_1","unstructured":"R. Ge Sham\u00a0M. Kakade R. Kidambi and Praneeth Netrapalli. 2019. The Step Decay Schedule: A Near Optimal Geometrically Decaying Learning Rate Procedure. In NeurIPS.  R. Ge Sham\u00a0M. Kakade R. Kidambi and Praneeth Netrapalli. 2019. The Step Decay Schedule: A Near Optimal Geometrically Decaying Learning Rate Procedure. In NeurIPS."},{"key":"e_1_2_1_19_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Bkeb7lHtvH","author":"Giladi Niv","year":"2020","unstructured":"Niv Giladi , Mor\u00a0Shpigel Nacson , Elad Hoffer , and Daniel Soudry . 2020 . At Stability\u2019s Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks? . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Bkeb7lHtvH Niv Giladi, Mor\u00a0Shpigel Nacson, Elad Hoffer, and Daniel Soudry. 2020. At Stability\u2019s Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Bkeb7lHtvH"},{"key":"e_1_2_1_20_1","volume-title":"Advances in Neural Information Processing Systems 32. Curran Associates","author":"Gitman Igor","unstructured":"Igor Gitman , Hunter Lang , Pengchuan Zhang , and Lin Xiao . 2019. Understanding the Role of Momentum in Stochastic Gradient Methods . In Advances in Neural Information Processing Systems 32. Curran Associates , Inc ., 9633\u20139643. http:\/\/papers.nips.cc\/paper\/9158-understanding-the-role-of-momentum-in-stochastic-gradient-methods.pdf Igor Gitman, Hunter Lang, Pengchuan Zhang, and Lin Xiao. 2019. Understanding the Role of Momentum in Stochastic Gradient Methods. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 9633\u20139643. http:\/\/papers.nips.cc\/paper\/9158-understanding-the-role-of-momentum-in-stochastic-gradient-methods.pdf"},{"key":"e_1_2_1_21_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross\u00a0B. Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677(2017). arxiv:1706.02677 http:\/\/arxiv.org\/abs\/1706.02677  Priya Goyal Piotr Doll\u00e1r Ross\u00a0B. Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677(2017). arxiv:1706.02677 http:\/\/arxiv.org\/abs\/1706.02677"},{"key":"e_1_2_1_22_1","unstructured":"Benjamin Guedj. 2019. A Primer on PAC-Bayesian Learning. ArXiv abs\/1901.05353(2019).  Benjamin Guedj. 2019. A Primer on PAC-Bayesian Learning. ArXiv abs\/1901.05353(2019)."},{"key":"e_1_2_1_23_1","unstructured":"Ido Hakimi Saar Barkai Moshe Gabel and A. Schuster. 2019. Taming Momentum in a Distributed Asynchronous Environment. ArXiv abs\/1907.11612(2019).  Ido Hakimi Saar Barkai Moshe Gabel and A. Schuster. 2019. Taming Momentum in a Distributed Asynchronous Environment. ArXiv abs\/1907.11612(2019)."},{"key":"e_1_2_1_24_1","volume-title":"Advances in Neural Information Processing Systems 32. Curran Associates","author":"He Fengxiang","unstructured":"Fengxiang He , Tongliang Liu , and Dacheng Tao . 2019. Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence . In Advances in Neural Information Processing Systems 32. Curran Associates , Inc ., 1143\u20131152. http:\/\/papers.nips.cc\/paper\/8398-control-batch-size-and-learning-rate-to-generalize-well-theoretical-and-empirical-evidence.pdf Fengxiang He, Tongliang Liu, and Dacheng Tao. 2019. Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 1143\u20131152. http:\/\/papers.nips.cc\/paper\/8398-control-batch-size-and-learning-rate-to-generalize-well-theoretical-and-empirical-evidence.pdf"},{"key":"e_1_2_1_25_1","volume-title":"Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770\u2013778","author":"He K.","unstructured":"K. He , X. Zhang , S. Ren , and J. Sun . 2016 . Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770\u2013778 . K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770\u2013778."},{"key":"e_1_2_1_26_1","volume-title":"Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , X. Zhang , Shaoqing Ren , and Jian Sun . 2016 . Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770\u2013778. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770\u2013778."},{"key":"e_1_2_1_27_1","unstructured":"Kaiming He X. Zhang Shaoqing Ren and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. ArXiv abs\/1603.05027(2016).  Kaiming He X. Zhang Shaoqing Ren and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. ArXiv abs\/1603.05027(2016)."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems","author":"Hoffer Elad","year":"2017","unstructured":"Elad Hoffer , Itay Hubara , and Daniel Soudry . 2017 . Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks . In Proceedings of the 31st International Conference on Neural Information Processing Systems ( Long Beach, California, USA) (NIPS\u201917). Curran Associates Inc., Red Hook, NY, USA, 1729\u20131739. Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS\u201917). Curran Associates Inc., Red Hook, NY, USA, 1729\u20131739."},{"key":"e_1_2_1_29_1","volume-title":"Squeeze-and-Excitation Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7132\u20137141","author":"Hu Jie","year":"2018","unstructured":"Jie Hu , Li Shen , and Gang Sun . 2018 . Squeeze-and-Excitation Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7132\u20137141 . https:\/\/doi.org\/10.1109\/CVPR.2018.00745 Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7132\u20137141. https:\/\/doi.org\/10.1109\/CVPR.2018.00745"},{"key":"e_1_2_1_30_1","unstructured":"Wenqing Hu Chris\u00a0Junchi Li Lei Li and Jian-Guo Liu. 2018. On the diffusion approximation of nonconvex stochastic gradient descent. arxiv:1705.07562 \u00a0[stat.ML]  Wenqing Hu Chris\u00a0Junchi Li Lei Li and Jian-Guo Liu. 2018. On the diffusion approximation of nonconvex stochastic gradient descent. arxiv:1705.07562 \u00a0[stat.ML]"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403089"},{"key":"e_1_2_1_32_1","volume-title":"Snapshot Ensembles: Train 1, get M for free. CoRR abs\/1704.00109(2017). arxiv:1704.00109 http:\/\/arxiv.org\/abs\/1704.00109","author":"Huang Gao","year":"2017","unstructured":"Gao Huang , Yixuan Li , Geoff Pleiss , Zhuang Liu , John\u00a0 E. Hopcroft , and Kilian\u00a0 Q. Weinberger . 2017 . Snapshot Ensembles: Train 1, get M for free. CoRR abs\/1704.00109(2017). arxiv:1704.00109 http:\/\/arxiv.org\/abs\/1704.00109 Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John\u00a0E. Hopcroft, and Kilian\u00a0Q. Weinberger. 2017. Snapshot Ensembles: Train 1, get M for free. CoRR abs\/1704.00109(2017). arxiv:1704.00109 http:\/\/arxiv.org\/abs\/1704.00109"},{"key":"e_1_2_1_33_1","volume-title":"Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261\u20132269","author":"Huang G.","unstructured":"G. Huang , Z. Liu , L. Van Der Maaten, and K.\u00a0Q. Weinberger. 2017 . Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261\u20132269 . G. Huang, Z. Liu, L. Van Der Maaten, and K.\u00a0Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261\u20132269."},{"key":"e_1_2_1_34_1","unstructured":"Stanis\u0142aw Jastrz\u0119bski Zac Kenton Devansh Arpit Nicolas Ballas Asja Fischer Amos Storkey and Yoshua Bengio. 2018. Three factors influencing minima in SGD. https:\/\/openreview.net\/forum?id=rJma2bZCW  Stanis\u0142aw Jastrz\u0119bski Zac Kenton Devansh Arpit Nicolas Ballas Asja Fischer Amos Storkey and Yoshua Bengio. 2018. Three factors influencing minima in SGD. https:\/\/openreview.net\/forum?id=rJma2bZCW"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330977"},{"key":"e_1_2_1_36_1","unstructured":"Nitish\u00a0Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak\u00a0Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs\/1609.04836(2016). arxiv:1609.04836 http:\/\/arxiv.org\/abs\/1609.04836  Nitish\u00a0Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak\u00a0Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs\/1609.04836(2016). arxiv:1609.04836 http:\/\/arxiv.org\/abs\/1609.04836"},{"key":"e_1_2_1_37_1","unstructured":"Nitish\u00a0Shirish Keskar and Richard Socher. 2017. Improving Generalization Performance by Switching from Adam to SGD. CoRR abs\/1712.07628(2017). arxiv:1712.07628 http:\/\/arxiv.org\/abs\/1712.07628  Nitish\u00a0Shirish Keskar and Richard Socher. 2017. Improving Generalization Performance by Switching from Adam to SGD. CoRR abs\/1712.07628(2017). arxiv:1712.07628 http:\/\/arxiv.org\/abs\/1712.07628"},{"key":"e_1_2_1_38_1","unstructured":"Rahul Kidambi Praneeth Netrapalli Prateek Jain and Sham\u00a0M. Kakade. 2018. On the insufficiency of existing momentum schemes for Stochastic Optimization. CoRR abs\/1803.05591(2018). arxiv:1803.05591 http:\/\/arxiv.org\/abs\/1803.05591  Rahul Kidambi Praneeth Netrapalli Prateek Jain and Sham\u00a0M. Kakade. 2018. On the insufficiency of existing momentum schemes for Stochastic Optimization. CoRR abs\/1803.05591(2018). arxiv:1803.05591 http:\/\/arxiv.org\/abs\/1803.05591"},{"key":"e_1_2_1_39_1","volume-title":"Kingma and Jimmy Ba","author":"P.","year":"2015","unstructured":"Diederik\u00a0 P. Kingma and Jimmy Ba . 2015 . Adam : A Method for Stochastic Optimization. CoRR abs\/1412.6980(2015). Diederik\u00a0P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs\/1412.6980(2015)."},{"key":"e_1_2_1_40_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey\u00a0E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. (2012) 1097\u20131105. http:\/\/papers.nips.cc\/paper\/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf  Alex Krizhevsky Ilya Sutskever and Geoffrey\u00a0E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. (2012) 1097\u20131105. http:\/\/papers.nips.cc\/paper\/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf"},{"key":"e_1_2_1_41_1","unstructured":"A. Kulunchakov and J. Mairal. 2019. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization. In ICML.  A. Kulunchakov and J. Mairal. 2019. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization. In ICML."},{"key":"e_1_2_1_42_1","unstructured":"M. Laborde and Adam\u00a0M. Oberman. 2020. A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In AISTATS.  M. Laborde and Adam\u00a0M. Oberman. 2020. A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In AISTATS."},{"key":"e_1_2_1_43_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rsogjAnYs4z","author":"Lee Namhoon","year":"2021","unstructured":"Namhoon Lee , Thalaiyasingam Ajanthan , Philip Torr , and Martin Jaggi . 2021 . Understanding the effects of data parallelism and sparsity on neural network training . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rsogjAnYs4z Namhoon Lee, Thalaiyasingam Ajanthan, Philip Torr, and Martin Jaggi. 2021. Understanding the effects of data parallelism and sparsity on neural network training. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rsogjAnYs4z"},{"key":"e_1_2_1_44_1","series-title":"SIAM Journal on Optimization 26 (08","volume-title":"Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints","author":"Lessard Laurent","year":"2014","unstructured":"Laurent Lessard , Benjamin Recht , and Andrew Packard . 2014. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints . SIAM Journal on Optimization 26 (08 2014 ). https:\/\/doi.org\/10.1137\/15M1009597 Laurent Lessard, Benjamin Recht, and Andrew Packard. 2014. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on Optimization 26 (08 2014). https:\/\/doi.org\/10.1137\/15M1009597"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327345.3327535"},{"key":"e_1_2_1_46_1","volume-title":"International Convention Centre","author":"Li Qianxiao","unstructured":"Qianxiao Li , Cheng Tai , and Weinan E . 2017. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms(Proceedings of Machine Learning Research, Vol.\u00a0 70). PMLR , International Convention Centre , Sydney, Australia, 2101\u20132110. http:\/\/proceedings.mlr.press\/v70\/li17f.html Qianxiao Li, Cheng Tai, and Weinan E. 2017. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms(Proceedings of Machine Learning Research, Vol.\u00a0 70). PMLR, International Convention Centre, Sydney, Australia, 2101\u20132110. http:\/\/proceedings.mlr.press\/v70\/li17f.html"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295285"},{"key":"e_1_2_1_48_1","unstructured":"Xiangru Lian Wei Zhang Ce Zhang and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In ICML. 3049\u20133058. http:\/\/proceedings.mlr.press\/v80\/lian18a.html  Xiangru Lian Wei Zhang Ce Zhang and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In ICML. 3049\u20133058. http:\/\/proceedings.mlr.press\/v80\/lian18a.html"},{"key":"e_1_2_1_49_1","unstructured":"Chaoyue Liu Libin Zhu and Mikhail Belkin. 2020. Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. arxiv:2003.00307 \u00a0[cs.LG]  Chaoyue Liu Libin Zhu and Mikhail Belkin. 2020. Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. arxiv:2003.00307 \u00a0[cs.LG]"},{"key":"e_1_2_1_50_1","unstructured":"Yanli Liu Yuan Gao and Wotao Yin. 2020. An Improved Analysis of Stochastic Gradient Descent with Momentum. arxiv:2007.07989 \u00a0[math.OC]  Yanli Liu Yuan Gao and Wotao Yin. 2020. An Improved Analysis of Stochastic Gradient Descent with Momentum. arxiv:2007.07989 \u00a0[math.OC]"},{"key":"e_1_2_1_51_1","unstructured":"Ben London. 2017. A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent. In NIPS.  Ben London. 2017. A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent. In NIPS."},{"key":"e_1_2_1_52_1","volume-title":"SGDR: Stochastic Gradient Descent with Restarts. CoRR abs\/1608.03983(2016). arxiv:1608.03983 http:\/\/arxiv.org\/abs\/1608.03983","author":"Loshchilov Ilya","year":"2016","unstructured":"Ilya Loshchilov and Frank Hutter . 2016 . SGDR: Stochastic Gradient Descent with Restarts. CoRR abs\/1608.03983(2016). arxiv:1608.03983 http:\/\/arxiv.org\/abs\/1608.03983 Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Restarts. CoRR abs\/1608.03983(2016). arxiv:1608.03983 http:\/\/arxiv.org\/abs\/1608.03983"},{"key":"e_1_2_1_53_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=S1fUpoR5FQ","author":"Ma Jerry","year":"2019","unstructured":"Jerry Ma and Denis Yarats . 2019 . Quasi-hyperbolic momentum and Adam for deep learning . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=S1fUpoR5FQ Jerry Ma and Denis Yarats. 2019. Quasi-hyperbolic momentum and Adam for deep learning. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=S1fUpoR5FQ"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33011069"},{"key":"e_1_2_1_55_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJzIBfZAb","author":"Madry Aleksander","year":"2018","unstructured":"Aleksander Madry , Aleksandar Makelov , Ludwig Schmidt , Dimitris Tsipras , and Adrian Vladu . 2018 . Towards Deep Learning Models Resistant to Adversarial Attacks . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJzIBfZAb Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJzIBfZAb"},{"key":"e_1_2_1_56_1","unstructured":"Subhransu Maji Esa Rahtu Juho Kannala Matthew\u00a0B. Blaschko and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. CoRR abs\/1306.5151(2013). arXiv:1306.5151 http:\/\/arxiv.org\/abs\/1306.5151  Subhransu Maji Esa Rahtu Juho Kannala Matthew\u00a0B. Blaschko and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. CoRR abs\/1306.5151(2013). arXiv:1306.5151 http:\/\/arxiv.org\/abs\/1306.5151"},{"key":"e_1_2_1_57_1","first-page":"1","article-title":"Stochastic Gradient Descent as Approximate Bayesian Inference","volume":"18","author":"Mandt Stephan","year":"2017","unstructured":"Stephan Mandt , Matthew\u00a0 D. Hoffman , and David\u00a0 M. Blei . 2017 . Stochastic Gradient Descent as Approximate Bayesian Inference . J. Mach. Learn. Res. 18 , 1 (Jan. 2017), 4873\u20134907. Stephan Mandt, Matthew\u00a0D. Hoffman, and David\u00a0M. Blei. 2017. Stochastic Gradient Descent as Approximate Bayesian Inference. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 4873\u20134907.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/279943.279989"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ALLERTON.2016.7852343"},{"key":"e_1_2_1_60_1","unstructured":"Y. Nesterov. 1983. A method for solving the convex programming problem with convergence rate \\(O(1\/k^{2})\\).  Y. Nesterov. 1983. A method for solving the convex programming problem with convergence rate \\(O(1\/k^{2})\\)."},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the 24th International Conference on Neural Information Processing Systems","author":"Niu Feng","year":"2011","unstructured":"Feng Niu , Benjamin Recht , Christopher Re , and Stephen\u00a0 J. Wright . 2011 . HOGWILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent . In Proceedings of the 24th International Conference on Neural Information Processing Systems ( Granada, Spain) (NIPS\u201911). Curran Associates Inc., Red Hook, NY, USA, 693\u2013701. Feng Niu, Benjamin Recht, Christopher Re, and Stephen\u00a0J. Wright. 2011. HOGWILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems(Granada, Spain) (NIPS\u201911). Curran Associates Inc., Red Hook, NY, USA, 693\u2013701."},{"key":"e_1_2_1_62_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke , S. Gross , Francisco Massa , A. Lerer , J. Bradbury , G. Chanan , T. Killeen , Z. Lin , N. Gimelshein , L. Antiga , Alban Desmaison , Andreas K\u00f6pf , E. Yang , Zach DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , B. Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . 2019. PyTorch: An Imperative Style , High-Performance Deep Learning Library . In NeurIPS. Adam Paszke, S. Gross, Francisco Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas K\u00f6pf, E. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, B. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/0041-5553(64)90137-5"},{"key":"e_1_2_1_64_1","unstructured":"Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. ArXiv abs\/1609.04747(2016).  Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. ArXiv abs\/1609.04747(2016)."},{"key":"e_1_2_1_65_1","unstructured":"Christopher\u00a0J. Shallue Jaehoon Lee Joseph\u00a0M. Antognini Jascha Sohl-Dickstein Roy Frostig and George\u00a0E. Dahl. 2018. Measuring the Effects of Data Parallelism on Neural Network Training. CoRR abs\/1811.03600(2018). http:\/\/arxiv.org\/abs\/1811.03600  Christopher\u00a0J. Shallue Jaehoon Lee Joseph\u00a0M. Antognini Jascha Sohl-Dickstein Roy Frostig and George\u00a0E. Dahl. 2018. Measuring the Effects of Data Parallelism on Neural Network Training. CoRR abs\/1811.03600(2018). http:\/\/arxiv.org\/abs\/1811.03600"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/267460.267466"},{"key":"e_1_2_1_67_1","volume-title":"3rd International Conference on Learning Representations, ICLR","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman . 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition . In 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA , USA, May 7-9, 2015, Conference Track Proceedings . http:\/\/arxiv.org\/abs\/1409.1556 Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_2_1_68_1","volume-title":"Cyclical Learning Rates for Training Neural Networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 464\u2013472","author":"Smith N.","year":"2017","unstructured":"L.\u00a0 N. Smith . 2017 . Cyclical Learning Rates for Training Neural Networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 464\u2013472 . https:\/\/doi.org\/10.1109\/WACV.2017.58 L.\u00a0N. Smith. 2017. Cyclical Learning Rates for Training Neural Networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 464\u2013472. https:\/\/doi.org\/10.1109\/WACV.2017.58"},{"key":"e_1_2_1_69_1","volume-title":"Smith and Nicholay Topin","author":"N.","year":"2017","unstructured":"Leslie\u00a0 N. Smith and Nicholay Topin . 2017 . Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates. CoRR abs\/1708.07120(2017). arxiv:1708.07120 http:\/\/arxiv.org\/abs\/1708.07120 Leslie\u00a0N. Smith and Nicholay Topin. 2017. Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates. CoRR abs\/1708.07120(2017). arxiv:1708.07120 http:\/\/arxiv.org\/abs\/1708.07120"},{"key":"e_1_2_1_70_1","unstructured":"Sam Smith and Quoc\u00a0V. Le. 2018. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. https:\/\/openreview.net\/pdf?id=BJij4yg0Z  Sam Smith and Quoc\u00a0V. Le. 2018. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. https:\/\/openreview.net\/pdf?id=BJij4yg0Z"},{"key":"e_1_2_1_71_1","volume-title":"Increase the Batch Size. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=B1Yy1BxCZ","author":"Smith L.","year":"2018","unstructured":"Samuel\u00a0 L. Smith , Pieter-Jan Kindermans , and Quoc\u00a0 V. Le . 2018 . Don\u2019t Decay the Learning Rate , Increase the Batch Size. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=B1Yy1BxCZ Samuel\u00a0L. Smith, Pieter-Jan Kindermans, and Quoc\u00a0V. Le. 2018. Don\u2019t Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=B1Yy1BxCZ"},{"key":"e_1_2_1_72_1","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems -","volume":"2","author":"Snoek Jasper","year":"2012","unstructured":"Jasper Snoek , Hugo Larochelle , and Ryan\u00a0 P. Adams . 2012 . Practical Bayesian Optimization of Machine Learning Algorithms . In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS\u201912). Curran Associates Inc., Red Hook, NY, USA, 2951\u20132959. Jasper Snoek, Hugo Larochelle, and Ryan\u00a0P. Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS\u201912). Curran Associates Inc., Red Hook, NY, USA, 2951\u20132959."},{"key":"e_1_2_1_73_1","volume-title":"Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","author":"Sun Jianhui","year":"2022","unstructured":"Jianhui Sun , Mengdi Huai , Kishlay Jha , and Aidong Zhang . 2022 . Demystify Hyperparameters for Stochastic Optimization with Transferable Representations . In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ( Washington, DC, USA) (KDD \u201922). Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3534678.3539298 Jianhui Sun, Mengdi Huai, Kishlay Jha, and Aidong Zhang. 2022. Demystify Hyperparameters for Stochastic Optimization with Transferable Representations. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Washington, DC, USA) (KDD \u201922). Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3534678.3539298"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467287"},{"key":"e_1_2_1_75_1","volume-title":"Recurrent Imputation for Multivariate Time Series with Missing Values. In 2019 IEEE International Conference on Healthcare Informatics, ICHI 2019","author":"Suo Qiuling","year":"2019","unstructured":"Qiuling Suo , Liuyi Yao , Guangxu Xun , Jianhui Sun , and Aidong Zhang . 2019 . Recurrent Imputation for Multivariate Time Series with Missing Values. In 2019 IEEE International Conference on Healthcare Informatics, ICHI 2019 , Xi\u2019an, China , June 10-13, 2019. IEEE, 1\u20133. https:\/\/doi.org\/10.1109\/ICHI.2019.8904638 Qiuling Suo, Liuyi Yao, Guangxu Xun, Jianhui Sun, and Aidong Zhang. 2019. Recurrent Imputation for Multivariate Time Series with Missing Values. In 2019 IEEE International Conference on Healthcare Informatics, ICHI 2019, Xi\u2019an, China, June 10-13, 2019. IEEE, 1\u20133. https:\/\/doi.org\/10.1109\/ICHI.2019.8904638"},{"key":"e_1_2_1_76_1","volume-title":"GLIMA: Global and Local Time Series Imputation with Multi-directional Attention Learning. In IEEE International Conference on Big Data, Big Data 2020","author":"Suo Qiuling","year":"2020","unstructured":"Qiuling Suo , Weida Zhong , Guangxu Xun , Jianhui Sun , Changyou Chen , and Aidong Zhang . 2020 . GLIMA: Global and Local Time Series Imputation with Multi-directional Attention Learning. In IEEE International Conference on Big Data, Big Data 2020 , Atlanta, GA, USA , December 10-13, 2020. IEEE, 798\u2013807. https:\/\/doi.org\/10.1109\/BigData50022.2020.9378408 Qiuling Suo, Weida Zhong, Guangxu Xun, Jianhui Sun, Changyou Chen, and Aidong Zhang. 2020. GLIMA: Global and Local Time Series Imputation with Multi-directional Attention Learning. In IEEE International Conference on Big Data, Big Data 2020, Atlanta, GA, USA, December 10-13, 2020. IEEE, 798\u2013807. https:\/\/doi.org\/10.1109\/BigData50022.2020.9378408"},{"key":"e_1_2_1_77_1","volume-title":"Proceedings of the 30th International Conference on International Conference on Machine Learning -","volume":"28","author":"Sutskever Ilya","year":"2013","unstructured":"Ilya Sutskever , James Martens , George Dahl , and Geoffrey Hinton . 2013 . On the Importance of Initialization and Momentum in Deep Learning . In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (Atlanta, GA, USA) (ICML\u201913). III\u20131139\u2013III\u20131147. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (Atlanta, GA, USA) (ICML\u201913). III\u20131139\u2013III\u20131147."},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCSYS.2017.2722406"},{"key":"e_1_2_1_79_1","unstructured":"Sharan Vaswani F. Bach and M. Schmidt. 2019. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron. ArXiv abs\/1810.07288(2019).  Sharan Vaswani F. Bach and M. Schmidt. 2019. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron. ArXiv abs\/1810.07288(2019)."},{"key":"e_1_2_1_80_1","volume-title":"Advances in Neural Information Processing Systems 30. Curran Associates","author":"Wilson C","unstructured":"Ashia\u00a0 C Wilson , Rebecca Roelofs , Mitchell Stern , Nati Srebro , and Benjamin Recht . 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning . In Advances in Neural Information Processing Systems 30. Curran Associates , Inc ., 4148\u20134158. http:\/\/papers.nips.cc\/paper\/7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning.pdf Ashia\u00a0C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 4148\u20134158. http:\/\/papers.nips.cc\/paper\/7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning.pdf"},{"key":"e_1_2_1_81_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0 139)","author":"Xie Zeke","year":"2021","unstructured":"Zeke Xie , Li Yuan , Zhanxing Zhu , and Masashi Sugiyama . 2021 . Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization . In Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0 139) , Marina Meila and Tong Zhang (Eds.). PMLR, 11448\u201311458. Zeke Xie, Li Yuan, Zhanxing Zhu, and Masashi Sugiyama. 2021. Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization. In Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0 139), Marina Meila and Tong Zhang (Eds.). PMLR, 11448\u201311458."},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403151"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btz142"},{"key":"e_1_2_1_84_1","doi-asserted-by":"crossref","unstructured":"Yan Yan Tianbao Yang Zhe Li Qihang Lin and Yi Yang. 2018. A Unified Analysis of Stochastic Momentum Methods for Deep Learning. In IJCAI. 2955\u20132961. https:\/\/doi.org\/10.24963\/ijcai.2018\/410  Yan Yan Tianbao Yang Zhe Li Qihang Lin and Yi Yang. 2018. A Unified Analysis of Stochastic Momentum Methods for Deep Learning. In IJCAI. 2955\u20132961. https:\/\/doi.org\/10.24963\/ijcai.2018\/410","DOI":"10.24963\/ijcai.2018\/410"},{"key":"e_1_2_1_85_1","unstructured":"Jian Zhang and Ioannis Mitliagkas. 2018. YellowFin and the Art of Momentum Tuning. arxiv:1706.03471 \u00a0[stat.ML]  Jian Zhang and Ioannis Mitliagkas. 2018. YellowFin and the Art of Momentum Tuning. arxiv:1706.03471 \u00a0[stat.ML]"},{"key":"e_1_2_1_86_1","volume-title":"Staleness-Aware Async-SGD for Distributed Deep Learning(IJCAI\u201916)","author":"Zhang Wei","unstructured":"Wei Zhang , Suyog Gupta , Xiangru Lian , and Ji Liu . 2016. Staleness-Aware Async-SGD for Distributed Deep Learning(IJCAI\u201916) . AAAI Press , 2350\u20132356. Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2016. Staleness-Aware Async-SGD for Distributed Deep Learning(IJCAI\u201916). AAAI Press, 2350\u20132356."},{"key":"e_1_2_1_87_1","volume-title":"Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In ITCS.","author":"Zhu Z.","year":"2017","unstructured":"Z. Zhu and L. Orecchia . 2017 . Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In ITCS. Z. Zhu and L. Orecchia. 2017. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In ITCS."}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3544782","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3544782","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:25Z","timestamp":1750178785000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3544782"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,22]]},"references-count":87,"alternative-id":["10.1145\/3544782"],"URL":"https:\/\/doi.org\/10.1145\/3544782","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"value":"1556-4681","type":"print"},{"value":"1556-472X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,22]]},"assertion":[{"value":"2021-09-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-05-13","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-06-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"3544782"}}