{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T10:03:05Z","timestamp":1766484185305,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2021,9,23]],"date-time":"2021-09-23T00:00:00Z","timestamp":1632355200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2021,10,31]]},"abstract":"<jats:p>In this article, we present a distributed variant of an adaptive stochastic gradient method for training deep neural networks in the parameter-server model. To reduce the communication cost among the workers and server, we incorporate two types of quantization schemes, i.e., gradient quantization and weight quantization, into the proposed distributed Adam. In addition, to reduce the bias introduced by quantization operations, we propose an error-feedback technique to compensate for the quantized gradient. Theoretically, in the stochastic nonconvex setting, we show that the distributed adaptive gradient method with gradient quantization and error feedback converges to the first-order stationary point, and that the distributed adaptive gradient method with weight quantization and error feedback converges to the point related to the quantized level under both the single-worker and multi-worker modes. Last, we apply the proposed distributed adaptive gradient methods to train deep neural networks. Experimental results demonstrate the efficacy of our methods.<\/jats:p>","DOI":"10.1145\/3470890","type":"journal-article","created":{"date-parts":[[2021,9,24]],"date-time":"2021-09-24T14:48:38Z","timestamp":1632494918000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Quantized Adam with Error Feedback"],"prefix":"10.1145","volume":"12","author":[{"given":"Congliang","family":"Chen","sequence":"first","affiliation":[{"name":"The Chinese University of Hong Kong, Shenzhen, Guangdong, China"}]},{"given":"Li","family":"Shen","sequence":"additional","affiliation":[{"name":"JD Explore Academy, Beijing, China"}]},{"given":"Haozhi","family":"Huang","sequence":"additional","affiliation":[{"name":"Tencent AI Lab, Guangdong, China"}]},{"given":"Wei","family":"Liu","sequence":"additional","affiliation":[{"name":"Tencent, Guangdong, China"}]}],"member":"320","published-online":{"date-parts":[[2021,9,23]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045410"},{"key":"e_1_2_1_2_1","unstructured":"Amitabh Basu Soham De Anirbit Mukherjee and Enayat Ullah. 2018. Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv:1807.06766.  Amitabh Basu Soham De Anirbit Mukherjee and Enayat Ullah. 2018. Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv:1807.06766."},{"key":"e_1_2_1_3_1","unstructured":"Xiangyi Chen Sijia Liu Ruoyu Sun and Mingyi Hong. 2018. On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv:1808.02941.  Xiangyi Chen Sijia Liu Ruoyu Sun and Mingyi Hong. 2018. On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv:1808.02941."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999271"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_6_1","volume-title":"Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv:1810.04805.","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv:1810.04805. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv:1810.04805."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2021068"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/3086952"},{"key":"e_1_2_1_9_1","volume-title":"Dally","author":"Han Song","year":"2015","unstructured":"Song Han , Huizi Mao , and William J . Dally . 2015 . Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_11_1","first-page":"249","article-title":"Neural networks for machine learning Lecture 6A: Overview of mini-batch gradient descent","volume":"14","author":"Hinton Geoffrey","year":"2012","unstructured":"Geoffrey Hinton , Nitish Srivastava , and Kevin Swersky . 2012 . Neural networks for machine learning Lecture 6A: Overview of mini-batch gradient descent . Cited 14 , 8 (2012), 249 . Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural networks for machine learning Lecture 6A: Overview of mini-batch gradient descent. Cited 14, 8 (2012), 249.","journal-title":"Cited"},{"volume-title":"International Conference on Learning Representations.","author":"Hou Lu","key":"e_1_2_1_12_1","unstructured":"Lu Hou , Ruiliang Zhang , and James T. Kwok . 2018. Analysis of quantized models . In International Conference on Learning Representations. Lu Hou, Ruiliang Zhang, and James T. Kwok. 2018. Analysis of quantized models. In International Conference on Learning Representations."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327144.3327178"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the International Conference on Machine Learning. 3252\u20133261","author":"Karimireddy Sai Praneeth","year":"2019","unstructured":"Sai Praneeth Karimireddy , Quentin Rebjock , Sebastian Stich , and Martin Jaggi . 2019 . Error feedback fixes signSGD and other gradient compression schemes . In Proceedings of the International Conference on Machine Learning. 3252\u20133261 . Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. 2019. Error feedback fixes signSGD and other gradient compression schemes. In Proceedings of the International Conference on Machine Learning. 3252\u20133261."},{"key":"e_1_2_1_15_1","unstructured":"Ahmed Khaled and Peter Richt\u00e1rik. 2019. Gradient descent with compressed iterates. arXiv:1909.04716.  Ahmed Khaled and Peter Richt\u00e1rik. 2019. Gradient descent with compressed iterates. arXiv:1909.04716."},{"key":"e_1_2_1_16_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A method for stochastic optimization. arXiv:1412.6980. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Koloskova Anastasia","year":"2019","unstructured":"Anastasia Koloskova , Tao Lin , Sebastian U. Stich , and Martin Jaggi . 2019 . Decentralized deep learning with arbitrary communication compression . In Proceedings of the International Conference on Learning Representations. Anastasia Koloskova, Tao Lin, Sebastian U. Stich, and Martin Jaggi. 2019. Decentralized deep learning with arbitrary communication compression. In Proceedings of the International Conference on Learning Representations."},{"volume-title":"Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR\u201913)","author":"Kraska Tim","key":"e_1_2_1_18_1","unstructured":"Tim Kraska , Ameet S. Talwalkar , John C. Duchi , R. Griffith , M. Franklin , and Michael I. Jordan . 2013. MLbase: A distributed machine-learning system . In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR\u201913) . Tim Kraska, Ameet S. Talwalkar, John C. Duchi, R. Griffith, M. Franklin, and Michael I. Jordan. 2013. MLbase: A distributed machine-learning system. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR\u201913)."},{"volume-title":"Learning Multiple Layers of Features from Tiny Images. Technical report","author":"Krizhevsky Alex","key":"e_1_2_1_19_1","unstructured":"Alex Krizhevsky . 2009. Learning Multiple Layers of Features from Tiny Images. Technical report . University of Toronto . Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical report. University of Toronto."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_21_1","volume-title":"Deep learning. Nature 521, 7553","author":"LeCun Yann","year":"2015","unstructured":"Yann LeCun , Yoshua Bengio , and Geoffrey Hinton . 2015. Deep learning. Nature 521, 7553 ( 2015 ), 436\u2013444. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436\u2013444."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/2685048.2685095"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 983\u2013992","author":"Li Xiaoyu","year":"2019","unstructured":"Xiaoyu Li and Francesco Orabona . 2019 . On the convergence of stochastic gradient descent with adaptive step sizes . In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 983\u2013992 . Xiaoyu Li and Francesco Orabona. 2019. On the convergence of stochastic gradient descent with adaptive step sizes. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 983\u2013992."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295285"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3041021.3051099"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Peter Kairouz H. Brendan McMahan Brendan Avent Aurelien Bellet Mehdi Bennis Arjun Nitin Bhagoji Kallista Bonawitz et\u00a0al. Advances and open problems in federated learning. Foundations and Trends\u00ae in Machine Learning 14 1-2(2021) 1\u2013210.  Peter Kairouz H. Brendan McMahan Brendan Avent Aurelien Bellet Mehdi Bennis Arjun Nitin Bhagoji Kallista Bonawitz et\u00a0al. Advances and open problems in federated learning. Foundations and Trends\u00ae in Machine Learning 14 1-2(2021) 1\u2013210.","DOI":"10.1561\/2200000083"},{"key":"e_1_2_1_28_1","unstructured":"H. Brendan McMahan and Matthew Streeter. 2010. Adaptive bound optimization for online convex optimization. arXiv:1002.4908.  H. Brendan McMahan and Matthew Streeter. 2010. Adaptive bound optimization for online convex optimization. arXiv:1002.4908."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1038\/nature14236"},{"key":"e_1_2_1_30_1","volume-title":"Davoud Ataee Tarzanagh, and George Michailidis","author":"Nazari Parvin","year":"2019","unstructured":"Parvin Nazari , Davoud Ataee Tarzanagh, and George Michailidis . 2019 . Dadam : A consensus-based distributed adaptive gradient method for online optimization. arXiv:1901.09109. Parvin Nazari, Davoud Ataee Tarzanagh, and George Michailidis. 2019. Dadam: A consensus-based distributed adaptive gradient method for online optimization. arXiv:1901.09109."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_32"},{"key":"e_1_2_1_32_1","unstructured":"Sashank Reddi Zachary Charles Manzil Zaheer Zachary Garrett Keith Rush Jakub Kone\u010dn\u1ef3 Sanjiv Kumar and H. Brendan McMahan. 2020. Adaptive federated optimization. arXiv:2003.00295.  Sashank Reddi Zachary Charles Manzil Zaheer Zachary Garrett Keith Rush Jakub Kone\u010dn\u1ef3 Sanjiv Kumar and H. Brendan McMahan. 2020. Adaptive federated optimization. arXiv:2003.00295."},{"key":"e_1_2_1_33_1","unstructured":"Sashank J. Reddi Satyen Kale and Sanjiv Kumar. 2019. On the convergence of Adam and beyond. arXiv:1904.09237.  Sashank J. Reddi Satyen Kale and Sanjiv Kumar. 2019. On the convergence of Adam and beyond. arXiv:1904.09237."},{"key":"e_1_2_1_34_1","volume-title":"Julian Schrittwieser, et\u00a0al.","author":"Silver David","year":"2016","unstructured":"David Silver , Aja Huang , Chris J. Maddison , Arthur Guez , Laurent Sifre , George Van Den Driessche , Julian Schrittwieser, et\u00a0al. 2016 . Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, et\u00a0al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484."},{"key":"e_1_2_1_35_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920931"},{"key":"e_1_2_1_37_1","unstructured":"H. Tang X. Lian S. Qiu L. Yuan C. Zhang T. Zhang and J. Liu. 2019. DeepSqueeze: Decentralized meets error-compensated compression. arXiv:1907.07346.  H. Tang X. Lian S. Qiu L. Yuan C. Zhang T. Zhang and J. Liu. 2019. DeepSqueeze: Decentralized meets error-compensated compression. arXiv:1907.07346."},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the International Conference on Machine Learning. 6677\u20136686","author":"Ward Rachel","year":"2019","unstructured":"Rachel Ward , Xiaoxia Wu , and Leon Bottou . 2019 . AdaGrad stepsizes: Sharp convergence over nonconvex landscapes . In Proceedings of the International Conference on Machine Learning. 6677\u20136686 . Rachel Ward, Xiaoxia Wu, and Leon Bottou. 2019. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. In Proceedings of the International Conference on Machine Learning. 6677\u20136686."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294771.3294915"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2956775"},{"key":"e_1_2_1_41_1","unstructured":"Shuang Wu Guoqi Li Feng Chen and Luping Shi. 2018. Training and inference with integers in deep neural networks. arXiv:1802.04680.  Shuang Wu Guoqi Li Feng Chen and Luping Shi. 2018. Training and inference with integers in deep neural networks. arXiv:1802.04680."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2015.2472014"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3298981"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3455314"},{"key":"e_1_2_1_45_1","unstructured":"Shuchang Zhou Zekun Ni Xinyu Zhou He Wen Yuxin Wu and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160.  Shuchang Zhou Zekun Ni Xinyu Zhou He Wen Yuxin Wu and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160."},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Zhou Zhiming","year":"2018","unstructured":"Zhiming Zhou , QingruZhang, Guansong Lu , Hongwei Wang , Weinan Zhang , and Yong Yu . 2018 . AdaShift: Decorrelation and convergence of adaptive learning rate methods . In Proceedings of the International Conference on Learning Representations. Zhiming Zhou, QingruZhang, Guansong Lu, Hongwei Wang, Weinan Zhang, and Yong Yu. 2018. AdaShift: Decorrelation and convergence of adaptive learning rate methods. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_2_1_47_1","unstructured":"Fangyu Zou Li Shen Zequn Jie Ju Sun and Wei Liu. 2018. Weighted AdaGrad with unified momentum. arXiv:1808.03408.  Fangyu Zou Li Shen Zequn Jie Ju Sun and Wei Liu. 2018. Weighted AdaGrad with unified momentum. arXiv:1808.03408."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01138"}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3470890","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3470890","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:55Z","timestamp":1750191535000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3470890"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,23]]},"references-count":48,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2021,10,31]]}},"alternative-id":["10.1145\/3470890"],"URL":"https:\/\/doi.org\/10.1145\/3470890","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"type":"print","value":"2157-6904"},{"type":"electronic","value":"2157-6912"}],"subject":[],"published":{"date-parts":[[2021,9,23]]},"assertion":[{"value":"2021-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}