{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,6,4]],"date-time":"2024-06-04T23:10:01Z","timestamp":1717542601450},"reference-count":86,"publisher":"Association for Computing Machinery (ACM)","issue":"8","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2022,4]]},"abstract":"Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply.<\/jats:p>\n We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally \"model parallel\" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.<\/jats:p>","DOI":"10.14778\/3529337.3529343","type":"journal-article","created":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T22:23:05Z","timestamp":1655936585000},"page":"1581-1590","source":"Crossref","is-referenced-by-count":11,"title":["Distributed learning of fully connected neural networks using independent subnet training"],"prefix":"10.14778","volume":"15","author":[{"given":"Binhang","family":"Yuan","sequence":"first","affiliation":[{"name":"Rice University"}]},{"given":"Cameron R.","family":"Wolfe","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Chen","family":"Dun","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Yuxin","family":"Tang","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Anastasios","family":"Kyrillidis","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Chris","family":"Jermaine","sequence":"additional","affiliation":[{"name":"Rice University"}]}],"member":"320","published-online":{"date-parts":[[2022,6,22]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283.","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , 2016 . Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283. Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283."},{"key":"e_1_2_1_2_1","volume-title":"Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021","author":"Aji Alham Fikri","year":"2017","unstructured":"Alham Fikri Aji and Kenneth Heafield . 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 ( 2017 ). Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017)."},{"key":"e_1_2_1_3_1","volume-title":"QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709--1720.","author":"Alistarh Dan","year":"2017","unstructured":"Dan Alistarh , Demjan Grubic , Jerry Li , Ryota Tomioka , and Milan Vojnovic . 2017 . QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709--1720. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709--1720."},{"key":"e_1_2_1_4_1","unstructured":"A. Berahas L. Cao K. Choromanski and K. Scheinberg. 2019. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. arXiv preprint arXiv:1905.01332 (2019). A. Berahas L. Cao K. Choromanski and K. Scheinberg. 2019. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. arXiv preprint arXiv:1905.01332 (2019)."},{"key":"e_1_2_1_5_1","unstructured":"Kush Bhatia Kunal Dahiya Himanshu Jain Yashoteja Prabhu and Manik Varma. [n.d.]. The Extreme Classification Repository: Multi-label Datasets and Code. http:\/\/manikvarma.org\/downloads\/XC\/XMLRepository.html. Kush Bhatia Kunal Dahiya Himanshu Jain Yashoteja Prabhu and Manik Varma. [n.d.]. The Extreme Classification Repository: Multi-label Datasets and Code. http:\/\/manikvarma.org\/downloads\/XC\/XMLRepository.html."},{"key":"e_1_2_1_6_1","unstructured":"Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020). Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1137\/0728014"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3369583.3392686"},{"key":"e_1_2_1_9_1","unstructured":"Zheng Chai Hannan Fayyaz Zeshan Fayyaz Ali Anwar Yi Zhou Nathalie Baracaldo Heiko Ludwig and Yue Cheng. 2019. Towards taming the resource and data heterogeneity in federated learning. In 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 19--21. Zheng Chai Hannan Fayyaz Zeshan Fayyaz Ali Anwar Yi Zhou Nathalie Baracaldo Heiko Ludwig and Yue Cheng. 2019. Towards taming the resource and data heterogeneity in federated learning. In 2019 { USENIX } Conference on Operational Machine Learning (OpML 19). 19--21."},{"key":"e_1_2_1_10_1","volume-title":"Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981","author":"Chen Jianmin","year":"2016","unstructured":"Jianmin Chen , Xinghao Pan , Rajat Monga , Samy Bengio , and Rafal Jozefowicz . 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 ( 2016 ). Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016)."},{"key":"e_1_2_1_11_1","volume-title":"Federated learning of out-of-vocabulary words. arXiv preprint arXiv:1903.10635","author":"Chen Mingqing","year":"2019","unstructured":"Mingqing Chen , Rajiv Mathews , Tom Ouyang , and Fran\u00e7oise Beaufays . 2019. Federated learning of out-of-vocabulary words. arXiv preprint arXiv:1903.10635 ( 2019 ). Mingqing Chen, Rajiv Mathews, Tom Ouyang, and Fran\u00e7oise Beaufays. 2019. Federated learning of out-of-vocabulary words. arXiv preprint arXiv:1903.10635 (2019)."},{"key":"e_1_2_1_12_1","first-page":"571","article-title":"Project Adam: Building an Efficient and Scalable Deep Learning Training System","volume":"14","author":"Chilimbi Trishul M","year":"2014","unstructured":"Trishul M Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project Adam: Building an Efficient and Scalable Deep Learning Training System .. In OSDI , Vol. 14. 571 -- 582 . Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System.. In OSDI, Vol. 14. 571--582.","journal-title":"OSDI"},{"key":"e_1_2_1_13_1","volume-title":"On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems 31","author":"Chizat Lenaic","year":"2018","unstructured":"Lenaic Chizat and Francis Bach . 2018. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems 31 ( 2018 ). Lenaic Chizat and Francis Bach. 2018. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_2_1_14_1","volume-title":"Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291","author":"Codreanu Valeriu","year":"2017","unstructured":"Valeriu Codreanu , Damian Podareanu , and Vikram Saletore . 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 ( 2017 ). Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017)."},{"key":"e_1_2_1_15_1","unstructured":"E Cordis. 2019. Machine learning ledger orchestration for drug discovery. E Cordis. 2019. Machine learning ledger orchestration for drug discovery."},{"key":"e_1_2_1_16_1","unstructured":"Matthieu Courbariaux Yoshua Bengio and Jean-Pierre David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems. 3123--3131. Matthieu Courbariaux Yoshua Bengio and Jean-Pierre David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems. 3123--3131."},{"key":"e_1_2_1_17_1","volume-title":"Nolwenn Le Stang, et al","author":"Courtiol Pierre","year":"2019","unstructured":"Pierre Courtiol , Charles Maussion , Matahi Moarii , Elodie Pronier , Samuel Pilcer , Meriem Sefta , Pierre Manceron , Sylvain Toldo , Mikhail Zaslavskiy , Nolwenn Le Stang, et al . 2019 . Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nature medicine 25, 10 (2019), 1519--1525. Pierre Courtiol, Charles Maussion, Matahi Moarii, Elodie Pronier, Samuel Pilcer, Meriem Sefta, Pierre Manceron, Sylvain Toldo, Mikhail Zaslavskiy, Nolwenn Le Stang, et al. 2019. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nature medicine 25, 10 (2019), 1519--1525."},{"key":"e_1_2_1_18_1","unstructured":"Walter de Brouwer. 2019. The federated future is ready for shipping. Walter de Brouwer. 2019. The federated future is ready for shipping."},{"key":"e_1_2_1_19_1","unstructured":"Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Andrew Senior Paul Tucker Ke Yang Quoc V Le etal 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231. Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Andrew Senior Paul Tucker Ke Yang Quoc V Le et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231."},{"key":"e_1_2_1_20_1","volume-title":"On the Ineffectiveness of Variance Reduced Optimization for Deep Learning. arXiv preprint arXiv:1812.04529","author":"Defazio Aaron","year":"2018","unstructured":"Aaron Defazio and Leon Bottou . 2018. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning. arXiv preprint arXiv:1812.04529 ( 2018 ). Aaron Defazio and Leon Bottou. 2018. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning. arXiv preprint arXiv:1812.04529 (2018)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_22_1","volume-title":"8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561","author":"Dettmers Tim","year":"2015","unstructured":"Tim Dettmers . 2015. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561 ( 2015 ). Tim Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561 (2015)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305487"},{"key":"e_1_2_1_24_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0097539704442684"},{"key":"e_1_2_1_26_1","first-page":"2121","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","author":"Duchi John","year":"2011","unstructured":"John Duchi , Elad Hazan , and Yoram Singer . 2011 . Adaptive subgradient methods for online learning and stochastic optimization . Journal of Machine Learning Research 12 , Jul (2011), 2121 -- 2159 . John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121--2159.","journal-title":"Journal of Machine Learning Research 12"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICAPP.1997.651531"},{"key":"e_1_2_1_28_1","volume-title":"On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent. arXiv preprint arXiv:1811.12941","author":"Golmant Noah","year":"2018","unstructured":"Noah Golmant , Nikita Vemuri , Zhewei Yao , Vladimir Feinberg , Amir Gholami , Kai Rothauge , Michael W Mahoney , and Joseph Gonzalez . 2018. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent. arXiv preprint arXiv:1811.12941 ( 2018 ). Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. 2018. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent. arXiv preprint arXiv:1811.12941 (2018)."},{"key":"e_1_2_1_29_1","volume-title":"large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal , Piotr Doll\u00e1r , Ross Girshick , Pieter Noordhuis , Lukasz Wesolowski , Aapo Kyrola , Andrew Tulloch , Yangqing Jia , and Kaiming He. 2017. Accurate , large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 ( 2017 ). Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)."},{"key":"e_1_2_1_30_1","volume-title":"International Conference on Machine Learning. 1737--1746","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep learning with limited numerical precision . In International Conference on Machine Learning. 1737--1746 . Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746."},{"key":"e_1_2_1_31_1","volume-title":"Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487","author":"Hadjis Stefan","year":"2016","unstructured":"Stefan Hadjis , Ce Zhang , Ioannis Mitliagkas , Dan Iter , and Christopher R\u00e9 . 2016 . Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487 (2016). Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, Dan Iter, and Christopher R\u00e9. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487 (2016)."},{"key":"e_1_2_1_32_1","volume-title":"Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604","author":"Hard Andrew","year":"2018","unstructured":"Andrew Hard , Kanishka Rao , Rajiv Mathews , Swaroop Ramaswamy , Fran\u00e7oise Beaufays , Sean Augenstein , Hubert Eichner , Chlo\u00e9 Kiddon , and Daniel Ramage . 2018. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 ( 2018 ). Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Fran\u00e7oise Beaufays, Sean Augenstein, Hubert Eichner, Chlo\u00e9 Kiddon, and Daniel Ramage. 2018. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_34_1","volume-title":"Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Mia Xu Chen , Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019 . GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism . arXiv:1811.06965 [cs.CV] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv:1811.06965 [cs.CV]"},{"key":"e_1_2_1_35_1","first-page":"1","article-title":"Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations","volume":"18","author":"Hubara Itay","year":"2017","unstructured":"Itay Hubara , Matthieu Courbariaux , Daniel Soudry , Ran El-Yaniv , and Yoshua Bengio . 2017 . Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations . Journal of Machine Learning Research 18 , 187 (2017), 1 -- 30 . Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. Journal of Machine Learning Research 18, 187 (2017), 1--30.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_36_1","volume-title":"International Conference on Machine Learning. 448--456","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy . 2015 . Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . In International Conference on Machine Learning. 448--456 . Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning. 448--456."},{"key":"e_1_2_1_37_1","unstructured":"Nikita Ivkin Daniel Rothchild Enayat Ullah Ion Stoica Raman Arora etal 2019. Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems. 13144--13154. Nikita Ivkin Daniel Rothchild Enayat Ullah Ion Stoica Raman Arora et al. 2019. Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems. 13144--13154."},{"key":"e_1_2_1_38_1","volume-title":"Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 795--811","author":"Karimi H.","unstructured":"H. Karimi , J. Nutini , and M. Schmidt . 2016. Linear convergence of gradient and proximal-gradient methods under the Polyak-\u0141ojasiewicz condition . In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 795--811 . H. Karimi, J. Nutini, and M. Schmidt. 2016. Linear convergence of gradient and proximal-gradient methods under the Polyak-\u0141ojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 795--811."},{"key":"e_1_2_1_39_1","volume-title":"On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836","author":"Keskar Nitish Shirish","year":"2016","unstructured":"Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , and Ping Tak Peter Tang . 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 ( 2016 ). Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)."},{"key":"e_1_2_1_40_1","unstructured":"A. Khaled and P. Richt\u00e1rik. 2019. Gradient descent with compressed iterates. arXiv preprint arXiv:1909.04716 (2019). A. Khaled and P. Richt\u00e1rik. 2019. Gradient descent with compressed iterates. arXiv preprint arXiv:1909.04716 (2019)."},{"key":"e_1_2_1_41_1","volume-title":"Federated learning for internet of things: Recent advances, taxonomy, and open challenges","author":"Khan Latif U","year":"2021","unstructured":"Latif U Khan , Walid Saad , Zhu Han , Ekram Hossain , and Choong Seon Hong . 2021. Federated learning for internet of things: Recent advances, taxonomy, and open challenges . IEEE Communications Surveys & Tutorials ( 2021 ). Latif U Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. 2021. Federated learning for internet of things: Recent advances, taxonomy, and open challenges. IEEE Communications Surveys & Tutorials (2021)."},{"key":"e_1_2_1_42_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings, Part V 16","author":"Kolesnikov Alexander","year":"2020","unstructured":"Alexander Kolesnikov , Lucas Beyer , Xiaohua Zhai , Joan Puigcerver , Jessica Yung , Sylvain Gelly , and Neil Houlsby . 2020 . Big transfer (bit): General visual representation learning. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020 , Proceedings, Part V 16 . Springer, 491--507. Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 2020. Big transfer (bit): General visual representation learning. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part V 16. Springer, 491--507."},{"key":"e_1_2_1_44_1","unstructured":"A. Krizhevsky I. Sutskever and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. A. Krizhevsky I. Sutskever and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683546"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.5555\/2685048.2685095"},{"key":"e_1_2_1_47_1","volume-title":"Kumar Kshitij Patel, and Martin Jaggi","author":"Lin Tao","year":"2018","unstructured":"Tao Lin , Sebastian U Stich , Kumar Kshitij Patel, and Martin Jaggi . 2018 . Don't use large mini-batches, use local SGD. arXiv preprint arXiv:1808.07217 (2018). Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. 2018. Don't use large mini-batches, use local SGD. arXiv preprint arXiv:1808.07217 (2018)."},{"key":"e_1_2_1_48_1","volume-title":"Inefficiency of K-FAC for Large Batch Size Training. arXiv preprint arXiv:1903.06237","author":"Ma Linjian","year":"2019","unstructured":"Linjian Ma , Gabe Montague , Jiayu Ye , Zhewei Yao , Amir Gholami , Kurt Keutzer , and Michael W Mahoney . 2019. Inefficiency of K-FAC for Large Batch Size Training. arXiv preprint arXiv:1903.06237 ( 2019 ). Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. 2019. Inefficiency of K-FAC for Large Batch Size Training. arXiv preprint arXiv:1903.06237 (2019)."},{"key":"e_1_2_1_49_1","unstructured":"Ryan Mcdonald Mehryar Mohri Nathan Silberman Dan Walker and Gideon S Mann. 2009. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems. 1231--1239. Ryan Mcdonald Mehryar Mohri Nathan Silberman Dan Walker and Gideon S Mann. 2009. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems. 1231--1239."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2287036.2287045"},{"key":"e_1_2_1_51_1","volume-title":"GPU asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186","author":"Paine Thomas","year":"2013","unstructured":"Thomas Paine , Hailin Jin , Jianchao Yang , Zhe Lin , and Thomas Huang . 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186 ( 2013 ). Thomas Paine, Hailin Jin, Jianchao Yang, Zhe Lin, and Thomas Huang. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186 (2013)."},{"key":"e_1_2_1_52_1","unstructured":"Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017). Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017)."},{"key":"e_1_2_1_53_1","volume-title":"Privacy should not be a luxury good. The New York Times","author":"Pichai Sundar","year":"2019","unstructured":"Sundar Pichai . 2019. Privacy should not be a luxury good. The New York Times ( 2019 ). Sundar Pichai. 2019. Privacy should not be a luxury good. The New York Times (2019)."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_2_1_55_1","volume-title":"Federated learning for emoji prediction in a mobile keyboard. arXiv preprint arXiv:1906.04329","author":"Ramaswamy Swaroop","year":"2019","unstructured":"Swaroop Ramaswamy , Rajiv Mathews , Kanishka Rao , and Fran\u00e7oise Beaufays . 2019. Federated learning for emoji prediction in a mobile keyboard. arXiv preprint arXiv:1906.04329 ( 2019 ). Swaroop Ramaswamy, Rajiv Mathews, Kanishka Rao, and Fran\u00e7oise Beaufays. 2019. Federated learning for emoji prediction in a mobile keyboard. arXiv preprint arXiv:1906.04329 (2019)."},{"key":"e_1_2_1_56_1","unstructured":"Alexander Ratner Dan Alistarh Gustavo Alonso Peter Bailis Sarah Bird Nicholas Carlini Bryan Catanzaro Eric Chung Bill Dally Jeff Dean etal 2019. SysML: The New Frontier of Machine Learning Systems. arXiv preprint arXiv:1904.03257 (2019). Alexander Ratner Dan Alistarh Gustavo Alonso Peter Bailis Sarah Bird Nicholas Carlini Bryan Catanzaro Eric Chung Bill Dally Jeff Dean et al. 2019. SysML: The New Frontier of Machine Learning Systems. arXiv preprint arXiv:1904.03257 (2019)."},{"key":"e_1_2_1_57_1","volume-title":"Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems. 693--701.","author":"Recht Benjamin","year":"2011","unstructured":"Benjamin Recht , Christopher Re , Stephen Wright , and Feng Niu . 2011 . Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems. 693--701. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems. 693--701."},{"key":"e_1_2_1_58_1","volume-title":"Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972","author":"Ridnik Tal","year":"2021","unstructured":"Tal Ridnik , Emanuel Ben-Baruch , Asaf Noy , and Lihi Zelnik-Manor . 2021. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 ( 2021 ). Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)."},{"key":"e_1_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Nicola Rieke Jonny Hancox Wenqi Li Fausto Milletari Holger R Roth Shadi Albarqouni Spyridon Bakas Mathieu N Galtier Bennett A Landman Klaus Maier-Hein etal 2020. The future of digital health with federated learning. NPJ digital medicine 3 1 (2020) 1--7. Nicola Rieke Jonny Hancox Wenqi Li Fausto Milletari Holger R Roth Shadi Albarqouni Spyridon Bakas Mathieu N Galtier Bennett A Landman Klaus Maier-Hein et al. 2020. The future of digital health with federated learning. NPJ digital medicine 3 1 (2020) 1--7.","DOI":"10.1038\/s41746-020-00323-1"},{"key":"e_1_2_1_60_1","volume-title":"An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747","author":"Ruder Sebastian","year":"2016","unstructured":"Sebastian Ruder . 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 ( 2016 ). Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-274"},{"key":"e_1_2_1_62_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso . 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 ( 2018 ). Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)."},{"key":"e_1_2_1_63_1","unstructured":"K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_2_1_64_1","volume-title":"Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489","author":"Smith Samuel L","year":"2017","unstructured":"Samuel L Smith , Pieter-Jan Kindermans , Chris Ying , and Quoc V Le. 2017. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 ( 2017 ). Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. 2017. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017)."},{"key":"e_1_2_1_65_1","volume-title":"Deep neural network models for computational histopathology: A survey. Medical Image Analysis","author":"Srinidhi Chetan L","year":"2020","unstructured":"Chetan L Srinidhi , Ozan Ciga , and Anne L Martel . 2020. Deep neural network models for computational histopathology: A survey. Medical Image Analysis ( 2020 ), 101813. Chetan L Srinidhi, Ozan Ciga, and Anne L Martel. 2020. Deep neural network models for computational histopathology: A survey. Medical Image Analysis (2020), 101813."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.1915893"},{"key":"e_1_2_1_68_1","volume-title":"Sai Praneeth Karimireddy, and Martin Jaggi","author":"Vogels Thijs","year":"2019","unstructured":"Thijs Vogels , Sai Praneeth Karimireddy, and Martin Jaggi . 2019 . PowerSGD: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems . 14236--14245. Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. 2019. PowerSGD: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems. 14236--14245."},{"key":"e_1_2_1_69_1","volume-title":"Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209","author":"Warden Pete","year":"2018","unstructured":"Pete Warden . 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 ( 2018 ). Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)."},{"key":"e_1_2_1_70_1","volume-title":"Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems. 1509--1519.","author":"Wen Wei","year":"2017","unstructured":"Wei Wen , Cong Xu , Feng Yan , Chunpeng Wu , Yandan Wang , Yiran Chen , and Hai Li . 2017 . Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems. 1509--1519. Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems. 1509--1519."},{"key":"e_1_2_1_71_1","first-page":"67","article-title":"Numerical optimization","volume":"35","author":"Wright Stephen","year":"1999","unstructured":"Stephen Wright and Jorge Nocedal . 1999 . Numerical optimization . Springer Science 35 , 67 -- 68 (1999), 7. Stephen Wright and Jorge Nocedal. 1999. Numerical optimization. Springer Science 35, 67--68 (1999), 7.","journal-title":"Springer Science"},{"key":"e_1_2_1_72_1","unstructured":"Alfred Xu. 2018. NCCL BASED MULTI-GPU TRAINING. http:\/\/on-demand.gputechconf.com\/gtc-cn\/2018\/pdf\/CH8209.pdf. (2018). Accessed: 2020-02-06. Alfred Xu. 2018. NCCL BASED MULTI-GPU TRAINING. http:\/\/on-demand.gputechconf.com\/gtc-cn\/2018\/pdf\/CH8209.pdf. (2018). Accessed: 2020-02-06."},{"key":"e_1_2_1_73_1","volume-title":"Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853","author":"Yadan Omry","year":"2013","unstructured":"Omry Yadan , Keith Adams , Yaniv Taigman , and Marc'Aurelio Ranzato . 2013. Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853 ( 2013 ). Omry Yadan, Keith Adams, Yaniv Taigman, and Marc'Aurelio Ranzato. 2013. Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853 (2013)."},{"key":"e_1_2_1_74_1","volume-title":"Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903","author":"Yang Timothy","year":"2018","unstructured":"Timothy Yang , Galen Andrew , Hubert Eichner , Haicheng Sun , Wei Li , Nicholas Kong , Daniel Ramage , and Fran\u00e7oise Beaufays . 2018. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903 ( 2018 ). Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Fran\u00e7oise Beaufays. 2018. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903 (2018)."},{"key":"e_1_2_1_75_1","unstructured":"Zhewei Yao Amir Gholami Qi Lei Kurt Keutzer and Michael W Mahoney. 2018. Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems. 4949--4959. Zhewei Yao Amir Gholami Qi Lei Kurt Keutzer and Michael W Mahoney. 2018. Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems. 4949--4959."},{"key":"e_1_2_1_76_1","volume-title":"Scaling SGD batch size to 32K for ImageNet training. arXiv preprint arXiv:1708.03888","author":"You Yang","year":"2017","unstructured":"Yang You , Igor Gitman , and Boris Ginsburg . 2017. Scaling SGD batch size to 32K for ImageNet training. arXiv preprint arXiv:1708.03888 ( 2017 ). Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD batch size to 32K for ImageNet training. arXiv preprint arXiv:1708.03888 (2017)."},{"key":"e_1_2_1_77_1","volume-title":"Large-Batch Training for LSTM and Beyond. arXiv preprint arXiv:1901.08256","author":"You Yang","year":"2019","unstructured":"Yang You , Jonathan Hseu , Chris Ying , James Demmel , Kurt Keutzer , and Cho-Jui Hsieh . 2019. Large-Batch Training for LSTM and Beyond. arXiv preprint arXiv:1901.08256 ( 2019 ). Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2019. Large-Batch Training for LSTM and Beyond. arXiv preprint arXiv:1901.08256 (2019)."},{"key":"e_1_2_1_78_1","volume-title":"Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. arXiv preprint arXiv:1904.00962","author":"You Yang","year":"2019","unstructured":"Yang You , Jing Li , Jonathan Hseu , Xiaodan Song , James Demmel , and Cho-Jui Hsieh . 2019. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. arXiv preprint arXiv:1904.00962 ( 2019 ). Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. arXiv preprint arXiv:1904.00962 (2019)."},{"key":"e_1_2_1_79_1","volume-title":"Distributed learning of deep neural networks using independent subnet training. arXiv preprint arXiv:1910.02120","author":"Yuan Binhang","year":"2019","unstructured":"Binhang Yuan , Anastasios Kyrillidis , and Christopher M Jermaine . 2019. Distributed learning of deep neural networks using independent subnet training. arXiv preprint arXiv:1910.02120 ( 2019 ). Binhang Yuan, Anastasios Kyrillidis, and Christopher M Jermaine. 2019. Distributed learning of deep neural networks using independent subnet training. arXiv preprint arXiv:1910.02120 (2019)."},{"key":"e_1_2_1_80_1","volume-title":"ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701","author":"Zeiler Matthew D","year":"2012","unstructured":"Matthew D Zeiler . 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 ( 2012 ). Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2733001"},{"key":"e_1_2_1_82_1","volume-title":"Ioannis Mitliagkas, and Christopher R\u00e9.","author":"Zhang Jian","year":"2016","unstructured":"Jian Zhang , Christopher De Sa , Ioannis Mitliagkas, and Christopher R\u00e9. 2016 . Parallel SGD : When does averaging help? arXiv preprint arXiv:1606.07365 (2016). Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher R\u00e9. 2016. Parallel SGD: When does averaging help? arXiv preprint arXiv:1606.07365 (2016)."},{"key":"e_1_2_1_83_1","volume-title":"Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 6660--6663","author":"Zhang Shanshan","year":"2013","unstructured":"Shanshan Zhang , Ce Zhang , Zhao You , Rong Zheng , and Bo Xu . 2013 . Asynchronous stochastic gradient descent for DNN training. In Acoustics , Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 6660--6663 . Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 6660--6663."},{"key":"e_1_2_1_84_1","unstructured":"Xiru Zhang Michael Mckenna Jill P Mesirov and David L Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in neural information processing systems. 801--809. Xiru Zhang Michael Mckenna Jill P Mesirov and David L Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in neural information processing systems. 801--809."},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.3390\/cancers12030603"},{"key":"e_1_2_1_86_1","unstructured":"Martin Zinkevich Markus Weimer Lihong Li and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603. Martin Zinkevich Markus Weimer Lihong Li and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3529337.3529343","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:45:13Z","timestamp":1672220713000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3529337.3529343"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4]]},"references-count":86,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2022,4]]}},"alternative-id":["10.14778\/3529337.3529343"],"URL":"http:\/\/dx.doi.org\/10.14778\/3529337.3529343","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2022,4]]}}}