{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,2]],"date-time":"2025-11-02T06:57:10Z","timestamp":1762066630385,"version":"build-2065373602"},"reference-count":47,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2022,9,12]],"date-time":"2022-09-12T00:00:00Z","timestamp":1662940800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004359","name":"Swedish Research Council (VR)","doi-asserted-by":"publisher","award":["2021-04772"],"award-info":[{"award-number":["2021-04772"]}],"id":[{"id":"10.13039\/501100004359","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>This article aims to give a comprehensive and rigorous review of the principles and recent development of coding for large-scale distributed machine learning (DML). With increasing data volumes and the pervasive deployment of sensors and computing machines, machine learning has become more distributed. Moreover, the involved computing nodes and data volumes for learning tasks have also increased significantly. For large-scale distributed learning systems, significant challenges have appeared in terms of delay, errors, efficiency, etc. To address the problems, various error-control or performance-boosting schemes have been proposed recently for different aspects, such as the duplication of computing nodes. More recently, error-control coding has been investigated for DML to improve reliability and efficiency. The benefits of coding for DML include high-efficiency, low complexity, etc. Despite the benefits and recent progress, however, there is still a lack of comprehensive survey on this topic, especially for large-scale learning. This paper seeks to introduce the theories and algorithms of coding for DML. For primal-based DML schemes, we first discuss the gradient coding with the optimal code distance. Then, we introduce random coding for gradient-based DML. For primal\u2013dual-based DML, i.e., ADMM (alternating direction method of multipliers), we propose a separate coding method for two steps of distributed optimization. Then coding schemes for different steps are discussed. Finally, a few potential directions for future works are also given.<\/jats:p>","DOI":"10.3390\/e24091284","type":"journal-article","created":{"date-parts":[[2022,9,12]],"date-time":"2022-09-12T22:53:46Z","timestamp":1663023226000},"page":"1284","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Coding for Large-Scale Distributed Machine Learning"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5407-0835","authenticated-orcid":false,"given":"Ming","family":"Xiao","sequence":"first","affiliation":[{"name":"Division of Information Science and Engineering, Royal Institute of Technology, Malvinas Vag 10, KTH, 100-44 Stockholm, Sweden"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7926-5081","authenticated-orcid":false,"given":"Mikael","family":"Skoglund","sequence":"additional","affiliation":[{"name":"Division of Information Science and Engineering, Royal Institute of Technology, Malvinas Vag 10, KTH, 100-44 Stockholm, Sweden"}]}],"member":"1968","published-online":{"date-parts":[[2022,9,12]]},"reference":[{"key":"ref_1","first-page":"2574","article-title":"A Survey on Large-scale Machine Learning","volume":"34","author":"Wang","year":"2021","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_2","unstructured":"Dean, J., and Ghemawat, S. (2004, January 10\u201312). MapReduce: Simplified data processing on large clusters. Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, Santa Clara, CA, USA."},{"key":"ref_3","unstructured":"Li, M., Andersen, D., Park, J., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., and Su, B.-Y. (2014, January 6\u20138). Scaling Distributed Machine Learning with the Parameter Server. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, USA. Available online: http:\/\/www.usenix.org\/events\/osdi04\/tech\/dean.html."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1514","DOI":"10.1109\/TIT.2017.2736066","article-title":"Speeding up distributed machine learning using codes","volume":"64","author":"Lee","year":"2018","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_5","unstructured":"Konecny, J., McMahan, H., Ramage, D., and Richtarik, P. (2016). Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv."},{"key":"ref_6","unstructured":"McMahan, B., and Ramage, D. (2022, September 07). Federated Learning: Collaborative Machine Learning without Centralized Training Data. Available online: https:\/\/ai.googleblog.com\/2017\/04\/federated-learning-collaborative.html."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1109\/MCOM.2017.1600894","article-title":"Coding for distributed fog computing","volume":"55","author":"Li","year":"2017","journal-title":"IEEE Commun. Mag."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Park, H., Lee, K., Sohn, J., Suh, C., and Moon, J. (2018, January 17\u201322). Hierarchical coding for distributed computing. Proceedings of the IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA.","DOI":"10.1109\/ISIT.2018.8437669"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Kiani, S., Ferdinand, N., and Draper, S. (2018, January 17\u201322). Exploitation of stragglers in coded computation. Proceedings of the IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA.","DOI":"10.1109\/ISIT.2018.8437871"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Yu, Q., Maddah-Ali, M.A., and Avestimehr, A.S. (2018). Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding. arXiv.","DOI":"10.1109\/ISIT.2018.8437563"},{"key":"ref_11","unstructured":"Yu, Q., Li, S., Raviv, N., Kalan, S., Soltanolkotabi, M., and Avestimehr, A. (2018). Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"4683","DOI":"10.1109\/TII.2018.2857203","article-title":"Distributed Fog Computing Based on Batched Sparse Codes for Industrial Control","volume":"14","author":"Yue","year":"2018","journal-title":"IEEE Trans. Ind. Inform."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"362","DOI":"10.1109\/LCOMM.2019.2930513","article-title":"Coded Decentralized Learning with Gradient Descent for Big Data Analytics","volume":"24","author":"Yue","year":"2020","journal-title":"IEEE Commun. Lett."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1337","DOI":"10.1109\/TMC.2019.2963668","article-title":"Coding for Distributed Fog Computing in Internet of Mobile Things","volume":"20","author":"Yue","year":"2021","journal-title":"IEEE Trans. Mob. Comput."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"5360","DOI":"10.1109\/JIOT.2021.3058116","article-title":"Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing","volume":"8","author":"Chen","year":"2021","journal-title":"IEEE Internet Things J."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/2200000016","article-title":"Distributed optimization and statistical learning via the alternating direction method of multipliers","volume":"3","author":"Boyd","year":"2011","journal-title":"Found Trends Mach. Learn."},{"key":"ref_17","unstructured":"Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (2012). Large Scale Distributed Deep Networks. Advances in Neural Information Processing Systems 25, Curran Associates, Inc."},{"key":"ref_18","unstructured":"Grubic, D., Tam, L., Alistarh, D., and Zhang, C. (2018, January 26\u201329). Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study. Proceedings of the 21st International Conference on Extending Database Technology, Vienna, Austria."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1145\/2408776.2408794","article-title":"The tail at scale","volume":"56","author":"Dean","year":"2013","journal-title":"Commun. ACM"},{"key":"ref_20","unstructured":"Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. (2013, January 2\u20135). Effective straggler mitigation: Attack of the clones. Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI \u201913), Lombard, IL, USA."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1145\/2847220.2847223","article-title":"Using straggler replication to reduce latency in large-scale parallel computing","volume":"3","author":"Wang","year":"2015","journal-title":"ACM Sigmetrics Perform. Eval. Rev."},{"key":"ref_22","unstructured":"Yadwadkar, N.J., and Choi, W. (2012). Proactive straggler avoidance using machine learning. White Paper, University of Berkeley."},{"key":"ref_23","first-page":"2619","article-title":"Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning","volume":"20","author":"Karakus","year":"2019","journal-title":"J. Mach. Learn. Res."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1109\/TIT.2017.2756959","article-title":"A Fundamental Tradeoff Between Computation and Communication in Distributed Computing","volume":"64","author":"Li","year":"2018","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Halbawi, W., Azizan, N., Salehi, F., and Hassibi, B. (2018, January 17\u201322). Improving Distributed Gradient Descent Using Reed\u2013Solomon Codes. Proceedings of the IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA.","DOI":"10.1109\/ISIT.2018.8437467"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Reisizadeh, A., Prakash, S., Pedarsani, R., and Avestimedhr, S. (2017, January 25\u201330). Coded Computation over Heterogeneous Clusters. Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany.","DOI":"10.1109\/ISIT.2017.8006961"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Fan, X., Soto, P., Zhong, X., Xi, D., Wang, Y., and Li, J. (2020, January 15\u201317). Leveraging Stragglers in Coded Computing with Heterogeneous Servers. Proceedings of the IEEE\/ACM 28th International Symposium on Quality of Service (IWQoS), Hang Zhou, China.","DOI":"10.1109\/IWQoS49365.2020.9213028"},{"key":"ref_28","unstructured":"Wang, S., Liu, J., and Shroff, N. (2018, January 10\u201315). Coded Sparse Matrix Multiplication. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1204","DOI":"10.1109\/18.850663","article-title":"Network information flow","volume":"46","author":"Ahlswede","year":"2000","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"782","DOI":"10.1109\/TNET.2003.818197","article-title":"An Algebraic Approach to Network Coding","volume":"11","author":"Koetter","year":"2003","journal-title":"IEEE\/ACM Trans. Netw. (TON)"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"37","DOI":"10.4310\/CIS.2006.v6.n1.a3","article-title":"Network Error Correction, Part I, Part II","volume":"6","author":"Yeung","year":"2006","journal-title":"Commun. Inf. Syst."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"5322","DOI":"10.1109\/TIT.2014.2334315","article-title":"Batched sparse codes","volume":"60","author":"Yang","year":"2014","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Boyd, S., and Vandenberghe, L. (2004). Convex Optimization, Cambridge University Press.","DOI":"10.1017\/CBO9780511804441"},{"key":"ref_34","first-page":"3368","article-title":"Gradient Coding: Avoiding Stragglers in Distributed Learning","volume":"Volume 70","author":"Precup","year":"2017","journal-title":"Proceedings of the 34th International Conference on Machine Learning"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1145\/285243.285258","article-title":"A digital fountain approach to reliable distribution of bulk data","volume":"28","author":"Byers","year":"1998","journal-title":"ACM SIGCOMM Comput. Commun. Rev."},{"key":"ref_36","unstructured":"Luby, M. (2002, January 16\u201319). LT codes. Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, Vancouver, BC, Canada."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"2551","DOI":"10.1109\/TIT.2006.874390","article-title":"Raptor codes","volume":"52","author":"Shokrollahi","year":"2006","journal-title":"IEEE Trans. Inform. Theory"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"3725","DOI":"10.1109\/TCOMM.2014.2362111","article-title":"Buffer-based Distributed LT Codes","volume":"62","author":"Hussain","year":"2014","journal-title":"IEEE Trans. Commu."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"2844","DOI":"10.1109\/TSP.2021.3078625","article-title":"Randomized Neural Networks based Decentralized Multi-Task Learning via Hybrid Multi-Block ADMM","volume":"69","author":"Ye","year":"2021","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"5842","DOI":"10.1109\/TSP.2020.3027917","article-title":"Privacy-preserving Incremental ADMM for Decentralized Consensus Optimization","volume":"68","author":"Ye","year":"2020","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"2048","DOI":"10.1109\/JIOT.2018.2875057","article-title":"Walk Proximal Gradient: An Energy-Efficient Algorithm for Consensus Optimization","volume":"6","author":"Mao","year":"2018","journal-title":"IEEE Internet Things J."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1109\/MSP.2020.3003845","article-title":"Communication-Censored Linearized ADMM for Decentralized Consensus Optimization","volume":"6","author":"Li","year":"2020","journal-title":"IEEE Trans. Signal Inf. Process. Netw."},{"key":"ref_43","first-page":"1594","article-title":"Beyond convexity: Stochastic quasi-convex optimization","volume":"28","author":"Hazan","year":"2015","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Cebe, M., Kaplan, B., and Akkaya, K. (2018, January 5\u20137). A Network Coding Based Information Spreading Approach for Permissioned Blockchain in IoT Settings. Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, New York, NY, USA.","DOI":"10.1145\/3286978.3286984"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Braun, M., Wiesmaier, A., Alnahawi, N., and Geibler, J. (2021, January 6\u20138). On Message-based Consensus and Network Coding. Proceedings of the 12th International Conference on Network of the Future (NoF), Coimbra, Portugal.","DOI":"10.1109\/NoF52522.2021.9609913"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"1361","DOI":"10.1109\/TIT.2011.2173631","article-title":"Secure network coding for wiretap networks of type II","volume":"58","author":"Rouayheb","year":"2012","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"424","DOI":"10.1109\/TIT.2010.2090197","article-title":"Secure Network Coding on a Wiretap Network","volume":"57","author":"Cai","year":"2011","journal-title":"IEEE Trans. Inf. Theory"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/9\/1284\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:29:52Z","timestamp":1760142592000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/9\/1284"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,12]]},"references-count":47,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2022,9]]}},"alternative-id":["e24091284"],"URL":"https:\/\/doi.org\/10.3390\/e24091284","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2022,9,12]]}}}