{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T16:43:25Z","timestamp":1740156205256,"version":"3.37.3"},"reference-count":44,"publisher":"Wiley","license":[{"start":{"date-parts":[[2021,9,21]],"date-time":"2021-09-21T00:00:00Z","timestamp":1632182400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Huawei Research"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Applied Computational Intelligence and Soft Computing"],"published-print":{"date-parts":[[2021,9,21]]},"abstract":"<jats:p>The growth in size and complexity of convolutional neural networks (CNNs) is forcing the partitioning of a network across multiple accelerators during training and pipelining of backpropagation computations over these accelerators. Pipelining results in the use of stale weights. Existing approaches to pipelined training avoid or limit the use of stale weights with techniques that either underutilize accelerators or increase training memory footprint. This paper contributes a pipelined backpropagation scheme that uses stale weights to maximize accelerator utilization and keep memory overhead modest. It explores the impact of stale weights on the statistical efficiency and performance using 4 CNNs (LeNet-5, AlexNet, VGG, and ResNet) and shows that when pipelining is introduced in early layers, training with stale weights converges and results in models with comparable inference accuracies to those resulting from nonpipelined training (a drop in accuracy of 0.4%, 4%, 0.83%, and 1.45% for the 4 networks, respectively). However, when pipelining is deeper in the network, inference accuracies drop significantly (up to 12% for VGG and 8.5% for ResNet-20). The paper also contributes a hybrid training scheme that combines pipelined with nonpipelined training to address this drop. The potential for performance improvement of the proposed scheme is demonstrated with a proof-of-concept pipelined backpropagation implementation in PyTorch on 2 GPUs using ResNet-56\/110\/224\/362, achieving speedups of up to 1.8X over a 1-GPU baseline.<\/jats:p>","DOI":"10.1155\/2021\/3839543","type":"journal-article","created":{"date-parts":[[2021,9,22]],"date-time":"2021-09-22T18:28:45Z","timestamp":1632335325000},"page":"1-16","source":"Crossref","is-referenced-by-count":0,"title":["Pipelined Training with Stale Weights in Deep Convolutional Neural Networks"],"prefix":"10.1155","volume":"2021","author":[{"given":"Lifu","family":"Zhang","sequence":"first","affiliation":[{"name":"Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2985-4873","authenticated-orcid":true,"given":"Tarek S.","family":"Abdelrahman","sequence":"additional","affiliation":[{"name":"Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON, Canada"}]}],"member":"311","reference":[{"key":"1","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2016.90"},{"first-page":"429","article-title":"Smiles-bert: large scale unsupervised pre-training for molecular property prediction","author":"S. Wang","key":"2"},{"key":"3","doi-asserted-by":"publisher","DOI":"10.1145\/3206025.3206062"},{"article-title":"Randomly initialized convolutional neural network for the recognition of Covid-19 using X-ray images","year":"2021","author":"S. Ben Atitallah","key":"4"},{"key":"5","doi-asserted-by":"publisher","DOI":"10.1016\/j.compag.2021.106014"},{"key":"6","doi-asserted-by":"publisher","DOI":"10.1016\/j.ecoinf.2021.101325"},{"key":"7","doi-asserted-by":"publisher","DOI":"10.21437\/interspeech.2012-7"},{"article-title":"Gpipe: efficient training of giant neural networks using pipeline parallelism","year":"2019","author":"Y. Huang","key":"8"},{"article-title":"Pipedream: generalized pipeline parallelism for DNN training","author":"D. Narayanan","key":"9","doi-asserted-by":"crossref","DOI":"10.1145\/3341301.3359646"},{"key":"10","doi-asserted-by":"publisher","DOI":"10.1038\/323533a0"},{"article-title":"Efficient and robust parallel DNN training through model parallelism on multi-GPU platform","year":"2019","author":"H.-Y. Cheng","key":"11"},{"first-page":"2103","article-title":"Decoupled parallel backpropagation with convergence guarantee","author":"Z. Huo","key":"12"},{"key":"13","doi-asserted-by":"publisher","DOI":"10.3389\/fnins.2017.00496"},{"author":"J. Chen","key":"14","article-title":"Revisiting distributed synchronous SGD"},{"key":"15","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901323"},{"author":"J. Dean","key":"16","article-title":"Large scale distributed deep networks"},{"article-title":"Accurate, large Minibatch SGD: training ImageNet in 1 hour","year":"2017","author":"P. Goyal","key":"17"},{"key":"18","first-page":"172","article-title":"Blink: fast and generic collectives for distributed ml","volume":"2","author":"G. Wang","year":"2020","journal-title":"Proceedings of Machine Learning and Systems"},{"first-page":"181","article-title":"An efficient communication architecture for distributed deep learning on GPU clusters","author":"H. Zhang","key":"19"},{"first-page":"571","article-title":"Project adam: building an efficient and scalable deep learning training system","author":"T. Chilimbi","key":"20"},{"key":"21","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901331"},{"key":"22","first-page":"2834","article-title":"On model parallelization and scheduling strategies for distributed machine learning","volume":"27","author":"S. Lee","year":"2014","journal-title":"Advances in Neural Information Processing Systems"},{"key":"23","doi-asserted-by":"publisher","DOI":"10.14778\/2212351.2212354"},{"key":"24","doi-asserted-by":"publisher","DOI":"10.1109\/72.286892"},{"key":"25","doi-asserted-by":"publisher","DOI":"10.5555\/3327757.3327772"},{"article-title":"Xpipe: efficient pipeline model parallelism for multi-GPU DNN training","year":"2020","author":"L. Guan","key":"26"},{"article-title":"Pipelined backpropagation at scale: training large models without batches","year":"2021","author":"A. Kosson","key":"27"},{"first-page":"307","article-title":"HetPipe: enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism","author":"J. H. Park","key":"28"},{"key":"29","first-page":"1","article-title":"Beyond data and model parallelism for deep neural networks","volume":"1","author":"Z. Jia","year":"2019","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"30","article-title":"Pipe-SGD: a decentralized pipelined SGD framework for distributed deep net training","volume-title":"Advances in Neural Information Processing Systems","author":"Y. Li","year":"2018"},{"key":"31","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"author":"P. Adam","key":"32","article-title":"Automatic differentiation in PyTorch"},{"issue":"11","key":"33","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"Y. LeCun","year":"1998","journal-title":"Proceedings of the IEEE"},{"article-title":"MNIST handwritten digit database","year":"1996","author":"Y. Le Cun","key":"34"},{"key":"35","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"author":"K. Simonyan","key":"36","article-title":"Very deep convolutional networks for large-scale image recognition"},{"article-title":"CIFAR-10 (Canadian institute for advanced research)","year":"2018","author":"A. Krizhevsky","key":"37"},{"article-title":"CIFAR-100 (Canadian institute for advanced research)","year":"2018","author":"A. Krizhevsky","key":"38"},{"first-page":"4171","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding","author":"J. Devlin","key":"39"},{"article-title":"Deep learning recommendation model for personalization and recommendation systems CoRR, abs\/1906","year":"2019","author":"M. Naumov","key":"40"},{"article-title":"The xilinx machine learning (Ml) suite, xdnn","year":"2019","author":"Xilinx","key":"41"},{"key":"42","doi-asserted-by":"publisher","DOI":"10.1137\/16m1080173"},{"key":"43","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729586"},{"article-title":"ImageNet: a large-scale hierarchical image database","author":"J. Deng","key":"44","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR.2009.5206848"}],"container-title":["Applied Computational Intelligence and Soft Computing"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/downloads.hindawi.com\/journals\/acisc\/2021\/3839543.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/acisc\/2021\/3839543.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/acisc\/2021\/3839543.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,22]],"date-time":"2021-09-22T18:29:02Z","timestamp":1632335342000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.hindawi.com\/journals\/acisc\/2021\/3839543\/"}},"subtitle":[],"editor":[{"given":"Mehdi","family":"Keshavarz-Ghorabaee","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,9,21]]},"references-count":44,"alternative-id":["3839543","3839543"],"URL":"https:\/\/doi.org\/10.1155\/2021\/3839543","relation":{},"ISSN":["1687-9732","1687-9724"],"issn-type":[{"type":"electronic","value":"1687-9732"},{"type":"print","value":"1687-9724"}],"subject":[],"published":{"date-parts":[[2021,9,21]]}}}