{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T07:45:41Z","timestamp":1768031141748,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":62,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,11,17]],"date-time":"2019-11-17T00:00:00Z","timestamp":1573948800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"LDRD"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,11,17]]},"DOI":"10.1145\/3295500.3356207","type":"proceedings-article","created":{"date-parts":[[2019,11,7]],"date-time":"2019-11-07T19:43:22Z","timestamp":1573155802000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":29,"title":["Channel and filter parallelism for large-scale CNN training"],"prefix":"10.1145","author":[{"given":"Nikoli","family":"Dryden","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign and Lawrence Livermore National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Naoya","family":"Maruyama","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tim","family":"Moon","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tom","family":"Benson","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marc","family":"Snir","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Brian","family":"Van Essen","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,11,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia RafalJozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/  Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia RafalJozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https:\/\/www.tensorflow.org\/"},{"key":"e_1_3_2_1_2_1","volume-title":"NeurIPS 2017 Workshop: Deep Learning at Supercomputer Scale.","author":"Akiba Takuya","year":"2017","unstructured":"Takuya Akiba , Shuji Suzuki , and Keisuke Fukuda . 2017 . Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes . In NeurIPS 2017 Workshop: Deep Learning at Supercomputer Scale. Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes. In NeurIPS 2017 Workshop: Deep Learning at Supercomputer Scale."},{"key":"e_1_3_2_1_3_1","volume-title":"A linear algebra framework for static High Performance Fortran code distribution. Scientific Programming 6, 1","author":"Ancourt Corinne","year":"1997","unstructured":"Corinne Ancourt , Fabien Coelho , Fran\u00e7ois Irigoin , and Ronan Keryell . 1997. A linear algebra framework for static High Performance Fortran code distribution. Scientific Programming 6, 1 ( 1997 ). Corinne Ancourt, Fabien Coelho, Fran\u00e7ois Irigoin, and Ronan Keryell. 1997. A linear algebra framework for static High Performance Fortran code distribution. Scientific Programming 6, 1 (1997)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3317550.3321441"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00018"},{"key":"e_1_3_2_1_6_1","volume-title":"Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941","author":"Ben-Nun Tal","year":"2018","unstructured":"Tal Ben-Nun and Torsten Hoefler . 2018. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941 ( 2018 ). Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941 (2018)."},{"key":"e_1_3_2_1_7_1","volume-title":"Aedan Pope, et al.","author":"Buchlovsky Peter","year":"2019","unstructured":"Peter Buchlovsky , David Budden , Dominik Grewe , Chris Jones , John Aslanides , Frederic Besse , Andy Brock , Aidan Clark , Sergio G\u00f3mez Colmenarejo , Aedan Pope, et al. 2019 . TF-Replicator: Distributed Machine Learning for Researchers . arXiv preprint arXiv:1902.00465 (2019). Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio G\u00f3mez Colmenarejo, Aedan Pope, et al. 2019. TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465 (2019)."},{"key":"e_1_3_2_1_8_1","volume-title":"Tenth International Workshop on Frontiers in Handwriting Recognition.","author":"Chellapilla Kumar","year":"2006","unstructured":"Kumar Chellapilla , Sidd Puri , and Patrice Simard . 2006 . High performance convolutional neural networks for document processing . In Tenth International Workshop on Frontiers in Handwriting Recognition. Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"e_1_3_2_1_10_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Haichen Shen , Meghan Cowan , Leyuan Wang , Yuwei Hu , Luis Ceze , 2018 . TVM: An automated end-to-end optimizing compiler for deep learning . In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)."},{"key":"e_1_3_2_1_11_1","volume-title":"cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 ( 2014 ). Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)."},{"key":"e_1_3_2_1_12_1","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).","author":"Chilimbi Trishul","year":"2014","unstructured":"Trishul Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project Adam: Building an efficient and scalable deep learning training system . In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI)."},{"key":"e_1_3_2_1_13_1","volume-title":"International Conference on Machine Learning (ICML).","author":"Coates Adam","year":"2013","unstructured":"Adam Coates , Brody Huval , Tao Wang , David Wu , Bryan Catanzaro , and Ng Andrew . 2013 . Deep learning with COTS HPC systems . In International Conference on Machine Learning (ICML). Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In International Conference on Machine Learning (ICML)."},{"key":"e_1_3_2_1_14_1","volume-title":"NeurIPS ML Systems Workshop.","author":"Coleman Cody","year":"2017","unstructured":"Cody Coleman , Deepak Narayanan , Daniel Kang , Tian Zhao , Jian Zhang , Luigi Nardi , Peter Bailis , Kunle Olukotun , Chris R\u00e9 , and Matei Zaharia . 2017 . DAWN-Bench: An end-to-end deep learning benchmark and competition . In NeurIPS ML Systems Workshop. Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris R\u00e9, and Matei Zaharia. 2017. DAWN-Bench: An end-to-end deep learning benchmark and competition. In NeurIPS ML Systems Workshop."},{"key":"e_1_3_2_1_15_1","unstructured":"Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Andrew Senior Paul Tucker Ke Yang Quoc V Le etal 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NeurIPS).  Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Andrew Senior Paul Tucker Ke Yang Quoc V Le et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_16_1","volume-title":"Communication-optimal convolutional neural nets. arXiv preprint arXiv:1802.06905","author":"Demmel James","year":"2018","unstructured":"James Demmel and Grace Dinh . 2018. Communication-optimal convolutional neural nets. arXiv preprint arXiv:1802.06905 ( 2018 ). James Demmel and Grace Dinh. 2018. Communication-optimal convolutional neural nets. arXiv preprint arXiv:1802.06905 (2018)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00031"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/MLHPC.2018.8638639"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3210377.3210394"},{"key":"e_1_3_2_1_20_1","volume-title":"large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal , Piotr Doll\u00e1r , Ross Girshick , Pieter Noordhuis , Lukasz Wesolowski , Aapo Kyrola , Andrew Tulloch , Yangqing Jia , and Kaiming He. 2017. Accurate , large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 ( 2017 ). Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00059"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_2_1_25_1","volume-title":"GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv preprint arXiv:1811.06965","author":"Huang Yanping","year":"2018","unstructured":"Yanping Huang , Yonglong Cheng , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V Le , and Zhifeng Chen . 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv preprint arXiv:1811.06965 ( 2018 ). Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv preprint arXiv:1811.06965 (2018)."},{"key":"e_1_3_2_1_26_1","volume-title":"Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 ( 2015 ). Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)."},{"key":"e_1_3_2_1_27_1","volume-title":"Proceedings of the Fifth International Conference on Learning Representations (ICLR).","author":"Keskar Nitish Shirish","year":"2017","unstructured":"Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , and Ping Tak Peter Tang . 2017 . On large-batch training for deep learning: Generalization gap and sharp minima . In Proceedings of the Fifth International Conference on Learning Representations (ICLR). Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of the Fifth International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/MLHPC.2016.006"},{"key":"e_1_3_2_1_29_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS).  Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00054"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126916"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_3_2_1_33_1","volume-title":"Deep learning. Nature 521, 7553","author":"LeCun Yann","year":"2015","unstructured":"Yann LeCun , Yoshua Bengio , and Geoffrey Hinton . 2015. Deep learning. Nature 521, 7553 ( 2015 ). Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015)."},{"key":"e_1_3_2_1_35_1","volume-title":"Nam Sung Kim, and Alexander Schwing","author":"Li Youjie","year":"2018","unstructured":"Youjie Li , Mingchao Yu , Songze Li , Salman Avestimehr , Nam Sung Kim, and Alexander Schwing . 2018 . Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training. In Advances in Neural Information Processing Systems (NeurIPS) . Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. 2018. Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_36_1","unstructured":"LLNL. 2018. Lassen. https:\/\/hpc.llnl.gov\/hardware\/platforms\/lassen.  LLNL. 2018. Lassen. https:\/\/hpc.llnl.gov\/hardware\/platforms\/lassen."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"e_1_3_2_1_38_1","volume-title":"Inefficiency of K-FAC for Large Batch Size Training. arXiv preprint arXiv:1903.06237","author":"Ma Linjian","year":"2019","unstructured":"Linjian Ma , Gabe Montague , Jiayu Ye , Zhewei Yao , Amir Gholami , Kurt Keutzer , and Michael W Mahoney . 2019. Inefficiency of K-FAC for Large Batch Size Training. arXiv preprint arXiv:1903.06237 ( 2019 ). Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. 2019. Inefficiency of K-FAC for Large Batch Size Training. arXiv preprint arXiv:1903.06237 (2019)."},{"key":"e_1_3_2_1_39_1","volume-title":"Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851","author":"Mathieu Michael","year":"2013","unstructured":"Michael Mathieu , Mikael Henaff , and Yann LeCun . 2013. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851 ( 2013 ). Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851 (2013)."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00068"},{"key":"e_1_3_2_1_41_1","volume-title":"Massively distributed SGD: ImageNet\/ResNet-50 Training in a Flash. arXiv preprint arXiv:1811.05233","author":"Mikami Hiroaki","year":"2018","unstructured":"Hiroaki Mikami , Hisahiro Suganuma , Pongsakorn U-chupala, Yoshiki Tanaka , and Yuichi Kageyama . 2018. Massively distributed SGD: ImageNet\/ResNet-50 Training in a Flash. arXiv preprint arXiv:1811.05233 ( 2018 ). Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki Tanaka, and Yuichi Kageyama. 2018. Massively distributed SGD: ImageNet\/ResNet-50 Training in a Flash. arXiv preprint arXiv:1811.05233 (2018)."},{"key":"e_1_3_2_1_42_1","unstructured":"MLPerf Collaboration. 2019. MLPerf. https:\/\/mlperf.org\/.  MLPerf Collaboration. 2019. MLPerf. https:\/\/mlperf.org\/."},{"key":"e_1_3_2_1_43_1","unstructured":"NVIDIA. 2019. NVIDIA Collective Communications Library. https:\/\/developer.nvidia.com\/nccl.  NVIDIA. 2019. NVIDIA Collective Communications Library. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_3_2_1_44_1","unstructured":"NVIDIA Research. 2019. CUB. https:\/\/nvlabs.github.io\/cub\/.  NVIDIA Research. 2019. CUB. https:\/\/nvlabs.github.io\/cub\/."},{"key":"e_1_3_2_1_45_1","volume-title":"The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study. arXiv preprint arXiv:1905.03776","author":"Park Daniel S","year":"2019","unstructured":"Daniel S Park , Jascha Sohl-Dickstein , Quoc V Le , and Samuel L Smith . 2019. The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study. arXiv preprint arXiv:1905.03776 ( 2019 ). Daniel S Park, Jascha Sohl-Dickstein, Quoc V Le, and Samuel L Smith. 2019. The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study. arXiv preprint arXiv:1905.03776 (2019)."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_3_2_1_47_1","volume-title":"Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548","author":"Real Esteban","year":"2018","unstructured":"Esteban Real , Alok Aggarwal , Yanping Huang , and Quoc V Le. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548 ( 2018 ). Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548 (2018)."},{"key":"e_1_3_2_1_48_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS).  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1137\/140993478"},{"key":"e_1_3_2_1_52_1","volume-title":"Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600","author":"Shallue Christopher J","year":"2018","unstructured":"Christopher J Shallue , Jaehoon Lee , Joe Antognini , Jascha Sohl-Dickstein , Roy Frostig , and George E Dahl . 2018. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 ( 2018 ). Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. 2018. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 (2018)."},{"key":"e_1_3_2_1_53_1","unstructured":"Noam Shazeer Youlong Cheng Niki Parmar Dustin Tran Ashish Vaswani Penporn Koanantakool Peter Hawkins HyoukJoong Lee Mingsheng Hong Cliff Young etal 2018. Mesh-TensorFlow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems (NeurIPS).  Noam Shazeer Youlong Cheng Niki Parmar Dustin Tran Ashish Vaswani Penporn Koanantakool Peter Hawkins HyoukJoong Lee Mingsheng Hong Cliff Young et al. 2018. Mesh-TensorFlow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5555\/3298023.3298188"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_2_1_57_1","volume-title":"SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4","author":"Van De Geijn Robert A","year":"1997","unstructured":"Robert A Van De Geijn and Jerrell Watts . 1997 . SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997). Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997)."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2834892.2834897"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_2_1_60_1","volume-title":"NeurIPS Systems for ML Workshop.","author":"Ying Chris","year":"2018","unstructured":"Chris Ying , Sameer Kumar , Dehao Chen , Tao Wang , and Youlong Cheng . 2018 . Image Classification at Supercomputer Scale . In NeurIPS Systems for ML Workshop. Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. In NeurIPS Systems for ML Workshop."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225069"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.30.87"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00391"},{"key":"e_1_3_2_1_64_1","volume-title":"mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412","author":"Zhang Hongyi","year":"2017","unstructured":"Hongyi Zhang , Moustapha Cisse , Yann N Dauphin , and David Lopez-Paz . 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 ( 2017 ). Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)."}],"event":{"name":"SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis","location":"Denver Colorado","acronym":"SC '19","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing","IEEE CS"]},"container-title":["Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3295500.3356207","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3295500.3356207","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3295500.3356207","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:02:13Z","timestamp":1750208533000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3295500.3356207"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,17]]},"references-count":62,"alternative-id":["10.1145\/3295500.3356207","10.1145\/3295500"],"URL":"https:\/\/doi.org\/10.1145\/3295500.3356207","relation":{},"subject":[],"published":{"date-parts":[[2019,11,17]]},"assertion":[{"value":"2019-11-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}