{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,14]],"date-time":"2026-02-14T06:15:56Z","timestamp":1771049756668,"version":"3.50.1"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2018,6,8]],"date-time":"2018-06-08T00:00:00Z","timestamp":1528416000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Machine Learning Initiative at the Pacific Northwest National Laboratory"},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1618620"],"award-info":[{"award-number":["1618620"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2018,6,30]]},"abstract":"<jats:p>\n            Convolution Neural Networks (CNNs), a special subcategory of Deep Learning Neural Networks (DNNs), have become increasingly popular in industry and academia for their powerful capability in pattern classification, image processing, and speech recognition. Recently, they have been widely adopted in High Performance Computing (HPC) environments for solving complex problems related to modeling, runtime prediction, and big data analysis. Current state-of-the-art designs for DNNs on modern multi- and many-core CPU architectures, such as variants of Caffe, have reported promising performance in speedup and scalability, comparable with the GPU implementations. However, modern CPU architectures employ\n            <jats:italic>Non-Uniform Memory Access<\/jats:italic>\n            (NUMA) technique to integrate multiple sockets, which incurs unique challenges for designing highly efficient CNN frameworks. Without a careful design, DNN frameworks can easily suffer from long memory latency due to a large number of memory accesses to remote NUMA domains, resulting in poor scalability. To address this challenge, we propose NUMA-aware multi-solver-based CNN design, named\n            <jats:italic>NUMA-Caffe<\/jats:italic>\n            , for accelerating deep learning neural networks on multi- and many-core CPU architectures. NUMA-Caffe is independent of DNN topology, does not impact network convergence rates, and provides superior scalability to the existing Caffe variants. Through a thorough empirical study on four contemporary NUMA-based multi- and many-core architectures, our experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs in terms of both throughput and scalability.\n          <\/jats:p>","DOI":"10.1145\/3199605","type":"journal-article","created":{"date-parts":[[2018,6,11]],"date-time":"2018-06-11T12:20:54Z","timestamp":1528719654000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["NUMA-Caffe"],"prefix":"10.1145","volume":"15","author":[{"given":"Probir","family":"Roy","sequence":"first","affiliation":[{"name":"College of William and Mary, Williamsburg, VA"}]},{"given":"Shuaiwen Leon","family":"Song","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory and College of William and Mary, Richland,WA"}]},{"given":"Sriram","family":"Krishnamoorthy","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory and College of William and Mary, Richland,WA"}]},{"given":"Abhinav","family":"Vishnu","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory and College of William and Mary, Richland,WA"}]},{"given":"Dipanjan","family":"Sengupta","sequence":"additional","affiliation":[{"name":"Intel Labs, Mission College Blvd., Santa Clara, CA"}]},{"given":"Xu","family":"Liu","sequence":"additional","affiliation":[{"name":"College of William and Mary, Williamsburg, VA"}]}],"member":"320","published-online":{"date-parts":[[2018,6,8]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Retrieved","year":"2018","unstructured":"2013. smem memory reporting tool . Retrieved January 21, 2018 from https:\/\/www.selenic.com\/smem. 2013. smem memory reporting tool. Retrieved January 21, 2018 from https:\/\/www.selenic.com\/smem."},{"key":"e_1_2_1_2_1","volume-title":"Retrieved","year":"2018","unstructured":"2016. Amazon DSSTNE github project . Retrieved January 21, 2018 from https:\/\/github.com\/amzn\/amazon-dsstne. 2016. Amazon DSSTNE github project. Retrieved January 21, 2018 from https:\/\/github.com\/amzn\/amazon-dsstne."},{"key":"e_1_2_1_3_1","volume-title":"Retrieved","year":"2018","unstructured":"2016. CaffeOnSpark github project . Retrieved January 21, 2018 from https:\/\/github.com\/yahoo\/CaffeOnSpark. 2016. CaffeOnSpark github project. Retrieved January 21, 2018 from https:\/\/github.com\/yahoo\/CaffeOnSpark."},{"key":"e_1_2_1_4_1","unstructured":"2016. PaddlePaddle. Retrieved January 21 2018 from https:\/\/github.com\/paddlepaddle\/paddle.  2016. PaddlePaddle. Retrieved January 21 2018 from https:\/\/github.com\/paddlepaddle\/paddle."},{"key":"e_1_2_1_5_1","volume-title":"Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo , Zhifeng Chen , Craig Citro , Greg S. Corrado , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Ian Goodfellow , Andrew Harp , Geoffrey Irving , Michael Isard , Yangqing Jia , Rafal Jozefowicz , Lukasz Kaiser , Manjunath Kudlur , Josh Levenberg , Dandelion Man\u00e9 , Rajat Monga , Sherry Moore , Derek Murray , Chris Olah , Mike Schuster , Jonathon Shlens , Benoit Steiner , Ilya Sutskever . 2016 . Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.v22:6"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018769"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2010.67"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-010-0136-3"},{"key":"e_1_2_1_10_1","volume-title":"Riccardo Zecchina, Stefano Soatto, and Ameet Talwalkar.","author":"Chaudhari Pratik","year":"2017","unstructured":"Pratik Chaudhari , Carlo Baldassi , Riccardo Zecchina, Stefano Soatto, and Ameet Talwalkar. 2017 . Parle : Parallelizing stochastic gradient descent. arXiv:1707.00424. Pratik Chaudhari, Carlo Baldassi, Riccardo Zecchina, Stefano Soatto, and Ameet Talwalkar. 2017. Parle: Parallelizing stochastic gradient descent. arXiv:1707.00424."},{"key":"e_1_2_1_11_1","unstructured":"Jianmin Chen Xinghao Pan Rajat Monga Samy Bengio and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv:1604.00981.  Jianmin Chen Xinghao Pan Rajat Monga Samy Bengio and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv:1604.00981."},{"key":"e_1_2_1_12_1","volume-title":"Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274","author":"Chen Tianqi","year":"2015","unstructured":"Tianqi Chen , Mu Li , Yutian Li , Min Lin , Naiyan Wang , Minjie Wang , Tianjun Xiao , Bing Xu , Chiyuan Zhang , and Zheng Zhang . 2015 . Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274 (2015). Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274 (2015)."},{"key":"e_1_2_1_13_1","volume-title":"CUDNN: Efficient primitives for deep learning. arXiv:1410.0759.","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . 2014 . CUDNN: Efficient primitives for deep learning. arXiv:1410.0759. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. CUDNN: Efficient primitives for deep learning. arXiv:1410.0759."},{"key":"e_1_2_1_14_1","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914)","author":"Chilimbi Trishul","year":"2014","unstructured":"Trishul Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project Adam: Building an efficient and scalable deep learning training system . In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914) . USENIX Association, Broomfield, CO, 571--582. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914). USENIX Association, Broomfield, CO, 571--582."},{"key":"e_1_2_1_15_1","unstructured":"Minsik Cho Ulrich Finkler Sameer Kumar David Kung Vaibhav Saxena and Dheeraj Sreedhar. 2017. PowerAI DDL. arXiv:1708.02188.  Minsik Cho Ulrich Finkler Sameer Kumar David Kung Vaibhav Saxena and Dheeraj Sreedhar. 2017. PowerAI DDL. arXiv:1708.02188."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1162\/NECO_a_00052"},{"key":"e_1_2_1_17_1","volume-title":"Retrieved","author":"Intel Corporation","year":"2009","unstructured":"Intel Corporation . 2009 . Intel QuickPath Interconnect . Retrieved January 21, 2018 from http:\/\/www.intel.com\/content\/www\/us\/en\/io\/quickpath-technology\/quickpath-technology-general.html. Intel Corporation. 2009. Intel QuickPath Interconnect. Retrieved January 21, 2018 from http:\/\/www.intel.com\/content\/www\/us\/en\/io\/quickpath-technology\/quickpath-technology-general.html."},{"key":"e_1_2_1_18_1","volume-title":"Retrieved","author":"Intel Corporation","year":"2010","unstructured":"Intel Corporation . 2010 . Intel VTune Performance Analyzer . Retrieved January 21, 2018 from http:\/\/software.intel.com\/en-us\/intel-vtune. Intel Corporation. 2010. Intel VTune Performance Analyzer. Retrieved January 21, 2018 from http:\/\/software.intel.com\/en-us\/intel-vtune."},{"key":"e_1_2_1_19_1","unstructured":"Intel Corporation. 2017. Intel distribution of Caffe*. https:\/\/github.com\/intel\/caffe.git.  Intel Corporation. 2017. Intel distribution of Caffe*. https:\/\/github.com\/intel\/caffe.git."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901323"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2011.2134090"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201912)","author":"Dean Jeffrey","unstructured":"Jeffrey Dean , Greg S. Corrado , Rajat Monga , Kai Chen , Matthieu Devin , Quoc V. Le , Mark Z. Mao , Marc\u2019Aurelio Ranzato , Andrew Senior , Paul Tucker , Ke Yang , and Andrew Y. Ng . 2012. Large scale distributed deep networks . In Proceedings of the 25th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201912) . Curran Associates Inc., 1223--1231. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc\u2019Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201912). Curran Associates Inc., 1223--1231."},{"key":"e_1_2_1_23_1","volume-title":"Retrieved","author":"Devices Advanced Micro","year":"2009","unstructured":"Advanced Micro Devices . 2009 . AMD HyperTransport Technology . Retrieved January 21, 2018 from http:\/\/www.amd.com\/en-us\/innovations\/software-technologies\/hypertransport. Advanced Micro Devices. 2009. AMD HyperTransport Technology. Retrieved January 21, 2018 from http:\/\/www.amd.com\/en-us\/innovations\/software-technologies\/hypertransport."},{"key":"e_1_2_1_24_1","volume-title":"Retreived","author":"Enterprise Hewlett Packard","year":"2017","unstructured":"Hewlett Packard Enterprise . 2017 . HP Integrity Superdome X . Retreived January 21, 2018 from http:\/\/www8.hp.com\/h20195\/v2\/GetPDF.aspx\/c04383189.pdf. Hewlett Packard Enterprise. 2017. HP Integrity Superdome X. Retreived January 21, 2018 from http:\/\/www8.hp.com\/h20195\/v2\/GetPDF.aspx\/c04383189.pdf."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/646668.700651"},{"key":"e_1_2_1_26_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch SGD: Training imagenet in 1 hour. arXiv:1706.02677  Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch SGD: Training imagenet in 1 hour. arXiv:1706.02677"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(96)00024-5"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2012.2205597"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201913)","author":"Ho Qirong","unstructured":"Qirong Ho , James Cipar , Henggang Cui , Jin Kyu Kim , Seunghak Lee , Phillip B. Gibbons , Garth A. Gibson , Gregory R. Ganger , and Eric P. Xing . 2013. More effective distributed ML via a stale synchronous parallel parameter server . In Proceedings of the 26th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201913) . Curran Associates Inc., 1223--1231. Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201913). Curran Associates Inc., 1223--1231."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.284"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3022227.3022323"},{"key":"e_1_2_1_32_1","volume-title":"Retrieved","year":"2017","unstructured":"Intel. 2017 . Intel Math Kernel Library 2017 . Retrieved January 21, 2018 from https:\/\/software.intel.com\/en-us\/articles\/intel-math-kernel-library-intel-mkl-2017-getting-started. Intel. 2017. Intel Math Kernel Library 2017. Retrieved January 21, 2018 from https:\/\/software.intel.com\/en-us\/articles\/intel-math-kernel-library-intel-mkl-2017-getting-started."},{"key":"e_1_2_1_33_1","volume-title":"Intel 64 and IA-32 Architectures Software Developer\u2019s Manual","author":"Intel Corporation","unstructured":"Intel Corporation . 2010. Intel 64 and IA-32 Architectures Software Developer\u2019s Manual , Vol. 3B: System Programming Guide, Part 2 , No. 253669-032 (June 2010). Intel Corporation. 2010. Intel 64 and IA-32 Architectures Software Developer\u2019s Manual, Vol. 3B: System Programming Guide, Part 2, No. 253669-032 (June 2010)."},{"key":"e_1_2_1_34_1","volume-title":"Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.","author":"Jia Yangqing","year":"2014","unstructured":"Yangqing Jia , Evan Shelhamer , Jeff Donahue , Sergey Karayev , Jonathan Long , Ross Girshick , Sergio Guadarrama , and Trevor Darrell . 2014 . Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093."},{"key":"e_1_2_1_35_1","volume-title":"One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997","author":"Krizhevsky Alex","year":"2014","unstructured":"Alex Krizhevsky . 2014. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 ( 2014 ). Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 (2014)."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201912)","author":"Krizhevsky Alex","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E. Hinton . 2012. ImageNet classification with deep convolutional neural networks . In Proceedings of the 25th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201912) . Curran Associates Inc., 1097--1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems\u2014Volume 1 (NIPS\u201912). Curran Associates Inc., 1097--1105."},{"key":"e_1_2_1_38_1","unstructured":"Renaud Lachaize Baptiste Lepers and Vivien Qu\u00e9ma. 2012. MemProf: A memory profiler for NUMA multicore systems. In Presented at 2012 USENIX Annual Technical Conference (USENIX ATC\u201912). 53--64.   Renaud Lachaize Baptiste Lepers and Vivien Qu\u00e9ma. 2012. MemProf: A memory profiler for NUMA multicore systems. In Presented at 2012 USENIX Annual Technical Conference (USENIX ATC\u201912). 53--64."},{"key":"e_1_2_1_39_1","unstructured":"John Langford Alexander Smola and Martin Zinkevich. 2009. Slow learners are fast. arXiv:0911.0491.   John Langford Alexander Smola and Martin Zinkevich. 2009. Slow learners are fast. arXiv:0911.0491."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/2968826.2968829"},{"key":"e_1_2_1_41_1","volume-title":"Retrieved","year":"2015","unstructured":"Linux. 2015 . Perf: Linux profiling with performance counters . Retrieved January 21, 2018 from https:\/\/perf.wiki.kernel.org\/index.php\/Main_Page. Linux. 2015. Perf: Linux profiling with performance counters. Retrieved January 21, 2018 from https:\/\/perf.wiki.kernel.org\/index.php\/Main_Page."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555271"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2016.7482086"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the ACM SIGPLAN Workshop on Memory System Performance and Correctness.","author":"Luo Hao","year":"2014","unstructured":"Hao Luo , Chen Ding , and Pengcheng Li . 2014 . Optimal thread-to-core mapping for pipeline programs . In Proceedings of the ACM SIGPLAN Workshop on Memory System Performance and Correctness. Hao Luo, Chen Ding, and Pengcheng Li. 2014. Optimal thread-to-core mapping for pipeline programs. In Proceedings of the ACM SIGPLAN Workshop on Memory System Performance and Correctness."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018759"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2259016.2259046"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS\u201909)","author":"Mann Gideon","unstructured":"Gideon Mann , Ryan McDonald , Mehryar Mohri , Nathan Silberman , and Daniel D. Walker . 2009. Efficient large-scale distributed training of conditional maximum entropy models . In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS\u201909) . Curran Associates Inc., 1231--1239. Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman, and Daniel D. Walker. 2009. Efficient large-scale distributed training of conditional maximum entropy models. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS\u201909). Curran Associates Inc., 1231--1239."},{"key":"e_1_2_1_49_1","volume-title":"Jordan","author":"Moritz Philipp","year":"2015","unstructured":"Philipp Moritz , Robert Nishihara , Ion Stoica , and Michael I . Jordan . 2015 . SparkNet : Training deep networks in spark. arXiv:1511.06051. Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I. Jordan. 2015. SparkNet: Training deep networks in spark. arXiv:1511.06051."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS\u201911)","author":"Niu Feng","unstructured":"Feng Niu , Benjamin Recht , Christopher Re , and Stephen J. Wright . 2011. HOGWILD&excl; A lock-free approach to parallelizing stochastic gradient descent . In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS\u201911) . Curran Associates Inc., 693--701. Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD&excl; A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS\u201911). Curran Associates Inc., 693--701."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037745"},{"key":"e_1_2_1_53_1","volume-title":"Retrieved","author":"SGI.","year":"2015","unstructured":"SGI. 2015 . SGI UV The World\u2019s Most Powerful In-Memory Supercomputers . Retrieved August 01, 2014 from https:\/\/www.sgi.com\/products\/servers\/uv. SGI. 2015. SGI UV The World\u2019s Most Powerful In-Memory Supercomputers. Retrieved August 01, 2014 from https:\/\/www.sgi.com\/products\/servers\/uv."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851158"},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of Workshop on Machine Learning Systems (LearningSys) in the 29th Annual Conference on Neural Information Processing Systems (NIPS\u201915)","volume":"5","author":"Tokui Seiya","year":"2015","unstructured":"Seiya Tokui , Kenta Oono , Shohei Hido , and Justin Clayton . 2015 . Chainer: A next-generation open source framework for deep learning . In Proceedings of Workshop on Machine Learning Systems (LearningSys) in the 29th Annual Conference on Neural Information Processing Systems (NIPS\u201915) , Vol. 5 . Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: A next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in the 29th Annual Conference on Neural Information Processing Systems (NIPS\u201915), Vol. 5."},{"key":"e_1_2_1_57_1","volume-title":"Distributed tensorflow with MPI. arXiv:1603.02339","author":"Vishnu Abhinav","year":"2016","unstructured":"Abhinav Vishnu , Charles Siegel , and Jeffrey Daily . 2016. Distributed tensorflow with MPI. arXiv:1603.02339 ( 2016 ). Abhinav Vishnu, Charles Siegel, and Jeffrey Daily. 2016. Distributed tensorflow with MPI. arXiv:1603.02339 (2016)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783323"},{"key":"e_1_2_1_59_1","unstructured":"Omry Yadan Keith Adams Yaniv Taigman and Marc\u2019Aurelio Ranzato. 2013. Multi-GPU training of convnets. arXiv:1312.5853 .  Omry Yadan Keith Adams Yaniv Taigman and Marc\u2019Aurelio Ranzato. 2013. Multi-GPU training of convnets. arXiv:1312.5853 ."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306783"},{"key":"e_1_2_1_61_1","unstructured":"Yang You Igor Gitman and Boris Ginsburg. 2017. Scaling SGD batch size to 32k for ImageNet training. arXiv:1708.03888.  Yang You Igor Gitman and Boris Ginsburg. 2017. Scaling SGD batch size to 32k for ImageNet training. arXiv:1708.03888."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2733001"},{"key":"e_1_2_1_64_1","volume-title":"Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arXiv:1512.06216.","author":"Zhang Hao","year":"2015","unstructured":"Hao Zhang , Zhiting Hu , Jinliang Wei , Pengtao Xie , Gunhee Kim , Qirong Ho , and Eric Xing . 2015 . Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arXiv:1512.06216. Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, and Eric Xing. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arXiv:1512.06216."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/2688500.2688507"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2015.131"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3199605","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3199605","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3199605","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:07:18Z","timestamp":1750273638000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3199605"}},"subtitle":["NUMA-Aware Deep Learning Neural Networks"],"short-title":[],"issued":{"date-parts":[[2018,6,8]]},"references-count":63,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2018,6,30]]}},"alternative-id":["10.1145\/3199605"],"URL":"https:\/\/doi.org\/10.1145\/3199605","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,6,8]]},"assertion":[{"value":"2017-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-06-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}