{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T15:13:27Z","timestamp":1769008407632,"version":"3.49.0"},"reference-count":44,"publisher":"SAGE Publications","issue":"6","license":[{"start":{"date-parts":[[2019,8,4]],"date-time":"2019-08-04T00:00:00Z","timestamp":1564876800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"name":"The Exascale Computing Project, a joint project of the U.S. Department of Energy's Office of Science and National Nuclear Security Administration.","award":["17-SC-20-SC"],"award-info":[{"award-number":["17-SC-20-SC"]}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2019,11]]},"abstract":"<jats:p> Deep neural networks (DNNs) have demonstrated effectiveness in many domains including object recognition, speech recognition, natural language processing, and health care. Typically, the computations involved in DNN training and inferencing are time consuming and require efficient implementations. Existing frameworks such as TensorFlow, Theano, Torch, Cognitive Tool Kit (CNTK), and Caffe enable Graphics Processing Unit (GPUs) as the status quo devices for DNN execution, leaving Central Processing Unit (CPUs) behind. Moreover, existing frameworks forgo or limit cross layer optimization opportunities that have the potential to improve performance by significantly reducing data movement through the memory hierarchy. In this article, we describe an alternative approach called SWIRL, a compiler that provides high-performance CPU implementations for DNNs. SWIRL is built on top of the existing domain-specific language (DSL) for DNNs called LATTE. SWIRL separates DNN specification and its schedule using predefined transformation recipes for tensors and layers commonly found in DNN layers. These recipes synergize with DSL constructs to generate high-quality fused, vectorized, and parallelized code for CPUs. On an Intel Xeon Platinum 8180M CPU, SWIRL achieves performance comparable with Tensorflow integrated with MKL-DNN; on average 1.00\u00d7 of Tensorflow inference and 0.99\u00d7 of Tensorflow training. It also outperforms the original LATTE compiler on average by 1.22\u00d7 and 1.30\u00d7 on inference and training, respectively. <\/jats:p>","DOI":"10.1177\/1094342019866247","type":"journal-article","created":{"date-parts":[[2019,8,5]],"date-time":"2019-08-05T02:57:09Z","timestamp":1564973829000},"page":"1275-1289","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":21,"title":["SWIRL: High-performance many-core CPU code generation for deep neural networks"],"prefix":"10.1177","volume":"33","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4167-4525","authenticated-orcid":false,"given":"Anand","family":"Venkat","sequence":"first","affiliation":[{"name":"Parallel Computing Laboratory, Intel Labs, Santa Clara, CA, USA"}]},{"given":"Tharindu","family":"Rusira","sequence":"additional","affiliation":[{"name":"School of Computing, University of Utah, Salt Lake City, UT, USA"}]},{"given":"Raj","family":"Barik","sequence":"additional","affiliation":[{"name":"Uber Technologies Inc, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3058-7573","authenticated-orcid":false,"given":"Mary","family":"Hall","sequence":"additional","affiliation":[{"name":"School of Computing, University of Utah, Salt Lake City, UT, USA"}]},{"given":"Leonard","family":"Truong","sequence":"additional","affiliation":[{"name":"Computer Science Department, Stanford University, CA, USA"}]}],"member":"179","published-online":{"date-parts":[[2019,8,4]]},"reference":[{"key":"bibr1-1094342019866247","unstructured":"Abadi M, Agarwal A, Barham P, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Available at: http:\/\/tensorflow.org\/ (accessed 6 January 2019)."},{"key":"bibr2-1094342019866247","unstructured":"Agarwal A, Akchurin E, Basoglu C, et al. (2014) An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report MSR-TR-2014-112. Available at: http:\/\/research.microsoft.com\/apps\/pubs\/default.aspx?id=226641."},{"key":"bibr3-1094342019866247","volume-title":"8th Biennial Conference on Innovative Data Systems Research (CIDR), CIDR \u201917","author":"Alkar S","year":"2017"},{"key":"bibr4-1094342019866247","volume-title":"Proceedings of the Python for Scientific Computing Conference (SciPy)","author":"Bergstra J","year":"2010"},{"issue":"1","key":"bibr5-1094342019866247","first-page":"1","volume":"1","author":"Catanzaro B","year":"2009","journal-title":"Programming Models for Emerging Architectures"},{"key":"bibr6-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1145\/2038037.1941561"},{"key":"bibr7-1094342019866247","unstructured":"Chen T, Moreau T, Jiang Z, et al. (2018) Tvm: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799."},{"key":"bibr8-1094342019866247","unstructured":"Chetlur S, Woolley C, Vandermersch P, et al. (2014) cuDNN: efficient primitives for deep learning. CoRR abs\/1410.0759. Available at: http:\/\/arxiv.org\/abs\/1410.0759."},{"key":"bibr9-1094342019866247","unstructured":"Chintala S (2015) Convnet Benchmarks. Available at: https:\/\/github.com\/soumith\/convnet-benchmarks (accessed 14 March 2019)."},{"key":"bibr10-1094342019866247","unstructured":"Collobert R, Kavukcuoglu K, Farabet C (2011) Torch7: a MATLAB-like environment for machine learning. In: BigLearn, NIPS Workshop, EPFL-CONF-192376, 2011."},{"key":"bibr11-1094342019866247","first-page":"136","volume-title":"Workshop on Languages and Compilers for Parallel Computing (LCPC)","author":"Donadio S","year":"2005"},{"key":"bibr12-1094342019866247","unstructured":"Dukhan M (2016) NNPACK. Available at: https:\/\/github.com\/Maratyszcza\/NNPACK (accessed 14 March 2019)."},{"key":"bibr13-1094342019866247","unstructured":"Google (2011) Improving the speed of neural networks on CPUs. Available at: https:\/\/research.google.com\/pubs\/pub37631.html (accessed 14 March 2019)."},{"key":"bibr14-1094342019866247","unstructured":"Google (2016) TensorFlow XLA. Available at: https:\/\/www.tensorflow.org\/versions\/master\/experimental\/xla\/ (accessed 14 March 2019)."},{"key":"bibr15-1094342019866247","first-page":"50","volume-title":"Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing","author":"Hall MW","year":"2009"},{"key":"bibr16-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2009.5161004"},{"key":"bibr17-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1038\/nature23463"},{"key":"bibr18-1094342019866247","unstructured":"Intel (2018) Intel mkl-dnn. Available at: https:\/\/github.com\/01org\/mkl-dnn (accessed 14 March 2019)."},{"key":"bibr19-1094342019866247","doi-asserted-by":"crossref","unstructured":"Jia Y, Shelhamer E, Donahue J, et al. (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.","DOI":"10.1145\/2647868.2654889"},{"key":"bibr20-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2014.194"},{"key":"bibr21-1094342019866247","first-page":"1097","volume":"1","author":"Krizhevsky A","year":"2012","journal-title":"Advances in Neural Information Processing Systems"},{"key":"bibr22-1094342019866247","doi-asserted-by":"crossref","unstructured":"Kurth T, Zhang J, Satish N, et al. (2017) Deep learning at 15pf: supervised and semi-supervised classification for scientific data. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC \u201917, New York, NY, USA, pp. 7:1\u20137:11. ACM. ISBN 978-1-4503-5114-0, DOI:10.1145\/3126908.3126916. Available at: http:\/\/doi.acm.org\/10.1145\/3126908.3126916.","DOI":"10.1145\/3126908.3126916"},{"key":"bibr23-1094342019866247","unstructured":"Latte (2016) Latte. Available at: https:\/\/github.com\/IntelLabs\/Latte.jl (accessed 14 March 2019)."},{"key":"bibr24-1094342019866247","unstructured":"Lavin A, Gray S (2015) Fast algorithms for convolutional neural networks. CoRR abs\/1509.09308. Available at: http:\/\/arxiv.org\/abs\/1509.09308."},{"key":"bibr25-1094342019866247","unstructured":"Liu Y, Racah E, Prabhat, et al. (2016) Application of deep convolutional neural networks for detecting extreme weather in climate datasets. CoRR abs\/1605.01156. Available at: http:\/\/arxiv.org\/abs\/1605.01156."},{"key":"bibr26-1094342019866247","unstructured":"Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. CoRR abs\/1312.5851. Available at: http:\/\/arxiv.org\/abs\/1312.5851."},{"key":"bibr27-1094342019866247","unstructured":"MathWorks (2018) im2col in matlab. Available at: https:\/\/www.mathworks.com\/help\/images\/ref\/im2col.html (accessed 14 March 2019)."},{"key":"bibr28-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1109\/HPCSim.2016.7568443"},{"key":"bibr29-1094342019866247","unstructured":"Nervana (2016) NEON. Available at: https:\/\/github.com\/NervanaSystems\/neon (accessed 14 March 2019)."},{"key":"bibr30-1094342019866247","unstructured":"NVIDIA (2016) NVIDIA GPU Inference Engine. Available at: www.devblogs.nvidia.com\/production-deep-learning-nvidia-gpu-inference-engine (accessed 14 March 2019)."},{"key":"bibr31-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462176"},{"key":"bibr32-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"bibr33-1094342019866247","unstructured":"Sermanet P, Eigen D, Zhang X, et al. (2013) OverFeat: integrated recognition, localization and detection using convolutional networks. CoRR abs\/1312.6229. Available at: http:\/\/arxiv.org\/abs\/1312.6229."},{"key":"bibr34-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1016\/j.matchar.2018.05.053"},{"key":"bibr35-1094342019866247","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"bibr36-1094342019866247","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556."},{"key":"bibr37-1094342019866247","unstructured":"Szegedy C, Liu W, Jia Y, et al. (2014) Going deeper with convolutions. CoRR abs\/1409.4842. Available at: http:\/\/arxiv.org\/abs\/1409.4842."},{"key":"bibr38-1094342019866247","doi-asserted-by":"crossref","unstructured":"Teixeira TSFX, Ancourt C, Padua D, et al. (2019) Locus: a system and a language for program optimization. In: Proceedings of the 2019 IEEE\/ACM international symposium on code generation and optimization, CGO 2019, Piscataway, NJ, USA, pp. 217\u2013228. Available at: http:\/\/dl.acm.org\/citation.cfm?id=3314872.3314898.","DOI":"10.1109\/CGO.2019.8661203"},{"key":"bibr39-1094342019866247","doi-asserted-by":"crossref","unstructured":"Truong L, Barik R, Totoni E, et al. (2016) Latte: a language, compiler, and runtime for elegant and efficient deep neural networks. In: Proceedings of the 37th ACM SIGPLAN conference on programming language design and implementation, PLDI \u201816, New York, NY, USA, pp. 209\u2013223. ACM. ISBN 978-1-4503-4261-2, DOI:10.1145\/2908080.2908105. Available at: http:\/\/doi.acm.org\/10.1145\/2908080.2908105.","DOI":"10.1145\/2908080.2908105"},{"key":"bibr40-1094342019866247","unstructured":"UCB-SEJITS (2017) Ctree. Available at: https:\/\/github.com\/ucb-sejits\/ctree (accessed 14 March 2019)."},{"key":"bibr41-1094342019866247","unstructured":"Vasilache N, Zinenko O, Theodoridis T, et al. (2018) Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730."},{"key":"bibr42-1094342019866247","unstructured":"Warden P (2015) Why GEMM is at the heart of deep learning. Available at: https:\/\/petewarden.com\/2015\/04\/20\/why-gemm-is-at-the-heart-of-deep-learning\/ (accessed 14 March 2019)."},{"key":"bibr43-1094342019866247","unstructured":"Zhang C (2015) Mocha.jl. Available at: https:\/\/github.com\/pluskid\/Mocha.jl (accessed 14 March 2019)."},{"key":"bibr44-1094342019866247","doi-asserted-by":"crossref","unstructured":"Zlateski A, Lee K, Seung HS (2015) ZNN - a fast and scalable algorithm for training 3d convolutional networks on multi-core and many-core shared memory machines. CoRR abs\/1510.06706. Available at: http:\/\/arxiv.org\/abs\/1510.06706.","DOI":"10.1109\/IPDPS.2016.119"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342019866247","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342019866247","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342019866247","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T10:14:36Z","timestamp":1740824076000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342019866247"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,4]]},"references-count":44,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2019,11]]}},"alternative-id":["10.1177\/1094342019866247"],"URL":"https:\/\/doi.org\/10.1177\/1094342019866247","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,8,4]]}}}