{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T15:34:32Z","timestamp":1780673672926,"version":"3.54.1"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,10,11]],"date-time":"2019-10-11T00:00:00Z","timestamp":1570752000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Facebook to ETH Z\u00fcrich"},{"name":"European Commission through the MNEMOSENE project","award":["780215"],"award-info":[{"award-number":["780215"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.<\/jats:p>","DOI":"10.1145\/3355606","type":"journal-article","created":{"date-parts":[[2019,10,11]],"date-time":"2019-10-11T14:53:33Z","timestamp":1570805613000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":38,"title":["The Next 700 Accelerated Layers"],"prefix":"10.1145","volume":"16","author":[{"given":"Nicolas","family":"Vasilache","sequence":"first","affiliation":[{"name":"Facebook AI Research, NY, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1978-0222","authenticated-orcid":false,"given":"Oleksandr","family":"Zinenko","sequence":"additional","affiliation":[{"name":"Inria and ENS, Paris, France"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Theodoros","family":"Theodoridis","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Z\u00fcrich, Switzerland"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Priya","family":"Goyal","sequence":"additional","affiliation":[{"name":"Facebook AI Research, New York City, NY, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zachary","family":"Devito","sequence":"additional","affiliation":[{"name":"Facebook AI Research, Menlo Park, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"William S.","family":"Moses","sequence":"additional","affiliation":[{"name":"MIT CSAIL, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Sven","family":"Verdoolaege","sequence":"additional","affiliation":[{"name":"Polly Labs 8 Facebook AI Research, Leuven, Belgium"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andrew","family":"Adams","sequence":"additional","affiliation":[{"name":"Facebook AI Research, Menlo Park, CA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8866-5343","authenticated-orcid":false,"given":"Albert","family":"Cohen","sequence":"additional","affiliation":[{"name":"Inria, ENS and Facebook AI Research, Paris, France"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2019,10,11]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","volume":"16","author":"Abadi Mart\u00edn","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning . In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916) , Vol. 16 . 265--283. Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916), Vol. 16. 265--283."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2006.37"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/109625.109631"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the Conference on Machine Learning on HPC Environments (MLHPC\u201917)","author":"Awan Ammar Ahmad","unstructured":"Ammar Ahmad Awan , Hari Subramoni , and Dhabaleswar K. Panda . 2017. An in-depth performance characterization of CPU- and GPU-based DNN training on modern architectures . In Proceedings of the Conference on Machine Learning on HPC Environments (MLHPC\u201917) . ACM, New York, NY, Article 8, 8 pages. DOI:https:\/\/doi.org\/10.1145\/3146347.3146356 10.1145\/3146347.3146356 Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda. 2017. An in-depth performance characterization of CPU- and GPU-based DNN training on modern architectures. In Proceedings of the Conference on Machine Learning on HPC Environments (MLHPC\u201917). ACM, New York, NY, Article 8, 8 pages. DOI:https:\/\/doi.org\/10.1145\/3146347.3146356"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the International Conference on Parallel Architecture and Compilation (PACT\u201915)","author":"Baghdadi R.","year":"2015","unstructured":"R. Baghdadi , U. Beaugnon , A. Cohen , T. Grosser , M. Kruse , C. Reddy , S. Verdoolaege , A. Betts , A. F. Donaldson , J. Ketema , J. Absar , S. V. Haastregt , A. Kravets , A. Lokhmotov , R. David , and E. Hajiyev . 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming . In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT\u201915) . 138--149. DOI:https:\/\/doi.org\/10.1109\/PACT. 2015 .17 10.1109\/PACT.2015.17 R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT\u201915). 138--149. DOI:https:\/\/doi.org\/10.1109\/PACT.2015.17"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854038.2854048"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the 22nd International Conference on Supercomputing (ICS\u201908)","author":"Baskaran Muthu Manikandan","unstructured":"Muthu Manikandan Baskaran , Uday Bondhugula , Sriram Krishnamoorthy , J. Ramanujam , Atanas Rountev , and P. Sadayappan . 2008. A compiler framework for optimization of affine loop nests for GPGPUs . In Proceedings of the 22nd International Conference on Supercomputing (ICS\u201908) . ACM, New York, NY, 225--234. DOI:https:\/\/doi.org\/10.1145\/1375527.1375562 10.1145\/1375527.1375562 Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd International Conference on Supercomputing (ICS\u201908). ACM, New York, NY, 225--234. DOI:https:\/\/doi.org\/10.1145\/1375527.1375562"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/1025127.1025992"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2597809.2597818"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC\u201909)","author":"Belter Geoffrey","unstructured":"Geoffrey Belter , E. R. Jessup , Ian Karlin , and Jeremy G. Siek . 2009. Automating the generation of composed linear algebra kernels . In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC\u201909) . ACM, New York, NY, Article 59, 12 pages. DOI:https:\/\/doi.org\/10.1145\/1654059.1654119 10.1145\/1654059.1654119 Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC\u201909). ACM, New York, NY, Article 59, 12 pages. DOI:https:\/\/doi.org\/10.1145\/1654059.1654119"},{"key":"e_1_2_1_11_1","series-title":"Lecture Notes in Computer Science","volume-title":"The polyhedral model is more widely applicable than you think","author":"Benabderrahmane Mohamed-Walid","unstructured":"Mohamed-Walid Benabderrahmane , Louis-No\u00ebl Pouchet , Albert Cohen , and C\u00e9dric Bastoul . 2010. The polyhedral model is more widely applicable than you think . In Compiler Construction, Rajiv Gupta (Ed.), Vol. 6011 , Lecture Notes in Computer Science .Springer, 283--303. Mohamed-Walid Benabderrahmane, Louis-No\u00ebl Pouchet, Albert Cohen, and C\u00e9dric Bastoul. 2010. The polyhedral model is more widely applicable than you think. In Compiler Construction, Rajiv Gupta (Ed.), Vol. 6011, Lecture Notes in Computer Science.Springer, 283--303."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2896389"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854317"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_2_1_16_1","unstructured":"Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao Bing Xu Chiyuan Zhang and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Retrieved from: http:\/\/arxiv.org\/abs\/1512.01274.  Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao Bing Xu Chiyuan Zhang and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Retrieved from: http:\/\/arxiv.org\/abs\/1512.01274."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Haichen Shen , Meghan Cowan , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018 . TVM: An automated end-to-end optimizing compiler for deep learning . In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918) . USENIX Association, 578--594. Retrieved from https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/chen. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association, 578--594. Retrieved from https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/chen."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems","author":"Chen Tianqi","unstructured":"Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018. Learning to optimize tensor programs . In Proceedings of the Conference on Advances in Neural Information Processing Systems , S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc. , 3389--3400. Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Proceedings of the Conference on Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3389--3400."},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"R. Collobert K. Kavukcuoglu and C. Farabet. 2012. Implementing neural networks efficiently. In Neural Networks: Tricks of the Trade G. Montavon G. Orr and K.-R. Muller (Eds.). Springer.  R. Collobert K. Kavukcuoglu and C. Farabet. 2012. Implementing neural networks efficiently. In Neural Networks: Tricks of the Trade G. Montavon G. Orr and K.-R. Muller (Eds.). Springer.","DOI":"10.1007\/978-3-642-35289-8_28"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3211346.3211354"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000108"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01379404"},{"key":"e_1_2_1_23_1","volume-title":"Encyclopedia of Parallel Computing","author":"Feautrier Paul","unstructured":"Paul Feautrier and Christian Lengauer . 2011. Polyhedron model . In Encyclopedia of Parallel Computing , David Padua (Ed.). Springer , 1581--1592. Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer, 1581--1592."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2012.05.002"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"3","author":"Frigo Matteo","unstructured":"Matteo Frigo and Steven G. Johnson . 1998. FFTW: An adaptive software architecture for the FFT . In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing , Vol. 3 . IEEE, 1381--1384. Matteo Frigo and Steven G. Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 3. IEEE, 1381--1384."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-006-0012-3"},{"key":"e_1_2_1_27_1","volume-title":"Genetic Algorithms in Search, Optimization and Machine Learning","author":"Goldberg David E.","unstructured":"David E. Goldberg . 1989. Genetic Algorithms in Search, Optimization and Machine Learning ( 1 st ed.). Addison-Wesley Longman Publishing Co., Inc. , Boston, MA . David E. Goldberg. 1989. Genetic Algorithms in Search, Optimization and Machine Learning (1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA.","edition":"1"},{"key":"e_1_2_1_28_1","unstructured":"Google 2017. XLA: Domain-Specific Compiler for Linear Algebra to Optimize TensorFlow Computations. Retrieved from https:\/\/www.tensorflow.org\/performance\/xla.  Google 2017. XLA: Domain-Specific Compiler for Linear Algebra to Optimize TensorFlow Computations. Retrieved from https:\/\/www.tensorflow.org\/performance\/xla."},{"key":"e_1_2_1_29_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross B. Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch SGD: Training ImageNet in 1 hour. Retrieved from http:\/\/arxiv.org\/abs\/1706.02677.  Priya Goyal Piotr Doll\u00e1r Ross B. Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch SGD: Training ImageNet in 1 hour. Retrieved from http:\/\/arxiv.org\/abs\/1706.02677."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626412500107"},{"key":"e_1_2_1_31_1","volume-title":"The Future of Computing. Google I\/O presentation. Retrieved on","author":"Hennessy John","year":"2018","unstructured":"John Hennessy . 2018. The Future of Computing. Google I\/O presentation. Retrieved on May 2018 from https:\/\/www.youtube.com\/watch?v&equals;Azt8Nc-mtKM. John Hennessy. 2018. The Future of Computing. Google I\/O presentation. Retrieved on May 2018 from https:\/\/www.youtube.com\/watch?v&equals;Azt8Nc-mtKM."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/73560.73588"},{"key":"e_1_2_1_33_1","unstructured":"Cijo Jose Moustpaha Cisse and Fran\u00e7ois Fleuret. 2017. Kronecker recurrent units. Retrieved from http:\/\/arxiv.org\/abs\/1705.10142.  Cijo Jose Moustpaha Cisse and Fran\u00e7ois Fleuret. 2017. Kronecker recurrent units. Retrieved from http:\/\/arxiv.org\/abs\/1705.10142."},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the 44th International Symposium on Computer Architecture (ISCA\u201917)","author":"Norman","unstructured":"Norman P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit . In Proceedings of the 44th International Symposium on Computer Architecture (ISCA\u201917) . 1--12. DOI:https:\/\/doi.org\/10.1145\/3079856.3080246 10.1145\/3079856.3080246 Norman P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA\u201917). 1--12. DOI:https:\/\/doi.org\/10.1145\/3079856.3080246"},{"key":"e_1_2_1_35_1","volume-title":"Allen","author":"Kennedy Ken","year":"2002","unstructured":"Ken Kennedy and John R . Allen . 2002 . Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers , Inc., San Francisco, CA. Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, Inc., San Francisco, CA."},{"key":"e_1_2_1_36_1","volume-title":"CUTLASS: Fast Linear Algebra in CUDA C++.","author":"Kerr Andrew","year":"2017","unstructured":"Andrew Kerr , Duane Merrill , Julien Demouth , and John Tran . 2017 . CUTLASS: Fast Linear Algebra in CUDA C++. Retrieved from https:\/\/devblogs.nvidia.com\/cutlass-linear-algebra-cuda\/. Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. Retrieved from https:\/\/devblogs.nvidia.com\/cutlass-linear-algebra-cuda\/."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133901"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2866569"},{"key":"e_1_2_1_39_1","unstructured":"Martin Kong and Louis-No\u00ebl Pouchet. 2018. A performance vocabulary for affine loop transformations. Retrieved from: http:\/\/arxiv.org\/abs\/1811.06043.  Martin Kong and Louis-No\u00ebl Pouchet. 2018. A performance vocabulary for affine loop transformations. Retrieved from: http:\/\/arxiv.org\/abs\/1811.06043."},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201913)","author":"Kong Martin","unstructured":"Martin Kong , Richard Veras , Kevin Stock , Franz Franchetti , Louis-No\u00ebl Pouchet , and P. Sadayappan . 2013. When polyhedral transformations meet SIMD code generation . In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201913) . 127--138. DOI:https:\/\/doi.org\/10.1145\/2462156.2462187 10.1145\/2462156.2462187 Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-No\u00ebl Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201913). 127--138. DOI:https:\/\/doi.org\/10.1145\/2462156.2462187"},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS\u201989)","author":"LeCun Yann","unstructured":"Yann LeCun , Bernhard E. Boser , John S. Denker , Donnie Henderson , Richard E. Howard , Wayne E. Hubbard , and Lawrence D. Jackel . 1989. Handwritten digit recognition with a back-propagation network . In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS\u201989) . 396--404. Retrieved from http:\/\/papers.nips.cc\/paper\/293-handwritten-digit-recognition-with-a-back-propagation-network. Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Handwritten digit recognition with a back-propagation network. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS\u201989). 396--404. Retrieved from http:\/\/papers.nips.cc\/paper\/293-handwritten-digit-recognition-with-a-back-propagation-network."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA\u201900)","author":"Luj\u00e1n Mikel","unstructured":"Mikel Luj\u00e1n , T. L. Freeman , and John R. Gurd . 2000. OoLALA: An object oriented analysis and design of numerical linear algebra . In Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA\u201900) . ACM, New York, NY, 229--252. DOI:https:\/\/doi.org\/10.1145\/353171.353187 10.1145\/353171.353187 Mikel Luj\u00e1n, T. L. Freeman, and John R. Gurd. 2000. OoLALA: An object oriented analysis and design of numerical linear algebra. In Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA\u201900). ACM, New York, NY, 229--252. DOI:https:\/\/doi.org\/10.1145\/353171.353187"},{"key":"e_1_2_1_43_1","volume-title":"Allen Leung, and Richard Lethin.","author":"Meister Benoit","year":"2011","unstructured":"Benoit Meister , Nicolas Vasilache , David Wohlford , Muthu Manikandan Baskaran , Allen Leung, and Richard Lethin. 2011 . R-Stream Compiler. Springer , Boston, MA, 1756--1765. DOI:https:\/\/doi.org\/10.1007\/978-0-387-09766-4_515 10.1007\/978-0-387-09766-4_515 Benoit Meister, Nicolas Vasilache, David Wohlford, Muthu Manikandan Baskaran, Allen Leung, and Richard Lethin. 2011. R-Stream Compiler. Springer, Boston, MA, 1756--1765. DOI:https:\/\/doi.org\/10.1007\/978-0-387-09766-4_515"},{"key":"e_1_2_1_44_1","unstructured":"Microsoft 2017. Microsoft Unveils Project Brainwave for Real-time AI. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-unveils-project-brainwave.  Microsoft 2017. Microsoft Unveils Project Brainwave for Real-time AI. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-unveils-project-brainwave."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925952"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694364"},{"key":"e_1_2_1_47_1","unstructured":"Nvidia 2017. Deploying Deep Neural Networks with Nvidia TensorRT. Retrieved from https:\/\/devblogs.nvidia.com\/parallelforall\/deploying-deep-learning-nvidia-tensorrt.  Nvidia 2017. Deploying Deep Neural Networks with Nvidia TensorRT. Retrieved from https:\/\/devblogs.nvidia.com\/parallelforall\/deploying-deep-learning-nvidia-tensorrt."},{"key":"e_1_2_1_48_1","volume-title":"NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques","author":"Paszke Adam","year":"2017","unstructured":"Adam Paszke , Sam Gross , Soumith Chintala , Gregory Chanan , Edward Yang , Zachary DeVito , Zeming Lin , Alban Desmaison , Luca Antiga , and Adam Lerer . 2017 . Automatic differentiation in PyTorch . In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques , Long Beach, CA. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA."},{"key":"e_1_2_1_49_1","unstructured":"PlaidML 2018. PlaidML. Retrieved from https:\/\/www.intel.ai\/plaidml\/#gs.bBu0cF8W.  PlaidML 2018. PlaidML. Retrieved from https:\/\/www.intel.ai\/plaidml\/#gs.bBu0cF8W."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1926385.1926449"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2435264.2435273"},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the 6th Workshop on Extreme-scale Programming Tools (ESPT\u201917","author":"Pradelle Benoit","year":"2017","unstructured":"Benoit Pradelle , Benoit Meister , Muthu Baskaran , Jonathan Springer , and Richard Lethin . 2017 . Polyhedral optimization of TensorFlow computation graphs . In Proceedings of the 6th Workshop on Extreme-scale Programming Tools (ESPT\u201917 , associated with SC\u201917). Benoit Pradelle, Benoit Meister, Muthu Baskaran, Jonathan Springer, and Richard Lethin. 2017. Polyhedral optimization of TensorFlow computation graphs. In Proceedings of the 6th Workshop on Extreme-scale Programming Tools (ESPT\u201917, associated with SC\u201917)."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/183432.183525"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342004041291"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.127"},{"key":"e_1_2_1_57_1","unstructured":"Nvidia Research. [n.d.]. CUB Documentation. Version 1.8.0. Retrieved from: https:\/\/nvlabs.github.io\/cub.  Nvidia Research. [n.d.]. CUB Documentation. Version 1.8.0. Retrieved from: https:\/\/nvlabs.github.io\/cub."},{"key":"e_1_2_1_58_1","volume-title":"Glow: Graph lowering compiler techniques for neural networks.","author":"Rotem Nadav","year":"2018","unstructured":"Nadav Rotem , Jordan Fix , Saleem Abdulrasool , Summer Deng , Roman Dzhabarov , James Hegeman , Roman Levenstein , Bert Maher , Nadathur Satish , Jakob Olesen , Jongsoo Park , Artem Rakhov , and Misha Smelyanskiy . 2018 . Glow: Graph lowering compiler techniques for neural networks. Retrieved from http:\/\/arxiv.org\/abs\/1805.00907. Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph lowering compiler techniques for neural networks. Retrieved from http:\/\/arxiv.org\/abs\/1805.00907."},{"key":"e_1_2_1_59_1","volume-title":"Facebook f8 presentation at McEnery Convention Center","author":"Schroepfer Mike","year":"2018","unstructured":"Mike Schroepfer . 2018. Day 2 Keynote . Facebook f8 presentation at McEnery Convention Center , San Jose, CA . Retrieved from https:\/\/developers.facebook.com\/videos\/f8- 2018 \/f8-2018-day-2-keynote\/. Mike Schroepfer. 2018. Day 2 Keynote. Facebook f8 presentation at McEnery Convention Center, San Jose, CA. Retrieved from https:\/\/developers.facebook.com\/videos\/f8-2018\/f8-2018-day-2-keynote\/."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/351397.351408"},{"key":"e_1_2_1_61_1","volume-title":"Theano: A Python framework for fast computation of mathematical expressions.","author":"Team Theano Development","year":"2016","unstructured":"Theano Development Team . 2016 . Theano: A Python framework for fast computation of mathematical expressions. Retrieved from http:\/\/arxiv.org\/abs\/1605.02688. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. Retrieved from http:\/\/arxiv.org\/abs\/1605.02688."},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the GCC Research Opportunities Workshop (GROW\u201910)","author":"Trifunovic Konrad","year":"2010","unstructured":"Konrad Trifunovic , Albert Cohen , David Edelsohn , Feng Li , Tobias Grosser , Harsha Jagasia , Razya Ladelsky , Sebastian Pop , Jan Sj\u00f6din , and Ramakrishna Upadrasta . 2010 . GRAPHITE two years after: First lessons learned from real-world polyhedral compilation . In Proceedings of the GCC Research Opportunities Workshop (GROW\u201910) . Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sj\u00f6din, and Ramakrishna Upadrasta. 2010. GRAPHITE two years after: First lessons learned from real-world polyhedral compilation. In Proceedings of the GCC Research Opportunities Workshop (GROW\u201910)."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2908080.2908105"},{"key":"e_1_2_1_64_1","unstructured":"A\u00e4ron van den Oord Sander Dieleman Heiga Zen Karen Simonyan Oriol Vinyals Alex Graves Nal Kalchbrenner Andrew W. Senior and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. Retrieved from http:\/\/arxiv.org\/abs\/1609.03499.  A\u00e4ron van den Oord Sander Dieleman Heiga Zen Karen Simonyan Oriol Vinyals Alex Graves Nal Kalchbrenner Andrew W. Senior and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. Retrieved from http:\/\/arxiv.org\/abs\/1609.03499."},{"key":"e_1_2_1_65_1","unstructured":"Nicolas Vasilache Jeff Johnson Micha\u00ebl Mathieu Soumith Chintala Serkan Piantino and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. Retrieved from http:\/\/arxiv.org\/abs\/1412.7580.  Nicolas Vasilache Jeff Johnson Micha\u00ebl Mathieu Soumith Chintala Serkan Piantino and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. Retrieved from http:\/\/arxiv.org\/abs\/1412.7580."},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques.","author":"Vasilache Nicolas","year":"2012","unstructured":"Nicolas Vasilache , Beno\u00eet Meister , Muthu Baskaran , and Richard Lethin . 2012 . Joint scheduling and layout optimization to enable multi-level vectorization . In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques. Nicolas Vasilache, Beno\u00eet Meister, Muthu Baskaran, and Richard Lethin. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques."},{"key":"e_1_2_1_67_1","unstructured":"T. Veldhuizen and E. Gannon. 1998. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop: Object Oriented Methods for Interoperable Scientific and Engineering Computing Michael E. Henderson Christopher R. Anderson and Stephen L. Lyons (Eds.). SIAM Press 286--295.  T. Veldhuizen and E. Gannon. 1998. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop: Object Oriented Methods for Interoperable Scientific and Engineering Computing Michael E. Henderson Christopher R. Anderson and Stephen L. Lyons (Eds.). SIAM Press 286--295."},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15582-6_49"},{"key":"e_1_2_1_69_1","volume-title":"Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT\u201911)","author":"Verdoolaege Sven","year":"2011","unstructured":"Sven Verdoolaege . 2011 . Counting affine calculator and applications . In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT\u201911) . DOI:https:\/\/doi.org\/10.13140\/RG.2.1.2959.5601 10.13140\/RG.2.1.2959.5601 Sven Verdoolaege. 2011. Counting affine calculator and applications. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT\u201911). DOI:https:\/\/doi.org\/10.13140\/RG.2.1.2959.5601"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/2400682.2400713"},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the 4th Workshop on Polyhedral Compilation Techniques (IMPACT\u201914","author":"Verdoolaege Sven","year":"2014","unstructured":"Sven Verdoolaege , Serge Guelton , Tobias Grosser , and Albert Cohen . 2014 . Schedule trees . In Proceedings of the 4th Workshop on Polyhedral Compilation Techniques (IMPACT\u201914 , Associated with HiPEAC\u201914). Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. 2014. Schedule trees. In Proceedings of the 4th Workshop on Polyhedral Compilation Techniques (IMPACT\u201914, Associated with HiPEAC\u201914)."},{"key":"#cr-split#-e_1_2_1_72_1.1","unstructured":"Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706. Department of Computer Science KU Leuven Leuven Belgium. DOI:https:\/\/doi.org\/10.13140\/RG.2.2.28998.68169 10.13140\/RG.2.2.28998.68169"},{"key":"#cr-split#-e_1_2_1_72_1.2","unstructured":"Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706. Department of Computer Science KU Leuven Leuven Belgium. DOI:https:\/\/doi.org\/10.13140\/RG.2.2.28998.68169"},{"key":"e_1_2_1_73_1","volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing (SC\u201998)","author":"Clint Whaley R.","unstructured":"R. Clint Whaley and Jack J. Dongarra . 1998. Automatically tuned linear algebra software . In Proceedings of the ACM\/IEEE Conference on Supercomputing (SC\u201998) . IEEE Computer Society, Washington, DC, 1--27. Retrieved from http:\/\/dl.acm.org\/citation.cfm?id&equals;509058.509096. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM\/IEEE Conference on Supercomputing (SC\u201998). IEEE Computer Society, Washington, DC, 1--27. Retrieved from http:\/\/dl.acm.org\/citation.cfm?id&equals;509058.509096."},{"key":"e_1_2_1_74_1","unstructured":"Yuxin Wu and Kaiming He. 2018. Group normalization. Retrieved from http:\/\/arxiv.org\/abs\/1803.08494.  Yuxin Wu and Kaiming He. 2018. Group normalization. Retrieved from http:\/\/arxiv.org\/abs\/1803.08494."},{"key":"e_1_2_1_75_1","unstructured":"Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. Retrieved from http:\/\/arxiv.org\/abs\/1611.05431.  Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. Retrieved from http:\/\/arxiv.org\/abs\/1611.05431."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178372.3179507"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3355606","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3355606","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:13:29Z","timestamp":1750202009000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3355606"}},"subtitle":["From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically"],"short-title":[],"issued":{"date-parts":[[2019,10,11]]},"references-count":76,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3355606"],"URL":"https:\/\/doi.org\/10.1145\/3355606","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,10,11]]},"assertion":[{"value":"2019-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}