{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T22:49:13Z","timestamp":1774392553669,"version":"3.50.1"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2018,9,4]],"date-time":"2018-09-04T00:00:00Z","timestamp":1536019200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2018,9,30]]},"abstract":"<jats:p>The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available.<\/jats:p>","DOI":"10.1145\/3235029","type":"journal-article","created":{"date-parts":[[2018,9,4]],"date-time":"2018-09-04T12:37:30Z","timestamp":1536064650000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["High-Performance Generalized Tensor Operations"],"prefix":"10.1145","volume":"15","author":[{"given":"Roman","family":"Gareev","sequence":"first","affiliation":[{"name":"Ural Federal University, Ekaterinburg, Russia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tobias","family":"Grosser","sequence":"additional","affiliation":[{"name":"ETH Zurich, Z\u00fcrich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Kruse","sequence":"additional","affiliation":[{"name":"INRIA, \u00c9cole Normale Sup\u00e9rieur, and Polly Labs, Paris, France"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,9,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Eugene Brevdo et al","author":"Abadi Mart\u00edn","year":"2015","unstructured":"Mart\u00edn Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo et al . 2015 . TensorFlow: Large-scale machine learning on heterogeneous systems. Retrieved from https:\/\/www.tensorflow.org\/. Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Retrieved from https:\/\/www.tensorflow.org\/."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.60"},{"key":"e_1_2_1_3_1","unstructured":"ARM. 2015. ARM Performance Libraries Reference Manual. ARM.  ARM. 2015. ARM Performance Libraries Reference Manual. ARM."},{"key":"e_1_2_1_4_1","unstructured":"ARM. 2016. Cortex-A57 Software Optimization Guide. ARM.  ARM. 2016. Cortex-A57 Software Optimization Guide. ARM."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1103\/RevModPhys.79.291"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840311"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654119"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2591635.2667174"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1379022.1375595"},{"key":"e_1_2_1_10_1","volume-title":"Minimizing Computation in Convolutional Neural Networks","author":"Cong Jason","unstructured":"Jason Cong and Bingjun Xiao . 2014. Minimizing Computation in Convolutional Neural Networks . Springer , Cham , 281--290. Jason Cong and Bingjun Xiao. 2014. Minimizing Computation in Convolutional Neural Networks. Springer, Cham, 281--290."},{"key":"e_1_2_1_11_1","unstructured":"Romain Dolbeau. 2016. Theoretical peak FLOPS per instruction set on less conventional hardware. https:\/\/www.researchgate.net\/publication\/308804090_Theoretical_Peak_FLOPS_per_instruction_set_on_less_conventional_hardwar. (Accessed: July 10 2018).  Romain Dolbeau. 2016. Theoretical peak FLOPS per instruction set on less conventional hardware. https:\/\/www.researchgate.net\/publication\/308804090_Theoretical_Peak_FLOPS_per_instruction_set_on_less_conventional_hardwar. (Accessed: July 10 2018)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/77626.79170"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914)","author":"Facchinei F.","unstructured":"F. Facchinei , S. Sagratella , and G. Scutari . 2014. Flexible parallel algorithms for big data optimization . In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914) . 7208--7212. F. Facchinei, S. Sagratella, and G. Scutari. 2014. Flexible parallel algorithms for big data optimization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914). 7208--7212."},{"key":"e_1_2_1_14_1","unstructured":"Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US Boston MA 1581--1592.  Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US Boston MA 1581--1592."},{"key":"e_1_2_1_15_1","unstructured":"Agner Fog. 2017. Instruction Tables. Technical University of Denmark.  Agner Fog. 2017. Instruction Tables. Technical University of Denmark."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-006-0012-3"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626412500107"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT\u201911)","author":"Grosser Tobias","year":"2011","unstructured":"Tobias Grosser , Hongbin Zheng , Ragesh Aloor , Andreas Simb\u00fcrger , Armin Gr\u00f6\u00dflinger , and Louis-No\u00ebl Pouchet . 2011 . Polly\u2014Polyhedral optimization in LLVM . In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT\u201911) , C. Alias and C. Bastoul (Eds.). Chamonix, France. Tobias Grosser, Hongbin Zheng, Ragesh Aloor, Andreas Simb\u00fcrger, Armin Gr\u00f6\u00dflinger, and Louis-No\u00ebl Pouchet. 2011. Polly\u2014Polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT\u201911), C. Alias and C. Bastoul (Eds.). Chamonix, France."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1137\/15M1026171"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC\u201915)","author":"Heinecke Alexander","year":"2015","unstructured":"Alexander Heinecke , Hans Pabst , and Greg Henry . 2015 . LIBXSMM: A high-performance library for small matrix multiplications . In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC\u201915) . Alexander Heinecke, Hans Pabst, and Greg Henry. 2015. LIBXSMM: A high-performance library for small matrix multiplications. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC\u201915)."},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC\u201911\/ETAPS\u201911)","author":"Henretty Tom","unstructured":"Tom Henretty , Kevin Stock , Louis-No\u00ebl Pouchet , Franz Franchetti , J. Ramanujam , and P. Sadayappan . 2011. Data layout transformation for stencil computations on short-vector SIMD architectures . In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC\u201911\/ETAPS\u201911) . Springer-Verlag, Berlin, 225--245. Tom Henretty, Kevin Stock, Louis-No\u00ebl Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data layout transformation for stencil computations on short-vector SIMD architectures. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC\u201911\/ETAPS\u201911). Springer-Verlag, Berlin, 225--245."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1021\/jp034596z"},{"key":"e_1_2_1_24_1","unstructured":"IBM 2012. XL C\/C++: Compiler Reference\u2014IBM. IBM.  IBM 2012. XL C\/C++: Compiler Reference\u2014IBM. IBM."},{"key":"e_1_2_1_25_1","unstructured":"Intel. {n.d.}. Intel Math Kernel Library (Intel MKL). Retrieved from https:\/\/software.intel.com\/ru-ru\/intel-mkl\/?cid&equals;sem43700011401059448&intel_term&equals;&equals;&equals;intel+mkl&gclid&equals;&equals;&equals;CIjbtvqaqM8CFSoNcwodDPUAbw&gclsrc&equals;&equals;&equals;aw.ds.  Intel. {n.d.}. Intel Math Kernel Library (Intel MKL). Retrieved from https:\/\/software.intel.com\/ru-ru\/intel-mkl\/?cid&equals;sem43700011401059448&intel_term&equals;&equals;&equals;intel+mkl&gclid&equals;&equals;&equals;CIjbtvqaqM8CFSoNcwodDPUAbw&gclsrc&equals;&equals;&equals;aw.ds."},{"key":"e_1_2_1_26_1","unstructured":"Intel. 2015. Intel C++ Compiler 16.0 Update 4 User and Reference Guide. Intel.  Intel. 2015. Intel C++ Compiler 16.0 Update 4 User and Reference Guide. Intel."},{"key":"e_1_2_1_27_1","unstructured":"Intel. 2018. Intel Intrinsics Guide. Intel.  Intel. 2018. Intel Intrinsics Guide. Intel."},{"key":"e_1_2_1_28_1","first-page":"190","article-title":"A secure scheme for privacy preserving data mining using matrix encoding","volume":"7","author":"Jayachandran Shana","year":"2016","unstructured":"Shana Jayachandran and T. Venkatachalam . 2016 . A secure scheme for privacy preserving data mining using matrix encoding . World Eng. Appl. Sci. J. 7 , 3 (2016), 190 -- 193 . Shana Jayachandran and T. Venkatachalam. 2016. A secure scheme for privacy preserving data mining using matrix encoding. World Eng. Appl. Sci. J. 7, 3 (2016), 190--193.","journal-title":"World Eng. Appl. Sci. J."},{"key":"e_1_2_1_29_1","volume-title":"LLVM: An Infrastructure for Multi-Stage Optimization. Master\u2019s thesis. Computer Science Department","author":"Lattner Chris","year":"2002","unstructured":"Chris Lattner . 2002 . LLVM: An Infrastructure for Multi-Stage Optimization. Master\u2019s thesis. Computer Science Department , University of Illinois at Urbana-Champaign , Urbana, IL . Retrieved from http:\/\/llvm.cs.uiuc.edu. Chris Lattner. 2002. LLVM: An Infrastructure for Multi-Stage Optimization. Master\u2019s thesis. Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL. Retrieved from http:\/\/llvm.cs.uiuc.edu."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/355841.355847"},{"key":"e_1_2_1_31_1","volume-title":"GEMM: From Pure C to SSE Optimized Micro Kernels.","author":"Lehn Michael","year":"2014","unstructured":"Michael Lehn . 2014 . GEMM: From Pure C to SSE Optimized Micro Kernels. Retrieved from http:\/\/apfel.mathematik.uni-ulm.de\/&sim;lehn\/sghpc\/gemm\/index.html. Michael Lehn. 2014. GEMM: From Pure C to SSE Optimized Micro Kernels. Retrieved from http:\/\/apfel.mathematik.uni-ulm.de\/&sim;lehn\/sghpc\/gemm\/index.html."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1025117523902"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2925987"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1021\/ct1007247"},{"key":"e_1_2_1_35_1","volume-title":"Sedukhin","author":"Matsumoto Kazuya","year":"2009","unstructured":"Kazuya Matsumoto and Stanislav G . Sedukhin . 2009 . A solution of the all-pairs shortest paths problem on the cell broadband engine processor. IEICE Trans . 92-D, 6 (2009), 1225--1231. Kazuya Matsumoto and Stanislav G. Sedukhin. 2009. A solution of the all-pairs shortest paths problem on the cell broadband engine processor. IEICE Trans. 92-D, 6 (2009), 1225--1231."},{"key":"e_1_2_1_36_1","unstructured":"Devin Matthews. 2016. High-performance tensor contraction without BLAS. CoRR abs\/1607.00291.  Devin Matthews. 2016. High-performance tensor contraction without BLAS. CoRR abs\/1607.00291."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/305138.305230"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.amc.2014.02.051"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5120\/17450-8341"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1137\/11082748X"},{"key":"e_1_2_1_41_1","unstructured":"Louis-No\u00ebl Pouchet. 2011. PolyBench\/C the Polyhedral Benchmark suite. Retrieved from http:\/\/web.cse.ohio-state.edu\/&sim;pouchet\/software\/polybench\/.  Louis-No\u00ebl Pouchet. 2011. PolyBench\/C the Polyhedral Benchmark suite. Retrieved from http:\/\/web.cse.ohio-state.edu\/&sim;pouchet\/software\/polybench\/."},{"key":"e_1_2_1_42_1","volume-title":"An Exact Method for Analysis of Value-Based Array Data Dependences","author":"Pugh William","unstructured":"William Pugh and David Wonnacott . 1994a. An Exact Method for Analysis of Value-Based Array Data Dependences . Springer , Berlin , 546--566. William Pugh and David Wonnacott. 1994a. An Exact Method for Analysis of Value-Based Array Data Dependences. Springer, Berlin, 546--566."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/183432.183525"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31464-3_23"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.110"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2581122.2544155"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3157733"},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Y. N. Srikant and P. Shankar. 2007. The Compiler Design Handbook: Optimizations and Machine Code Generation Second Edition. CRC Press.   Y. N. Srikant and P. Shankar. 2007. The Compiler Design Handbook: Optimizations and Machine Code Generation Second Edition. CRC Press.","DOI":"10.1201\/9781420043839"},{"key":"e_1_2_1_49_1","volume-title":"Using and Porting the GNU Compiler Collection: For Gcc-2.95","author":"Stallman R.","unstructured":"R. Stallman . 1999. Using and Porting the GNU Compiler Collection: For Gcc-2.95 . Free Software Foundation . R. Stallman. 1999. Using and Porting the GNU Compiler Collection: For Gcc-2.95. Free Software Foundation."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.101"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2086696.2086729"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2017.7863734"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/11557654_89"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/331532.331599"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2764454"},{"key":"e_1_2_1_56_1","volume-title":"Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR","author":"Vasilache Nicolas","year":"2014","unstructured":"Nicolas Vasilache , Jeff Johnson , Micha\u00ebl Mathieu , Soumith Chintala , Serkan Piantino , and Yann LeCun . 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR ( 2014 ). Retrieved from http:\/\/arxiv.org\/abs\/1412.7580. Nicolas Vasilache, Jeff Johnson, Micha\u00ebl Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR (2014). Retrieved from http:\/\/arxiv.org\/abs\/1412.7580."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503219"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.29"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(00)00085-5"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2012.97"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.14"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3235029","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3235029","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:08:17Z","timestamp":1750212497000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3235029"}},"subtitle":["A Compiler-Oriented Approach"],"short-title":[],"issued":{"date-parts":[[2018,9,4]]},"references-count":61,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2018,9,30]]}},"alternative-id":["10.1145\/3235029"],"URL":"https:\/\/doi.org\/10.1145\/3235029","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,9,4]]},"assertion":[{"value":"2017-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-09-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}