{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:22:02Z","timestamp":1750220522353,"version":"3.41.0"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,6,8]],"date-time":"2021-06-08T00:00:00Z","timestamp":1623110400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"<jats:p>A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels\u2019 implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community\u2019s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH\u2019s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2\u00d7 faster than hand-tuned kernels across architectures.<\/jats:p>","DOI":"10.1145\/3458357","type":"journal-article","created":{"date-parts":[[2021,6,8]],"date-time":"2021-06-08T16:21:19Z","timestamp":1623169279000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Flynn\u2019s Reconciliation"],"prefix":"10.1145","volume":"18","author":[{"given":"Daniel","family":"Thuerck","sequence":"first","affiliation":[{"name":"NEC Laboratories Europe and TU Darmstadt, Heidelberg, Germany"}]},{"given":"Nicolas","family":"Weber","sequence":"additional","affiliation":[{"name":"NEC Laboratories Europe, Heidelberg, Germany"}]},{"given":"Roberto","family":"Bifulco","sequence":"additional","affiliation":[{"name":"NEC Laboratories Europe, Heidelberg, Germany"}]}],"member":"320","published-online":{"date-parts":[[2021,6,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-016-0485-7"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ScalA49573.2019.00006"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2838735"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4460"},{"volume-title":"Proceedings of the 46th International Conference on Parallel Processing (ICPP\u201917)","author":"Anzt Hartwig","key":"e_1_2_1_5_1","unstructured":"Hartwig Anzt , Jack Dongarra , Goran Flegar , and Enrique S . Quintana-Orti. 2017. Variable-size batched LU for small matrices and its integration into block-Jacobi preconditioning . In Proceedings of the 46th International Conference on Parallel Processing (ICPP\u201917) . IEEE, 91\u2013100. Hartwig Anzt, Jack Dongarra, Goran Flegar, and Enrique S. Quintana-Orti. 2017. Variable-size batched LU for small matrices and its integration into block-Jacobi preconditioning. In Proceedings of the 46th International Conference on Parallel Processing (ICPP\u201917). IEEE, 91\u2013100."},{"volume-title":"Computational Combinatorial Optimization","author":"Applegate David","key":"e_1_2_1_6_1","unstructured":"David Applegate , Robert Bixby , Va\u0161ek Chv\u00e1tal , and William Cook . 2001. TSP cuts which do not conform to the template paradigm . In Computational Combinatorial Optimization . Springer , 261\u2013303. David Applegate, Robert Bixby, Va\u0161ek Chv\u00e1tal, and William Cook. 2001. TSP cuts which do not conform to the template paradigm. In Computational Combinatorial Optimization. Springer, 261\u2013303."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3317550.3321441"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555258"},{"volume-title":"Proceedings of the IEEE\/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC\u201919)","author":"Beckingsale David A.","key":"e_1_2_1_9_1","unstructured":"David A. Beckingsale , Jason Burmark , Rich Hornung , Holger Jones , William Killian , Adam J. Kunen , Olga Pearce , Peter Robinson , Brian S. Ryujin , and Thomas R. W. Scogland . 2019. RAJA: Portable performance for large-scale scientific applications . In Proceedings of the IEEE\/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC\u201919) . IEEE, 71\u201381. David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas R. W. Scogland. 2019. RAJA: Portable performance for large-scale scientific applications. In Proceedings of the IEEE\/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC\u201919). IEEE, 71\u201381."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356173"},{"key":"e_1_2_1_11_1","unstructured":"Eli Bendersky. 2010. PyCParser. Retrieved from https:\/\/github.com\/eliben\/pycparser.  Eli Bendersky. 2010. PyCParser. Retrieved from https:\/\/github.com\/eliben\/pycparser."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACCPD.2016.010"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.orl.2008.01.004"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178433.3178435"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356162"},{"key":"e_1_2_1_16_1","volume-title":"cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 ( 2014 ). Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346199"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2011.308"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2011.10.002"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/7.2.149"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1543753.1543756"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.1998.727179"},{"key":"e_1_2_1_23_1","unstructured":"Nicolai H\u00e4hnle. 2019. D68994 [RFC]Redefine \u201cconvergent\u201d in Terms of Dynamic Instances. Retrieved from https:\/\/reviews.llvm.org\/D68994.  Nicolai H\u00e4hnle. 2019. D68994 [RFC]Redefine \u201cconvergent\u201d in Terms of Dynamic Instances. Retrieved from https:\/\/reviews.llvm.org\/D68994."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3148173.3148185"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368826.3377928"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3282307"},{"key":"e_1_2_1_27_1","first-page":"4","article-title":"Open source computer algebra systems: SymPy","volume":"45","author":"Joyner David","year":"2012","unstructured":"David Joyner , Ond\u0159ej \u010cert\u00edk , Aaron Meurer , and Brian E. Granger . 2012 . Open source computer algebra systems: SymPy . ACM Commun. Comput. Algeb. 45 , 3\/ 4 (Jan. 2012), 225\u2013234. David Joyner, Ond\u0159ej \u010cert\u00edk, Aaron Meurer, and Brian E. Granger. 2012. Open source computer algebra systems: SymPy. ACM Commun. Comput. Algeb. 45, 3\/4 (Jan. 2012), 225\u2013234.","journal-title":"ACM Commun. Comput. Algeb."},{"volume-title":"Automatic SIMD Vectorization of SSA-Based Control Flow Graphs","author":"Karrenberg Ralf","key":"e_1_2_1_28_1","unstructured":"Ralf Karrenberg . 2015. Whole-function vectorization . In Automatic SIMD Vectorization of SSA-Based Control Flow Graphs . Springer , 85\u2013125. Ralf Karrenberg. 2015. Whole-function vectorization. In Automatic SIMD Vectorization of SSA-Based Control Flow Graphs. Springer, 85\u2013125."},{"key":"e_1_2_1_29_1","volume-title":"SPIR-V specification","author":"Kessenich John","year":"2018","unstructured":"John Kessenich , Boaz Ouriel , and Raun Krisch . 2018. SPIR-V specification . Khronos Group 3 ( 2018 ). John Kessenich, Boaz Ouriel, and Raun Krisch. 2018. SPIR-V specification. Khronos Group 3 (2018)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/977395.977673"},{"key":"e_1_2_1_31_1","volume-title":"MLIR: A compiler infrastructure for the end of Moore\u2019s law. arXiv:2002.11054 [cs] (Feb.","author":"Lattner Chris","year":"2020","unstructured":"Chris Lattner , Mehdi Amini , Uday Bondhugula , Albert Cohen , Andy Davis , Jacques Pienaar , River Riddle , Tatiana Shpeisman , Nicolas Vasilache , and Oleksandr Zinenko . 2020 . MLIR: A compiler infrastructure for the end of Moore\u2019s law. arXiv:2002.11054 [cs] (Feb. 2020). Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A compiler infrastructure for the end of Moore\u2019s law. arXiv:2002.11054 [cs] (Feb. 2020)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370036.2145825"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the Workshop on Programming Models for SIMD\/Vector Processing. 17\u201324","author":"Lei\u00dfa Roland","year":"2014","unstructured":"Roland Lei\u00dfa , Immanuel Haffner , and Sebastian Hack . 2014 . Sierra: A SIMD extension for C++ . In Proceedings of the Workshop on Programming Models for SIMD\/Vector Processing. 17\u201324 . Roland Lei\u00dfa, Immanuel Haffner, and Sebastian Hack. 2014. Sierra: A SIMD extension for C++. In Proceedings of the Workshop on Programming Models for SIMD\/Vector Processing. 17\u201324."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178433.3178434"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3205289.3205294"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACCPD.2016.006"},{"key":"e_1_2_1_37_1","volume-title":"Pad\u00e9 activation units: End-to-end learning of flexible activation functions in deep networks. arXiv preprint arXiv:1907.06732","author":"Molina Alejandro","year":"2019","unstructured":"Alejandro Molina , Patrick Schramowski , and Kristian Kersting . 2019. Pad\u00e9 activation units: End-to-end learning of flexible activation functions in deep networks. arXiv preprint arXiv:1907.06732 ( 2019 ). Alejandro Molina, Patrick Schramowski, and Kristian Kersting. 2019. Pad\u00e9 activation units: End-to-end learning of flexible activation functions in deep networks. arXiv preprint arXiv:1907.06732 (2019)."},{"key":"e_1_2_1_38_1","unstructured":"Simon Moll. 2019. D57504 [RFC]: Prototype & Roadmap for Vector Predication in LLVM. Retrieved from https:\/\/reviews.llvm.org\/D57504.  Simon Moll. 2019. D57504 [RFC]: Prototype & Roadmap for Vector Predication in LLVM. Retrieved from https:\/\/reviews.llvm.org\/D57504."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3303117.3306172"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/951710.951714"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/2190025.2190062"},{"key":"e_1_2_1_42_1","unstructured":"NVIDIA. 2020. CUDA C++ Programming Guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.  NVIDIA. 2020. CUDA C++ Programming Guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html."},{"volume-title":"Proceedings of the Innovative Parallel Computing Conference (InPar\u201912)","author":"Pharr Matt","key":"e_1_2_1_43_1","unstructured":"Matt Pharr and William R. Mark . 2012. Ispc: A SPMD compiler for high-performance CPU programming . In Proceedings of the Innovative Parallel Computing Conference (InPar\u201912) . IEEE, 1\u201313. Matt Pharr and William R. Mark. 2012. Ispc: A SPMD compiler for high-performance CPU programming. In Proceedings of the Innovative Parallel Computing Conference (InPar\u201912). IEEE, 1\u201313."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178433.3178436"},{"volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 45\u201355","author":"Shin Jaewook","key":"e_1_2_1_45_1","unstructured":"Jaewook Shin , J. Chame , and M. W. Hall . 2002. Compiler-controlled caching in superword register files for multimedia extension architectures . In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 45\u201355 . Jaewook Shin, J. Chame, and M. W. Hall. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 45\u201355."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2017.7863730"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/IA349570.2019.00014"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/IA351965.2020.00010"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/IA3.2018.00008"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3392032"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355606"},{"volume-title":"High-Performance Computing on the Intel\u00ae Xeon Phi\u2122","author":"Wang Endong","key":"e_1_2_1_53_1","unstructured":"Endong Wang , Qing Zhang , Bo Shen , Guangyong Zhang , Xiaowei Lu , Qing Wu , and Yajuan Wang . 2014. Intel math kernel library . In High-Performance Computing on the Intel\u00ae Xeon Phi\u2122 . Springer , 167\u2013188. Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel\u00ae Xeon Phi\u2122. Springer, 167\u2013188."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3106341"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854038.2854041"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356210"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458357","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3458357","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:42Z","timestamp":1750195482000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458357"}},"subtitle":["Automating the Register Cache Idiom for Cross-accelerator Programming"],"short-title":[],"issued":{"date-parts":[[2021,6,8]]},"references-count":56,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3458357"],"URL":"https:\/\/doi.org\/10.1145\/3458357","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2021,6,8]]},"assertion":[{"value":"2020-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-06-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}