{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T11:09:56Z","timestamp":1768475396626,"version":"3.49.0"},"reference-count":25,"publisher":"SAGE Publications","issue":"6","license":[{"start":{"date-parts":[[2020,7,31]],"date-time":"2020-07-31T00:00:00Z","timestamp":1596153600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/L016796\/1"],"award-info":[{"award-number":["EP\/L016796\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:p> Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30% of theoretical peak performance for many examples of practical significance, and exceed 50% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels. <\/jats:p>","DOI":"10.1177\/1094342020945005","type":"journal-article","created":{"date-parts":[[2020,7,31]],"date-time":"2020-07-31T10:33:46Z","timestamp":1596191626000},"page":"629-644","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":17,"title":["A study of vectorization for matrix-free finite element methods"],"prefix":"10.1177","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4223-6700","authenticated-orcid":false,"given":"Tianjiao","family":"Sun","sequence":"first","affiliation":[{"name":"Department of Computer Science, Imperial College, London, UK"}]},{"given":"Lawrence","family":"Mitchell","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Durham University, Durham, UK"}]},{"given":"Kaushik","family":"Kulkarni","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA"}]},{"given":"Andreas","family":"Kl\u00f6ckner","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA"}]},{"given":"David A","family":"Ham","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Imperial College, London, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5905-1804","authenticated-orcid":false,"given":"Paul HJ","family":"Kelly","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Imperial College, London, UK"}]}],"member":"179","published-online":{"date-parts":[[2020,7,31]]},"reference":[{"key":"bibr1-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1145\/2566630"},{"key":"bibr2-1094342020945005","doi-asserted-by":"publisher","DOI":"10.2172\/1409218"},{"key":"bibr3-1094342020945005","unstructured":"Fog A (2017) VCL\u2014A C++ Vector Class Library. Available at: https:\/\/www.agner.org\/optimize\/vectorclass.pdf (accessed 19 March 2019)."},{"key":"bibr4-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1515\/jnum-2012-0013"},{"key":"bibr5-1094342020945005","unstructured":"Homolya M, Kirby RC, Ham DA (2017) Exposing and exploiting structure: optimal code generation for high-order finite element methods. ArXiv: 1711.02473 [cs.MS]."},{"key":"bibr6-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1137\/17M1130642"},{"key":"bibr7-1094342020945005","unstructured":"Kempf D, He\u00df R, M\u00fcthing S, et al. (2018) Automatic code generation for high-performance discontinuous Galerkin methods on modern architectures. arXiv preprint arXiv:1812.08075."},{"key":"bibr8-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1137\/17M1133208"},{"key":"bibr9-1094342020945005","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1145\/2627373.2627387","volume-title":"Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming","author":"Kl\u00f6ckner A","year":"2014"},{"key":"bibr10-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1145\/2427023.2427027"},{"key":"bibr11-1094342020945005","unstructured":"Kronbichler M, Kormann K (2017) Fast matrix-free evaluation of discontinuous Galerkin finite element operators. arXiv preprint arXiv:1711.03590 ."},{"key":"bibr12-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23099-8"},{"key":"bibr13-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1145\/3054944"},{"issue":"4","key":"bibr14-1094342020945005","first-page":"57","volume":"11","author":"Luporini F","year":"2015","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO"},{"key":"bibr15-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1137\/15M1021167"},{"key":"bibr16-1094342020945005","unstructured":"M\u00fcthing S, Piatkowski M, Bastian P (2017) High-performance implementation of matrix-free high-order discontinuous Galerkin methods. arXiv preprint arXiv:1711.10885."},{"key":"bibr17-1094342020945005","unstructured":"OpenMP Architecture Review Board (2018) OpenMP Application Programming Interface Version 5.0. Available: https:\/\/www.openmp.org\/wp-content\/uploads\/OpenMP-API-Specification-5.0.pdf (accessed 5 November 2020)."},{"key":"bibr18-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1145\/2998441"},{"key":"bibr19-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1109\/SC.Companion.2012.134"},{"key":"bibr20-1094342020945005","unstructured":"Sun T (2019a) Cross-element vectorization in Firedrake. Available at: https:\/\/www.codeocean.com\/. DOI: https:\/\/doi.org\/10.24433\/CO.8386435.v2 (accessed 5 November 2020)."},{"key":"bibr21-1094342020945005","unstructured":"Sun T (2019b) tj-sun\/firedrake-vectorization: scripts for experimental evaluation for the manuscript on cross-element vectorization. DOI: 10.5281\/zenodo.3365432 (accessed 5 November 2020)."},{"key":"bibr22-1094342020945005","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15582-6_49"},{"key":"bibr23-1094342020945005","doi-asserted-by":"crossref","unstructured":"Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 65\u201376. Available at: http:\/\/dl.acm.org\/citation.cfm?id=1498785 (accessed 5 November 2020).","DOI":"10.1145\/1498765.1498785"},{"key":"bibr24-1094342020945005","unstructured":"Zenodo\/Firedrake (2019) Softwareusedin\u2019Astudyofvectorization for matrix-free finite element methods\u2019. DOI:10.5281\/zenodo.3362177 (accessed 5 November 2020)."},{"key":"bibr25-1094342020945005","unstructured":"Zhang B (2016) Guide to automatic vectorization with Intel AVX-512 instructions in Knights Landing processors. Available at: https:\/\/colfaxresearch.com\/knl-avx512 (accessed 19 March 2019)."}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342020945005","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342020945005","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342020945005","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,2]],"date-time":"2025-03-02T06:31:49Z","timestamp":1740897109000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342020945005"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,31]]},"references-count":25,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["10.1177\/1094342020945005"],"URL":"https:\/\/doi.org\/10.1177\/1094342020945005","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,7,31]]}}}