{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T06:58:17Z","timestamp":1770274697417,"version":"3.49.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2015,1,9]],"date-time":"2015-01-09T00:00:00Z","timestamp":1420761600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"U.S. National Science Foundation","award":["0811457, 0926687, and 1059417"],"award-info":[{"award-number":["0811457, 0926687, and 1059417"]}]},{"name":"MAPDES project"},{"name":"Department of Computing at Imperial College London"},{"name":"EPSRC","award":["EP\/I00677X\/1, EP\/I006761\/1, and EP\/L000407\/1"],"award-info":[{"award-number":["EP\/I00677X\/1, EP\/I006761\/1, and EP\/L000407\/1"]}]},{"name":"Louisiana State University"},{"name":"HiPEAC collaboration grant"},{"name":"NERC","award":["NE\/K008951\/1 and NE\/K006789\/1"],"award-info":[{"award-number":["NE\/K008951\/1 and NE\/K006789\/1"]}]},{"name":"U.S. Army through contract W911NF-10-1-000"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2015,1,9]]},"abstract":"<jats:p>We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.<\/jats:p>","DOI":"10.1145\/2687415","type":"journal-article","created":{"date-parts":[[2015,1,12]],"date-time":"2015-01-12T20:02:10Z","timestamp":1421092930000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":32,"title":["Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly"],"prefix":"10.1145","volume":"11","author":[{"given":"Fabio","family":"Luporini","sequence":"first","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ana Lucia","family":"Varbanescu","sequence":"additional","affiliation":[{"name":"University of Amsterdam"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Florian","family":"Rathgeber","sequence":"additional","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gheorghe-Teodor","family":"Bercea","sequence":"additional","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"J.","family":"Ramanujam","sequence":"additional","affiliation":[{"name":"Louisiana State University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David A.","family":"Ham","sequence":"additional","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paul H. J.","family":"Kelly","sequence":"additional","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,1,9]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Compilers: Principles, Techniques, and Tools","author":"Aho A. V.","year":"2007","unstructured":"A. V. Aho , M. S. Lam , R. Sethi , and J. D. Ullman (Eds.). 2007 . Compilers: Principles, Techniques, and Tools ( 2 nd ed.). Pearson\/Addison Wesley , Boston, MA . http:\/\/www.loc.gov\/catdir\/toc\/ecip0618\/2006024333.html. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman (Eds.). 2007. Compilers: Principles, Techniques, and Tools (2nd ed.). Pearson\/Addison Wesley, Boston, MA. http:\/\/www.loc.gov\/catdir\/toc\/ecip0618\/2006024333.html.","edition":"2"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2566630"},{"key":"e_1_2_1_3_1","volume-title":"Department of Earth Science and Engineering, South Kensington Campus","author":"Fluidity CG.","unstructured":"AM CG. 2010. Fluidity Manual (version 4.0-release ed.). Applied Modelling and Computation Group , Department of Earth Science and Engineering, South Kensington Campus , Imperial College London , London , SW7 2AZ, UK. Available at http:\/\/hdl.handle.net\/10044\/1\/7086. AMCG. 2010. Fluidity Manual (version 4.0-release ed.). Applied Modelling and Computation Group, Department of Earth Science and Engineering, South Kensington Campus, Imperial College London, London, SW7 2AZ, UK. Available at http:\/\/hdl.handle.net\/10044\/1\/7086."},{"key":"e_1_2_1_4_1","series-title":"Lecture Notes in Computer Science","volume-title":"Languages and Compilers for Parallel Computing, Hironori Kasahara and Keiji Kimura (Eds.)","author":"Bertolli C.","unstructured":"C. Bertolli , A. Betts , N. Loriant , G. R. Mudalige , D. Radford , D. A. Ham , M. B. Giles , and P. H. J. Kelly . 2013. Compiler optimizations for industrial unstructured mesh CFD applications on GPUs . In Languages and Compilers for Parallel Computing, Hironori Kasahara and Keiji Kimura (Eds.) . Lecture Notes in Computer Science , Vol. 7760 . Springer , 112--126. DOI: http:\/\/dx.doi.org\/10.1007\/978-3-642-37658-0_8 10.1007\/978-3-642-37658-0_8 C. Bertolli, A. Betts, N. Loriant, G. R. Mudalige, D. Radford, D. A. Ham, M. B. Giles, and P. H. J. Kelly. 2013. Compiler optimizations for industrial unstructured mesh CFD applications on GPUs. In Languages and Compilers for Parallel Computing, Hironori Kasahara and Keiji Kimura (Eds.). Lecture Notes in Computer Science, Vol. 7760. Springer, 112--126. DOI: http:\/\/dx.doi.org\/10.1007\/978-3-642-37658-0_8"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063396"},{"key":"e_1_2_1_7_1","volume-title":"Retrieved","year":"2014","unstructured":"Firedrake. 2014 . The Firedrake Project . Retrieved November 16, 2014, from http:\/\/www.firedrakeproject.org. Firedrake. 2014. The Firedrake Project. Retrieved November 16, 2014, from http:\/\/www.firedrakeproject.org."},{"key":"e_1_2_1_8_1","volume-title":"Retrieved","author":"Fischer P. F.","year":"2014","unstructured":"P. F. Fischer , J. W. Lottes , and S. G. Kerkemeier . 2008. Nek5000 Web Page . Retrieved November 16, 2014 , from http:\/\/nek5000.mcs.anl.gov. P. F. Fischer, J. W. Lottes, and S. G. Kerkemeier. 2008. Nek5000 Web Page. Retrieved November 16, 2014, from http:\/\/nek5000.mcs.anl.gov."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840301"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1021\/jp9051215"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2467268"},{"key":"e_1_2_1_12_1","volume-title":"Retrieved","author":"Intel Corporation","year":"2012","unstructured":"Intel Corporation . 2012 . Intel Architecture Code Analyzer (IACA) . Retrieved November 16, 2014, from http:\/\/software.intel.com\/en-us\/articles\/intel-architecture-code-analyzer\/. Intel Corporation. 2012. Intel Architecture Code Analyzer (IACA). Retrieved November 16, 2014, from http:\/\/software.intel.com\/en-us\/articles\/intel-architecture-code-analyzer\/."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1137\/040607824"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1163641.1163644"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2427023.2427027"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2013.08.026"},{"key":"e_1_2_1_17_1","volume-title":"Lecture Notes in Computational Science and Engineering","volume":"84","author":"Logg A.","year":"2012","unstructured":"A. Logg , K.-A. Mardal , and G. N. Wells ( Eds .). 2012 . Automated Solution of Differential Equations by the Finite Element Method . Lecture Notes in Computational Science and Engineering , Vol. 84 . Springer. DOI: http:\/\/dx.doi.org\/10.1007\/978-3-642-23099-8 10.1007\/978-3-642-23099-8 A. Logg, K.-A. Mardal, and G. N. Wells (Eds.). 2012. Automated Solution of Differential Equations by the Finite Element Method. Lecture Notes in Computational Science and Engineering, Vol. 84. Springer. DOI: http:\/\/dx.doi.org\/10.1007\/978-3-642-23099-8"},{"key":"e_1_2_1_18_1","volume-title":"Retrieved","author":"Luporini F.","year":"2014","unstructured":"F. Luporini . 2014 a. Helmholtz, Advection-Diffusion, and Burgers UFL Code . Retrieved November 16, 2014, from https:\/\/github.com\/firedrakeproject\/firedrake\/tree\/pyop2-ir-perf-eval\/t ests\/perf-eval. F. Luporini. 2014a. Helmholtz, Advection-Diffusion, and Burgers UFL Code. Retrieved November 16, 2014, from https:\/\/github.com\/firedrakeproject\/firedrake\/tree\/pyop2-ir-perf-eval\/t ests\/perf-eval."},{"key":"e_1_2_1_19_1","volume-title":"Retrieved","author":"Luporini F.","year":"2014","unstructured":"F. Luporini . 2014 b. Static Linear Elasticity Code . Retrieved November 16, 2014, from https:\/\/github.com\/firedrakeproject\/firedrake-bench\/tree\/experiments\/elasticity. F. Luporini. 2014b. Static Linear Elasticity Code. Retrieved November 16, 2014, from https:\/\/github.com\/firedrakeproject\/firedrake-bench\/tree\/experiments\/elasticity."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2010.04.203"},{"key":"e_1_2_1_21_1","series-title":"Lecture Notes in Computer Science","volume-title":"Supercomputing","author":"Markall G. R.","unstructured":"G. R. Markall , F. Rathgeber , L. Mitchell , N. Loriant , C. Bertolli , D. A. Ham , and P. H. J. Kelly . 2013. Performance portable finite element assembly using PyOP2 and FEniCS . In Supercomputing . Lecture Notes in Computer Science , Vol. 7905 . Springer , 279--289. G. R. Markall, F. Rathgeber, L. Mitchell, N. Loriant, C. Bertolli, D. A. Ham, and P. H. J. Kelly. 2013. Performance portable finite element assembly using PyOP2 and FEniCS. In Supercomputing. Lecture Notes in Computer Science, Vol. 7905. Springer, 279--289."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1644001.1644009"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840306"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.Companion.2012.134"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2504565"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491491.2491496"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810120"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2581122.2544155"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.101"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989508"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2010.03.031"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 1998 ACM\/IEEE Conference on Supercomputing (Supercomputing\u201998)","author":"Whaley R. C.","unstructured":"R. C. Whaley and J. J. Dongarra . 1998. Automatically tuned linear algebra software . In Proceedings of the 1998 ACM\/IEEE Conference on Supercomputing (Supercomputing\u201998) . IEEE, Los Alamitos, CA, 1--27. http:\/\/dl.acm.org\/citation.cfm&quest;id=509058.509096. R. C. Whaley and J. J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM\/IEEE Conference on Supercomputing (Supercomputing\u201998). IEEE, Los Alamitos, CA, 1--27. http:\/\/dl.acm.org\/citation.cfm&quest;id=509058.509096."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2687415","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2687415","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:12:14Z","timestamp":1750227134000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2687415"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,1,9]]},"references-count":32,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,1,9]]}},"alternative-id":["10.1145\/2687415"],"URL":"https:\/\/doi.org\/10.1145\/2687415","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,1,9]]},"assertion":[{"value":"2014-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-01-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}