{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T07:25:23Z","timestamp":1774855523541,"version":"3.50.1"},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,3,31]],"date-time":"2019-03-31T00:00:00Z","timestamp":1553990400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["KR4661\/2-1"],"award-info":[{"award-number":["KR4661\/2-1"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2019,3,31]]},"abstract":"<jats:p>This article presents matrix-free finite-element techniques for efficiently solving partial differential equations on modern many-core processors, such as graphics cards. We develop a GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes. The central algorithmic component is the matrix-free operator evaluation with sum factorization. We compare the node-level performance of our implementation running on an Nvidia Pascal P100 GPU to a highly optimized multicore implementation running on comparable Intel Broadwell CPUs and an Intel Xeon Phi. Our experiments show that the GPU implementation is approximately 1.5 to 2 times faster across four different scenarios of the Poisson equation and a variety of element degrees in 2D and 3D. The lowest time to solution per degree of freedom is recorded for moderate polynomial degrees between 3 and 5. A detailed performance analysis highlights the capabilities of the GPU architecture and the chosen execution model with threading within the element, particularly with respect to the evaluation of the matrix-vector product. Atomic intrinsics are shown to provide a fast way for avoiding the possible race conditions in summing the elemental residuals into the global vector associated to shared vertices, edges, and surfaces. In addition, the solver infrastructure allows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precision with an outer correction in double precision, increasing throughput by up to 83%.<\/jats:p>","DOI":"10.1145\/3322813","type":"journal-article","created":{"date-parts":[[2019,5,8]],"date-time":"2019-05-08T14:11:11Z","timestamp":1557324671000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":62,"title":["Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors"],"prefix":"10.1145","volume":"6","author":[{"given":"Martin","family":"Kronbichler","sequence":"first","affiliation":[{"name":"Technical University of Munich, Garching, Germany"}]},{"given":"Karl","family":"Ljungkvist","sequence":"additional","affiliation":[{"name":"Uppsala University, Uppsala, Sweden"}]}],"member":"320","published-online":{"date-parts":[[2019,5,7]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0021-9991(03)00194-3"},{"key":"e_1_2_1_2_1","volume-title":"Retrieved","author":"Adams Mark","year":"2019"},{"key":"e_1_2_1_3_1","unstructured":"Mark Adams Phillip Colella Daniel T. Graves Jeff N. Johnson Hans S. Johansen Noel D. Keen Terry J. Ligocki etal 2015. Chombo Software Package for AMR Applications Design Document. Technical Report. Lawrence Berkeley National Laboratory. https:\/\/crd.lbl.gov\/assets\/pubs_presos\/chomboDesign.pdf.  Mark Adams Phillip Colella Daniel T. Graves Jeff N. Johnson Hans S. Johansen Noel D. Keen Terry J. Ligocki et al. 2015. Chombo Software Package for AMR Applications Design Document. Technical Report. Lawrence Berkeley National Laboratory. https:\/\/crd.lbl.gov\/assets\/pubs_presos\/chomboDesign.pdf."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2010.04.024"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1515\/jnma-2018-0054"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2049673.2049678"},{"key":"e_1_2_1_7_1","volume-title":"Parallel Computational Fluid Dynamics. North Holland","author":"Berger Marsha J."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1090\/S0025-5718-1977-0431719-X"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10915-010-9396-8"},{"issue":"2","key":"e_1_2_1_10_1","first-page":"6","article-title":"ECP Milestone Report: Propose High-Order Mesh\/Data Format","volume":"2","author":"Brown Jed","year":"2018","journal-title":"WBS"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1137\/110834512"},{"key":"e_1_2_1_12_1","volume-title":"CEED Bake-Off Problems (Benchmarks). Retrieved","author":"CEED.","year":"2019"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1137\/18M1172260"},{"key":"e_1_2_1_14_1","unstructured":"Thomas C. Clevenger Timo Heister Guido Kanschat and Martin Kronbichler. 2019. A flexible parallel adaptive geometric multigrid method for FEM. arXiv:1904.03317 preprint.  Thomas C. Clevenger Timo Heister Guido Kanschat and Martin Kronbichler. 2019. A flexible parallel adaptive geometric multigrid method for FEM. arXiv:1904.03317 preprint."},{"key":"e_1_2_1_15_1","volume-title":"Mund","author":"Deville Michel O.","year":"2002"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1137\/17M1128903"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1002\/fld.4511"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1002\/fld.4683"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.4208\/aamm.2013.m87"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2005.01.005"},{"key":"e_1_2_1_21_1","unstructured":"Paul F. Fischer Thilina Rathnayake Som Dutta Veselin Dobrev Tzanio Kolev Jean-Sylvain Camier Martin Kronbichler etal 2019. Running faster in HPC applications. In Preparation.  Paul F. Fischer Thilina Rathnayake Som Dutta Veselin Dobrev Tzanio Kolev Jean-Sylvain Camier Martin Kronbichler et al. 2019. Running faster in HPC applications. In Preparation."},{"key":"e_1_2_1_22_1","volume-title":"Towards a complete FEM-based simulation toolkit on GPUs: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses. Computers 8 Fluids 80","author":"Geveler Markus","year":"2013"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1137\/15M1010798"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1137\/130941353"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1080\/17445760601122076"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of Parallel CFD\u201999","author":"Gropp William D."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/370049.370405"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Mark Harris. 2007. Optimizing CUDA. In SC\u201907: High Performance Computing With CUDA. Nvidia.  Mark Harris. 2007. Optimizing CUDA. In SC\u201907: High Performance Computing With CUDA. Nvidia.","DOI":"10.1145\/1281500.1281650"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the 16th AIAA Computational Fluid Dynamics Conference.","author":"Helenbrook Brian"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1089014.1089021"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1137\/090778523"},{"key":"e_1_2_1_32_1","volume-title":"Knights Landing Edition. Morgan Kaufmann","author":"Jeffers James"},{"key":"e_1_2_1_33_1","volume-title":"Dissecting the Volta GPU Architecture Through Microbenchmarking: GTC 2018","author":"Jia Zhe","year":"2019"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1002\/fld.845"},{"key":"e_1_2_1_35_1","volume-title":"Multilevel methods for discontinuous Galerkin FEM on locally refined meshes. Computers 8 Structures 82, 28","author":"Kanschat Guido","year":"2004"},{"key":"e_1_2_1_36_1","volume-title":"Sherwin","author":"Karniadakis George E.","year":"2013"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1002\/nme.1620191103"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2627373.2627387"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2009.06.041"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2427023.2427027"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2010.06.024"},{"key":"e_1_2_1_42_1","volume-title":"Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scientists and Engineers","author":"Kopriva David A."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/eScience.2011.53"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1137\/130930352"},{"key":"e_1_2_1_45_1","volume-title":"A generic interface for parallel cell-based finite element operator application. Computers 8 Fluids 63","author":"Kronbichler Martin","year":"2012"},{"key":"e_1_2_1_46_1","volume-title":"Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM Transactions on Mathematical Software, in press","author":"Kronbichler Martin","year":"2019"},{"key":"e_1_2_1_47_1","series-title":"Lecture Notes in Computer Science","volume-title":"ISC High Performance","author":"Kronbichler Martin"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1137\/16M110455X"},{"key":"e_1_2_1_49_1","series-title":"Lecture Notes in Computer Science","volume-title":"Euro-Par 2014: Parallel Processing Workshops","author":"Ljungkvist Karl"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.5555\/3108096.3108097"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10915-004-4787-3"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2005.06.019"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01386067"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.28"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.2307\/2007986"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00791-014-0223-x"},{"key":"e_1_2_1_57_1","unstructured":"Nvidia Corporation. 2013. CUDA cuSPARSE Library. Version 8.0. Nvidia.  Nvidia Corporation. 2013. CUDA cuSPARSE Library. Version 8.0. Nvidia."},{"key":"e_1_2_1_59_1","unstructured":"Nvidia Corporation. 2016. Kepler Tuning Guide. Nvidia.  Nvidia Corporation. 2016. Kepler Tuning Guide. Nvidia."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(80)90005-4"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(84)90128-1"},{"key":"e_1_2_1_62_1","volume-title":"An Introduction to Partial Differential Equations","author":"Pinchover Yehuda"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2016.08.005"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807675"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10915-016-0345-z"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389055"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1002\/nla.1979"},{"key":"e_1_2_1_68_1","volume-title":"International Journal for High Performance Computing Applications","author":"\u015awirydowicz Kasia"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1146\/annurev.fluid.35.101101.161209"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPPW.2010.38"},{"key":"e_1_2_1_71_1","volume-title":"Elsevier Academic Press","author":"Trottenberg Ulrich"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/331532.331599"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851488"},{"key":"e_1_2_1_74_1","volume-title":"Proceedings of the GPU Technology Conference (GTC\u201910)","volume":"10","author":"Volkov Vasily","year":"2010"},{"key":"e_1_2_1_75_1","unstructured":"Nicholas Wilt. 2013. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.  Nicholas Wilt. 2013. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education."}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3322813","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3322813","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:02:26Z","timestamp":1750208546000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3322813"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,3,31]]},"references-count":74,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,3,31]]}},"alternative-id":["10.1145\/3322813"],"URL":"https:\/\/doi.org\/10.1145\/3322813","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,3,31]]},"assertion":[{"value":"2018-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-05-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}