{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,24]],"date-time":"2025-09-24T08:19:43Z","timestamp":1758701983185,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"OOPSLA2","license":[{"start":{"date-parts":[[2023,10,16]],"date-time":"2023-10-16T00:00:00Z","timestamp":1697414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2023,10,16]]},"abstract":"<jats:p>Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when leveraging deterministic sparsity patterns in structured grids.<\/jats:p>\n          <jats:p>In this paper, we propose an innovative domain-specific language (DSL), Mat2Stencil, with its compiler, for PDE solvers on structured grids. Mat2Stencil introduces a structured sparse matrix abstraction, facilitating modular, flexible, and easy-to-use expression of solvers across a broad spectrum, encompassing components such as Jacobi or Gauss-Seidel preconditioners, incomplete LU or Cholesky decompositions, and multigrid methods built upon them. Our DSL compiler subsequently generates matrix-free code consisting of generalized stencils through multi-stage programming. The code allows spatial loop-carried dependence in the form of quasi-affine loops, in addition to the Jacobi-style stencil\u2019s embarrassingly parallel on spatial dimensions. We further propose a novel automatic parallelization technique for the spatially dependent loops, which offers a compile-time deterministic task partitioning for threading, calculates necessary inter-thread synchronization automatically, and generates an efficient multi-threaded implementation with fine-grained synchronization.<\/jats:p>\n          <jats:p>Implementing 4 benchmarking programs, 3 of them being the pseudo-applications in NAS Parallel Benchmarks with 6.3% lines of code and 1 being matrix-free High Performance Conjugate Gradients with 16.4% lines of code, we achieve up to 1.67\u00d7 and on average 1.03\u00d7 performance compared to manual implementations.<\/jats:p>","DOI":"10.1145\/3622822","type":"journal-article","created":{"date-parts":[[2023,10,16]],"date-time":"2023-10-16T15:41:29Z","timestamp":1697470889000},"page":"686-715","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3870-106X","authenticated-orcid":false,"given":"Huanqi","family":"Cao","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6543-0859","authenticated-orcid":false,"given":"Shizhi","family":"Tang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5021-2912","authenticated-orcid":false,"given":"Qianchao","family":"Zhu","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5537-8244","authenticated-orcid":false,"given":"Bowen","family":"Yu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4281-1018","authenticated-orcid":false,"given":"Wenguang","family":"Chen","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China \/ Pengcheng Laboratory, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2023,10,16]]},"reference":[{"volume-title":"Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann","author":"Allen Randy","key":"e_1_2_1_1_1","unstructured":"Randy Allen and Ken Kennedy . 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann , San Francisco, CA, USA . isbn:1-55860-286-0 Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco, CA, USA. isbn:1-55860-286-0"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3314221.3314615"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2019.8661197"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1177\/109434209100500306"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2896389"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2615094"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.8149701"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1137\/140968896"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5222004"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78800-3_24"},{"key":"e_1_2_1_12_1","unstructured":"James Decker. 2019. Implementation of Lightweight Modular Staging (LMS) in Python. https:\/\/github.com\/jmd1011\/snek-LMS \t\t\t\t  James Decker. 2019. Implementation of Lightweight Modular Staging (LMS) in Python. https:\/\/github.com\/jmd1011\/snek-LMS"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342015593158"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579990.3580006"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01407835"},{"key":"e_1_2_1_16_1","unstructured":"Johannes Habich T. Zeiser Georg Hager and Gerhard Wellein. 2009. Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization. \t\t\t\t  Johannes Habich T. Zeiser Georg Hager and Gerhard Wellein. 2009. Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.5194\/gmd-12-4729-2019"},{"key":"e_1_2_1_18_1","unstructured":"Intel. 2023. Intel oneAPI Math Kernel Library. https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/tools\/oneapi\/onemkl.html \t\t\t\t  Intel. 2023. Intel oneAPI Math Kernel Library. https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/tools\/oneapi\/onemkl.html"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1178597.1178605"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250761"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-47956-5_14"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1916461.1916467"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5194\/gmd-12-1165-2019"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1365-2478.1983.tb01060.x"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.57"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.2"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517208.2517228"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/7902.7904"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342020959423"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/331532.331562"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2184319.2184345"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898718003"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183653"},{"key":"e_1_2_1_35_1","volume-title":"Jimy Dudhia, O. Gill, Zhiquan Liu, Judith Berner, Wei Wang, G. Powers, Greg Duda, Dale M. Barker, and Xiangyu Huang.","author":"Skamarock C.","year":"2019","unstructured":"C. Skamarock , Bogumi\u0141 a Klemp , Jimy Dudhia, O. Gill, Zhiquan Liu, Judith Berner, Wei Wang, G. Powers, Greg Duda, Dale M. Barker, and Xiangyu Huang. 2019 . A Description of the Advanced Research WRF Model Version 4. C. Skamarock, Bogumi\u0141 a Klemp, Jimy Dudhia, O. Gill, Zhiquan Liu, Judith Berner, Wei Wang, G. Powers, Greg Duda, Dale M. Barker, and Xiangyu Huang. 2019. A Description of the Advanced Research WRF Model Version 4."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898717938"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3278122.3278139"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2584665"},{"volume-title":"Multi-Stage Programming: Its Theory and Applications. Ph. D. Dissertation","author":"Taha Walid","key":"e_1_2_1_39_1","unstructured":"Walid Taha . 1999. Multi-Stage Programming: Its Theory and Applications. Ph. D. Dissertation . Halmstad University , Sweden . https:\/\/urn.kb.se\/resolve?urn=urn:nbn:se:hh:diva-15052 Walid Taha. 1999. Multi-Stage Programming: Its Theory and Applications. Ph. D. Dissertation. Halmstad University, Sweden. https:\/\/urn.kb.se\/resolve?urn=urn:nbn:se:hh:diva-15052"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3519939.3523448"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989508"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15582-6_49"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.13140\/RG.2.2.28998.68169"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41592-019-0686-2"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1201\/b10376-8"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-007-0034-5"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.5"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2017.89"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476158"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3622822","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3622822","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:04Z","timestamp":1750178224000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3622822"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,16]]},"references-count":48,"journal-issue":{"issue":"OOPSLA2","published-print":{"date-parts":[[2023,10,16]]}},"alternative-id":["10.1145\/3622822"],"URL":"https:\/\/doi.org\/10.1145\/3622822","relation":{},"ISSN":["2475-1421"],"issn-type":[{"type":"electronic","value":"2475-1421"}],"subject":[],"published":{"date-parts":[[2023,10,16]]},"assertion":[{"value":"2023-10-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}