{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:11:47Z","timestamp":1760033507295,"version":"build-2065373602"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"OOPSLA2","license":[{"start":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T00:00:00Z","timestamp":1759968000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-2216964"],"award-info":[{"award-number":["CCF-2216964"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2025,10,9]]},"abstract":"<jats:p>We introduce REPTILE, a compiler that performs tiling optimizations for programs expressed as mathematical recurrence equations. REPTILE recursively decomposes a recurrence program into a set of unique tiles and then simplifies each into a different set of recurrences. Given declarative user specifications of recurrence equations, optimizations, and optional mappings of recurrence subexpressions to external libraries calls, REPTILE generates C code that composes compiler-generated loops with calls to external hand-optimized libraries. We show that for direct linear solvers expressible as recurrence equations, the generated C code matches and often exceed the performance of standard hand-optimized libraries. We evaluate REPTILE's generated C code against hand-optimized implementations of linear solvers in Intel MKL, as well as two non-solver recurrences from bioinformatics: Needleman-Wunsch and Smith-Waterman. When the user provides good tiling specifications, REPTILE achieves parity with MKL, achieving between 0.79--1.27x speedup for the LU decomposition, 0.97--1.21x speedup for the Cholesky decomposition, 1.61x--2.72x for lower triangular matrix inversion, 1.01--1.14x speedup for triangular solve with multiple right-hand sides, and 1.14--1.73x speedup over handwritten implementations of the bioinformatics recurrences.<\/jats:p>","DOI":"10.1145\/3763074","type":"journal-article","created":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T08:49:50Z","timestamp":1759999790000},"page":"670-696","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["REPTILE: Performant Tiling of Recurrences"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-1588-0271","authenticated-orcid":false,"given":"Muhammad Usman","family":"Tariq","sequence":"first","affiliation":[{"name":"Stanford University, Stanford, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5228-1295","authenticated-orcid":false,"given":"Shiv","family":"Sundram","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2267-903X","authenticated-orcid":false,"given":"Fredrik","family":"Kjolstad","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,10,9]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Accessed 2024. OpenBLAS: An Optimized BLAS Library. https:\/\/www.openblas.net\/ Accessed: 2024-11-15"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/180\/1\/012037"},{"key":"e_1_2_1_3_1","volume-title":"Anne Greenbaum, Sven Hammarling, Alan McKenney, et al.","author":"Anderson Edward","year":"1999","unstructured":"Edward Anderson, Zhaojun Bai, Christian Bischof, L Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, Alan McKenney, et al. 1999. LAPACK users\u2019 guide. SIAM."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3591236"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1055531.1055532"},{"key":"e_1_2_1_6_1","unstructured":"Aart J.C. Bik Bixia Zheng Fredrik Kjolstad Nicolas Vasilache Penporn Koanantakool and Tatiana Shpeisman. 2022. Compiler Support for Sparse Tensor Computations in MLIR. ACM Transactions on Architecture and Code Optimization."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.299"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2013.98"},{"key":"e_1_2_1_9_1","volume-title":"Evaluating PaRSEC Through Matrix Computations in Scientific Applications. In Workshop on Asynchronous Many-Task Systems and Applications. 22\u201333","author":"Cao Qinglei","year":"2024","unstructured":"Qinglei Cao, Thomas Herault, Aurelien Bouteiller, Joseph Schuchart, and George Bosilca. 2024. Evaluating PaRSEC Through Matrix Computations in Scientific Applications. In Workshop on Asynchronous Many-Task Systems and Applications. 22\u201333."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/197320.197366"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126936"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392486"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","unstructured":"J. Choi J.J. Dongarra R. Pozo and D.W. Walker. 1992. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation. 120\u2013127. https:\/\/doi.org\/10.1109\/FMPC.1992.234898 10.1109\/FMPC.1992.234898","DOI":"10.1109\/FMPC.1992.234898"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3276493"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-016-0930-z"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/77626.77627"},{"key":"e_1_2_1_17_1","volume-title":"The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience, 15, 9","author":"Dongarra Jack J","year":"2003","unstructured":"Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience, 15, 9 (2003), 803\u2013820."},{"key":"e_1_2_1_18_1","volume-title":"James R Bunch, and Gilbert W Stewart.","author":"Dongarra Jack J","year":"1979","unstructured":"Jack J Dongarra, Cleve Barry Moler, James R Bunch, and Gilbert W Stewart. 1979. LINPACK users\u2019 guide. SIAM."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.3115\/1219044.1219076"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01407931"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/504210.504213"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3519939.3523446"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/321406.321418"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2019.8661185"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133901"},{"key":"e_1_2_1_26_1","volume-title":"Sparse tensor algebra compilation. Ph. D. Dissertation","author":"Kj\u00f8lstad Fredrik Berg","unstructured":"Fredrik Berg Kj\u00f8lstad. 2020. Sparse tensor algebra compilation. Ph. D. Dissertation. Massachusetts Institute of Technology."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476167"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/360827.360844"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/355841.355847"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/945885.945888"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(88)90035-X"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(88)90036-1"},{"key":"e_1_2_1_33_1","unstructured":"E Peise and P Bientinesi. 2016. Recursive algorithms for dense linear algebra: the ReLAPACK collection. ArXiv e-prints (Feb. arXiv preprint cs.MS\/1602.06763."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3428226"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3649820"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","unstructured":"Muhammad Usman Tariq Shiv Sundram and Fredrik Kjolstad. 2025. REPTILE artifact. https:\/\/doi.org\/10.5281\/zenodo.15761691 10.5281\/zenodo.15761691","DOI":"10.5281\/zenodo.15761691"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Ruiqin Tian Luanzheng Guo Jiajia Li Bin Ren and Gokcen Kestor. 2021. A High Performance Sparse Tensor Algebra Compiler in MLIR. 12.","DOI":"10.1109\/LLVMHPC54804.2021.00009"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2755561"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2764454"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Endong Wang Qing Zhang Bo Shen Guangyong Zhang Xiaowei Lu Qing Wu Yajuan Wang Endong Wang Qing Zhang Bo Shen et al. 2014. Intel math kernel library. High-Performance Computing on the Intel\u00ae Xeon Phi\u2122: How to Fully Exploit MIC Architectures 167\u2013188.","DOI":"10.1007\/978-3-319-06486-4_7"},{"key":"e_1_2_1_41_1","volume-title":"Automated empirical optimizations of software and the ATLAS project. Parallel computing, 27, 1-2","author":"Whaley R Clint","year":"2001","unstructured":"R Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel computing, 27, 1-2 (2001), 3\u201335."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.104"},{"key":"e_1_2_1_43_1","unstructured":"Qing Yi and Ken Kennedy. 2002. Transforming complex loop nests for locality. Ph. D. Dissertation. USA. isbn:049361673X AAI3047379"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3763074","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3763074","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:48:08Z","timestamp":1760032088000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3763074"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,9]]},"references-count":43,"journal-issue":{"issue":"OOPSLA2","published-print":{"date-parts":[[2025,10,9]]}},"alternative-id":["10.1145\/3763074"],"URL":"https:\/\/doi.org\/10.1145\/3763074","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,9]]},"assertion":[{"value":"2025-03-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}