{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:39:43Z","timestamp":1766219983920,"version":"3.48.0"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","funder":[{"name":"National Key R\\\\&D Program of China","award":["2023YFB3002000"],"award-info":[{"award-number":["2023YFB3002000"]}]},{"name":"National Natural Science Foundation of China","award":["U23A6007"],"award-info":[{"award-number":["U23A6007"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,9,8]]},"DOI":"10.1145\/3754598.3754604","type":"proceedings-article","created":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:34:32Z","timestamp":1766219672000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Auto-Stencil: Performance-Driven Stencil Optimization with Hardware Feedback for LLMs"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7664-6559","authenticated-orcid":false,"given":"Quan","family":"Deng","sequence":"first","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1297-4462","authenticated-orcid":false,"given":"Lin","family":"Gan","sequence":"additional","affiliation":[{"name":"Yau Mathematical Sciences Center, Tsinghua University, Beijing, China and Hetao Institute of Mathematics and Interdisciplinary Sciences, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2591-6328","authenticated-orcid":false,"given":"Hongkun","family":"Yu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1036-4732","authenticated-orcid":false,"given":"Wenlai","family":"Zhao","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8673-8254","authenticated-orcid":false,"given":"Guangwen","family":"Yang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,20]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Chanyoung Oh Zhen Zheng Xipeng Shen Jidong Zhai and Youngmin Yi. Gopipe: A granularity-oblivious programming framework for pipelined stencil executions on gpu. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques pages 43\u201354 2020.","DOI":"10.1145\/3410463.3414656"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Uday Bondhugula Vinayaka Bandishti and Irshad Pananilath. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28(5):1285\u20131298 2016.","DOI":"10.1109\/TPDS.2016.2615094"},{"key":"e_1_3_3_2_4_2","doi-asserted-by":"crossref","unstructured":"Kazuaki Matsumura Hamid\u00a0Reza Zohouri Mohamed Wahib Toshio Endo and Satoshi Matsuoka. An5d: automated stencil framework for high-degree temporal blocking on gpus. In Proceedings of the 18th ACM\/IEEE International Symposium on Code Generation and Optimization pages 199\u2013211 2020.","DOI":"10.1145\/3368826.3377904"},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"crossref","unstructured":"Prashant\u00a0Singh Rawat Miheer Vaidya Aravind Sukumaran-Rajam Mahesh Ravishankar Vinod Grover Atanas Rountev Louis-No\u00ebl Pouchet and Ponnuswamy Sadayappan. Domain-specific optimization and generation of high-performance gpu code for stencil computations. Proceedings of the IEEE 106(11):1902\u20131920 2018.","DOI":"10.1109\/JPROC.2018.2862896"},{"key":"e_1_3_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Charles Yount. Vector folding: improving stencil performance via multi-dimensional simd-vector representation. In 2015 IEEE 17th international conference on high performance computing and communications 2015 IEEE 7th international symposium on cyberspace safety and security and 2015 IEEE 12th international conference on embedded software and systems pages 865\u2013870. IEEE 2015.","DOI":"10.1109\/HPCC-CSS-ICESS.2015.27"},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"crossref","unstructured":"Sven Verdoolaege Juan Carlos\u00a0Juega Albert Cohen Jose Ignacio\u00a0Gomez Christian Tenllado and Francky Catthoor. Polyhedral parallel code generation for cuda. ACM Transactions on Architecture and Code Optimization (TACO) 9(4):1\u201323 2013.","DOI":"10.1145\/2400682.2400713"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Xin You Hailong Yang Zhonghui Jiang Zhongzhi Luan and Depei Qian. Drstencil: Exploiting data reuse within low-order stencil on gpu. In 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor Cloud & Big Data Systems & Application (HPCC\/DSS\/SmartCity\/DependSys) pages 63\u201370. IEEE 2021.","DOI":"10.1109\/HPCC-DSS-SmartCity-DependSys53884.2021.00036"},{"key":"e_1_3_3_2_9_2","unstructured":"Zheng Yuan Hongyi Yuan Chuanqi Tan Wei Wang Songfang Huang and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2304.05302 2023."},{"key":"e_1_3_3_2_10_2","unstructured":"John Schulman Filip Wolski Prafulla Dhariwal Alec Radford and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1707.06347 2017."},{"key":"e_1_3_3_2_11_2","unstructured":"Shukai Duan Nikos Kanakaris Xiongye Xiao Heng Ping Chenyu Zhou Nesreen\u00a0K Ahmed Guixiang Ma Mihai Capota Theodore\u00a0L Willke Shahin Nazarian et\u00a0al. Leveraging reinforcement learning and large language models for code optimization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2312.05657 2023."},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics pages 311\u2013318 2002.","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_3_2_13_2","unstructured":"Shuo Ren Daya Guo Shuai Lu Long Zhou Shujie Liu Duyu Tang Neel Sundaresan Ming Zhou Ambrosio Blanco and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2009.10297 2020."},{"key":"e_1_3_3_2_14_2","unstructured":"Yuanbo Wen Qi\u00a0Guo Qiang Fu Xiaqing Li Jianxing Xu Yanlin Tang Yongwei Zhao Xing Hu Zidong Du Ling Li et\u00a0al. Babeltower: Learning to auto-parallelized program translation. In International Conference on Machine Learning pages 23685\u201323700. PMLR 2022."},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"crossref","unstructured":"Yuetao Chen Kun Li Yuhao Wang Donglin Bai Lei Wang Lingxiao Ma Liang Yuan Yunquan Zhang Ting Cao and Mao Yang. Convstencil: Transform stencil computation to matrix multiplication on tensor cores. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming pages 333\u2013347 2024.","DOI":"10.1145\/3627535.3638476"},{"key":"e_1_3_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Yiwei Zhang Kun Li Liang Yuan Jiawen Cheng Yunquan Zhang Ting Cao and Mao Yang. Lorastencil: Low-rank adaptation of stencil computation on tensor cores. In SC24: International Conference for High Performance Computing Networking Storage and Analysis pages 1\u201317. IEEE 2024.","DOI":"10.1109\/SC41406.2024.00059"},{"key":"e_1_3_3_2_17_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia\u00a0Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat et\u00a0al. Gpt-4 technical report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2303.08774 2023."},{"key":"e_1_3_3_2_18_2","unstructured":"Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi et\u00a0al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2501.12948 2025."},{"key":"e_1_3_3_2_19_2","unstructured":"Binyuan Hui Jian Yang Zeyu Cui Jiaxi Yang Dayiheng Liu Lei Zhang Tianyu Liu Jiajun Zhang Bowen Yu Keming Lu et\u00a0al. Qwen2. 5-coder technical report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2409.12186 2024."},{"key":"e_1_3_3_2_20_2","doi-asserted-by":"crossref","unstructured":"Mahesh Ravishankar Justin Holewinski and Vinod Grover. Forma: A dsl for image processing applications to target gpus and multi-core cpus. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs pages 109\u2013120 2015.","DOI":"10.1145\/2716282.2716290"},{"key":"e_1_3_3_2_21_2","doi-asserted-by":"crossref","unstructured":"Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch and Christophe Dubach. High performance stencil code generation with lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization pages 100\u2013112 2018.","DOI":"10.1145\/3168824"},{"key":"e_1_3_3_2_22_2","doi-asserted-by":"crossref","unstructured":"Prashant\u00a0Singh Rawat Miheer Vaidya Aravind Sukumaran-Rajam Atanas Rountev Louis-No\u00ebl Pouchet and P\u00a0Sadayappan. On optimizing complex stencils on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) pages 641\u2013652. IEEE 2019.","DOI":"10.1109\/IPDPS.2019.00073"},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Naoya Maruyama Tatsuo Nomura Kento Sato and Satoshi Matsuoka. Physis: an implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing Networking Storage and Analysis pages 1\u201312 2011.","DOI":"10.1145\/2063384.2063398"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Johannes de\u00a0Fine\u00a0Licht Andreas Kuster Tiziano De\u00a0Matteis Tal Ben-Nun Dominic Hofer and Torsten Hoefler. Stencilflow: Mapping large stencil programs to distributed spatial computing systems. In 2021 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO) pages 315\u2013326. IEEE 2021.","DOI":"10.1109\/CGO51591.2021.9370315"},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Mingzhen Li Yi\u00a0Liu Hailong Yang Yongmin Hu Qingxiao Sun Bangduo Chen Xin You Xiaoyan Liu Zhongzhi Luan and Depei Qian. Automatic code generation and optimization of large-scale stencil computation on many-core processors. In Proceedings of the 50th International Conference on Parallel Processing pages 1\u201312 2021.","DOI":"10.1145\/3472456.3473517"},{"key":"e_1_3_3_2_26_2","doi-asserted-by":"crossref","unstructured":"Xiaoyan Liu Yi\u00a0Liu Hailong Yang Jianjin Liao Mingzhen Li Zhongzhi Luan and Depei Qian. Toward accelerated stencil computation by adapting tensor core unit on gpu. In Proceedings of the 36th ACM International Conference on Supercomputing pages 1\u201312 2022.","DOI":"10.1145\/3524059.3532392"},{"key":"e_1_3_3_2_27_2","doi-asserted-by":"crossref","unstructured":"Kun Li Liang Yuan Yunquan Zhang and Yue Yue. Reducing redundancy in data organization and arithmetic calculation for stencil computations. In Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis pages 1\u201315 2021.","DOI":"10.1145\/3458817.3476154"},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"crossref","unstructured":"Zafar Ahmad Rezaul Chowdhury Rathish Das Pramod Ganapathi Aaron Gregory and Yimin Zhu. Brief announcement: Faster stencil computations using gaussian approximations. In Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures pages 291\u2013293 2022.","DOI":"10.1145\/3490148.3538558"},{"key":"e_1_3_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Jason Ansel Shoaib Kamil Kalyan Veeramachaneni Jonathan Ragan-Kelley Jeffrey Bosboom Una-May O\u2019Reilly and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation pages 303\u2013316 2014.","DOI":"10.1145\/2628071.2628092"},{"key":"e_1_3_3_2_30_2","doi-asserted-by":"crossref","unstructured":"Joseph\u00a0D Garvey and Tarek\u00a0S Abdelrahman. Automatic performance tuning of stencil computations on gpus. In 2015 44th International Conference on Parallel Processing pages 300\u2013309. IEEE 2015.","DOI":"10.1109\/ICPP.2015.39"},{"key":"e_1_3_3_2_31_2","doi-asserted-by":"crossref","unstructured":"Qingxiao Sun Yi\u00a0Liu Hailong Yang Zhonghui Jiang Zhongzhi Luan and Depei Qian. Adaptive auto-tuning framework for global exploration of stencil optimization on gpus. IEEE Transactions on Parallel and Distributed Systems 35(1):20\u201333 2023.","DOI":"10.1109\/TPDS.2023.3325630"},{"key":"e_1_3_3_2_32_2","doi-asserted-by":"crossref","unstructured":"Mohamed-Walid Benabderrahmane Louis-No\u00ebl Pouchet Albert Cohen and C\u00e9dric Bastoul. The polyhedral model is more widely applicable than you think. In International Conference on Compiler Construction pages 283\u2013303. Springer 2010.","DOI":"10.1007\/978-3-642-11970-5_16"},{"key":"e_1_3_3_2_33_2","doi-asserted-by":"crossref","unstructured":"Junyi Liu John Wickerson Samuel Bayliss and George\u00a0A Constantinides. Polyhedral-based dynamic loop pipelining for high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(9):1802\u20131815 2017.","DOI":"10.1109\/TCAD.2017.2783363"},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"crossref","unstructured":"Louis-Noel Pouchet Peng Zhang Ponnuswamy Sadayappan and Jason Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM\/SIGDA international symposium on Field programmable gate arrays pages 29\u201338 2013.","DOI":"10.1145\/2435264.2435273"},{"key":"e_1_3_3_2_35_2","doi-asserted-by":"crossref","unstructured":"Sven Verdoolaege. isl: An integer set library for the polyhedral model. In International Congress on Mathematical Software pages 299\u2013302. Springer 2010.","DOI":"10.1007\/978-3-642-15582-6_49"},{"key":"e_1_3_3_2_36_2","doi-asserted-by":"crossref","unstructured":"Zhen Li Ali Jannesari and Felix Wolf. Discovery of potential parallelism in sequential programs. In 2013 42nd International Conference on Parallel Processing pages 1004\u20131013. IEEE 2013.","DOI":"10.1109\/ICPP.2013.119"},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"crossref","unstructured":"Muthu\u00a0Manikandan Baskaran Uday Bondhugula Sriram Krishnamoorthy Jagannathan Ramanujam Atanas Rountev and Ponnuswamy Sadayappan. A compiler framework for optimization of affine loop nests for gpgpus. In Proceedings of the 22nd annual international conference on Supercomputing pages 225\u2013234 2008.","DOI":"10.1145\/1375527.1375562"},{"key":"e_1_3_3_2_38_2","doi-asserted-by":"crossref","unstructured":"Gleison Mendon\u00e7a Breno Guimar\u00e3es P\u00e9ricles Alves M\u00e1rcio Pereira Guido Ara\u00fajo and Fernando Magno\u00a0Quint\u00e3o Pereira. Dawncc: automatic annotation for data parallelism and offloading. ACM Transactions on Architecture and Code Optimization (TACO) 14(2):1\u201325 2017.","DOI":"10.1145\/3084540"},{"key":"e_1_3_3_2_39_2","doi-asserted-by":"crossref","unstructured":"Cedric Nugteren and Henk Corporaal. Bones: An automatic skeleton-based c-to-cuda compiler for gpus. ACM Transactions on Architecture and Code Optimization (TACO) 11(4):1\u201325 2014.","DOI":"10.1145\/2665079"},{"key":"e_1_3_3_2_40_2","unstructured":"Alexander Shypula Aman Madaan Yimeng Zeng Uri Alon Jacob Gardner Milad Hashemi Graham Neubig Parthasarathy Ranganathan Osbert Bastani and Amir Yazdanbakhsh. Learning performance-improving code edits. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2302.07867 2023."},{"key":"e_1_3_3_2_41_2","unstructured":"Jiate Liu Yiqin Zhu Kaiwen Xiao Qiang Fu Xiao Han Wei Yang and Deheng Ye. Rltf: Reinforcement learning from unit test feedback. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.04349 2023."},{"key":"e_1_3_3_2_42_2","unstructured":"Hung Le Yue Wang Akhilesh\u00a0Deepak Gotmare Silvio Savarese and Steven Chu\u00a0Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35:21314\u201321328 2022."},{"key":"e_1_3_3_2_43_2","unstructured":"Parshin Shojaee Aneesh Jain Sindhu Tipirneni and Chandan\u00a0K Reddy. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2301.13816 2023."},{"key":"e_1_3_3_2_44_2","unstructured":"Bo\u00a0Shen Jiaxin Zhang Taihong Chen Daoguang Zan Bing Geng An\u00a0Fu Muhan Zeng Ailun Yu Jichuan Ji Jingyang Zhao et\u00a0al. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2307.14936 2023."}],"event":{"name":"ICPP '25: 54th International Conference on Parallel Processing","location":"San Diego CA USA","acronym":"ICPP '25"},"container-title":["Proceedings of the 54th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3754598.3754604","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:36:25Z","timestamp":1766219785000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3754598.3754604"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,8]]},"references-count":43,"alternative-id":["10.1145\/3754598.3754604","10.1145\/3754598"],"URL":"https:\/\/doi.org\/10.1145\/3754598.3754604","relation":{},"subject":[],"published":{"date-parts":[[2025,9,8]]},"assertion":[{"value":"2025-12-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}