{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,19]],"date-time":"2026-06-19T22:42:25Z","timestamp":1781908945825,"version":"3.54.5"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2026,5,27]],"date-time":"2026-05-27T00:00:00Z","timestamp":1779840000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2026,6,30]]},"abstract":"<jats:p>\n                    General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD\u2019s NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on\n                    <jats:italic toggle=\"yes\">ad hoc<\/jats:italic>\n                    runtime coordination or manual scheduling. We demonstrate MLIR-AIR\u2019s capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.\n                  <\/jats:p>","DOI":"10.1145\/3785670","type":"journal-article","created":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T14:06:03Z","timestamp":1768831563000},"page":"1-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3603-6852","authenticated-orcid":false,"given":"Erwei","family":"Wang","sequence":"first","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4821-9634","authenticated-orcid":false,"given":"Samuel","family":"Bayliss","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6639-1865","authenticated-orcid":false,"given":"Andra","family":"Bisca","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0621-6032","authenticated-orcid":false,"given":"Zachary","family":"Blair","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3053-230X","authenticated-orcid":false,"given":"Sangeeta","family":"Chowdhary","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6668-4562","authenticated-orcid":false,"given":"Kristof","family":"Denolf","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8696-6890","authenticated-orcid":false,"given":"Jeff","family":"Fifield","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-3055-4751","authenticated-orcid":false,"given":"Brandon","family":"Freiberger","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5499-8871","authenticated-orcid":false,"given":"Erika","family":"Hunhoff","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5503-3511","authenticated-orcid":false,"given":"Phil","family":"James-Roxby","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4918-478X","authenticated-orcid":false,"given":"Jack","family":"Lo","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9519-0502","authenticated-orcid":false,"given":"Joseph","family":"Melber","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2956-8428","authenticated-orcid":false,"given":"Stephen","family":"Neuendorffer","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4585-3459","authenticated-orcid":false,"given":"Eddie","family":"Richter","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-5971-8584","authenticated-orcid":false,"given":"Andre","family":"Rosti","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9701-2352","authenticated-orcid":false,"given":"Javier","family":"Setoain","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3502-7401","authenticated-orcid":false,"given":"Gagandeep","family":"Singh","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-5136-7580","authenticated-orcid":false,"given":"Endri","family":"Taka","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4352-3816","authenticated-orcid":false,"given":"Pranathi","family":"Vasireddy","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4526-6024","authenticated-orcid":false,"given":"Zhewen","family":"Yu","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2850-0176","authenticated-orcid":false,"given":"Niansong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3659-339X","authenticated-orcid":false,"given":"Jinming","family":"Zhuang","sequence":"additional","affiliation":[{"name":"Research and Advanced Development, AMD, San Jose, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,5,27]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS55958.2022.9895630"},{"key":"e_1_3_2_3_2","volume-title":"IEEE\/ACM International Conference on Computer-Aided Design","author":"Bohm Agostini Nicolas","year":"2022","unstructured":"Nicolas Bohm Agostini, Serena Curzel, Vinay Amatya, Cheng Tan, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, David Kaeli, and Antonino Tumeo. 2022. An MLIR-based compiler flow for system-level design and hardware acceleration. In IEEE\/ACM International Conference on Computer-Aided Design."},{"key":"e_1_3_2_4_2","unstructured":"The Chromium Authors. 2025. Perfetto. Retrieved from https:\/\/perfetto.dev\/docs\/"},{"key":"e_1_3_2_5_2","unstructured":"The IREE Authors. 2019. IREE. Retrieved from https:\/\/iree.dev\/"},{"key":"e_1_3_2_6_2","volume-title":"International Conference on Supercomputing","author":"Manikandan Baskaran Muthu","year":"2008","unstructured":"Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In International Conference on Supercomputing."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11970-5_16"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_3_2_9_2","unstructured":"Cerebras. 2025. Cerebras Wafer Scale Engine. Retrieved from https:\/\/www.cerebras.ai\/chip"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.5555\/355074"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485137"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2011.2110592"},{"key":"e_1_3_2_13_2","unstructured":"Microsoft Corporation. 2025. Triton-Shared. Retrieved from https:\/\/github.com\/microsoft\/triton-shared"},{"key":"e_1_3_2_14_2","unstructured":"Advanced Micro Devices. 2025. AI Engine. Retrieved from https:\/\/www.amd.com\/en\/products\/adaptive-socs-and-fpgas\/technologies\/ai-engine.html"},{"key":"e_1_3_2_15_2","unstructured":"Advanced Micro Devices. 2025. MLIR-AIE. Retrieved from https:\/\/xilinx.github.io\/mlir-aie\/"},{"key":"e_1_3_2_16_2","unstructured":"Advanced Micro Devices. 2025. ROCr. Retrieved from https:\/\/rocm.docs.amd.com\/projects\/ROCR-Runtime\/en\/latest\/"},{"key":"e_1_3_2_17_2","unstructured":"Advanced Micro Devices. 2025. Ryzen 7 7840U. Retrieved from https:\/\/www.amd.com\/en\/products\/processors\/laptop\/ryzen\/7000-series\/amd-ryzen-7-7840u.html"},{"key":"e_1_3_2_18_2","unstructured":"Advanced Micro Devices. 2025. XRT. Retrieved from https:\/\/xilinx.github.io\/XRT\/master\/html\/index.html"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3211346.3211354"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICOEI.2017.8300883"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM62733.2025.00043"},{"key":"e_1_3_2_22_2","unstructured":"INTEL. 2025. Intel NPU. Retrieved from https:\/\/edc.intel.com\/content\/www\/us\/en\/design\/products\/platforms\/details\/arrow-lake-s\/core-ultra-200s-series-processors-datasheet-volume-1-of-2\/intel-neural-processing-unit-intel-npu\/"},{"key":"e_1_3_2_23_2","volume-title":"International Conference on Parallel Architectures and Compilation Techniques (PACT)","author":"Jeong Geonhwa","year":"2021","unstructured":"Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, and Tushar Krishna. 2021. Union: A unified HW-SW co-design ecosystem in MLIR for evaluating tensor operations on spatial accelerators. In International Conference on Parallel Architectures and Compilation Techniques (PACT)."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_2_25_2","unstructured":"Michele Lacchia. 2025. Radon. Retrieved from https:\/\/github.com\/rubik\/radon\/tree\/master"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370308"},{"key":"e_1_3_2_27_2","unstructured":"Chris Lattner and Jacques Pienaar. 2019. MLIR Primer: A compiler infrastructure for the end of Moore\u2019s law. In Compilers for Machine Learning. C4ML workshop at CGO 2019."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT52795.2021.00011"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-95953-1_7"},{"key":"e_1_3_2_30_2","unstructured":"NVIDIA. 2020. CUDA Release: 10.2.89. Retrieved from https:\/\/developer.nvidia.com\/cuda-toolkit"},{"key":"e_1_3_2_31_2","unstructured":"OpenAI. 2025. TRITON. Retrieved from https:\/\/triton-lang.org\/main\/index.html"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO61859.2024.00100"},{"key":"e_1_3_2_33_2","unstructured":"Qualcomm. 2025. Qualcomm AI Engine. Retrieved from https:\/\/www.qualcomm.com\/products\/technology\/processors\/ai-engine"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2024.3423692"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM62733.2025.00031"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00016"},{"key":"e_1_3_2_37_2","unstructured":"Amit Sabne. 2020. XLA: Compiling Machine Learning for Peak Performance. Retrieved from https:\/\/research.google\/pubs\/xla-compiling-machine-learning-for-peak-performance\/"},{"key":"e_1_3_2_38_2","unstructured":"Gagandeep Singh. 2022. Designing modeling and optimizing data-intensive computing systems. arXiv:2208.08886. Retrieved from https:\/\/arxiv.org\/abs\/2208.08886"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3088396"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1186\/s13059-024-03181-2"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL50879.2020.00014"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593719"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD56317.2022.00080"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_3_2_45_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439292"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3617232.3624850"},{"key":"e_1_3_2_48_2","volume-title":"Annual International Symposium on Computer Architecture","author":"Zheng Size","year":"2022","unstructured":"Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. 2022. AMOS: Enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In Annual International Symposium on Computer Architecture."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3706628.3708870"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3785670","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,27]],"date-time":"2026-05-27T14:04:43Z","timestamp":1779890683000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3785670"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,27]]},"references-count":48,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,6,30]]}},"alternative-id":["10.1145\/3785670"],"URL":"https:\/\/doi.org\/10.1145\/3785670","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,27]]},"assertion":[{"value":"2025-06-04","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-15","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-05-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}