{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T01:03:59Z","timestamp":1773277439639,"version":"3.50.1"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"PLDI","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2025,6,10]]},"abstract":"<jats:p>\n            Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA\u2019s Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called\n            <jats:italic toggle=\"yes\">tasks<\/jats:italic>\n            that operate on\n            <jats:italic toggle=\"yes\">tensors<\/jats:italic>\n            and are free of communication and synchronization. Cypress programs are bound to the target machine through a\n            <jats:italic toggle=\"yes\">mapping<\/jats:italic>\n            specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.\n          <\/jats:p>","DOI":"10.1145\/3729262","type":"journal-article","created":{"date-parts":[[2025,6,13]],"date-time":"2025-06-13T16:02:27Z","timestamp":1749830547000},"page":"396-420","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Task-Based Tensor Computations on Modern GPUs"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0746-066X","authenticated-orcid":false,"given":"Rohan","family":"Yadav","sequence":"first","affiliation":[{"name":"Stanford University, Stanford, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6093-7602","authenticated-orcid":false,"given":"Michael","family":"Garland","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3372-9036","authenticated-orcid":false,"given":"Alex","family":"Aiken","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8928-3032","authenticated-orcid":false,"given":"Michael","family":"Bauer","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,6,13]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Ullman","author":"Aho Alfred V.","year":"2006","unstructured":"Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA. isbn:0321486811"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1631"},{"key":"e_1_2_2_3_1","unstructured":"JAX Authors. 2024. Pallas Documentation. https:\/\/jax.readthedocs.io\/en\/latest\/pallas\/index.html"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063400"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555258"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.71"},{"key":"e_1_2_2_7_1","unstructured":"Ganesh Bikshandi and Jay Shah. 2023. A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library. arxiv:2312.11918. arxiv:2312.11918"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1640089.1640097"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.281"},{"key":"e_1_2_2_10_1","volume-title":"TVM: End-to-End Optimization Stack for Deep Learning. CoRR, abs\/1802.04799","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning. CoRR, abs\/1802.04799 (2018), arXiv:1802.04799. arxiv:1802.04799"},{"key":"e_1_2_2_11_1","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arxiv:2307.08691. arxiv:2307.08691"},{"key":"e_1_2_2_12_1","volume-title":"Language Modeling with Gated Convolutional Networks. CoRR, abs\/1612.08083","author":"Dauphin Yann N.","year":"2016","unstructured":"Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language Modeling with Gated Convolutional Networks. CoRR, abs\/1612.08083 (2016), arXiv:1612.08083. arxiv:1612.08083"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2012.44"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2006.55"},{"key":"e_1_2_2_15_1","volume-title":"Henrik Barthels, Rastislav Bod\u00edk, and Vinod Grover.","author":"Hagedorn Bastian","year":"2020","unstructured":"Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bod\u00edk, and Vinod Grover. 2020. Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs. CoRR, abs\/2003.06324 (2020), arXiv:2003.06324. arxiv:2003.06324"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582018"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3408974"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-020-2649-2"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1345206.1345229"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3519939.3523446"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133901"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1229428.1229477"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370308"},{"key":"e_1_2_2_24_1","volume-title":"Ray: A Distributed Framework for Emerging AI Applications. CoRR, abs\/1712.05889","author":"Moritz Philipp","year":"2017","unstructured":"Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I. Jordan, and Ion Stoica. 2017. Ray: A Distributed Framework for Emerging AI Applications. CoRR, abs\/1712.05889 (2017), arXiv:1712.05889. arxiv:1712.05889"},{"key":"e_1_2_2_25_1","unstructured":"NVIDIA. 2017. Volta Architecture Whitepaper. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf"},{"key":"e_1_2_2_26_1","unstructured":"NVIDIA. 2018. Turing Architecture Whitepaper. https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf"},{"key":"e_1_2_2_27_1","unstructured":"NVIDIA. 2021. Ampere Architecture Whitepaper. https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/data-center\/nvidia-ampere-architecture-whitepaper.pdf"},{"key":"e_1_2_2_28_1","unstructured":"NVIDIA. 2022. CUTLASS - CuTe Documentation. https:\/\/github.com\/NVIDIA\/cutlass\/tree\/main\/media\/docs\/cute"},{"key":"e_1_2_2_29_1","unstructured":"NVIDIA. 2023. Hopper Architecture Whitepaper. https:\/\/resources.nvidia.com\/en-us-tensor-core\/gtc22-whitepaper-hopper"},{"key":"e_1_2_2_30_1","unstructured":"NVIDIA. 2023. NVIDIA CUTLASS. https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_2_2_31_1","unstructured":"NVIDIA. 2024. CUTLASS Ampere GEMM. https:\/\/github.com\/NVIDIA\/cutlass\/blob\/53668799b2e38d3bb4d8245e949301476344fc2c\/include\/cutlass\/gemm\/collective\/sm80_mma_multistage.hpp"},{"key":"e_1_2_2_32_1","unstructured":"NVIDIA. 2024. CUTLASS Hopper GEMM. https:\/\/github.com\/NVIDIA\/cutlass\/blob\/53668799b2e38d3bb4d8245e949301476344fc2c\/include\/cutlass\/gemm\/collective\/sm90_mma_tma_gmma_rs_warpspecialized.hpp"},{"key":"e_1_2_2_33_1","unstructured":"NVIDIA. 2024. Efficient GEMM In CUDA. https:\/\/github.com\/NVIDIA\/cutlass\/blob\/main\/media\/docs\/efficient_gemm.md"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462176"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3665643"},{"key":"e_1_2_2_36_1","volume-title":"Dask: Parallel Computation with Blocked algorithms and Task Scheduling.. In SciPy, Kathryn Huff and James Bergstra (Eds.). scipy.org, 126\u2013132","author":"Rocklin Matthew","year":"2015","unstructured":"Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling.. In SciPy, Kathryn Huff and James Bergstra (Eds.). scipy.org, 126\u2013132. http:\/\/dblp.uni-trier.de\/db\/conf\/scipy\/scipy2015.html##Rocklin15"},{"key":"e_1_2_2_37_1","unstructured":"Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arxiv:2407.08608. arxiv:2407.08608"},{"key":"e_1_2_2_38_1","unstructured":"Noam Shazeer. 2020. GLU Variants Improve Transformer. arxiv:2002.05202. arxiv:2002.05202"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3530390.3532733"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807629"},{"key":"e_1_2_2_41_1","unstructured":"Benjamin F. Spector Simran Arora Aaryan Singhal Daniel Y. Fu and Christopher R\u00e9. 2024. ThunderKittens: Simple Fast and Adorable AI Kernels. arxiv:2410.20399. arxiv:2410.20399"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2017.7863730"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3315508.3329973"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3022671.2984016"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2764454"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.1998.10004"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3434304"},{"key":"e_1_2_2_48_1","unstructured":"Rohan Yadav Shiv Sundram Wonchan Lee Michael Garland Michael Bauer Alex Aiken and Fredrik Kjolstad. 2024. Composing Distributed Computations Through Task and Kernel Fusion. arxiv:2406.18109. arxiv:2406.18109"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3729262","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T06:03:31Z","timestamp":1752645811000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3729262"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,10]]},"references-count":48,"journal-issue":{"issue":"PLDI","published-print":{"date-parts":[[2025,6,10]]}},"alternative-id":["10.1145\/3729262"],"URL":"https:\/\/doi.org\/10.1145\/3729262","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,10]]},"assertion":[{"value":"2024-11-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-06","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}