{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T15:31:13Z","timestamp":1773588673312,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":58,"publisher":"ACM","funder":[{"name":"NSF &#x28;National Science Foundation&#x29;","award":["NSF Graduate Research Fellowship Program &#x28;NSF GRFP&#x29;"],"award-info":[{"award-number":["NSF Graduate Research Fellowship Program &#x28;NSF GRFP&#x29;"]}]},{"name":"Stanford University","award":["Stanford Graduate Fellowship &#x28;SGF&#x29;"],"award-info":[{"award-number":["Stanford Graduate Fellowship &#x28;SGF&#x29;"]}]},{"name":"Defense Advanced Research Projects Agency","award":["HR00112520038"],"award-info":[{"award-number":["HR00112520038"]}]},{"name":"Naval Surface Warfare Center","award":["N00164-23-9-G057-01"],"award-info":[{"award-number":["N00164-23-9-G057-01"]}]},{"name":"Stanford University","award":["Stanford Data Analytics for What&rsquo;s Next &#x28;DAWN&#x29; Affiliate Program"],"award-info":[{"award-number":["Stanford Data Analytics for What&rsquo;s Next &#x28;DAWN&#x29; Affiliate Program"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,3,22]]},"DOI":"10.1145\/3779212.3790229","type":"proceedings-article","created":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T13:55:26Z","timestamp":1773150926000},"page":"1912-1932","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Streaming Tensor Programs: A Streaming Abstraction for Dynamic Parallelism"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-1899-1043","authenticated-orcid":false,"given":"Gina","family":"Sohn","sequence":"first","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3866-8167","authenticated-orcid":false,"given":"Genghan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-9542-3317","authenticated-orcid":false,"given":"Konstantin","family":"Hossfeld","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4773-9251","authenticated-orcid":false,"given":"Jungwoo","family":"Kim","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7994-4875","authenticated-orcid":false,"given":"Nathan","family":"Sobotka","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9668-902X","authenticated-orcid":false,"given":"Nathan","family":"Zhang","sequence":"additional","affiliation":[{"name":"SambaNova Systems, Inc, Palo Alto, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4195-8106","authenticated-orcid":false,"given":"Olivia","family":"Hsu","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA and Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8779-0636","authenticated-orcid":false,"given":"Kunle","family":"Olukotun","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2026,3,22]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Sandhini Agarwal Lama Ahmad Jason Ai Sam Altman Andy Applebaum Edwin Arbus Rahul K Arora Yu Bai Bowen Baker Haiming Bao et al. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640366"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3385412.3385965"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582020"},{"key":"e_1_3_2_1_5_1","first-page":"1362","article-title":"Llm-inference-bench: Inference benchmarking of large language models on ai accelerators. In SC24-W: Workshops of the International Conference for High Performance Computing","author":"Chitty-Venkata Krishna Teja","year":"2024","unstructured":"Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. 2024. Llm-inference-bench: Inference benchmarking of large language models on ai accelerators. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1362-1379.","journal-title":"Networking, Storage and Analysis. IEEE"},{"key":"e_1_3_2_1_6_1","unstructured":"Gheorghe Comanici Eric Bieber Mike Schaekermann Ice Pasupat Noveen Sachdeva Inderjit Dhillon Marcel Blistein Ori Ram Dan Zhang Evan Rosen et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning multimodality long context and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507706"},{"key":"e_1_3_2_1_8_1","volume-title":"Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066","author":"Dai Damai","year":"2024","unstructured":"Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al., 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2020.3012750"},{"key":"e_1_3_2_1_10_1","unstructured":"Deep Ganguli Liane Lovitt Jackson Kernion Amanda Askell Yuntao Bai Saurav Kadavath Ben Mann Ethan Perez Nicholas Schiefer Kamal Ndousse et al. 2022. Red teaming language models to reduce harms: Methods scaling behaviors and lessons learned. arXiv preprint arXiv:2209.07858 (2022)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3729256"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00046"},{"key":"e_1_3_2_1_13_1","unstructured":"Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3696443.3708918"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582051"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3104255"},{"key":"e_1_3_2_1_17_1","volume-title":"Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al.","author":"Jiang Albert Q","year":"2024","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al., 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3192366.3192379"},{"key":"e_1_3_2_1_19_1","volume-title":"Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance. arXiv preprint arXiv:2410.23668","author":"Koeplinger David","year":"2024","unstructured":"David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, et al., 2024. Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance. arXiv preprint arXiv:2410.23668 (2024)."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLSITechnologyandCir46783.2024.10631383"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_22_1","volume-title":"Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Yifan Shen, Zifan Xiao, Zhifeng Yao, Ce Zhang, Ion Stoica, Hao Chen, and Matei Zaharia. 2023b. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180 (2023). https:\/\/arxiv.org\/abs\/2309.06180"},{"key":"e_1_3_2_1_23_1","unstructured":"Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)."},{"key":"e_1_3_2_1_24_1","volume-title":"LM Arena Leaderboard. https:\/\/lmarena.ai\/leaderboard. Accessed","author":"Arena LM","year":"2025","unstructured":"LM Arena. 2025. LM Arena Leaderboard. https:\/\/lmarena.ai\/leaderboard. Accessed: Aug. 2025."},{"key":"e_1_3_2_1_25_1","volume-title":"F. Nisa Bostanc?, Ataberk Olgun, A. Giray Ya\u011flik\u00e7i, and Onur Mutlu.","author":"Luo Haocong","year":"2023","unstructured":"Haocong Luo, Yahya Can Tu\u011frul, F. Nisa Bostanc?, Ataberk Olgun, A. Giray Ya\u011flik\u00e7i, and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator."},{"key":"e_1_3_2_1_26_1","volume-title":"Automation & Test in Europe Conference & Exhibition (DATE). 918-923","author":"Meng Pingfan","year":"2016","unstructured":"Pingfan Meng, Alric Althoff, Quentin Gautier, and Ryan Kastner. 2016. Adaptive Threshold Non-Pareto Elimination: Re-thinking machine learning for system level design space exploration on FPGAs. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). 918-923."},{"key":"e_1_3_2_1_27_1","unstructured":"Meta AI. 2025. The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal Models. https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.103"},{"key":"e_1_3_2_1_29_1","volume-title":"Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE'04., IEEE, 69-70","author":"Nikhil Rishiyur","year":"2004","unstructured":"Rishiyur Nikhil. 2004. Bluespec System Verilog: efficient, correct RTL from high level specifications. In Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE'04., IEEE, 69-70."},{"key":"e_1_3_2_1_30_1","unstructured":"NVIDIA Corporation. 2024. CUDA C Programming Guide. NVIDIA. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/ Release 12.6."},{"key":"e_1_3_2_1_31_1","unstructured":"Bowen Pang Kai Li and Feifan Wang. 2025. Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching. arXiv:2503.05248 [cs.DC] https:\/\/arxiv.org\/abs\/2503.05248"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/HCS61935.2024.10664717"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC42614.2022.9731612"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080256"},{"key":"e_1_3_2_1_36_1","unstructured":"PyTorch Contributors. 2025. CUDAGraph Trees. https:\/\/docs.pytorch.org\/docs\/stable\/torch.compiler_cudagraph_trees.html. PyTorch 2.9 documentation section ''Reasons for Skipping CUDAGraph''. Accessed: 2025-12-03."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA57654.2024.00016"},{"key":"e_1_3_2_1_39_1","unstructured":"Jon Saad-Falcon Avanika Narayan Hakki Orhun Akengin J. Wes Griffin Herumb Shandilya Adrian Gamarra Lafuente Medhya Goel Rebecca Joseph Shlok Natarajan Etash Kumar Guha Shang Zhu Ben Athiwaratkun John Hennessy Azalia Mirhoseini and Christopher R\u00e9. 2025. Intelligence per Watt: Measuring Intelligence Efficiency of Local AI. arXiv:2511.07885 [cs.DC] https:\/\/arxiv.org\/abs\/2511.07885"},{"key":"e_1_3_2_1_40_1","unstructured":"Noam Shazeer. 2020. GLU Variants Improve Transformer. arXiv:2002.05202 [cs.LG] https:\/\/arxiv.org\/abs\/2002.05202"},{"key":"e_1_3_2_1_41_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Jovan Stojkovic Chaojie Zhang \u00cd\u00f1igo Goiri Josep Torrellas and Esha Choukse. 2024. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. arXiv:2408.00741 [cs.AI] https:\/\/arxiv.org\/abs\/2408.00741","DOI":"10.1109\/HPCA61900.2025.00102"},{"key":"e_1_3_2_1_43_1","volume-title":"Agentic, Reasoning, and Coding","unstructured":"GLM-4.5 Team. 2025. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. https:\/\/arxiv.org\/abs\/2508.06471"},{"key":"e_1_3_2_1_44_1","unstructured":"Kimi Team Yifan Bai Yiping Bao Guanduo Chen Jiahao Chen Ningxin Chen Ruijue Chen Yanru Chen Yuankun Chen Yutian Chen et al. 2025a. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534 (2025)."},{"key":"e_1_3_2_1_45_1","unstructured":"Tencent Hunyuan Team Ao Liu Botong Zhou Can Xu Chayse Zhou ChenChen Zhang Chengcheng Xu Chenhao Wang Decheng Wu Dengpeng Wu et al. 2025b. Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought. arXiv preprint arXiv:2505.15431 (2025)."},{"key":"e_1_3_2_1_46_1","volume-title":"Proceedings of the 11th International Conference on Compiler Construction (CC '02)","author":"Thies William","unstructured":"William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction (CC '02). Springer-Verlag, Berlin, Heidelberg, 179\u2013196."},{"key":"e_1_3_2_1_47_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00039"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00035"},{"key":"e_1_3_2_1_50_1","unstructured":"Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Yang Fan Kai Dang Mengfei Du Xuancheng Ren Rui Men Dayiheng Liu Chang Zhou Jingren Zhou and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https:\/\/arxiv.org\/abs\/2409.12191"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASP-DAC47756.2020.9045201"},{"key":"e_1_3_2_1_52_1","unstructured":"An Yang Anfeng Li Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chang Gao Chengen Huang Chenxu Lv et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)."},{"key":"e_1_3_2_1_53_1","volume-title":"Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Sungjin","year":"2022","unstructured":"Sungjin Yu, Jaehong Jeong, et al., 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/yu"},{"key":"e_1_3_2_1_54_1","volume-title":"Jongmin Kim, Hyungyo Kim, et al.","author":"Yun Sungmin","year":"2025","unstructured":"Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, et al., 2025. The new LLM bottleneck: A systems perspective on latent attention and mixture-of-experts. arXiv preprint arXiv:2507.15465 (2025)."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00046"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00085"},{"key":"e_1_3_2_1_57_1","volume-title":"Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al.","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al., 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems, Vol. 37 (2024), 62557-62583."},{"key":"e_1_3_2_1_58_1","volume-title":"Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI'24). USENIX Association, USA, Article 11, 18 pages."}],"event":{"name":"ASPLOS '26: 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems","location":"Pittsburgh PA USA","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages","SIGARCH ACM Special Interest Group on Computer Architecture","SIGBED ACM Special Interest Group on Embedded Systems"]},"container-title":["Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T14:05:01Z","timestamp":1773583501000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3779212.3790229"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,22]]},"references-count":58,"alternative-id":["10.1145\/3779212.3790229","10.1145\/3779212"],"URL":"https:\/\/doi.org\/10.1145\/3779212.3790229","relation":{},"subject":[],"published":{"date-parts":[[2026,3,22]]},"assertion":[{"value":"2026-03-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}