{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T15:30:33Z","timestamp":1773588633970,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":74,"publisher":"ACM","funder":[{"name":"National Natural Science Foundation of China","award":["92473205"],"award-info":[{"award-number":["92473205"]}]},{"name":"National Key Research and Development Program of China","award":["2023YFB4404400"],"award-info":[{"award-number":["2023YFB4404400"]}]},{"name":"National Natural Science Foundation of China","award":["62222411"],"award-info":[{"award-number":["62222411"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,3,22]]},"DOI":"10.1145\/3779212.3790197","type":"proceedings-article","created":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T13:55:26Z","timestamp":1773150926000},"page":"1349-1365","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-8717-068X","authenticated-orcid":false,"given":"Yiqi","family":"Liu","sequence":"first","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0012-4113","authenticated-orcid":false,"given":"Yudong","family":"Pan","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7012-2308","authenticated-orcid":false,"given":"Mengdi","family":"Wang","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5175-7025","authenticated-orcid":false,"given":"Shixin","family":"Zhao","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5644-3105","authenticated-orcid":false,"given":"Haonan","family":"Zhu","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0904-6681","authenticated-orcid":false,"given":"Yinhe","family":"Han","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9711-8758","authenticated-orcid":false,"given":"Lei","family":"Zhang","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5172-4736","authenticated-orcid":false,"given":"Ying","family":"Wang","sequence":"additional","affiliation":[{"name":"SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,3,22]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-38189-8_18"},{"key":"e_1_3_2_1_2_1","volume-title":"Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369","author":"Agrawal Amey","year":"2023","unstructured":"Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00051"},{"key":"e_1_3_2_1_4_1","first-page":"100022","article-title":"A survey on processing-in-memory techniques: Advances and challenges. Memories-Materials, Devices","volume":"4","author":"Asifuzzaman Kazi","year":"2023","unstructured":"Kazi Asifuzzaman, Narasinga Rao Miniskar, Aaron R Young, Frank Liu, and Jeffrey S Vetter. A survey on processing-in-memory techniques: Advances and challenges. Memories-Materials, Devices, Circuits and Systems, 4:100022, 2023.","journal-title":"Circuits and Systems"},{"key":"e_1_3_2_1_5_1","volume-title":"Cacti 7: New tools for interconnect exploration in innovative off-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14(2):1-25","author":"Balasubramonian Rajeev","year":"2017","unstructured":"Rajeev Balasubramonian, Andrew B Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. Cacti 7: New tools for interconnect exploration in innovative off-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14(2):1-25, 2017."},{"key":"e_1_3_2_1_6_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/276304.276336"},{"issue":"240","key":"e_1_3_2_1_8_1","first-page":"1","article-title":"Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1-113, 2023.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ECTC32862.2020.00013"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.mejo.2016.04.006"},{"key":"e_1_3_2_1_11_1","volume-title":"Nvidia a100 tensor core gpu architecture","author":"NVIDIA Corporation","year":"2020","unstructured":"NVIDIA Corporation. Nvidia a100 tensor core gpu architecture, 2020. Accessed: 2023-07-25."},{"key":"e_1_3_2_1_12_1","volume-title":"Nvidia h100 tensor core gpu architecture","author":"NVIDIA Corporation","year":"2022","unstructured":"NVIDIA Corporation. Nvidia h100 tensor core gpu architecture, 2022. Accessed: 2023-07-25."},{"key":"e_1_3_2_1_13_1","volume-title":"Nvidia nvlink","author":"NVIDIA Corporation","year":"2024","unstructured":"NVIDIA Corporation. Nvidia nvlink, 2024. Accessed: 2024-07-25."},{"key":"e_1_3_2_1_14_1","unstructured":"Dr. Ian Cutress. 'Better Yield on 5nm than 7nm': TSMC Update on Defect Rates for N5 -- anandtech.com. https:\/\/www.anandtech.com\/show\/16028\/better-yield-on-5nm-than-7nm-tsmc-update-on-defect-rates-for-n5. [Accessed 01-03-2025]."},{"key":"e_1_3_2_1_15_1","volume-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344-16359","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344-16359, 2022."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3711920"},{"key":"e_1_3_2_1_17_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"e_1_3_2_1_18_1","volume-title":"Switch-less dragonfly on wafers: A scalable interconnection architecture based on wafer-scale integration. arXiv preprint arXiv:2407.10290","author":"Feng Yinxiao","year":"2024","unstructured":"Yinxiao Feng and Kaisheng Ma. Switch-less dragonfly on wafers: A scalable interconnection architecture based on wafer-scale integration. arXiv preprint arXiv:2407.10290, 2024."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC42614.2022.9731754"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3174101"},{"key":"e_1_3_2_1_21_1","volume-title":"Onnxim: A fast, cycle-level multi-core npu simulator. arXiv preprint arXiv:2406.08051","author":"Ham Hyungkyu","year":"2024","unstructured":"Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, and Gwangsun Kim. Onnxim: A fast, cycle-level multi-core npu simulator. arXiv preprint arXiv:2406.08051, 2024."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00035"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00060"},{"key":"e_1_3_2_1_24_1","volume-title":"How nvlink will enable faster, easier multi-gpu computing","author":"Harris Mark","year":"2014","unstructured":"Mark Harris. How nvlink will enable faster, easier multi-gpu computing, 2014. Accessed: 2023-07-25."},{"key":"e_1_3_2_1_25_1","volume-title":"Waferllm: A wafer-scale llm inference system. arXiv e-prints","author":"He Congjie","year":"2025","unstructured":"Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, and Luo Mai. Waferllm: A wafer-scale llm inference system. arXiv e-prints, pages arXiv-2502, 2025."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00051"},{"key":"e_1_3_2_1_27_1","volume-title":"devices, and structures","author":"ITRS.","year":"2007","unstructured":"ITRS. International technology roadmap for semiconductors 2007 edition process integration, devices, and structures, 2007. Accessed: 2025-07-21."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372780.3380846"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2021.3064189"},{"key":"e_1_3_2_1_30_1","first-page":"1","volume-title":"Proceedings of the 39th International Conference on Computer-Aided Design","author":"Jiang Bentian","year":"2020","unstructured":"Bentian Jiang, Jingsong Chen, Jinwei Liu, Lixin Liu, Fangzhou Wang, Xiaopeng Zhang, and Evangeline FY Young. Cu. poker: placing dnns on wafer-scale ai accelerator with optimal kernel sizing. In Proceedings of the 39th International Conference on Computer-Aided Design, pages 1-9, 2020."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2013.6557149"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_2_1_33_1","volume-title":"Functional Simulations of Deep Neural Network Accelerators","author":"Kim B.","year":"2021","unstructured":"B. Kim, C. park, T. Lim, and W. Song. NPUsim: Full-System, Cycle-Accurate, Functional Simulations of Deep Neural Network Accelerators, Oct. 2021."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC42613.2021.9365862"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLSITechnologyandCir46769.2022.9830438"},{"key":"e_1_3_2_1_37_1","first-page":"155","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. InfiniGen: Efficient generative inference of large language models with dynamic KV cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155-172, Santa Clara, CA, July 2024. USENIX Association."},{"key":"e_1_3_2_1_38_1","volume-title":"iserve: An intent-based serving system for llms. arXiv preprint arXiv:2501.13111","author":"Liakopoulos Dimitrios","year":"2025","unstructured":"Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, and Neeraja J Yadwadkar. iserve: An intent-based serving system for llms. arXiv preprint arXiv:2501.13111, 2025."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HCS55958.2022.9895479"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604869"},{"key":"e_1_3_2_1_41_1","first-page":"36","article-title":"Exploiting the persistence of importance hypothesis for llm kv cache compression at test time","author":"Liu Zichang","year":"2024","unstructured":"Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480125"},{"key":"e_1_3_2_1_43_1","volume-title":"Pointer sentinel mixture models","author":"Merity Stephen","year":"2016","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/PROC.1964.3442"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18074.2021.9586194"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640422"},{"key":"e_1_3_2_1_47_1","volume-title":"Shuaiwen Leon Song, and Michael Taylor. Chiplet cloud: Building ai supercomputers for serving large generative language models. arXiv preprint arXiv:2307.02666","author":"Peng Huwan","year":"2023","unstructured":"Huwan Peng, Scott Davidson, Richard Shi, Shuaiwen Leon Song, and Michael Taylor. Chiplet cloud: Building ai supercomputers for serving large generative language models. arXiv preprint arXiv:2307.02666, 2023."},{"key":"e_1_3_2_1_48_1","first-page":"606","article-title":"Efficiently scaling transformer inference","volume":"5","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606-624, 2023.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_49_1","volume-title":"Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1-67","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1-67, 2020."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695053.3731055"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.vlsi.2017.02.002"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2021.3108344"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3716368.3735205"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/HCS55958.2022.9895534"},{"key":"e_1_3_2_1_55_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur\u00e9lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2022.3172600"},{"key":"e_1_3_2_1_57_1","article-title":"Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline\/parallel reconfigurable modes","author":"Tu Fengbin","year":"2022","unstructured":"Fengbin Tu, Zihan Wu, Yiqi Wang, Ling Liang, Liu Liu, Yufei Ding, Leibo Liu, Shaojun Wei, Yuan Xie, and Shouyi Yin. Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline\/parallel reconfigurable modes. IEEE Journal of Solid-State Circuits, 2022.","journal-title":"IEEE Journal of Solid-State Circuits"},{"key":"e_1_3_2_1_58_1","article-title":"Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity","author":"Tu Fengbin","year":"2023","unstructured":"Fengbin Tu, Zihan Wu, Yiqi Wang, Weiwei Wu, Leibo Liu, Yang Hu, Shaojun Wei, and Shouyi Yin. Multcim: Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity. IEEE Journal of Solid-State Circuits, 2023.","journal-title":"IEEE Journal of Solid-State Circuits"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2"},{"key":"e_1_3_2_1_60_1","volume-title":"Attention is all you need. Advances in neural information processing systems, 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00018"},{"key":"e_1_3_2_1_62_1","first-page":"1","volume-title":"Wiley Encyclopedia of Computer Science and Engineering","author":"Wolsey Laurence A","year":"2007","unstructured":"Laurence A Wolsey. Mixed integer programming. Wiley Encyclopedia of Computer Science and Engineering, pages 1-10, 2007."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695053.3731101"},{"key":"e_1_3_2_1_64_1","volume-title":"Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305","author":"Yang Aiyuan","year":"2023","unstructured":"Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, and Zhiying Wu. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023."},{"key":"e_1_3_2_1_65_1","volume-title":"5 technical report. arXiv preprint arXiv:2412.15115","author":"Yang An","year":"2024","unstructured":"An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695053.3731045"},{"key":"e_1_3_2_1_67_1","first-page":"521","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521-538, 2022."},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695053.3731016"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626202.3637562"},{"key":"e_1_3_2_1_70_1","first-page":"559","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Marcos K. Aguilera and Hakim Weatherspoon, editors, 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 559-578. USENIX Association, 2022."},{"key":"e_1_3_2_1_71_1","first-page":"193","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193-210, 2024."},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2022.3170848"},{"key":"e_1_3_2_1_73_1","volume-title":"Theseus: Exploring efficient wafer-scale chip design for large language models. arXiv preprint arXiv:2407.02079","author":"Zhu Jingchen","year":"2024","unstructured":"Jingchen Zhu, Chenhao Xue, Yiqi Chen, Zhao Wang, Chen Zhang, Yu Shen, Yifan Chen, Zekang Cheng, Yu Jiang, Tianqi Wang, Yibo Lin, Wei Hu, Bin Cui, Runsheng Wang, Yun Liang, and Guangyu Sun. Theseus: Exploring efficient wafer-scale chip design for large language models. arXiv preprint arXiv:2407.02079, 2024."},{"key":"e_1_3_2_1_74_1","volume-title":"Yu Cao, Yuan Xie, Huazhong Yang, and Yu Wang. Mnsim 2.0: A behavior-level modeling tool for processing-in-memory architectures","author":"Zhu Zhenhua","year":"2023","unstructured":"Zhenhua Zhu, Hanbo Sun, Tongxin Xie, Yu Zhu, Guohao Dai, Lixue Xia, Dimin Niu, Xiaoming Chen, Xiaobo Sharon Hu, Yu Cao, Yuan Xie, Huazhong Yang, and Yu Wang. Mnsim 2.0: A behavior-level modeling tool for processing-in-memory architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(11):4112-4125, 2023."}],"event":{"name":"ASPLOS '26: 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems","location":"Pittsburgh PA USA","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages","SIGARCH ACM Special Interest Group on Computer Architecture","SIGBED ACM Special Interest Group on Embedded Systems"]},"container-title":["Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T13:59:36Z","timestamp":1773583176000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3779212.3790197"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,22]]},"references-count":74,"alternative-id":["10.1145\/3779212.3790197","10.1145\/3779212"],"URL":"https:\/\/doi.org\/10.1145\/3779212.3790197","relation":{},"subject":[],"published":{"date-parts":[[2026,3,22]]},"assertion":[{"value":"2026-03-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}