{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T16:06:28Z","timestamp":1775837188367,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":80,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,21]]},"DOI":"10.1145\/3695053.3731412","type":"proceedings-article","created":{"date-parts":[[2025,6,20]],"date-time":"2025-06-20T16:43:11Z","timestamp":1750437791000},"page":"1731-1745","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-2297-0790","authenticated-orcid":false,"given":"Chenggang","family":"Zhao","sequence":"first","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-4777-2448","authenticated-orcid":false,"given":"Chengqi","family":"Deng","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-7896-2558","authenticated-orcid":false,"given":"Chong","family":"Ruan","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9714-7902","authenticated-orcid":false,"given":"Damai","family":"Dai","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8238-440X","authenticated-orcid":false,"given":"Huazuo","family":"Gao","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0709-7692","authenticated-orcid":false,"given":"Jiashi","family":"Li","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-8250-1249","authenticated-orcid":false,"given":"Liyue","family":"Zhang","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2875-0184","authenticated-orcid":false,"given":"Panpan","family":"Huang","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9696-1801","authenticated-orcid":false,"given":"Shangyan","family":"Zhou","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1686-407X","authenticated-orcid":false,"given":"Shirong","family":"Ma","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2177-6864","authenticated-orcid":false,"given":"Wenfeng","family":"Liang","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2906-7751","authenticated-orcid":false,"given":"Ying","family":"He","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9282-0408","authenticated-orcid":false,"given":"Yuqing","family":"Wang","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2770-7926","authenticated-orcid":false,"given":"Yuxuan","family":"Liu","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-3926-923X","authenticated-orcid":false,"given":"Y.X.","family":"Wei","sequence":"additional","affiliation":[{"name":"DeepSeek-AI, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,20]]},"reference":[{"key":"e_1_3_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CCGRID.2017.29"},{"key":"e_1_3_3_1_3_2","doi-asserted-by":"publisher","unstructured":"E. Agostini D. Rossetti and S. Potluri. 2018. GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters. J. Parallel and Distrib. Comput. 114 (2018) 28\u201345. 10.1016\/j.jpdc.2017.12.007","DOI":"10.1016\/j.jpdc.2017.12.007"},{"key":"e_1_3_3_1_4_2","unstructured":"AI@Meta. 2024. Llama 3 Model Card. https:\/\/github.com\/meta-llama\/llama3\/blob\/main\/MODEL_CARD.md"},{"key":"e_1_3_3_1_5_2","unstructured":"AI@Meta. 2024. Llama 3.1 Model Card. https:\/\/github.com\/meta-llama\/llama-models\/blob\/main\/models\/llama3_1\/MODEL_CARD.md"},{"key":"e_1_3_3_1_6_2","doi-asserted-by":"crossref","unstructured":"Joshua Ainslie James Lee-Thorp Michiel de Jong Yury Zemlyanskiy Federico Lebr\u00f3n and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2305.13245 (2023).","DOI":"10.18653\/v1\/2023.emnlp-main.298"},{"key":"e_1_3_3_1_7_2","unstructured":"AMD. 2025. AMD Ryzen AI Max+ PRO 395: Designed to power a new generation of compact Copilot+ PC workstations. https:\/\/www.amd.com\/en\/products\/processors\/laptop\/ryzen-pro\/ai-max-pro-300-series\/amd-ryzen-ai-max-plus-pro-395.html"},{"key":"e_1_3_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00089"},{"key":"e_1_3_3_1_9_2","unstructured":"Anthropic. 2024. Claude 3.5 Sonnet. https:\/\/www.anthropic.com\/news\/claude-3-5-sonnet"},{"key":"e_1_3_3_1_10_2","unstructured":"Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. https:\/\/www.anthropic.com\/news\/claude-3-7-sonnet"},{"key":"e_1_3_3_1_11_2","unstructured":"Apple. 2024. Apple introduces M4 Pro and M4 Max. https:\/\/www.apple.com\/newsroom\/2024\/10\/apple-introduces-m4-pro-and-m4-max\/"},{"key":"e_1_3_3_1_12_2","unstructured":"Iz Beltagy Matthew\u00a0E. Peters and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:https:\/\/arXiv.org\/abs\/2004.05150 (2020)."},{"key":"e_1_3_3_1_13_2","series-title":"(NSDI\u201924)","volume-title":"Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation","author":"Blach Nils","year":"2025","unstructured":"Nils Blach, Maciej Besta, Daniele De\u00a0Sensi, Jens Domke, Hussein Harake, Shigang Li, Patrick Iff, Marek Konieczny, Kartik Lakhotia, Ales Kubicek, Marcel Ferrari, Fabrizio Petrini, and Torsten Hoefler. 2025. A high-performance design, implementation, deployment, and evaluation of the slim fly network. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (Santa Clara, CA, USA) (NSDI\u201924). USENIX Association, USA, Article 57, 20\u00a0pages."},{"key":"e_1_3_3_1_14_2","unstructured":"Broadcom. 2025. Scale Up Ethernet Framework. https:\/\/docs.broadcom.com\/doc\/scale-up-ethernet-framework"},{"key":"e_1_3_3_1_15_2","volume-title":"Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024","author":"Cai Tianle","year":"2024","unstructured":"Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason\u00a0D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https:\/\/openreview.net\/forum?id=PEpbUobfJv"},{"key":"e_1_3_3_1_16_2","unstructured":"Shaoyuan Chen Wencong Xiao Yutong Lin Mingxing Zhang Yingdi Shan Jinlei Jiang Kang Chen and Yongwei Wu. 2025. Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arxiv:https:\/\/arXiv.org\/abs\/2405.01814\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2405.01814"},{"key":"e_1_3_3_1_17_2","unstructured":"ULTRA ACCELERATOR\u00a0LINK CONSORTIUM. 2025. Introducing UALink 200G 1.0 Specification. https:\/\/ualinkconsortium.org\/wp-content\/uploads\/2025\/04\/UALink-1.0-White_Paper_FINAL.pdf"},{"key":"e_1_3_3_1_18_2","unstructured":"Ultra\u00a0Ethernet Consortium. 2023. Overview of and Motivation for the Forthcoming Ultra Ethernet Consortium Specification. https:\/\/ultraethernet.org\/wp-content\/uploads\/sites\/20\/2023\/10\/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf"},{"key":"e_1_3_3_1_19_2","unstructured":"Ultra\u00a0Ethernet Consortium. 2024. UEC Progresses Towards v1.0 Set of Specifications. https:\/\/ultraethernet.org\/uec-progresses-towards-v1-0-set-of-specifications\/"},{"key":"e_1_3_3_1_20_2","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning."},{"key":"e_1_3_3_1_21_2","volume-title":"Advances in Neural Information Processing Systems","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Daniel\u00a0Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_3_1_22_2","series-title":"(ICML\u201924)","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Dao Tri","year":"2024","unstructured":"Tri Dao and Albert Gu. 2024. Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML\u201924). JMLR.org, Article 399, 31\u00a0pages."},{"key":"e_1_3_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00039"},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","unstructured":"DeepSeek-AI. 2024. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. CoRR abs\/2406.11931 (2024). 10.48550\/arXiv.2406.11931","DOI":"10.48550\/arXiv.2406.11931"},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"publisher","unstructured":"DeepSeek-AI. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. CoRR abs\/2401.02954 (2024). 10.48550\/arXiv.2401.02954","DOI":"10.48550\/arXiv.2401.02954"},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"publisher","unstructured":"DeepSeek-AI. 2024. DeepSeek-V2: A Strong Economical and Efficient Mixture-of-Experts Language Model. CoRR abs\/2405.04434 (2024). 10.48550\/arXiv.2405.04434","DOI":"10.48550\/arXiv.2405.04434"},{"key":"e_1_3_3_1_27_2","unstructured":"DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. (2024). arxiv:https:\/\/arXiv.org\/abs\/2412.19437\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2412.19437"},{"key":"e_1_3_3_1_28_2","doi-asserted-by":"publisher","unstructured":"DeepSeek-AI. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. CoRR abs\/2401.06066 (2024). 10.48550\/arXiv.2401.06066","DOI":"10.48550\/arXiv.2401.06066"},{"key":"e_1_3_3_1_29_2","unstructured":"DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arxiv:https:\/\/arXiv.org\/abs\/2501.12948\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2501.12948"},{"key":"e_1_3_3_1_30_2","unstructured":"DeepSeek-AI. 2025. DualPipe: A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3\/R1 training. https:\/\/github.com\/deepseek-ai\/dualpipe."},{"key":"e_1_3_3_1_31_2","unstructured":"DeepSeek-AI. 2025. Fire-Flyer File System. https:\/\/github.com\/deepseek-ai\/3FS"},{"key":"e_1_3_3_1_32_2","unstructured":"DeepSeek-AI. 2025. Profiling Data in DeepSeek Infra. https:\/\/github.com\/deepseek-ai\/profile-data?tab=readme-ov-file#inference"},{"key":"e_1_3_3_1_33_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2210.17323 (2022)."},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672233"},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTI.2008.21"},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","unstructured":"Amir Gholami Zhewei Yao Sehoon Kim Coleman Hooper Michael\u00a0W. Mahoney and Kurt Keutzer. 2024. AI and Memory Wall. IEEE Micro 44 03 (May 2024) 33\u201339. 10.1109\/MM.2024.3373763","DOI":"10.1109\/MM.2024.3373763"},{"key":"e_1_3_3_1_37_2","volume-title":"Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024","author":"Gloeckle Fabian","year":"2024","unstructured":"Fabian Gloeckle, Badr\u00a0Youbi Idrissi, Baptiste Rozi\u00e8re, David Lopez-Paz, and Gabriel Synnaeve. 2024. Better & Faster Large Language Models via Multi-token Prediction. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https:\/\/openreview.net\/forum?id=pEWAcejiU2"},{"key":"e_1_3_3_1_38_2","unstructured":"Google. 2024. Introducing Gemini 2.0: our new AI model for the agentic era. https:\/\/blog.google\/technology\/google-deepmind\/google-gemini-ai-update-december-2024"},{"key":"e_1_3_3_1_39_2","unstructured":"Google. 2025. Gemini 2.5: Our most intelligent AI model. https:\/\/blog.google\/technology\/google-deepmind\/gemini-model-thinking-updates-march-2025\/"},{"key":"e_1_3_3_1_40_2","unstructured":"MADSys group and Approaching.AI. 2025. A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations. https:\/\/github.com\/kvcache-ai\/ktransformers"},{"key":"e_1_3_3_1_41_2","unstructured":"Coleman Hooper Sehoon Kim Hiva Mohammadzadeh Michael\u00a0W Mahoney Yakun\u00a0Sophia Shao Kurt Keutzer and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.18079 (2024)."},{"key":"e_1_3_3_1_42_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier et\u00a0al. 2023. Mistral 7B. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.06825 (2023)."},{"key":"e_1_3_3_1_43_2","unstructured":"Ziheng Jiang Haibin Lin Yinmin Zhong Qi Huang Yangrui Chen Zhi Zhang Yanghua Peng Xiang Li Cong Xie Shibiao Nong Yulu Jia Sun He Hongmin Chen Zhihao Bai Qi Hou Shipeng Yan Ding Zhou Yiyao Sheng Zhuo Jiang Haohan Xu Haoran Wei Zhang Zhang Pengfei Nie Leqi Zou Sida Zhao Liang Xiang Zherui Liu Zhe Li Xiaoying Jia Jianxi Ye Xin Jin and Xin Liu. 2024. MegaScale: Scaling Large Language Model Training to More Than 10 000 GPUs. http:\/\/arxiv.org\/abs\/2402.15627 arXiv:https:\/\/arXiv.org\/abs\/2402.15627 [cs]."},{"key":"e_1_3_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_3_1_45_2","unstructured":"Hao Kang Qingru Zhang Souvik Kundu Geonhwa Jeong Zaoxing Liu Tushar Krishna and Tuo Zhao. 2024. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM. arxiv:https:\/\/arXiv.org\/abs\/2403.05527\u00a0[cs.LG]"},{"key":"e_1_3_3_1_46_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom\u00a0B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs\/2001.08361 (2020). arXiv:https:\/\/arXiv.org\/abs\/2001.08361https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.19"},{"key":"e_1_3_3_1_48_2","unstructured":"Vijay\u00a0Anand Korthikanti Jared Casper Sangkug Lym Lawrence McAfee Michael Andersch Mohammad Shoeybi and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023)."},{"key":"e_1_3_3_1_49_2","volume-title":"Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024","author":"Li Yuhui","year":"2024","unstructured":"Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https:\/\/openreview.net\/forum?id=1NdN7eXyb4"},{"key":"e_1_3_3_1_50_2","unstructured":"Heng Liao Bingyang Liu Xianping Chen Zhigang Guo Chuanning Cheng Jianbing Wang Xiangyu Chen Peng Dong Rui Meng Wenjie Liu Zhe Zhou Ziyang Zhang Yuhang Gai Cunle Qian Yi Xiong Zhongwu Cheng Jing Xia Yuli Ma Xi Chen Wenhua Du Shizhong Xiao Chungang Li Yong Qin Liudong Xiong Zhou Yu Lv Chen Lei Chen Buyun Wang Pei Wu Junen Gao Xiaochu Li Jian He Shizhuan Yan and Bill McColl. 2025. UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture. arxiv:https:\/\/arXiv.org\/abs\/2503.20377\u00a0[cs.AR] https:\/\/arxiv.org\/abs\/2503.20377"},{"key":"e_1_3_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS55958.2022.9895479"},{"key":"e_1_3_3_1_52_2","volume-title":"MLSys","author":"Lin Ji","year":"2024","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. In MLSys."},{"key":"e_1_3_3_1_53_2","unstructured":"Zirui Liu Jiayi Yuan Hongye Jin Shaochen Zhong Zhaozhuo Xu Vladimir Braverman Beidi Chen and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.02750 (2024)."},{"key":"e_1_3_3_1_54_2","unstructured":"Junyu Luo Weizhi Zhang Ye Yuan Yusheng Zhao Junwei Yang Yiyang Gu Bohan Wu Binqi Chen Ziyue Qiao Qingqing Long Rongcheng Tu Xiao Luo Wei Ju Zhiping Xiao Yifan Wang Meng Xiao Chenwu Liu Jingyang Yuan Shichang Zhang Yiqiao Jin Fan Zhang Xian Wu Hanqing Zhao Dacheng Tao Philip\u00a0S. Yu and Ming Zhang. 2025. Large Language Model Agent: A Survey on Methodology Applications and Challenges. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.21460 (2025)."},{"key":"e_1_3_3_1_55_2","unstructured":"Karthik Mandakolathur and Sylvain Jeaugey. 2022. Doubling all2all Performance with NVIDIA Collective Communication Library 2.12. https:\/\/developer.nvidia.com\/blog\/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12\/"},{"key":"e_1_3_3_1_56_2","unstructured":"Mistral. 2024. Cheaper Better Faster Stronger: Continuing to push the frontier of AI and making it accessible to all. https:\/\/mistral.ai\/news\/mixtral-8x22b"},{"key":"e_1_3_3_1_57_2","unstructured":"Dheevatsa Mudigere Yuchen Hao Jianyu Huang Zhihao Jia Andrew Tulloch Srinivas Sridharan Xing Liu Mustafa Ozdal Jade Nie Jongsoo Park Liang Luo Jie\u00a0Amy Yang Leon Gao Dmytro Ivchenko Aarti Basant Yuxi Hu Jiyan Yang Ehsan\u00a0K. Ardestani Xiaodong Wang Rakesh Komuravelli Ching-Hsiang Chu Serhat Yilmaz Huayu Li Jiyuan Qian Zhuobo Feng Yinbin Ma Junjie Yang Ellie Wen Hong Li Lin Yang Chonglin Sun Whitney Zhao Dimitry Melts Krishna Dhulipala K.\u00a0R. Kishore Tyler Graf Assaf Eisenman Kiran\u00a0Kumar Matam Adi Gangidi Guoqiang\u00a0Jerry Chen Manoj Krishnan Avinash Nayak Krishnakumar Nair Bharath Muthiah Mahmoud khorashadi Pallab Bhattacharya Petr Lapukhov Maxim Naumov Ajit Mathews Lin Qiao Mikhail Smelyanskiy Bill Jia and Vijay Rao. 2023. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. http:\/\/arxiv.org\/abs\/2104.05158 arXiv:https:\/\/arXiv.org\/abs\/2104.05158 [cs]."},{"key":"e_1_3_3_1_58_2","unstructured":"NVIDIA. 2022. Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. https:\/\/developer.nvidia.com\/blog\/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async\/"},{"key":"e_1_3_3_1_59_2","unstructured":"NVIDIA. 2025. NVIDIA DGX Spark: A Grace Blackwell AI supercomputer on your desk.https:\/\/www.nvidia.com\/en-us\/products\/workstations\/dgx-spark\/"},{"key":"e_1_3_3_1_60_2","unstructured":"OpenAI. 2024. Hello GPT-4o. https:\/\/openai.com\/index\/hello-gpt-4o\/"},{"key":"e_1_3_3_1_61_2","unstructured":"OpenAI. 2024. Introducing OpenAI o1. https:\/\/openai.com\/o1\/"},{"key":"e_1_3_3_1_62_2","unstructured":"OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https:\/\/openai.com\/index\/introducing-o3-and-o4-mini\/."},{"key":"e_1_3_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672265"},{"key":"e_1_3_3_1_64_2","series-title":"(ICML\u201924)","volume-title":"Proceedings of the 41st International Conference on Machine Learning","author":"Qin Zhen","year":"2024","unstructured":"Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. 2024. Various lengths, constant speed: efficient language modeling with lightning attention. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML\u201924). JMLR.org, Article 1688, 19\u00a0pages."},{"key":"e_1_3_3_1_65_2","unstructured":"Rafael Rafailov Archit Sharma Eric Mitchell Stefano Ermon Christopher\u00a0D. Manning and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arxiv:https:\/\/arXiv.org\/abs\/2305.18290\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2305.18290"},{"key":"e_1_3_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356208"},{"key":"e_1_3_3_1_67_2","unstructured":"Bita\u00a0Darvish Rouhani Ritchie Zhao Ankit More Mathew Hall Alireza Khodamoradi Summer Deng Dhruv Choudhary Marius Cornea Eric Dellinger Kristof Denolf Stosic Dusan Venmugil Elango Maximilian Golub Alexander Heinecke Phil James-Roxby Dharmesh Jani Gaurav Kolhe Martin Langhammer Ada Li Levi Melnick Maral Mesmakhosroshahi Andres Rodriguez Michael Schulte Rasoul Shafipour Lei Shao Michael Siu Pradeep Dubey Paulius Micikevicius Maxim Naumov Colin Verrilli Ralph Wittig Doug Burger and Eric Chung. 2023. Microscaling Data Formats for Deep Learning. arxiv:https:\/\/arXiv.org\/abs\/2310.10537\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2310.10537"},{"key":"e_1_3_3_1_68_2","unstructured":"John Schulman Filip Wolski Prafulla Dhariwal Alec Radford and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arxiv:https:\/\/arXiv.org\/abs\/1707.06347\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/1707.06347"},{"key":"e_1_3_3_1_69_2","unstructured":"ByteDance Seed. 2025. Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning. arxiv:https:\/\/arXiv.org\/abs\/2504.13914\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2504.13914"},{"key":"e_1_3_3_1_70_2","unstructured":"Zhihong Shao Peiyi Wang Qihao Zhu Runxin Xu Junxiao Song Xiao Bi Haowei Zhang Mingchuan Zhang Y.\u00a0K. Li Y. Wu and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arxiv:https:\/\/arXiv.org\/abs\/2402.03300\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2402.03300"},{"key":"e_1_3_3_1_71_2","unstructured":"Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. CoRR abs\/1911.02150 (2019). http:\/\/arxiv.org\/abs\/1911.02150"},{"key":"e_1_3_3_1_72_2","unstructured":"Qwen Team. 2025. Qwen3: Think Deeper Act Faster. https:\/\/github.com\/QwenLM\/Qwen3"},{"key":"e_1_3_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.23919\/VLSITechnologyandCir57934.2023.10185427"},{"key":"e_1_3_3_1_74_2","unstructured":"xAI. 2024. Grok-2 Beta Release. https:\/\/x.ai\/news\/grok-2."},{"key":"e_1_3_3_1_75_2","unstructured":"xAI. 2024. Our Gigafactory of Compute:Colossus. https:\/\/x.ai\/colossus."},{"key":"e_1_3_3_1_76_2","unstructured":"An Yang Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chengyuan Li Dayiheng Liu Fei Huang Haoran Wei Huan Lin Jian Yang Jianhong Tu Jianwei Zhang Jianxin Yang Jiaxi Yang Jingren Zhou Junyang Lin Kai Dang Keming Lu Keqin Bao Kexin Yang Le Yu Mei Li Mingfeng Xue Pei Zhang Qin Zhu Rui Men Runji Lin Tianhao Li Tianyi Tang Tingyu Xia Xingzhang Ren Xuancheng Ren Yang Fan Yang Su Yichang Zhang Yu Wan Yuqiong Liu Zeyu Cui Zhenru Zhang and Zihan Qiu. 2024. Qwen2.5 Technical Report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2412.15115 (2024)."},{"key":"e_1_3_3_1_77_2","unstructured":"Jingyang Yuan Huazuo Gao Damai Dai Junyu Luo Liang Zhao Zhengyan Zhang Zhenda Xie Y.\u00a0X. Wei Lean Wang Zhiping Xiao Yuqing Wang Chong Ruan Ming Zhang Wenfeng Liang and Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. https:\/\/arxiv.org\/abs\/2502.11089"},{"key":"e_1_3_3_1_78_2","unstructured":"Chenggang Zhao Liang Zhao Jiashi Li and Zhean Xu. 2025. DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling. https:\/\/github.com\/deepseek-ai\/DeepGEMM."},{"key":"e_1_3_3_1_79_2","unstructured":"Chenggang Zhao Shangyan Zhou Liyue Zhang Chengqi Deng Zhean Xu Yuxuan Liu Kuai Yu Jiashi Li and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https:\/\/github.com\/deepseek-ai\/DeepEP."},{"key":"e_1_3_3_1_80_2","unstructured":"Size Zheng Jin Fang Xuegui Zheng Qi Hou Wenlei Bao Ningxin Zheng Ziheng Jiang Dongyang Wang Jianxi Ye Haibin Lin Li-Wen Chang and Xin Liu. 2025. TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives. arxiv:https:\/\/arXiv.org\/abs\/2503.20313\u00a0[cs.DC] https:\/\/arxiv.org\/abs\/2503.20313"},{"key":"e_1_3_3_1_81_2","first-page":"193","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 193\u2013210. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/zhong-yinmin"}],"event":{"name":"ISCA '25: Proceedings of the 52nd Annual International Symposium on Computer Architecture","location":"Tokyo Japan","acronym":"SIGARCH '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 52nd Annual International Symposium on Computer Architecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695053.3731412","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T11:02:43Z","timestamp":1750503763000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695053.3731412"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,20]]},"references-count":80,"alternative-id":["10.1145\/3695053.3731412","10.1145\/3695053"],"URL":"https:\/\/doi.org\/10.1145\/3695053.3731412","relation":{},"subject":[],"published":{"date-parts":[[2025,6,20]]},"assertion":[{"value":"2025-06-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}