{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T23:03:23Z","timestamp":1768345403660,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":62,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,19]]},"DOI":"10.1145\/3772052.3772270","type":"proceedings-article","created":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:00Z","timestamp":1768321140000},"page":"687-694","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Understanding GPU Resource Interference One Level Deeper"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-6025-844X","authenticated-orcid":false,"given":"Paul","family":"Elvinger","sequence":"first","affiliation":[{"name":"ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3364-2109","authenticated-orcid":false,"given":"Foteini","family":"Strati","sequence":"additional","affiliation":[{"name":"ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0526-2080","authenticated-orcid":false,"given":"Natalie Enright","family":"Jerger","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8559-0529","authenticated-orcid":false,"given":"Ana","family":"Klimovic","sequence":"additional","affiliation":[{"name":"ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"https:\/\/www.olcf.ornl.gov\/wp-content\/uploads\/2019\/10\/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf","author":"AMD.","year":"2019","unstructured":"AMD. Amd gpu hardware basics. https:\/\/www.olcf.ornl.gov\/wp-content\/uploads\/2019\/10\/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf, 2019."},{"key":"e_1_3_2_1_2_1","volume-title":"https:\/\/rocm.docs.amd.com\/projects\/omniperf\/en\/docs-6.2.0\/","author":"AMD.","year":"2024","unstructured":"AMD. Omniperf documentation. https:\/\/rocm.docs.amd.com\/projects\/omniperf\/en\/docs-6.2.0\/, 2024."},{"key":"e_1_3_2_1_3_1","first-page":"224","volume-title":"Proceedings of the 25th International Middleware Conference, Middleware '24","author":"Bhasi Vivek M.","year":"2024","unstructured":"Vivek M. Bhasi, Aakash Sharma, Rishabh Jain, Jashwant Raj Gunasekaran, Ashutosh Pattnaik, Mahmut Taylan Kandemir, and Chita Das. Towards slocompliant and cost-effective serverless computing on emerging gpu architectures. In Proceedings of the 25th International Middleware Conference, Middleware '24, page 211\u2013224, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620678.3624660"},{"key":"e_1_3_2_1_5_1","volume-title":"Lithos: An operating system for efficient machine learning on gpus","author":"Coppock Patrick H.","year":"2025","unstructured":"Patrick H. Coppock, Brian Zhang, Eliot H. Solomon, Vasilis Kypriotis, Leon Yang, Bikash Sharma, Dan Schatzberg, Todd C. Mowry, and Dimitrios Skarlatos. Lithos: An operating system for efficient machine learning on gpus, 2025."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499368.2451125"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2644865.2541941"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3597503.3639232"},{"key":"e_1_3_2_1_9_1","first-page":"706","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Graur Dan","year":"2022","unstructured":"Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A. Thekkath, and Ana Klimovic. Cachew: Machine learning input data processing as a service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 689\u2013706, Carlsbad, CA, July 2022. USENIX Association."},{"key":"e_1_3_2_1_10_1","first-page":"462","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Gujarati Arpan","unstructured":"Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443\u2013462. USENIX Association, November 2020."},{"key":"e_1_3_2_1_11_1","first-page":"558","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 539\u2013558, Carlsbad, CA, July 2022. USENIX Association."},{"key":"e_1_3_2_1_12_1","volume-title":"Uellm: A unified and efficient approach for llm inference serving","author":"He Yiyuan","year":"2024","unstructured":"Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, and Chengzhong Xu. Uellm: A unified and efficient approach for llm inference serving, 2024."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTAS.2019.00011"},{"key":"e_1_3_2_1_14_1","volume-title":"Pod-attention: Unlocking full prefill-decode overlap for faster llm inference","author":"Kamath Aditya K","year":"2024","unstructured":"Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference, 2024."},{"key":"e_1_3_2_1_15_1","first-page":"175","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Lim Gangmuk","unstructured":"Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. Zico: Efficient GPU memory sharing for concurrent DNN training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 161\u2013175. USENIX Association, July 2021."},{"key":"e_1_3_2_1_16_1","volume-title":"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B-Instruct","year":"2024","unstructured":"Meta. Llama-3.1 8b instruct. https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B-Instruct, 2024."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613163"},{"key":"e_1_3_2_1_18_1","volume-title":"Gpu pro tip: Cuda 7 streams simplify concurrency. https:\/\/developer. nvidia.com\/blog\/gpu-pro-tip-cuda-7-streams-simplify-concurrency\/","author":"NVIDIA.","year":"2015","unstructured":"NVIDIA. Gpu pro tip: Cuda 7 streams simplify concurrency. https:\/\/developer. nvidia.com\/blog\/gpu-pro-tip-cuda-7-streams-simplify-concurrency\/, 2015."},{"key":"e_1_3_2_1_19_1","volume-title":"Nvidia multi-process service. https:\/\/docs.nvidia.com\/deploy\/mps\/index.html","author":"NVIDIA.","year":"2015","unstructured":"NVIDIA. Nvidia multi-process service. https:\/\/docs.nvidia.com\/deploy\/mps\/index.html, 2015."},{"key":"e_1_3_2_1_20_1","volume-title":"Question about sp and sm. https:\/\/forums.developer.nvidia.com\/t\/questions- about- sp- and- sm\/76700\/6","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. Question about sp and sm. https:\/\/forums.developer.nvidia.com\/t\/questions- about- sp- and- sm\/76700\/6, 2019."},{"key":"e_1_3_2_1_21_1","volume-title":"Nsight compute, l2 hit rate. https:\/\/forums.developer.nvidia.com\/t-\/l2-hit-rate-always-at-100\/257458\/2?u=elpaul","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. Nsight compute, l2 hit rate. https:\/\/forums.developer.nvidia.com\/t-\/l2-hit-rate-always-at-100\/257458\/2?u=elpaul, 2023."},{"key":"e_1_3_2_1_22_1","volume-title":"Cuda c++programming guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Cuda c++programming guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/, 2024."},{"key":"e_1_3_2_1_23_1","volume-title":"Cuda shared memory configuration. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#shared-memory-7-x","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Cuda shared memory configuration. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#shared-memory-7-x, 2024."},{"key":"e_1_3_2_1_24_1","unstructured":"NVIDIA. Is there a document about in which hardware unit(ie. alu fmu\u2026) an instruction is executed? https:\/\/forums.developer.nvidia.com\/t\/is-there-a-document-about-in-which-hardware-unit-ie-alu-fmu-an- instruction-is-executed\/227475 2024."},{"key":"e_1_3_2_1_25_1","volume-title":"Mapping of pipelines to functional units. https:\/\/forums.developer.nvidia.com\/t\/mapping-of-pipelines-to-functional-units\/315200","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Mapping of pipelines to functional units. https:\/\/forums.developer.nvidia.com\/t\/mapping-of-pipelines-to-functional-units\/315200, 2024."},{"key":"e_1_3_2_1_26_1","volume-title":"Nsight compute. https:\/\/developer.nvidia.com\/nsight- compute","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Nsight compute. https:\/\/developer.nvidia.com\/nsight- compute, 2024."},{"key":"e_1_3_2_1_27_1","volume-title":"Nsight systems. https:\/\/developer.nvidia.com\/nsight- systems","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Nsight systems. https:\/\/developer.nvidia.com\/nsight- systems, 2024."},{"key":"e_1_3_2_1_28_1","volume-title":"Nvidia double precision intrinsics. https:\/\/docs.nvidia.com\/cuda\/cuda-math-api\/cuda_math_api\/group__CUDA__MATH__INTRINSIC__DOUBLE.html","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Nvidia double precision intrinsics. https:\/\/docs.nvidia.com\/cuda\/cuda-math-api\/cuda_math_api\/group__CUDA__MATH__INTRINSIC__DOUBLE.html, 2024."},{"key":"e_1_3_2_1_29_1","volume-title":"Nvidia management library (nvml). https:\/\/developer.nvidia.com\/management-library-nvml","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Nvidia management library (nvml). https:\/\/developer.nvidia.com\/management-library-nvml, 2024."},{"key":"e_1_3_2_1_30_1","volume-title":"nvidia-smi. https:\/\/docs.nvidia.com\/deploy\/nvidia-smi\/index.html","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. nvidia-smi. https:\/\/docs.nvidia.com\/deploy\/nvidia-smi\/index.html, 2024."},{"key":"e_1_3_2_1_31_1","volume-title":"Roofline charts. https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html#roofline-charts","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Roofline charts. https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html#roofline-charts, 2024."},{"key":"e_1_3_2_1_32_1","volume-title":"Separate cuda core pipeline for fp16 and fp32? https:\/\/forums.developer.nvidia.com\/t\/separate-cuda-core-pipeline-for-fp16-and-fp32\/302018","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. Separate cuda core pipeline for fp16 and fp32? https:\/\/forums.developer.nvidia.com\/t\/separate-cuda-core-pipeline-for-fp16-and-fp32\/302018, 2024."},{"key":"e_1_3_2_1_33_1","unstructured":"NVIDIA. What does achieved active warps per sm in nsight means and how to calculate it? https:\/\/forums.developer.nvidia.com\/t\/what-does-achieved-active-warps-per-sm-in-nsight-means-and-how-to-calculate-it\/128256l 2024."},{"key":"e_1_3_2_1_34_1","volume-title":"what is ipc(instructions per cycle)? https:\/\/forums.developer.nvidia.com\/t\/what-is-ipc-instructions-per-cycle\/66138","author":"NVIDIA.","year":"2024","unstructured":"NVIDIA. what is ipc(instructions per cycle)? https:\/\/forums.developer.nvidia.com\/t\/what-is-ipc-instructions-per-cycle\/66138, 2024."},{"key":"e_1_3_2_1_35_1","volume-title":"Achieved occupancy. https:\/\/docs.nvidia.com\/gameworks\/content\/developertools\/desktop\/analysis\/report\/cudaexperiments\/kernellevel\/achievedoccupancy.htm","author":"NVIDIA.","year":"2025","unstructured":"NVIDIA. Achieved occupancy. https:\/\/docs.nvidia.com\/gameworks\/content\/developertools\/desktop\/analysis\/report\/cudaexperiments\/kernellevel\/achievedoccupancy.htm, 2025."},{"key":"e_1_3_2_1_36_1","volume-title":"Cuda green context. https:\/\/docs.nvidia.com\/cuda\/cuda-driver-api\/group__CUDA__GREEN__CONTEXTS.html","author":"NVIDIA.","year":"2025","unstructured":"NVIDIA. Cuda green context. https:\/\/docs.nvidia.com\/cuda\/cuda-driver-api\/group__CUDA__GREEN__CONTEXTS.html, 2025."},{"key":"e_1_3_2_1_37_1","volume-title":"Cuda green context. https:\/\/docs.nvidia.com\/deploy\/mps\/index.html#volta-mps-execution-resource-provisioning","author":"NVIDIA.","year":"2025","unstructured":"NVIDIA. Cuda green context. https:\/\/docs.nvidia.com\/deploy\/mps\/index.html#volta-mps-execution-resource-provisioning, 2025."},{"key":"e_1_3_2_1_38_1","volume-title":"Nsight compute metrics decoder. https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html#metrics-decoder","author":"NVIDIA.","year":"2025","unstructured":"NVIDIA. Nsight compute metrics decoder. https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html#metrics-decoder, 2025."},{"key":"e_1_3_2_1_39_1","volume-title":"Special register smid. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/#special-registers-smid","author":"NVIDIA.","year":"2025","unstructured":"NVIDIA. Special register smid. https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/#special-registers-smid, 2025."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651329"},{"key":"e_1_3_2_1_41_1","volume-title":"torch.mm. https:\/\/pytorch.org\/docs\/stable\/generated\/torch.mm.html","year":"2024","unstructured":"PyTorch. torch.mm. https:\/\/pytorch.org\/docs\/stable\/generated\/torch.mm.html, 2024."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC58863.2023.10363447"},{"key":"e_1_3_2_1_43_1","first-page":"964","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Shubha Sudipta Saha","year":"2024","unstructured":"Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. USHER: Holistic interference avoidance for resource optimized ML inference. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 947\u2013964, Santa Clara, CA, July 2024. USENIX Association."},{"key":"e_1_3_2_1_44_1","volume-title":"Dynamollm: Designing llm inference clusters for performance and energy efficiency","author":"Stojkovic Jovan","year":"2024","unstructured":"Jovan Stojkovic, Chaojie Zhang, \u00cd\u00f1igo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency, 2024."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629578"},{"key":"e_1_3_2_1_46_1","first-page":"332","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, PACT '22","author":"Tan Xiaodan Serina","year":"2023","unstructured":"Xiaodan Serina Tan, Pavel Golikov, Nandita Vijaykumar, and Gennady Pekhimenko. Gpupool: A holistic approach to fine-grained gpu sharing in the cloud. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, PACT '22, page 317\u2013332, New York, NY, USA, 2023. Association for Computing Machinery."},{"key":"e_1_3_2_1_47_1","unstructured":"Gemma Team. Gemma 3. 2025."},{"key":"e_1_3_2_1_48_1","first-page":"960","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945\u2013960, Renton, WA, April 2022. USENIX Association."},{"key":"e_1_3_2_1_49_1","first-page":"85","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Wu Bingyang","year":"2023","unstructured":"Bingyang Wu, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. Transparent GPU sharing in container clouds for deep learning workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 69\u201385, Boston, MA, April 2023. USENIX Association."},{"key":"e_1_3_2_1_50_1","first-page":"610","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Xiao Wencong","year":"2018","unstructured":"Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595\u2013610, Carlsbad, CA, October 2018. USENIX Association."},{"key":"e_1_3_2_1_51_1","first-page":"548","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Xiao Wencong","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533\u2013548. USENIX Association, November 2020."},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3232715"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3485849.3485855"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3079202"},{"key":"e_1_3_2_1_55_1","volume-title":"Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs\/1902.04610","author":"Yu Peifeng","year":"2019","unstructured":"Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs\/1902.04610, 2019."},{"key":"e_1_3_2_1_56_1","volume-title":"Fine-grained, hardware-level gpu resource isolation for multi-tenant dnn inference","author":"Zhang Yongkang","year":"2024","unstructured":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, and Huaicheng Li. Missile: Fine-grained, hardware-level gpu resource isolation for multi-tenant dnn inference, 2024."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3405671.3405810"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00074"},{"key":"e_1_3_2_1_59_1","first-page":"1385","volume-title":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20","author":"Zhao Xia","year":"2020","unstructured":"Xia Zhao, Magnus Jahre, and Lieven Eeckhout. Hsm: A hybrid slowdown model for multitasking gpus. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 1371\u20131385, New York, NY, USA, 2020. Association for Computing Machinery."},{"key":"e_1_3_2_1_60_1","volume-title":"Muxflow: Efficient and safe gpu sharing in large-scale production deep learning clusters","author":"Zhao Yihao","year":"2023","unstructured":"Yihao Zhao, Xin Liu, Shufan Liu, Xiang Li, Yibo Zhu, Gang Huang, Xuanzhe Liu, and Xin Jin. Muxflow: Efficient and safe gpu sharing in large-scale production deep learning clusters, 2023."},{"key":"e_1_3_2_1_61_1","first-page":"97","volume-title":"Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, SenSys '23","author":"Zhao Zhihe","year":"2024","unstructured":"Zhihe Zhao, Neiwen Ling, Nan Guan, and Guoliang Xing. Miriam: Exploiting elastic kernels for real-time multi-dnn inference on edge gpu. In Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, SenSys '23, page97-110, New York, NY, USA, 2024. Association for Computing Machinery."},{"key":"e_1_3_2_1_62_1","volume-title":"Nanoflow: Towards optimal large language model serving throughput","author":"Zhu Kan","year":"2024","unstructured":"Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. Nanoflow: Towards optimal large language model serving throughput, 2024."}],"event":{"name":"SoCC '25: ACM Symposium on Cloud Computing","location":"Online USA","acronym":"SoCC '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 2025 ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772052.3772270","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:20Z","timestamp":1768321160000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772052.3772270"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":62,"alternative-id":["10.1145\/3772052.3772270","10.1145\/3772052"],"URL":"https:\/\/doi.org\/10.1145\/3772052.3772270","relation":{},"subject":[],"published":{"date-parts":[[2025,11,19]]},"assertion":[{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}