{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T20:52:12Z","timestamp":1777063932876,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":72,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62572341"],"award-info":[{"award-number":["62572341"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,4,27]]},"DOI":"10.1145\/3767295.3803581","type":"proceedings-article","created":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T20:20:04Z","timestamp":1777062004000},"page":"423-438","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-7865-4905","authenticated-orcid":false,"given":"Zhixin","family":"Zhao","sequence":"first","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0458-0900","authenticated-orcid":false,"given":"Yitao","family":"Hu","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5035-3398","authenticated-orcid":false,"given":"Simin","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Texas at Dallas, Richardson, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0691-7994","authenticated-orcid":false,"given":"Mingfang","family":"Ji","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5338-7347","authenticated-orcid":false,"given":"Wei","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Texas at Dallas, Richardson, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0118-8135","authenticated-orcid":false,"given":"Yuhao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1967-2192","authenticated-orcid":false,"given":"Laiping","family":"Zhao","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8507-0339","authenticated-orcid":false,"given":"Wenxin","family":"Li","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8414-1668","authenticated-orcid":false,"given":"Xiulong","family":"Liu","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4817-5187","authenticated-orcid":false,"given":"Wenyu","family":"Qu","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1444-2657","authenticated-orcid":false,"given":"Hao","family":"Wang","sequence":"additional","affiliation":[{"name":"Stevens Institute of Technology, Hoboken, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,4,26]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359658"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472883.3486993"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTAS54340.2022.00020"},{"key":"e_1_3_2_1_4_1","volume-title":"Ipa: Inference pipeline adaptation to achieve high accuracy and cost-efficiency. arXiv preprint arXiv:2308.12871","author":"Ghafouri Saeid","year":"2023","unstructured":"Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, and Pooyan Jamshidi. Ipa: Inference pipeline adaptation to achieve high accuracy and cost-efficiency. arXiv preprint arXiv:2308.12871, 2023."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3419111.3421285"},{"key":"e_1_3_2_1_6_1","first-page":"286","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Hsieh Kevin","year":"2018","unstructured":"Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. Focus: Querying large video datasets with low latency and low cost. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 269\u2013286, 2018."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTSS46320.2019.00042"},{"key":"e_1_3_2_1_8_1","first-page":"1062","volume-title":"USENIX Annual Technical Conference","author":"Zhang Chengliang","year":"2019","unstructured":"Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. MArk: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In USENIX Annual Technical Conference, pages 1049\u20131062, 2019."},{"key":"e_1_3_2_1_9_1","first-page":"808","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zhang Hong","year":"2023","unstructured":"Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. SHEPHERD: Serving DNNs in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 787\u2013808, 2023."},{"key":"e_1_3_2_1_10_1","first-page":"1057","volume-title":"USENIX NSDI","author":"Gunasekaran Jashwant Raj","year":"2022","unstructured":"Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. Cocktail: A multidimensional optimization for model serving in cloud. In USENIX NSDI, pages 1041\u20131057, 2022."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617232.3624849"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3232715"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCC.2020.3006751"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190541"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-48424-7_18"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3631311.3632401"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2023.102888"},{"key":"e_1_3_2_1_18_1","volume-title":"xllm technical report","author":"Liu Tongxuan","year":"2026","unstructured":"Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Yitao Hu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, and Ke Zhang. xllm technical report, 2026."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD51958.2021.9643501"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2408776.2408794"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3623278.3624753"},{"key":"e_1_3_2_1_22_1","first-page":"627","volume-title":"NSDI","volume":"17","author":"Crankshaw Daniel","year":"2017","unstructured":"Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In NSDI, volume 17, pages 613\u2013627, 2017."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3135974.3135993"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2025.3528125"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM55648.2025.11044536"},{"key":"e_1_3_2_1_26_1","first-page":"495","article-title":"Ml-optimized cache management for inference-oriented gpus","volume":"5","author":"Fu Yaosheng","year":"2023","unstructured":"Yaosheng Fu, Evgeny Bolotin, Aamer Jaleel, Gal Dalal, Shie Mannor, Jacob Subag, Noam Korem, Michael Behar, and David Nellans. Auto-scratch: Ml-optimized cache management for inference-oriented gpus. Proceedings of Machine Learning and Systems, 5:495\u2013512, 2023.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_27_1","volume-title":"Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139","author":"Olston Christopher","year":"2017","unstructured":"Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139, 2017."},{"key":"e_1_3_2_1_28_1","first-page":"462","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443\u2013462, 2020."},{"key":"e_1_3_2_1_29_1","first-page":"411","volume-title":"USENIX Annual Technical Conference","author":"Romero Francisco","year":"2021","unstructured":"Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In USENIX Annual Technical Conference, pages 397\u2013411, 2021."},{"key":"e_1_3_2_1_30_1","volume-title":"Orloj: Predictably serving unpredictable dnns. arXiv preprint arXiv:2209.00159","author":"Yu Peifeng","year":"2022","unstructured":"Peifeng Yu, Yuqing Qiu, Xin Jin, and Mosharaf Chowdhury. Orloj: Predictably serving unpredictable dnns. arXiv preprint arXiv:2209.00159, 2022."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507709"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00049"},{"key":"e_1_3_2_1_33_1","volume-title":"PyTorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM55648.2025.11044490"},{"key":"e_1_3_2_1_35_1","volume-title":"Network time protocol version 4: Protocol and algorithms specification. Technical report","author":"Mills David","year":"2010","unstructured":"David Mills, Jim Martin, Jack Burbank, and William Kasch. Network time protocol version 4: Protocol and algorithms specification. Technical report, University of Delaware, 2010."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/NetGames.2018.8463362"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3097964"},{"key":"e_1_3_2_1_38_1","volume-title":"Youtube live streaming. https:\/\/www.youtube.com\/howyoutubeworks\/product-features\/live\/","year":"2024","unstructured":"YouTube. Youtube live streaming. https:\/\/www.youtube.com\/howyoutubeworks\/product-features\/live\/, 2024."},{"key":"e_1_3_2_1_39_1","volume-title":"https:\/\/www.twitch.tv\/","year":"2024","unstructured":"Twitch. Twitch. https:\/\/www.twitch.tv\/, 2024."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.comnet.2009.02.019"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/GHTC.2018.8601887"},{"key":"e_1_3_2_1_42_1","volume-title":"Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'20, USA","author":"Shahrad Mohammad","year":"2020","unstructured":"Mohammad Shahrad, Rodrigo Fonseca, \u00cd\u00f1igo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. Serverless in the wild: characterizing and optimizing the serverless workload at a large cloud provider. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'20, USA, 2020. USENIX Association."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629578"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267809.3267823"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00044"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472883.3486992"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3423211.3425690"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071120"},{"key":"e_1_3_2_1_49_1","first-page":"314","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Cho Inho","year":"2020","unstructured":"Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. Overload control for {\u03bcs-scale} {RPCs} with breakwater. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 299\u2013314, 2020."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133373.1133386"},{"key":"e_1_3_2_1_51_1","volume-title":"Seda: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review, 35(5):230\u2013243","author":"Welsh Matt","year":"2001","unstructured":"Matt Welsh, David Culler, and Eric Brewer. Seda: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review, 35(5):230\u2013243, 2001."},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132774"},{"key":"e_1_3_2_1_53_1","first-page":"738","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Cho Inho","year":"2023","unstructured":"Inho Cho, Ahmed Saeed, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. Protego: Overload control for applications with unpredictable lock contention. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 725\u2013738, 2023."},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446693"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446700"},{"key":"e_1_3_2_1_56_1","first-page":"165","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Wang Zibo","year":"2024","unstructured":"Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, and Francis Y Yan. Autothrottle: A practical {Bi-Level} approach to resource management for {SLO-Targeted} microservices. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 149\u2013165, 2024."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303958"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_59_1","volume-title":"LangChain: A framework for building LLM-powered applications. https:\/\/github.com\/langchain-ai\/langchain","author":"LangChain","year":"2025","unstructured":"LangChain AI. LangChain: A framework for building LLM-powered applications. https:\/\/github.com\/langchain-ai\/langchain, 2025. GitHub repository, supports chaining together models, embeddings, vector stores, and tools :contentReference[oaicite:0]index=0."},{"key":"e_1_3_2_1_60_1","volume-title":"Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600","author":"Yang Zhilin","year":"2018","unstructured":"Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018."},{"key":"e_1_3_2_1_61_1","volume-title":"The llama 3 herd of models. arXiv preprint arXiv:2407.21783","author":"Grattafiori Aaron","year":"2024","unstructured":"Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024."},{"key":"e_1_3_2_1_62_1","volume-title":"The faiss library","author":"Douze Matthijs","year":"2024","unstructured":"Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar\u00e9, Maria Lomeli, Lucas Hosseini, and Herv\u00e9 J\u00e9gou. The faiss library. 2024."},{"key":"e_1_3_2_1_63_1","volume-title":"Tavily: The Web Access Layer for AI Agents. https:\/\/www.tavily.com\/","year":"2025","unstructured":"Tavily. Tavily: The Web Access Layer for AI Agents. https:\/\/www.tavily.com\/, 2025. Official website, provides APIs for real-time search, extraction, crawling :contentReference[oaicite:0]index=0."},{"key":"e_1_3_2_1_64_1","first-page":"20","article-title":"Sla-driven ml inference framework for clouds with heterogeneous accelerators","volume":"4","author":"Cho Junguk","year":"2022","unstructured":"Junguk Cho, Diman Zad Tootaghaj, Lianjie Cao, and Puneet Sharma. Sla-driven ml inference framework for clouds with heterogeneous accelerators. Proceedings of Machine Learning and Systems, 4:20\u201332, 2022.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_65_1","article-title":"Efficient inference serving for hybrid deep learning with slo guarantees via dnn re-alignment","author":"Wu Jing","year":"2023","unstructured":"Jing Wu, Lin Wang, Qirui Jin, and Fangming Liu. Graft: Efficient inference serving for hybrid deep learning with slo guarantees via dnn re-alignment. IEEE Transactions on Parallel and Distributed Systems, 2023.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_1_66_1","volume-title":"Automated ensemble for deep learning inference on edge computing platforms","author":"Bai Yang","year":"2021","unstructured":"Yang Bai, Lixing Chen, Mohamed Abdel-Mottaleb, and Jie Xu. Automated ensemble for deep learning inference on edge computing platforms. IEEE internet of things journal, 9(6):4202\u20134213, 2021."},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3567508"},{"key":"e_1_3_2_1_68_1","first-page":"172","article-title":"Accelerating training and inference of graph neural networks with fast sampling and pipelining","volume":"4","author":"Kaler Tim","year":"2022","unstructured":"Tim Kaler, Nickolas Stathas, Anne Ouyang, Alexandros-Stavros Il-iopoulos, Tao Schardl, Charles E Leiserson, and Jie Chen. Accelerating training and inference of graph neural networks with fast sampling and pipelining. Proceedings of Machine Learning and Systems, 4:172\u2013189, 2022.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_69_1","first-page":"13","volume-title":"Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '18","author":"Harlap Aaron","year":"2018","unstructured":"Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger, and Phillip B. Gibbons. Tributary: Spot-dancing for elastic services with latency slos. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '18, page 1\u201313, USA, 2018."},{"key":"e_1_3_2_1_70_1","volume-title":"Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1","author":"Gao Yunfan","year":"2023","unstructured":"Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1, 2023."},{"key":"e_1_3_2_1_71_1","first-page":"93","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Qiu Haoran","year":"2024","unstructured":"Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Ba\u015far, and Ravishankar K Iyer. Power-aware deep learning model serving with {\u03bc-Serve}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 75\u201393, 2024."},{"key":"e_1_3_2_1_72_1","first-page":"18015","article-title":"Increasing gpu utilization during generative inference for higher throughput","volume":"36","author":"Jin Yunho","year":"2023","unstructured":"Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei. s3: Increasing gpu utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems, 36:18015\u201318027, 2023.","journal-title":"Advances in Neural Information Processing Systems"}],"event":{"name":"EUROSYS '26: 21st European Conference on Computer Systems","location":"McEwan Hall\/The University of Edinburgh Edinburgh Scotland UK","acronym":"EUROSYS '26","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the 21st European Conference on Computer Systems"],"original-title":[],"deposited":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T20:23:59Z","timestamp":1777062239000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3767295.3803581"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,26]]},"references-count":72,"alternative-id":["10.1145\/3767295.3803581","10.1145\/3767295"],"URL":"https:\/\/doi.org\/10.1145\/3767295.3803581","relation":{},"subject":[],"published":{"date-parts":[[2026,4,26]]},"assertion":[{"value":"2026-04-26","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}