{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T12:08:07Z","timestamp":1763381287292,"version":"3.45.0"},"publisher-location":"New York, NY, USA","reference-count":48,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,17]]},"DOI":"10.1145\/3772356.3772390","type":"proceedings-article","created":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T12:02:48Z","timestamp":1763380968000},"page":"402-410","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-7628-6673","authenticated-orcid":false,"given":"Jiangkai","family":"Wu","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-4133-7416","authenticated-orcid":false,"given":"Zhiyuan","family":"Ren","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9168-4897","authenticated-orcid":false,"given":"Liming","family":"Liu","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0484-5951","authenticated-orcid":false,"given":"Xinggong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2025. Qwen3-VL-Plus. https:\/\/bailian.console.aliyun.com\/?spm=a2c4g.11186623.0.0.74e555efL5VoGI&tab=model#\/model-market\/detail\/qwen3-vl-plus."},{"key":"e_1_3_2_1_2_1","unstructured":"2025. VMAF. https:\/\/github.com\/Netflix\/vmaf."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230558"},{"key":"e_1_3_2_1_4_1","volume-title":"22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)","author":"An Congkai","year":"2025","unstructured":"Congkai An, Huanhuan Zhang, Shibo Wang, Jingyang Kang, Anfu Zhou, Liang Liu, Huadong Ma, Zili Meng, Delei Ma, Yusheng Dong, et al. 2025. Tooth: Toward Optimal Balance of Video QoE and Redundancy Cost by Fine-Grained FEC in Cloud Gaming Streaming. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 635\u2013651."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3009824"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2910017.2910605"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01742"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672260"},{"key":"e_1_3_2_1_9_1","volume-title":"Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811","author":"Chen Xiaokang","year":"2025","unstructured":"Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)."},{"key":"e_1_3_2_1_10_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Cheng Yihua","year":"2024","unstructured":"Yihua Cheng, Ziyi Zhang, Hanchen Li, Anton Arapin, Yue Zhang, Qizheng Zhang, Yuhan Liu, Kuntai Du, Xu Zhang, Francis Y Yan, et al. 2024. GRACE: Loss-Resilient Real-Time video through neural codecs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 509\u2013531."},{"key":"e_1_3_2_1_11_1","volume-title":"12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)","author":"Dong Mo","year":"2015","unstructured":"Mo Dong, Qingxi Li, Doron Zarchy, P Brighten Godfrey, and Michael Schapira. 2015. PCC: Re-architecting congestion control for consistent high performance. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). 395\u2013408."},{"key":"e_1_3_2_1_12_1","volume-title":"15th USENIX symposium on networked systems design and implementation (NSDI 18)","author":"Dong Mo","year":"2018","unstructured":"Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. 2018. PCC vivace: Online-Learning congestion control. In 15th USENIX symposium on networked systems design and implementation (NSDI 18). 343\u2013356."},{"key":"e_1_3_2_1_13_1","unstructured":"Chaoyou Fu Haojia Lin Xiong Wang Yi-Fan Zhang Yunhang Shen Xiaoyu Liu Haoyu Cao Zuwei Long Heting Gao Ke Li et al. 2025. Vita-1.5: Towards GPT-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957 (2025)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2619239.2626296"},{"key":"e_1_3_2_1_15_1","unstructured":"Aaron Hurst Adam Lerer Adam P Goucher Adam Perelman Aditya Ramesh Aidan Clark AJ Ostrow Akila Welihinda Alan Hayes Alec Radford et al. 2024. GPT-4o system card. arXiv preprint arXiv:2410.21276 (2024)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2413176.2413189"},{"key":"e_1_3_2_1_17_1","unstructured":"Zhiwei Jin Xiaohui Song Nan Wang Yafei Liu Chao Li Xin Li Ruichen Wang Zhihao Li Qi Qi Long Cheng Dongze Hao Quanlong Zheng Yanhao Zhang Haobo Ji Jian Ma Zhitong Zheng Zhenyi Lin Haolin Deng Xin Zou Xiaojie Yin Ruilin Wang Liankai Cai Haijing Liu Yuqing Qiu Ke Chen Zixian Li Chi Xie Huafei Li Chenxing Li Chuangchuang Wang Kai Tang Zhiguang Zhu Kai Tang Wenmei Gao Rui Wang Jun Wu Chao Liu Qin Xie Chen Chen and Haonan Lu. 2025. AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model. arXiv:2510.11496 [cs.CV] https:\/\/arxiv.org\/abs\/2510.11496"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM48880.2022.9796887"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5573\/IEIESPC.2023.12.2.122"},{"key":"e_1_3_2_1_20_1","volume-title":"Reparo: Loss-resilient generative codec for video conferencing. arXiv preprint arXiv:2305.14135","author":"Li Tianhong","year":"2023","unstructured":"Tianhong Li, Vibhaalakshmi Sivaraman, Pantea Karimi, Lijie Fan, Mohammad Alizadeh, and Dina Katabi. 2023. Reparo: Loss-resilient generative codec for video conferencing. arXiv preprint arXiv:2305.14135 (2023)."},{"key":"e_1_3_2_1_21_1","volume-title":"MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer. arXiv preprint arXiv:2509.16197","author":"Li Yanghao","year":"2025","unstructured":"Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, et al. 2025. MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer. arXiv preprint arXiv:2509.16197 (2025)."},{"key":"e_1_3_2_1_22_1","volume-title":"Streamingbench: Assessing the gap for MLLMs to achieve streaming video understanding. arXiv preprint arXiv:2411.03628","author":"Lin Junming","year":"2024","unstructured":"Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. 2024. Streamingbench: Assessing the gap for MLLMs to achieve streaming video understanding. arXiv preprint arXiv:2411.03628 (2024)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3098822.3098843"},{"key":"e_1_3_2_1_24_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Meng Zili","year":"2024","unstructured":"Zili Meng, Xiao Kong, Jing Chen, Bo Wang, Mingwei Xu, Rui Han, Honghao Liu, Venkat Arun, Hongxin Hu, and Xue Wei. 2024. Hairpin: Rethinking packet loss recovery in edge-based interactive video streaming. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 907\u2013926."},{"key":"e_1_3_2_1_25_1","volume-title":"Latency Optimization in Interactive Multimedia Streaming","author":"Meng Zili","unstructured":"Zili Meng and Mingwei Xu. 2024. Feedback on Control Path: Early Congestion Feedback. In Latency Optimization in Interactive Multimedia Streaming. Springer, 23\u201342."},{"key":"e_1_3_2_1_26_1","volume-title":"Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction. arXiv preprint arXiv:2501.03218","author":"Qian Rui","year":"2025","unstructured":"Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. 2025. Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction. arXiv preprint arXiv:2501.03218 (2025)."},{"key":"e_1_3_2_1_27_1","unstructured":"Haoran Qiu Anish Biswas Zihan Zhao Jayashree Mohan Alind Khare Esha Choukse \u00cd\u00f1igo Goiri Zeyu Zhang Haiying Shen Chetan Bansal Ramachandran Ramjee and Rodrigo Fonseca. 2025. Mod-Serve: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving. arXiv:2502.00937 [cs.DC] https:\/\/arxiv.org\/abs\/2502.00937"},{"key":"e_1_3_2_1_28_1","volume-title":"International conference on machine learning. PmLR, 8748\u20138763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748\u20138763."},{"key":"e_1_3_2_1_29_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Rudow Michael","year":"2023","unstructured":"Michael Rudow, Francis Y Yan, Abhishek Kumar, Ganesh Ananthanarayanan, Martin Ellis, and KV Rashmi. 2023. Tambur: Efficient loss recovery for videoconferencing via streaming codes. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 953\u2013971."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2007.905532"},{"key":"e_1_3_2_1_31_1","volume-title":"Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818","author":"Team Chameleon","year":"2024","unstructured":"Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024)."},{"key":"e_1_3_2_1_32_1","volume-title":"Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al.","author":"Team Gemini","year":"2024","unstructured":"Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)."},{"key":"e_1_3_2_1_33_1","unstructured":"V Team Wenyi Hong Wenmeng Yu et al. 2025. GLM-4.5 V and GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv preprint arXiv:2507.01006 (2025)."},{"key":"e_1_3_2_1_34_1","unstructured":"Aaron Van Den Oord Oriol Vinyals et al. 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_35_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15963\u201315974","author":"Anasosalu Vasu Pavan Kumar","year":"2024","unstructured":"Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, and Oncel Tuzel. 2024. Mobileclip: Fast image-text models through multi-modal reinforced training. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15963\u201315974."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2973796"},{"key":"e_1_3_2_1_37_1","volume-title":"StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant. arXiv preprint arXiv:2505.05467","author":"Wang Haibo","year":"2025","unstructured":"Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. 2025. StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant. arXiv preprint arXiv:2505.05467 (2025)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.01763"},{"key":"e_1_3_2_1_39_1","volume-title":"Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge. arXiv preprint arXiv:2501.13468","author":"Xiong Haomiao","year":"2025","unstructured":"Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. 2025. Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge. arXiv preprint arXiv:2501.13468 (2025)."},{"key":"e_1_3_2_1_40_1","unstructured":"Jin Xu Zhifang Guo Jinzheng He Hangrui Hu Ting He Shuai Bai Keqin Chen Jialin Wang Yang Fan Kai Dang et al. 2025. Qwen2.5-Omni technical report. arXiv preprint arXiv:2503.20215 (2025)."},{"key":"e_1_3_2_1_41_1","volume-title":"17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)","author":"Yan Francis Y","year":"2020","unstructured":"Francis Y Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. 2020. Learning in situ: a randomized experiment in video streaming. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). 495\u2013511."},{"key":"e_1_3_2_1_42_1","volume-title":"SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding. In The Thirteenth International Conference on Learning Representations.","author":"Yang Zhenyu","unstructured":"Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. [n. d.]. SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding. In The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Linli Yao Yicheng Li Yuancheng Wei Lei Li Shuhuai Ren Yuanxin Liu Kun Ouyang Lean Wang Shicheng Li Sida Li et al. 2025. TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos. arXiv preprint arXiv:2504.17343 (2025).","DOI":"10.1145\/3746027.3754839"},{"key":"e_1_3_2_1_44_1","unstructured":"Yuan Yao Tianyu Yu Ao Zhang Chongyi Wang Junbo Cui Hongji Zhu Tianchi Cai Haoyu Li Weilin Zhao Zhihui He Qianyu Chen Huarong Zhou Zhensheng Zou Haoye Zhang Shengding Hu Zhi Zheng Jie Zhou Jie Cai Xu Han Guoyang Zeng Dahai Li Zhiyuan Liu and Maosong Sun. 2024. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800 [cs.CV] https:\/\/arxiv.org\/abs\/2408.01800"},{"key":"e_1_3_2_1_45_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Yu Lijun","unstructured":"Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. [n. d.]. Language Model Beats Diffusion-Tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_2_1_46_1","unstructured":"Pan Zhang Xiaoyi Dong Yuhang Cao Yuhang Zang Rui Qian Xilin Wei Lin Chen Yifei Li Junbo Niu Shuangrui Ding et al. 2024. InternLM-XComposer2.5-OmniLive: A comprehensive multimodal system for long-term streaming video and audio interactions. arXiv preprint arXiv:2412.09596 (2024)."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651863.3651881"},{"key":"e_1_3_2_1_48_1","volume-title":"Aim: Adaptive inference of multi-modal LLMs via token merging and pruning. arXiv preprint arXiv:2412.03248","author":"Zhong Yiwu","year":"2024","unstructured":"Yiwu Zhong, Zhuoming Liu, Yin Li, and Liwei Wang. 2024. Aim: Adaptive inference of multi-modal LLMs via token merging and pruning. arXiv preprint arXiv:2412.03248 (2024)."}],"event":{"name":"HotNets '25: 24th ACM Workshop on Hot Topics in Networks","location":"UMD Campus College Park MD USA","acronym":"HotNets '25","sponsor":["SIGCOMM ACM Special Interest Group on Data Communication"]},"container-title":["Proceedings of the 24th ACM Workshop on Hot Topics in Networks"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772356.3772390","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T12:04:05Z","timestamp":1763381045000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772356.3772390"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,17]]},"references-count":48,"alternative-id":["10.1145\/3772356.3772390","10.1145\/3772356"],"URL":"https:\/\/doi.org\/10.1145\/3772356.3772390","relation":{},"subject":[],"published":{"date-parts":[[2025,11,17]]},"assertion":[{"value":"2025-11-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}