{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:52:02Z","timestamp":1781538722586,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62372314"],"award-info":[{"award-number":["62372314"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"The Centre for Large AI Models (CLAIM) of The Hong Kong Polytechnic University"},{"name":"The Intelligent Computing Center of Shenzhen University"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810774","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"1598-1607","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4074-3442","authenticated-orcid":false,"given":"Jiaxin","family":"Wu","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, Shenzhen University, Shenzhen, Guangdong, China and Department of Computing, The Hong Kong Polytechnic University, Hong Kong, Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5706-5177","authenticated-orcid":false,"given":"Xiao-Yong","family":"Wei","sequence":"additional","affiliation":[{"name":"College of Computer Science, Sichuan University, Chengdu, China and Department of Computing, The Hong Kong Polytechnic University, Hong Kong, Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3370-471X","authenticated-orcid":false,"given":"Qing","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computing, The Hong Kong Polytechnic University, Hong Kong, Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02757"},{"key":"e_1_3_3_1_3_2","volume-title":"Proceedings of the TRECVid 2019 Workshop","author":"Awad George","year":"2019","unstructured":"George Awad, Asad Butt, Keith Curtis, Yooyoung Lee, Jonathan Fiscus, Godil Afzal, Andrew Delgado, Zhang Jesse, Eliot Godard, Lukas Diduch, Alan\u00a0F. Smeaton, Yvette Graham, Wessel Kraaij, and Georges Quenot. 2019. TRECVid 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search and retrieval. In Proceedings of the TRECVid 2019 Workshop."},{"key":"e_1_3_3_1_4_2","unstructured":"George Awad Keith Curtis Asad Butt Jonathan Fiscus Afzal Godil Yooyoung Lee Andrew Delgado Eliot Godard Lukas Diduch Jeffrey Liu Yvette Graham and Georges Quenot. 2022. An overview on the evaluated video retrieval tasks at TRECVID 2022. arxiv:https:\/\/arXiv.org\/abs\/2306.13118\u00a0[cs.AI]"},{"key":"e_1_3_3_1_5_2","volume-title":"Proceedings of TRECVID 2023","author":"Awad George","year":"2023","unstructured":"George Awad, Keith Curtis, Asad\u00a0A. Butt, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Lukas Diduch, Yvette Graham, , and Georges Qu\u00e9not. 2023. TRECVID 2023 - A series of evaluation tracks in video understanding. In Proceedings of TRECVID 2023. NIST, USA."},{"key":"e_1_3_3_1_6_2","first-page":"1","volume-title":"Proceedings of the TRECVid 2016 Workshop","author":"Awad George","year":"2016","unstructured":"George Awad, Fiscus Jonathan, Joy David, Michel Martial, Smeaton Alan, Kraaij Wessel, Quenot Georges, Eskevich Maria, Aly Robin, Ordelman Roeland, Jones Gareth, Huet Benoit, and LarsonMartha. 2016. TRECVid 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking. In Proceedings of the TRECVid 2016 Workshop. 1\u201354."},{"key":"e_1_3_3_1_7_2","unstructured":"Shuai Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Sibo Song Kai Dang Peng Wang Shijie Wang Jun Tang Humen Zhong Yuanzhi Zhu Mingkun Yang Zhaohai Li Jianqiang Wan Pengfei Wang Wei Ding Zheren Fu Yiheng Xu Jiabo Ye Xi Zhang Tianbao Xie Zesen Cheng Hang Zhang Zhibo Yang Haiyang Xu and Junyang Lin. 2025. Qwen2.5-VL Technical Report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2502.13923 (2025)."},{"key":"e_1_3_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.findings-emnlp.622"},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01065"},{"key":"e_1_3_3_1_10_2","doi-asserted-by":"crossref","unstructured":"Jianfeng Dong Xirong Li and Cees G.\u00a0M. Snoek. 2018. Predicting Visual Features From Text for Image and Video Caption Retrieval. IEEE Transactions on Multimedia 20 12 (2018) 3377\u20133388.","DOI":"10.1109\/TMM.2018.2832602"},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00957"},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"crossref","unstructured":"Jianfeng Dong Yabing Wang Xianke Chen Xiaoye Qu Xirong Li Yuan He and Xun Wang. 2022. Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. IEEE Transactions on Circuits and Systems for Video Technology 32 (2022) 5680\u20135694.","DOI":"10.1109\/TCSVT.2022.3150959"},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3372278.3390737"},{"key":"e_1_3_3_1_14_2","unstructured":"Grand View Research. 2024. Video Analytics Market Size Share & Trends Analysis Report by Type (Software Services) by Deployment (Cloud On-Premise) by Application by Vertical by Region and Segment Forecasts 2025\u20132030. https:\/\/www.grandviewresearch.com\/industry-analysis\/video-analytics-market Accessed: 2025-11-06."},{"key":"e_1_3_3_1_15_2","unstructured":"Dong Guo Faming Wu and et al. 2025. Seed1.5-VL Technical Report. arxiv:https:\/\/arXiv.org\/abs\/2505.07062\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2505.07062"},{"key":"e_1_3_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Silvan Heller Viktor Gsteiger Werner Bailer Cathal Gurrin Bj\u00f6rn \u00de\u00f3r J\u00f3nsson Jakub Loko\u010d Andreas Leibetseder Franti\u0161ek Mejzl\u00edk Ladislav Pe\u0161ka Luca Rossetto Konstantin Schall Klaus Schoeffmann Heiko Schuldt Florian Spiess Ly-Duyen Tran Lucia Vadicamo Patrik Vesel\u00fd Stefanos Vrochidis and Jiaxin Wu. 2022. Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown. International Journal of Multimedia Information Retrieval 11 (2022) 1\u201318.","DOI":"10.1007\/s13735-021-00225-2"},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19781-9_26"},{"key":"e_1_3_3_1_18_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Huang Sihong","year":"2026","unstructured":"Sihong Huang, Dongmei Wu, Jiaxin\u00a0Jiang, Yi Cai, , Yaowei Wang, and Xiaoyong Wei. 2026. Compositional Transformation Reasoning for Composed Video Retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02695"},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02242"},{"key":"e_1_3_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.02440"},{"key":"e_1_3_3_1_22_2","unstructured":"Junnan Li Dongxu Li Silvio Savarese and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2301.12597 (2023)."},{"key":"e_1_3_3_1_23_2","doi-asserted-by":"crossref","unstructured":"Xirong Li Fangming Zhou Chaoxi Xu Jiaqi Ji and Gang Yang. 2021. SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia (2021) 4351\u20134362.","DOI":"10.1109\/TMM.2020.3042067"},{"key":"e_1_3_3_1_24_2","volume-title":"Proceedings of the TRECVid 2016 Workshop","author":"Liang Junwei","year":"2016","unstructured":"Junwei Liang, Poyao Huang, Lu Jiang, Zhenzhong Lan, Jia Chen, and Alexander Hauptmann. 2016. Informedia @ TRECVid 2016 MED and AVS. In Proceedings of the TRECVid 2016 Workshop."},{"key":"e_1_3_3_1_25_2","unstructured":"Huaishao Luo Lei Ji Ming Zhong Yang Chen Wen Lei Nan Duan and Tianrui Li. 2021. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2104.08860 (2021) 1\u201314."},{"key":"e_1_3_3_1_26_2","volume-title":"IEEE Computer Society","author":"Ngo Chong-Wah","year":"2008","unstructured":"Chong-Wah Ngo, Yu-Gang Jiang, Xiao-Yong Wei, Wanlei Zhao, Feng Wang, Xiao Wu, and Hung-Khoon Tan. 2008. Beyond semantic search: What you observe may not be what you think. In IEEE Computer Society."},{"key":"e_1_3_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-96-4291-5_24"},{"key":"e_1_3_3_1_28_2","first-page":"1","volume-title":"Proceedings of the TRECVid 2019 Workshop","author":"Nguyen Phuong\u00a0Anh","year":"2019","unstructured":"Phuong\u00a0Anh Nguyen, Jiaxin Wu, Chong-Wah Ngo, Francis Danny, and Huet Benoit. 2019. VIREO-EURECOM @ TRECVid 2019: Ad-hoc Video Search. In Proceedings of the TRECVid 2019 Workshop. 1\u20138."},{"key":"e_1_3_3_1_29_2","volume-title":"How Short Video Became the Center of the Social Media Universe","author":"Pellicer Miquel","year":"2025","unstructured":"Miquel Pellicer. 2025. How Short Video Became the Center of the Social Media Universe. https:\/\/foresights.substack.com\/p\/the-4-second-battle-how-short-video Accessed: 2025-11-06."},{"key":"e_1_3_3_1_30_2","first-page":"8748","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong\u00a0Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML. 8748\u20138763."},{"key":"e_1_3_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1178677.1178722"},{"key":"e_1_3_3_1_32_2","unstructured":"Qwen Team. 2025. Qwen3 Technical Report. arxiv:https:\/\/arXiv.org\/abs\/2505.09388\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2505.09388"},{"key":"e_1_3_3_1_33_2","first-page":"1","volume-title":"Proceedings of the TRECVid 2016 Workshop","author":"Ueki Kazuya","year":"2016","unstructured":"Kazuya Ueki, Kotaro Kikuchi, Susumu Saito, and Tetsunori Kobayashi. 2016. Waseda at TRECVid 2016: Ad-hoc Video Search. In Proceedings of the TRECVid 2016 Workshop. 1\u20135."},{"key":"e_1_3_3_1_34_2","volume-title":"The Twelfth International Conference on Learning Representations","author":"Wang Yi","year":"2023","unstructured":"Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et\u00a0al. 2023. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547968"},{"key":"e_1_3_3_1_36_2","first-page":"1","volume-title":"Proceedings of the TRECVid 2021 Workshop","author":"Wu Jiaxin","year":"2021","unstructured":"Jiaxin Wu, Zhijian Hou, Zhixin Ma, and Chong-Wah Ngo. 2021. VIREO @ TRECVid 2021 Ad-hoc Video Search. In Proceedings of the TRECVid 2021 Workshop. 1\u20139."},{"key":"e_1_3_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413916"},{"key":"e_1_3_3_1_38_2","volume-title":"ACM Transactions on Information Systems","author":"Wu Jiaxin","year":"2023","unstructured":"Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, and Zhijian Hou. 2023. (Un)likelihood Training for Interpretable Embedding. In ACM Transactions on Information Systems , Vol.\u00a042. Article 75, 26\u00a0pages."},{"key":"e_1_3_3_1_39_2","unstructured":"Jiaxin Wu Chong-Wah Ngo Wing-Kwong Chan Sheng-Hua Zhong Xiong-Yong Wei and Qing Li. 2025. Multimodal LLM-based Query Paraphrasing for Video Search. arxiv:https:\/\/arXiv.org\/abs\/2407.12341\u00a0[cs.MM] https:\/\/arxiv.org\/abs\/2407.12341"},{"key":"e_1_3_3_1_40_2","volume-title":"Proceedings of the TRECVid 2020 Workshop","author":"Wu Jiaxin","year":"2020","unstructured":"Jiaxin Wu, Phuong\u00a0Anh Nguyen, and Chong-Wah Ngo. 2020. VIREO @ TRECVid 2020: Ad-hoc Video Search. In Proceedings of the TRECVid 2020 Workshop."},{"key":"e_1_3_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3652583.3658052"},{"key":"e_1_3_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Zhen-Qun\u00a0Yang Xiao-Yong\u00a0Wei. 2013. Coaching the exploration and exploitation in active learning for interactive video retrieval. IEEE Transactions on Image Processing 22 3 (2013) 955\u2013968.","DOI":"10.1109\/TIP.2012.2222902"},{"key":"e_1_3_3_1_43_2","volume-title":"The Thirteenth International Conference on Learning Representations","author":"Ye Jiabo","year":"2025","unstructured":"Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. 2025. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models. In The Thirteenth International Conference on Learning Representations."},{"key":"e_1_3_3_1_44_2","doi-asserted-by":"crossref","unstructured":"Haonan Zhang Pengpeng Zeng Lianli Gao Jingkuan Song Yihang Duan Xinyu Lyu and Heng\u00a0Tao Shen. 2025. Text-Video Retrieval With Global-Local Semantic Consistent Learning. IEEE Transactions on Image Processing 34 (2025) 3463\u20133474.","DOI":"10.1109\/TIP.2025.3574925"},{"key":"e_1_3_3_1_45_2","volume-title":"The Twelfth International Conference on Learning Representations","author":"Zhu Bin","year":"2024","unstructured":"Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai\u00a0Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. In The Twelfth International Conference on Learning Representations."}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:52:43Z","timestamp":1781535163000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810774"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":44,"alternative-id":["10.1145\/3805622.3810774","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810774","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}