{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T23:35:38Z","timestamp":1761176138100,"version":"build-2065373602"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686318","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T00:00:00Z","timestamp":1761004800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,10,21]]},"abstract":"<jats:p>The burgeoning field of video-text retrieval has witnessed significant advancements with the advent of deep learning. However, understanding and matching textual descriptions and video data remains a formidable challenge due to the large information gap across textual and video modalities. As observed, the caption of a video is commonly under-described, lacking expressions of minor characters or local details. Some recent advances have attempted to leverage multimodal Large Language Model (mLLM) to bridge the comprehension gap. However, mLLMs\u2019 potential in enhancing video-text retrieval (VTR) is understudied. This paper aims to fill this research vacancy, analyzing the practical significance and model preferences for utilizing mLLMs in VTR enhancement, as well as investigating the effective integration of mLLM-derived information into the retrieval learning. Based on our analytical insights, we innovatively propose treating mLLM as caption supplements rather than substitutes to bridge the expression gap across modalities. To achieve better cross-modal alignment, we systematically generate diverse variations of videos to construct an elastic visual space. By treating mLLM-supplemented captions as out-of-space points, cross-modal representation learning is accomplished through the optimization of a conical-like representation space. Our model achieves state-of-the-art results on various benchmarks, including MSR-VTT, MSVD, and DiDeMo, and analytical experiments suggest appropriate prompt proposals and indicate our method\u2019s robustness to different mLLMs.<\/jats:p>","DOI":"10.3233\/faia250879","type":"book-chapter","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:44:47Z","timestamp":1761126287000},"source":"Crossref","is-referenced-by-count":0,"title":["Unlocking the Potential of mLLMs: Enhancing Video-Text Retrieval Through Caption Supplementation and Conical Embedding Optimization"],"prefix":"10.3233","author":[{"given":"Baoyao","family":"Yang","sequence":"first","affiliation":[{"name":"School of Computers, Guangdong University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junxiang","family":"Chen","sequence":"additional","affiliation":[{"name":"WeChat, Tencent"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenbin","family":"Yao","sequence":"additional","affiliation":[{"name":"WeChat, Tencent"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","ECAI 2025"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA250879","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:44:47Z","timestamp":1761126287000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA250879"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,21]]},"ISBN":["9781643686318"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia250879","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,21]]}}}