{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,6]],"date-time":"2025-06-06T04:32:05Z","timestamp":1749184325138},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8]]},"abstract":"<jats:p>Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.<\/jats:p>","DOI":"10.24963\/ijcai.2021\/154","type":"proceedings-article","created":{"date-parts":[[2021,8,11]],"date-time":"2021-08-11T11:00:49Z","timestamp":1628679649000},"page":"1113-1121","source":"Crossref","is-referenced-by-count":12,"title":["Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment"],"prefix":"10.24963","author":[{"given":"Wenzhe","family":"Wang","sequence":"first","affiliation":[{"name":"Zhejiang University"}]},{"given":"Mengdan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Youtu, Tencent"}]},{"given":"Runnan","family":"Chen","sequence":"additional","affiliation":[{"name":"The University of Hong Kong"}]},{"given":"Guanyu","family":"Cai","sequence":"additional","affiliation":[{"name":"Tongji University"}]},{"given":"Penghao","family":"Zhou","sequence":"additional","affiliation":[{"name":"Youtu, Tencent"}]},{"given":"Pai","family":"Peng","sequence":"additional","affiliation":[{"name":"Youtu, Tencent"}]},{"given":"Xiaowei","family":"Guo","sequence":"additional","affiliation":[{"name":"Youtu, Tencent"}]},{"given":"Jian","family":"Wu","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Xing","family":"Sun","sequence":"additional","affiliation":[{"name":"Youtu, Tencent"}]}],"member":"10584","event":{"number":"30","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"acronym":"IJCAI-2021","name":"Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}","start":{"date-parts":[[2021,8,19]]},"theme":"Artificial Intelligence","location":"Montreal, Canada","end":{"date-parts":[[2021,8,27]]}},"container-title":["Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2021,8,11]],"date-time":"2021-08-11T11:01:40Z","timestamp":1628679700000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2021\/154"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2021,8]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2021\/154","relation":{},"subject":[],"published":{"date-parts":[[2021,8]]}}}