{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T00:52:26Z","timestamp":1775868746760,"version":"3.50.1"},"reference-count":88,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,11,7]],"date-time":"2023-11-07T00:00:00Z","timestamp":1699315200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation (NSF) of China","doi-asserted-by":"crossref","award":["62276155, 62376140, 62206156, 62206157 and 62006142"],"award-info":[{"award-number":["62276155, 62376140, 62206156, 62206157 and 62006142"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003091","name":"NSF of Shandong Province","doi-asserted-by":"crossref","award":["ZR2021MF040 and ZR2022QF047"],"award-info":[{"award-number":["ZR2021MF040 and ZR2022QF047"]}],"id":[{"id":"10.13039\/501100003091","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Key R&D Program of Shandong","award":["2022CXGC020107"],"award-info":[{"award-number":["2022CXGC020107"]}]},{"name":"Alibaba Group through Alibaba Innovative Research Program","award":["21169774"],"award-info":[{"award-number":["21169774"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Localizing a desired moment within an untrimmed video via a given natural language query, i.e., cross-modal moment localization, has attracted widespread research attention recently. However, it is a challenging task because it requires not only accurately understanding intra-modal semantic information, but also explicitly capturing inter-modal semantic correlations\u00a0(consistency and complementarity). Existing efforts mainly focus on intra-modal semantic understanding and inter-modal semantic alignment, while ignoring necessary semantic supplement. Consequently, we present a cross-modal semantic perception network for more effective intra-modal semantic understanding and inter-modal semantic collaboration. Concretely, we design a dual-path representation network for intra-modal semantic modeling. Meanwhile, we develop a semantic collaborative network to achieve multi-granularity semantic alignment and hierarchical semantic supplement. Thereby, effective moment localization can be achieved based on sufficient semantic collaborative learning. Extensive comparison experiments demonstrate the promising performance of our model compared with existing state-of-the-art competitors.<\/jats:p>","DOI":"10.1145\/3620669","type":"journal-article","created":{"date-parts":[[2023,9,7]],"date-time":"2023-09-07T11:08:09Z","timestamp":1694084889000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Semantic Collaborative Learning for Cross-Modal Moment Localization"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5653-8286","authenticated-orcid":false,"given":"Yupeng","family":"Hu","sequence":"first","affiliation":[{"name":"Shandong University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-4856-8806","authenticated-orcid":false,"given":"Kun","family":"Wang","sequence":"additional","affiliation":[{"name":"Shandong University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1582-5764","authenticated-orcid":false,"given":"Meng","family":"Liu","sequence":"additional","affiliation":[{"name":"Shandong Jianzhu University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1344-2513","authenticated-orcid":false,"given":"Haoyu","family":"Tang","sequence":"additional","affiliation":[{"name":"Shandong University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1476-0273","authenticated-orcid":false,"given":"Liqiang","family":"Nie","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen), China"}]}],"member":"320","published-online":{"date-parts":[[2023,11,7]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_2_3_2","first-page":"4462","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Bojanowski Piotr","year":"2015","unstructured":"Piotr Bojanowski, R\u00e9mi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In Proceedings of the IEEE International Conference on Computer Vision. 4462\u20134470."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_5_2","first-page":"1072","volume-title":"Proceedings of the American Association for Artificial Intelligence","author":"Chen Qingchao","year":"2021","unstructured":"Qingchao Chen, Yang Liu, and Samuel Albanie. 2021. Mind-the-gap! Unsupervised domain adaptation for text-video retrieval. In Proceedings of the American Association for Artificial Intelligence. 1072\u20131080."},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3406109","article-title":"Fine-grained privacy detection with graph-regularized hierarchical attentive representation learning","volume":"38","author":"Chen Xiaolin","year":"2020","unstructured":"Xiaolin Chen, Xuemeng Song, Ruiyang Ren, Lei Zhu, Zhiyong Cheng, and Liqiang Nie. 2020. Fine-grained privacy detection with graph-regularized hierarchical attentive representation learning. ACM Transactions on Information Systems 38, 4 (2020), 1\u201326.","journal-title":"ACM Transactions on Information Systems"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490477"},{"key":"e_1_3_2_8_2","first-page":"4171","volume-title":"Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171\u20134186."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2832602"},{"key":"e_1_3_2_10_2","first-page":"1523","volume-title":"Proceedings of the World Wide Web Conference","author":"Feng Fuli","year":"2018","unstructured":"Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie, and Tat-Seng Chua. 2018. Learning on partial-order hypergraphs. In Proceedings of the World Wide Web Conference. 1523\u20131532."},{"key":"e_1_3_2_11_2","first-page":"455","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval","author":"Feng Fuli","year":"2017","unstructured":"Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational social indicators: A case study of Chinese university ranking. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 455\u2013464."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.563"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00155"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.392"},{"key":"e_1_3_2_15_2","first-page":"1","article-title":"Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog","volume":"39","author":"Gao Shen","year":"2021","unstructured":"Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, and Rui Yan. 2021. Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog. ACM Transactions on Information Systems 39, 2 (2021), 1\u201332.","journal-title":"ACM Transactions on Information Systems"},{"key":"e_1_3_2_16_2","first-page":"9819","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Gong Guoqiang","year":"2020","unstructured":"Guoqiang Gong, Xinghan Wang, Yadong Mu, and Qi Tian. 2020. Learning temporal co-attention models for unsupervised video action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9819\u20139828."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3486250"},{"key":"e_1_3_2_18_2","first-page":"1","article-title":"Enhancing factorization machines with generalized metric learning","author":"Guo Yangyang","year":"2022","unstructured":"Yangyang Guo, Zhiyong Cheng, Jiazheng Jing, Yanpeng Lin, Liqiang Nie, and Meng Wang. 2022. Enhancing factorization machines with generalized metric learning. IEEE Transactions on Knowledge and Data Engineering 34, 8 (2022), 1\u201315.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331186"},{"key":"e_1_3_2_20_2","first-page":"3826\u2013-3834","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Han Ning","year":"2021","unstructured":"Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the ACM International Conference on Multimedia. 3826\u2013-3834."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2936742"},{"key":"e_1_3_2_22_2","first-page":"4528","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Han Yudong","year":"2021","unstructured":"Yudong Han, Yangyang Guo, Jianhua Yin, Meng Liu, Yupeng Hu, and Liqiang Nie. 2021. Focal and composed vision-semantic modeling for visual question answering. In Proceedings of the ACM International Conference on Multimedia. 4528\u20134536."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1168"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.493"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3073867"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3090521"},{"key":"e_1_3_2_27_2","first-page":"7404","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Huang Haoshuo","year":"2019","unstructured":"Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, and Eugene Ie. 2019. Transferable representation learning in vision-and-language navigation. In Proceedings of the IEEE International Conference on Computer Vision. 7404\u20137413."},{"key":"e_1_3_2_28_2","first-page":"1114","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval","author":"Jin Weike","year":"2021","unstructured":"Weike Jin, Zhou Zhao, Pengcheng Zhang, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. 2021. Hierarchical cross-modal graph consistency learning for video-text retrieval. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 1114\u20131124."},{"key":"e_1_3_2_29_2","first-page":"1","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. 1\u201315."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3480967"},{"key":"e_1_3_2_31_2","first-page":"1902","volume-title":"Proceedings of the American Association for Artificial Intelligence","author":"Li Kun","year":"2021","unstructured":"Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the American Association for Artificial Intelligence. 1902\u20131910."},{"key":"e_1_3_2_32_2","first-page":"8553","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Li Yanwei","year":"2020","unstructured":"Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, and Jian Sun. 2020. Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8553\u20138562."},{"key":"e_1_3_2_33_2","first-page":"2657","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Lin Dahua","year":"2014","unstructured":"Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2657\u20132664."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2965987"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00139"},{"key":"e_1_3_2_36_2","first-page":"1","article-title":"An attribute-aware attentive GCN model for attribute missing in recommendation","author":"Liu Fan","year":"2022","unstructured":"Fan Liu, Zhiyong Cheng, Chenghao Liu, and Liqiang Nie. 2022. An attribute-aware attentive GCN model for attribute missing in recommendation. IEEE Transactions on Knowledge and Data Engineering 34, 9 (2022), 1\u201312.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_2_37_2","first-page":"1526-\u20131534","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Liu Fan","year":"2019","unstructured":"Fan Liu, Zhiyong Cheng, Changchang Sun, Yinglong Wang, Liqiang Nie, and Mohan Kankanhalli. 2019. User diverse preference modeling by multimodal attentive metric learning. In Proceedings of the ACM International Conference on Multimedia. 1526-\u20131534."},{"key":"e_1_3_2_38_2","first-page":"1296","volume-title":"Proceedings of the International Conference on World Wide Web","author":"Liu Fan","year":"2021","unstructured":"Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-aware message-passing GCN for recommendation. In Proceedings of the International Conference on World Wide Web. 1296\u20131305."},{"key":"e_1_3_2_39_2","first-page":"970","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Liu Meng","year":"2017","unstructured":"Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia. 970\u2013978."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2875363"},{"key":"e_1_3_2_41_2","first-page":"1","article-title":"A survey on video moment localization","volume":"55","author":"Liu Meng","year":"2023","unstructured":"Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. 2023. A survey on video moment localization. ACM Computing Surveys 55, 9 (2023), 1\u201337.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3026625"},{"key":"e_1_3_2_43_2","first-page":"15","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval","author":"Liu Meng","year":"2018","unstructured":"Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 15\u201324."},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240549"},{"key":"e_1_3_2_45_2","first-page":"14954","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Liu Yang","year":"2021","unstructured":"Yang Liu, Qingchao Chen, and Samuel Albanie. 2021. Adaptive cross-modal prototypes for cross-domain visual-language retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14954\u201314964."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01186"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01082"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00055"},{"key":"e_1_3_2_49_2","first-page":"1047","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Qu Leigang","year":"2020","unstructured":"Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the ACM International Conference on Multimedia. 1047\u20131055."},{"key":"e_1_3_2_50_2","first-page":"1104","volume-title":"Proceedings of the International ACM Conference on Research and Development in Information Retrieval","author":"Qu Leigang","year":"2021","unstructured":"Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval. 1104\u20131113."},{"key":"e_1_3_2_51_2","first-page":"706","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Ranjay Krishna","year":"2017","unstructured":"Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Carlos Niebles Juan. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706\u2013715."},{"key":"e_1_3_2_52_2","first-page":"615","volume-title":"Proceedings of the IEEE Winter Conference on Applications of Computer Vision","author":"Rashid Maheen","year":"2020","unstructured":"Maheen Rashid, Hedvig Kjellstrom, and Yong Jae Lee. 2020. Action graphs: Weakly-supervised action localization with graph convolution networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 615\u2013624."},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00207"},{"key":"e_1_3_2_54_2","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems.","journal-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33718-5_11"},{"key":"e_1_3_2_56_2","first-page":"1049","volume-title":"Proceedings of the International Conference on Computer Vision and Pattern Recognition","author":"Shou Zheng","year":"2016","unstructured":"Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1049\u20131058."},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"e_1_3_2_58_2","first-page":"5","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval","author":"Song Xuemeng","year":"2018","unstructured":"Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural compatibility modeling with attentive knowledge distillation. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 5\u201314."},{"key":"e_1_3_2_59_2","first-page":"4073","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Su Tianyu","year":"2021","unstructured":"Tianyu Su, Xuemeng Song, Na Zheng, Weili Guan, Yan Li, and Liqiang Nie. 2021. Complementary factorization towards outfit compatibility modeling. In Proceedings of the ACM International Conference on Multimedia. 4073\u20134081."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.108027"},{"key":"e_1_3_2_61_2","article-title":"Frame-wise Cross-modal Matching for Video Moment Retrieval","volume":"24","author":"Tang Haoyu","year":"2022","unstructured":"Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng. 2022. Frame-wise Cross-modal Matching for Video Moment Retrieval. IEEE Transactions on Multimedia 24, 1 (2022), 1338\u20131349.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_62_2","article-title":"Multi-level query interaction for temporal language grounding","volume":"1","author":"Tang Haoyu","year":"2022","unstructured":"Haoyu Tang, Jihua Zhu, Lin Wang, Qinghai Zheng, and Tianwei Zhang. 2022. Multi-level query interaction for temporal language grounding. IEEE Transactions on Intelligent Transportation Systems 1, 12 (2022), 25479\u201325488.","journal-title":"IEEE Transactions on Intelligent Transportation Systems"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3594633"},{"key":"e_1_3_2_65_2","first-page":"4116-\u20134124","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Wang Hao","year":"2020","unstructured":"Hao Wang, Zheng-Jun Zha, Xuejin Chen, Zhiwei Xiong, and Jiebo Luo. 2020. Dual path interaction network for video moment localization. In Proceedings of the ACM International Conference on Multimedia. 4116-\u20134124."},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00695"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00679"},{"key":"e_1_3_2_68_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3052774","article-title":"Unifying virtual and physical worlds: Learning toward local and global consistency","volume":"36","author":"Wang Xiang","year":"2017","unstructured":"Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua. 2017. Unifying virtual and physical worlds: Learning toward local and global consistency. ACM Transactions on Information Systems 36, 1 (2017), 1\u201326.","journal-title":"ACM Transactions on Information Systems"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2923608"},{"key":"e_1_3_2_70_2","first-page":"1","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Xu Bing","year":"2015","unstructured":"Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical evaluation of rectified activations in convolutional network. In Proceedings of the International Conference on Machine Learning. 1\u20135."},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.617"},{"key":"e_1_3_2_72_2","first-page":"9062","volume-title":"Proceedings of the American Association for Artificial Intelligence","author":"Xu Huijuan","year":"2019","unstructured":"Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the American Association for Artificial Intelligence. 9062\u20139069."},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3070200"},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2914889"},{"key":"e_1_3_2_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2019.2920652"},{"key":"e_1_3_2_76_2","first-page":"1339","volume-title":"Proceedings of the ACM International Conference on Research and Development in Information Retrieval","author":"Yang Xun","year":"2021","unstructured":"Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2021. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 1339\u20131348."},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.11.035"},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2877127"},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.347"},{"key":"e_1_3_2_80_2","first-page":"9159","volume-title":"Proceedings of the American Association for Artificial Intelligence","author":"Yuan Yitian","year":"2019","unstructured":"Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the American Association for Artificial Intelligence. 9159\u20139166."},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01030"},{"key":"e_1_3_2_82_2","first-page":"1328","volume-title":"Proceedings of the International Conference on Web Search and Data Mining","author":"Zhan Jingtao","year":"2022","unstructured":"Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Learning discrete representations via constrained clustering for effective and efficient dense retrieval. In Proceedings of the International Conference on Web Search and Data Mining. 1328\u20131336."},{"key":"e_1_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00134"},{"key":"e_1_3_2_84_2","first-page":"6543","volume-title":"Proceedings of the Association for Computational Linguistics","author":"Zhang Hao","year":"2020","unstructured":"Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In Proceedings of the Association for Computational Linguistics. 6543\u20136554."},{"key":"e_1_3_2_85_2","first-page":"12870","volume-title":"Proceedings of the American Association for Artificial Intelligence","author":"Zhang Songyang","year":"2020","unstructured":"Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the American Association for Artificial Intelligence. 12870\u201312877."},{"key":"e_1_3_2_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3113791"},{"key":"e_1_3_2_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3023339"},{"key":"e_1_3_2_88_2","doi-asserted-by":"publisher","DOI":"10.1145\/3476107"},{"key":"e_1_3_2_89_2","first-page":"655","volume-title":"Proceedings of the International Conference on Research and Development in Information Retrieval","author":"Zhu Zhang","year":"2019","unstructured":"Zhang Zhu, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International Conference on Research and Development in Information Retrieval. 655\u2013664."}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3620669","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3620669","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:43Z","timestamp":1750291423000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3620669"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,7]]},"references-count":88,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3620669"],"URL":"https:\/\/doi.org\/10.1145\/3620669","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,7]]},"assertion":[{"value":"2022-05-16","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-19","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}