{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T23:05:25Z","timestamp":1768345525422,"version":"3.49.0"},"reference-count":94,"publisher":"Association for Computing Machinery (ACM)","issue":"2s","license":[{"start":{"date-parts":[[2022,6,30]],"date-time":"2022-06-30T00:00:00Z","timestamp":1656547200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Konica Minolta research funding"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,6,30]]},"abstract":"<jats:p>As a bridge between videos and natural languages, video-text matching has been a hot multimedia research topic in recent years. Such cross-modal retrieval is usually achieved by learning a common embedding space where videos and text captions are directly comparable. It is still challenging because existing visual representations cannot exploit semantic correlations within videos well, resulting in a mismatch with semantic concepts that are contained in the corresponding text descriptions. In this article, we propose a new Guided Graph Attention Learning (GGAL) model to enhance video embedding learning by capturing important region-level semantic concepts within the spatiotemporal space. Our model builds connections between object regions and performs hierarchical graph reasoning on both frame-level and whole video\u2013level region graphs. During this process, global context is used to guide attention learning on this hierarchical graph topology so that the learned overall video embedding can focus on essential semantic concepts and can be better aligned with text captions. Experiments on commonly used benchmarks validate that GGAL outperforms many recent video-text retrieval methods with a clear margin. As multimedia data in dynamic environments becomes critically important, we also validate GGAL learned video-text representations that can be generalized well to unseen out-of-domain data via cross-dataset evaluations. To further investigate the interpretability of our model, we visualize attention weights learned by GGAL models. We find that GGAL successfully focuses on key semantic concepts in the video and has complementary attention on the context parts based on different ways of building region graphs.<\/jats:p>","DOI":"10.1145\/3538533","type":"journal-article","created":{"date-parts":[[2022,9,9]],"date-time":"2022-09-09T14:09:23Z","timestamp":1662732563000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Guided Graph Attention Learning for Video-Text Matching"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5805-793X","authenticated-orcid":false,"given":"Kunpeng","family":"Li","sequence":"first","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0219-4748","authenticated-orcid":false,"given":"Chang","family":"Liu","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1418-2437","authenticated-orcid":false,"given":"Mike","family":"Stopa","sequence":"additional","affiliation":[{"name":"Konica Minolta, San Mateo, California, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9653-794X","authenticated-orcid":false,"given":"Jun","family":"Amano","sequence":"additional","affiliation":[{"name":"Konica Minolta, San Mateo, California, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5098-2853","authenticated-orcid":false,"given":"Yun","family":"Fu","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,1,6]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"9198","article-title":"Watch your step: Learning node embeddings via graph attention","author":"Abu-El-Haija Sami","year":"2018","unstructured":"Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A. Alemi. 2018. Watch your step: Learning node embeddings via graph attention. Advances in Neural Information Processing Systems (NeurIPS\u201918) (2018), 9198\u20139208.","journal-title":"Advances in Neural Information Processing Systems (NeurIPS\u201918)"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","first-page":"6077","DOI":"10.1109\/CVPR.2018.00636","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918). 6077\u20136086."},{"key":"e_1_3_1_4_2","first-page":"858","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Bertasius Gedas","year":"2017","unstructured":"Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, and Jianbo Shi. 2017. Convolutional random walk networks for semantic image segmentation. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 858\u2013866."},{"key":"e_1_3_1_5_2","first-page":"1989","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Cadene Remi","year":"2019","unstructured":"Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. Murel: Multimodal relational reasoning for visual question answering. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 1989\u20131998."},{"key":"e_1_3_1_6_2","first-page":"2956","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201915)","author":"Cao Chunshui","year":"2015","unstructured":"Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et\u00a0al. 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201915). 2956\u20132964."},{"key":"e_1_3_1_7_2","first-page":"6299","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 6299\u20136308."},{"key":"e_1_3_1_8_2","first-page":"5103","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201917)","author":"Chandra Siddhartha","year":"2017","unstructured":"Siddhartha Chandra, Nicolas Usunier, and Iasonas Kokkinos. 2017. Dense and low-rank Gaussian CRFs using deep embeddings. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201917). 5103\u20135112."},{"key":"e_1_3_1_9_2","first-page":"190","volume-title":"Annual Meeting of the Association for Computational Linguistics (ACL\u201911)","author":"Chen David L.","year":"2011","unstructured":"David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Annual Meeting of the Association for Computational Linguistics (ACL\u201911). 190\u2013200."},{"key":"e_1_3_1_10_2","doi-asserted-by":"crossref","first-page":"3073","DOI":"10.1109\/TMM.2020.3019710","article-title":"Interclass-relativity-adaptive metric learning for cross-modal matching and beyond","volume":"23","author":"Chen Feiyu","year":"2020","unstructured":"Feiyu Chen, Jie Shao, Yonghui Zhang, Xing Xu, and Heng Tao Shen. 2020. Interclass-relativity-adaptive metric learning for cross-modal matching and beyond. IEEE Transactions on Multimedia 23 (2020), 3073\u20133084.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_11_2","first-page":"15789","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Chen Jiacheng","year":"2021","unstructured":"Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 15789\u201315798."},{"key":"e_1_3_1_12_2","first-page":"1072","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Chen Qingchao","year":"2021","unstructured":"Qingchao Chen and Samuel Albanie. 2021. Mind-the-Gap! Unsupervised domain adaptation for text-video retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 1072\u20131080."},{"key":"e_1_3_1_13_2","first-page":"10638","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Chen Shizhe","year":"2020","unstructured":"Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 10638\u201310647."},{"key":"e_1_3_1_14_2","first-page":"1724","volume-title":"Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914)","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914). 1724\u20131734."},{"key":"e_1_3_1_15_2","first-page":"2388","volume-title":"IEEE International Conference on Image Processing (ICIP\u201921)","author":"Choo Sungkwon","year":"2021","unstructured":"Sungkwon Choo, Seong Jong Ha, and Joonsoo Lee. 2021. Semantic-preserving metric learning for video-text retrieval. In IEEE International Conference on Image Processing (ICIP\u201921). IEEE, 2388\u20132392."},{"key":"e_1_3_1_16_2","first-page":"1","article-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling","author":"Chung Junyoung","year":"2014","unstructured":"Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv (2014), 1\u20139.","journal-title":"arXiv"},{"key":"e_1_3_1_17_2","first-page":"4171","volume-title":"Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL\u201919)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL\u201919). 4171\u20134186."},{"key":"e_1_3_1_18_2","first-page":"9346","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Dong Jianfeng","year":"2019","unstructured":"Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 9346\u20139355."},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TPAMI.2021.3059295","article-title":"Dual encoding for video retrieval by text","author":"Dong Jianfeng","year":"2021","unstructured":"Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1\u20131.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_20_2","first-page":"1","volume-title":"British Machine Vision Conference (BMVC\u201918)","author":"Faghri Fartash","year":"2018","unstructured":"Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference (BMVC\u201918). 1\u201313."},{"key":"e_1_3_1_21_2","first-page":"1005","volume-title":"International Joint Conference on Artificial Intelligence (IJCAI 20)","author":"Feng Zerun","year":"2020","unstructured":"Zerun Feng, Zhimin Zeng, Caili Guo, and Zheng Li. 2020. Exploiting visual semantic reasoning for video-text retrieval. In International Joint Conference on Artificial Intelligence (IJCAI 20). 1005\u20131011."},{"key":"e_1_3_1_22_2","first-page":"1868","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201919) Workshops","author":"Francis Danny","year":"2019","unstructured":"Danny Francis, Phuong Anh Nguyen, Benoit Huet, and Chong-Wah Ngo. 2019. Fusion of multimodal embeddings for ad-hoc video search. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201919) Workshops. 1868\u20131872."},{"key":"e_1_3_1_23_2","first-page":"214","volume-title":"European Conference on Computer Vision (ECCV\u201920)","author":"Gabeur Valentin","year":"2020","unstructured":"Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision (ECCV\u201920). 214\u2013229."},{"key":"e_1_3_1_24_2","first-page":"1","article-title":"CLIP2TV: An empirical study on transformer-based methods for video-text retrieval","author":"Gao Zijian","year":"2021","unstructured":"Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. 2021. CLIP2TV: An empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610 (2021), 1\u201317.","journal-title":"arXiv:2111.05610"},{"key":"e_1_3_1_25_2","first-page":"9543","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Guo Dongyan","year":"2021","unstructured":"Dongyan Guo, Yanyan Shao, Ying Cui, Zhenhua Wang, Liyan Zhang, and Chunhua Shen. 2021. Graph attention tracking. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 9543\u20139552."},{"issue":"1","key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1016\/0167-2789(90)90087-6","article-title":"The symbol grounding problem","volume":"42","author":"Harnad Stevan","year":"1990","unstructured":"Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 1-3 (1990), 335\u2013346.","journal-title":"Physica D: Nonlinear Phenomena"},{"key":"e_1_3_1_27_2","first-page":"770","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 770\u2013778."},{"issue":"1","key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1016\/0004-3702(93)90015-4","article-title":"Interpretation as abduction","volume":"63","author":"Hobbs Jerry R.","year":"1993","unstructured":"Jerry R. Hobbs, Mark E. Stickel, and Paul Martin. 1993. Interpretation as abduction. Artificial Intelligence 63, 1-2 (1993), 69\u2013142.","journal-title":"Artificial Intelligence"},{"key":"e_1_3_1_29_2","first-page":"7132","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Hu Jie","year":"2018","unstructured":"Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918). 7132\u20137141."},{"issue":"2","key":"e_1_3_1_30_2","first-page":"1","article-title":"Multi-peak graph-based multi-instance learning for weakly supervised object detection","volume":"17","author":"Ji Ruyi","year":"2021","unstructured":"Ruyi Ji, Zeyu Liu, Libo Zhang, Jianwei Liu, Xin Zuo, Yanjun Wu, Chen Zhao, Haofeng Wang, and Lin Yang. 2021. Multi-peak graph-based multi-instance learning for weakly supervised object detection. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2s (2021), 1\u201321.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_31_2","first-page":"3128","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201915)","author":"Karpathy Andrej","year":"2015","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201915). 3128\u20133137."},{"issue":"5","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1177\/1073858413514136","article-title":"Bottom-up and top-down attention: Different processes and overlapping neural systems","volume":"20","author":"Katsuki Fumi","year":"2014","unstructured":"Fumi Katsuki and Christos Constantinidis. 2014. Bottom-up and top-down attention: Different processes and overlapping neural systems. The Neuroscientist 20, 5 (2014), 509\u2013521.","journal-title":"The Neuroscientist"},{"key":"e_1_3_1_33_2","first-page":"94","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201917)","author":"Kaufman Dotan","year":"2017","unstructured":"Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. 2017. Temporal tessellation: A unified approach for video analysis. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201917). 94\u2013104."},{"key":"e_1_3_1_34_2","first-page":"1","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv (2014), 1\u201315.","journal-title":"arXiv"},{"key":"e_1_3_1_35_2","first-page":"1","volume-title":"ICLR\u201917","author":"Kipf Thomas N.","year":"2017","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR\u201917. 1\u201314."},{"key":"e_1_3_1_36_2","first-page":"1","article-title":"Unifying visual-semantic embeddings with multimodal neural language models","author":"Kiros Ryan","year":"2015","unstructured":"Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics (2015), 1\u201313.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_1_37_2","first-page":"706","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Krishna Ranjay","year":"2017","unstructured":"Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 706\u2013715."},{"issue":"1","key":"e_1_3_1_38_2","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual genome: Connecting language and vision using crowdsourced dense image annotations","volume":"123","author":"Krishna Ranjay","year":"2017","unstructured":"Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et\u00a0al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 32\u201373.","journal-title":"IJCV"},{"key":"e_1_3_1_39_2","first-page":"529","volume-title":"Conference on Empirical Methods in Natural Language Processing (EMNLP\u201911)","author":"Lao Ni","year":"2011","unstructured":"Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Conference on Empirical Methods in Natural Language Processing (EMNLP\u201911). 529\u2013539."},{"key":"e_1_3_1_40_2","first-page":"201","volume-title":"European Conference on Computer Vision (ECCV\u201918)","author":"Lee Kuang-Huei","year":"2018","unstructured":"Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision (ECCV\u201918). 201\u2013216."},{"key":"e_1_3_1_41_2","first-page":"7331","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Lei Jie","year":"2021","unstructured":"Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 7331\u20137341."},{"key":"e_1_3_1_42_2","first-page":"12526","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Li Kunpeng","year":"2020","unstructured":"Kunpeng Li, Chen Fang, Zhaowen Wang, Seokhwan Kim, Hailin Jin, and Yun Fu. 2020. Screencast tutorial video understanding. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 12526\u201312535."},{"issue":"12","key":"e_1_3_1_43_2","first-page":"2996","article-title":"Guided attention inference network","volume":"42","author":"Li Kunpeng","year":"2019","unstructured":"Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. 2019. Guided attention inference network. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 12 (2019), 2996\u20133010.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_44_2","first-page":"4654","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Li Kunpeng","year":"2019","unstructured":"Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201919). 4654\u20134662."},{"key":"e_1_3_1_45_2","first-page":"1","article-title":"Image-text embedding learning via visual and textual semantic reasoning","author":"Li Kunpeng","year":"2022","unstructured":"Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1\u201314.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_46_2","first-page":"1","volume-title":"British Machine Vision Conference (BMVC\u201919)","author":"Liu Yang","year":"2019","unstructured":"Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. In British Machine Vision Conference (BMVC\u201919). 1\u201319."},{"key":"e_1_3_1_47_2","first-page":"1","article-title":"UniVL: A unified video and language pre-training model for multimodal understanding and generation","author":"Luo Huaishao","year":"2020","unstructured":"Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020), 1\u201316.","journal-title":"arXiv:2002.06353"},{"key":"e_1_3_1_48_2","first-page":"293","article-title":"Clip4clip: An empirical study of clip for end to end video clip retrieval","author":"Luo Huaishao","year":"2021","unstructured":"Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021), 293\u2013304.","journal-title":"arXiv:2104.08860"},{"key":"e_1_3_1_49_2","first-page":"9879","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Miech Antoine","year":"2020","unstructured":"Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 9879\u20139889."},{"key":"e_1_3_1_50_2","first-page":"2630","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Miech Antoine","year":"2019","unstructured":"Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201919). 2630\u20132640."},{"key":"e_1_3_1_51_2","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1145\/3206025.3206064","volume-title":"ACM International Conference on Multimedia Retrieval (ICMR\u201918)","author":"Mithun Niluthpol Chowdhury","year":"2018","unstructured":"Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ACM International Conference on Multimedia Retrieval (ICMR\u201918). 19\u201327."},{"issue":"2","key":"e_1_3_1_52_2","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1207\/s15516709cog0402_2","article-title":"Physical symbol systems","volume":"4","author":"Newell Allen","year":"1980","unstructured":"Allen Newell. 1980. Physical symbol systems. Cognitive Science 4, 2 (1980), 135\u2013183.","journal-title":"Cognitive Science"},{"key":"e_1_3_1_53_2","first-page":"1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201918)","author":"Norcliffe-Brown Will","year":"2018","unstructured":"Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems (NeurIPS\u201918), Vol. 31. 1\u201310."},{"key":"e_1_3_1_54_2","first-page":"10870","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Pan Boxiao","year":"2020","unstructured":"Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 10870\u201310879."},{"key":"e_1_3_1_55_2","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/978-3-030-77004-4_1","volume-title":"Mexican Conference on Pattern Recognition (MCPR\u201921)","author":"Portillo-Quintero Jes\u00fas Andr\u00e9s","year":"2021","unstructured":"Jes\u00fas Andr\u00e9s Portillo-Quintero, Jos\u00e9 Carlos Ortiz-Bayliss, and Hugo Terashima-Mar\u00edn. 2021. A straightforward framework for video retrieval using CLIP. In Mexican Conference on Pattern Recognition (MCPR\u201921). 3\u201312."},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","first-page":"2989","DOI":"10.1109\/TIP.2020.3048680","article-title":"Semantics-aware spatial-temporal binaries for cross-modal video retrieval","volume":"30","author":"Qi Mengshi","year":"2021","unstructured":"Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, and Jiebo Luo. 2021. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing 30 (2021), 2989\u20133004.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_57_2","first-page":"8748","volume-title":"International Conference on Machine Learning (ICML\u201921)","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et\u00a0al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML\u201921). 8748\u20138763."},{"key":"e_1_3_1_58_2","first-page":"1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201915)","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS\u201915), Vol. 28. 1\u20139."},{"key":"e_1_3_1_59_2","first-page":"2039","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Schwartz Idan","year":"2019","unstructured":"Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander G. Schwing. 2019. Factor graph attention. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 2039\u20132048."},{"key":"e_1_3_1_60_2","first-page":"618","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201917)","author":"Selvaraju Ramprasaath R.","year":"2017","unstructured":"Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201917). 618\u2013626."},{"key":"e_1_3_1_61_2","first-page":"1","volume-title":"ICLR Workshop\u201914","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop\u201914. 1\u20138."},{"key":"e_1_3_1_62_2","first-page":"2914","article-title":"Spatial-temporal graphs for cross-modal text2video retrieval","author":"Song Xue","year":"2021","unstructured":"Xue Song, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. 2021. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia (2021), 2914\u20132923.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_63_2","first-page":"1979","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Song Yale","year":"2019","unstructured":"Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 1979\u20131988."},{"key":"e_1_3_1_64_2","first-page":"1","article-title":"Learning language-visual embedding for movie understanding with natural-language","author":"Torabi Atousa","year":"2016","unstructured":"Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv (2016), 1\u201313.","journal-title":"arXiv"},{"key":"e_1_3_1_65_2","first-page":"4489","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201915)","author":"Tran Du","year":"2015","unstructured":"Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201915). 4489\u20134497."},{"key":"e_1_3_1_66_2","first-page":"1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201917)","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS\u201917), Vol. 30. 1\u20139."},{"key":"e_1_3_1_67_2","first-page":"1","volume-title":"ICLR\u201918","author":"Veli\u010dkovi\u0107 Petar","year":"2018","unstructured":"Petar Veli\u010dkovi\u0107, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR\u201918. 1\u201312."},{"key":"e_1_3_1_68_2","first-page":"4534","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201915)","author":"Venugopalan Subhashini","year":"2015","unstructured":"Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201915). 4534\u20134542."},{"key":"e_1_3_1_69_2","first-page":"10296","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Wang Lei","year":"2019","unstructured":"Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. 2019. Graph attention convolution for point cloud semantic segmentation. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 10296\u201310305."},{"key":"e_1_3_1_70_2","doi-asserted-by":"crossref","first-page":"7794","DOI":"10.1109\/CVPR.2018.00813","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Wang Xiaolong","year":"2018","unstructured":"Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918). 7794\u20137803."},{"key":"e_1_3_1_71_2","doi-asserted-by":"crossref","first-page":"2022","DOI":"10.1145\/3308558.3313562","volume-title":"The World Wide Web Conference (WWW\u201919)","author":"Wang Xiao","year":"2019","unstructured":"Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019. Heterogeneous graph attention network. In The World Wide Web Conference (WWW\u201919). 2022\u20132032."},{"key":"e_1_3_1_72_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TPAMI.2020.3015894","article-title":"Symbiotic attention for egocentric action recognition with object-centric alignment","author":"Wang Xiaohan","year":"2020","unstructured":"Xiaohan Wang, Linchao Zhu, Yu Wu, and Yi Yang. 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1\u201313.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_73_2","first-page":"5079","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Wang Xiaohan","year":"2021","unstructured":"Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021. T2VLAD: Global-local sequence alignment for text-video retrieval. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 5079\u20135088."},{"key":"e_1_3_1_74_2","first-page":"5764","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Wang Zihao","year":"2019","unstructured":"Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201919). 5764\u20135773."},{"key":"e_1_3_1_75_2","first-page":"13005","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Wei Jiwei","year":"2020","unstructured":"Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 13005\u201313014."},{"key":"e_1_3_1_76_2","first-page":"450","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Wray Michael","year":"2019","unstructured":"Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In IEEE\/CVF International Conference on Computer Vision (ICCV\u201919). 450\u2013459."},{"key":"e_1_3_1_77_2","first-page":"1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201919)","author":"Wu Aming","year":"2019","unstructured":"Aming Wu, Linchao Zhu, Yahong Han, and Yi Yang. 2019. Connective cognition network for directional visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS\u201919), Vol. 32. 1\u201310."},{"key":"e_1_3_1_78_2","first-page":"3518","volume-title":"ACM International Conference on Multimedia (ACM MM\u201921)","author":"Wu Peng","year":"2021","unstructured":"Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical alignment networks for video-text retrieval. In ACM International Conference on Multimedia (ACM MM\u201921). 3518\u20133527."},{"key":"e_1_3_1_79_2","first-page":"1492","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Xie Saining","unstructured":"Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 1492\u20131500."},{"key":"e_1_3_1_80_2","first-page":"5288","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Xu Jun","year":"2016","unstructured":"Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 5288\u20135296."},{"key":"e_1_3_1_81_2","first-page":"11562","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Yang Jianwei","unstructured":"Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. TACo: Token-aware cascade contrastive learning for video-text alignment. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 11562\u201311572."},{"key":"e_1_3_1_82_2","first-page":"1","article-title":"Visual semantic navigation using scene priors","author":"Yang Wei","year":"2019","unstructured":"Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. 2019. Visual semantic navigation using scene priors. ICLR (2019), 1\u201314.","journal-title":"ICLR"},{"key":"e_1_3_1_83_2","first-page":"1339","volume-title":"ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR\u201920)","author":"Yang Xun","year":"2020","unstructured":"Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR\u201920). 1339\u20131348."},{"key":"e_1_3_1_84_2","first-page":"684","volume-title":"European Conference on Computer Vision (ECCV\u201918)","author":"Yao Ting","year":"2018","unstructured":"Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision (ECCV\u201918). 684\u2013699."},{"key":"e_1_3_1_85_2","first-page":"1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS\u201919)","author":"Yu Weijiang","year":"2019","unstructured":"Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, and Nong Xiao. 2019. Heterogeneous graph learning for visual commonsense reasoning. In Advances in Neural Information Processing Systems (NeurIPS\u201919), Vol. 32. 1\u201310."},{"key":"e_1_3_1_86_2","first-page":"471","volume-title":"European Conference on Computer Vision (ECCV\u201918)","author":"Yu Youngjae","year":"2018","unstructured":"Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In European Conference on Computer Vision (ECCV\u201918). 471\u2013487."},{"key":"e_1_3_1_87_2","first-page":"1","volume-title":"European Conference on Computer Vision (ECCV\u201916)","author":"Yu Youngjae","year":"2016","unstructured":"Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video captioning and retrieval models with semantic attention. In European Conference on Computer Vision (ECCV\u201916). 1\u201314."},{"key":"e_1_3_1_88_2","first-page":"3165","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Yu Youngjae","year":"2017","unstructured":"Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 3165\u20133173."},{"key":"e_1_3_1_89_2","first-page":"818","volume-title":"European Conference on Computer Vision (ECCV\u201914)","author":"Zeiler Matthew D.","year":"2014","unstructured":"Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV\u201914). 818\u2013833."},{"key":"e_1_3_1_90_2","first-page":"1084","volume-title":"European Conference on Computer Vision (ECCV\u201916)","author":"Zhang Jianming","year":"2016","unstructured":"Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. 2016. Top-down neural attention by excitation backprop. In European Conference on Computer Vision (ECCV\u201916). 1084\u20131102."},{"key":"e_1_3_1_91_2","first-page":"286","volume-title":"European Conference on Computer Vision (ECCV\u201918)","author":"Zhang Yulun","year":"2018","unstructured":"Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision (ECCV\u201918). 286\u2013301."},{"key":"e_1_3_1_92_2","first-page":"1","article-title":"Memory enhanced embedding learning for cross-modal video-text retrieval","author":"Zhao Rui","year":"2021","unstructured":"Rui Zhao, Kecheng Zheng, Zheng-Jun Zha, Hongtao Xie, and Jiebo Luo. 2021. Memory enhanced embedding learning for cross-modal video-text retrieval. arXiv:2103.15686 (2021), 1\u20139.","journal-title":"arXiv:2103.15686"},{"key":"e_1_3_1_93_2","first-page":"2921","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Zhou Bolei","year":"2016","unstructured":"Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 2921\u20132929."},{"key":"e_1_3_1_94_2","first-page":"1","article-title":"Double attention based on graph attention network for image multi-label classification","author":"Zhou Wei","year":"2022","unstructured":"Wei Zhou, Zhiwu Xia, Peng Dou, Tao Su, and Haifeng Hu. 2022. Double attention based on graph attention network for image multi-label classification. ACM Transactions on Multimedia Computing, Communications, and Applications (2022), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_95_2","first-page":"8746","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Zhu Linchao","year":"2020","unstructured":"Linchao Zhu and Yi Yang. 2020. ActBERT: Learning global-local video-text representations. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 8746\u20138755."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3538533","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3538533","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:14Z","timestamp":1750186814000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3538533"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,30]]},"references-count":94,"journal-issue":{"issue":"2s","published-print":{"date-parts":[[2022,6,30]]}},"alternative-id":["10.1145\/3538533"],"URL":"https:\/\/doi.org\/10.1145\/3538533","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,30]]},"assertion":[{"value":"2021-11-16","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-05-06","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}