{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:09:48Z","timestamp":1750219788306,"version":"3.41.0"},"reference-count":71,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2023,6,7]],"date-time":"2023-06-07T00:00:00Z","timestamp":1686096000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,10,31]]},"abstract":"<jats:p>\n            This article introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g., faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating entities with labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame\u2013caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks,\n            <jats:sc>SC-Friends<\/jats:sc>\n            and\n            <jats:sc>SC-BBT<\/jats:sc>\n            , based on the\n            <jats:italic>Friends<\/jats:italic>\n            and\n            <jats:italic>Big Bang Theory<\/jats:italic>\n            TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.\n            <jats:xref ref-type=\"fn\">\n              <jats:sup>1<\/jats:sup>\n            <\/jats:xref>\n          <\/jats:p>\n          <jats:p\/>","DOI":"10.1145\/3583138","type":"journal-article","created":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T13:22:20Z","timestamp":1675776140000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Self-contained Entity Discovery from Captioned Videos"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6847-5999","authenticated-orcid":false,"given":"Melika","family":"Ayoughi","sequence":"first","affiliation":[{"name":"University of Amsterdam, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9275-5942","authenticated-orcid":false,"given":"Pascal","family":"Mettes","sequence":"additional","affiliation":[{"name":"University of Amsterdam, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0183-6910","authenticated-orcid":false,"given":"Paul","family":"Groth","sequence":"additional","affiliation":[{"name":"University of Amsterdam, The Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2023,6,7]]},"reference":[{"key":"e_1_3_2_2_2","volume-title":"CVPR","author":"Alayrac Jean-Baptiste","year":"2016","unstructured":"Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In CVPR."},{"key":"e_1_3_2_3_2","volume-title":"ACCV","author":"Bain Max","year":"2020","unstructured":"Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. 2020. Condensed movies: Story based retrieval with contextual embeddings. In ACCV."},{"key":"e_1_3_2_4_2","volume-title":"ACM Multimedia","author":"Bin Yi","year":"2021","unstructured":"Yi Bin, Xindi Shang, Bo Peng, Yujuan Ding, and Tat-Seng Chua. 2021. Multi-perspective video captioning. In ACM Multimedia."},{"key":"e_1_3_2_5_2","volume-title":"MediaEval 2016","author":"Bredin Herv\u00e9","year":"2016","unstructured":"Herv\u00e9 Bredin, Claude Barras, and Camille Guinaudeau. 2016. Multimodal person discovery in broadcast TV at MediaEval 2016. In MediaEval 2016."},{"key":"e_1_3_2_6_2","volume-title":"IEEE-MIPR","author":"Brown Andrew","year":"2021","unstructured":"Andrew Brown, Ernesto Coto, and Andrew Zisserman. 2021. Automated video labelling: Identifying faces by corroborative evidence. In IEEE-MIPR."},{"key":"e_1_3_2_7_2","volume-title":"FG","author":"Cao Qiong","year":"2018","unstructured":"Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. 2018. Vggface2: A dataset for recognising faces across pose and age. In FG. IEEE."},{"key":"e_1_3_2_8_2","article-title":"Generating video descriptions with latent topic guidance","author":"Chen Shizhe","year":"2019","unstructured":"Shizhe Chen, Qin Jin, Jia Chen, and Alexander G. Hauptmann. 2019. Generating video descriptions with latent topic guidance. IEEE Trans. Multimedia (2019).","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_2_9_2","volume-title":"ICMR","author":"Curtis Keith","year":"2020","unstructured":"Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2020. HLVU: A new challenge to test deep understanding of movies the way humans do. In ICMR."},{"key":"e_1_3_2_10_2","volume-title":"CVPR","author":"Divvala Santosh K.","year":"2014","unstructured":"Santosh K. Divvala, Ali Farhadi, and Carlos Guestrin. 2014. Learning everything about anything: Webly-supervised visual concept learning. In CVPR."},{"key":"e_1_3_2_11_2","unstructured":"Tim Esler. 2021. facenet-pytorch: Pretrained Pytorch Face Detection (MTCNN) and Facial Recognition (Inception Resnet) Models. Retrieved from https:\/\/github.com\/timesler\/facenet-pytorch#use-this-repo-in-your-own-git-project."},{"key":"e_1_3_2_12_2","volume-title":"BMVC","author":"Everingham Mark","year":"2006","unstructured":"Mark Everingham, Josef Sivic, and Andrew Zisserman. 2006. Hello! my name is... buffy\u201d\u2014Automatic naming of characters in TV video. In BMVC."},{"key":"e_1_3_2_13_2","volume-title":"ICCV","author":"Feichtenhofer Christoph","year":"2019","unstructured":"Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV."},{"key":"e_1_3_2_14_2","volume-title":"NeurIPS","author":"Feichtenhofer Christoph","year":"2016","unstructured":"Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Spatiotemporal residual networks for video action recognition. In NeurIPS."},{"key":"e_1_3_2_15_2","volume-title":"ECCV","author":"Gan Chuang","year":"2016","unstructured":"Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV."},{"key":"e_1_3_2_16_2","volume-title":"CVPR","author":"Gan Chuang","year":"2016","unstructured":"Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR."},{"key":"e_1_3_2_17_2","unstructured":"Priya Goyal Quentin Duval Isaac Seessel Mathilde Caron Mannat Singh Ishan Misra Levent Sagun Armand Joulin and Piotr Bojanowski. 2022. Vision Models Are More Robust and Fair When Pretrained on Uncurated Images without Supervision. Retrieved from https:\/\/github.com\/facebookresearch\/vissl\/blob\/main\/projects\/SEER\/README.md."},{"key":"e_1_3_2_18_2","volume-title":"CVPR","author":"Grauman Kristen","year":"2022","unstructured":"Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et\u00a0al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR."},{"key":"e_1_3_2_19_2","volume-title":"CVPR","author":"Gu Chunhui","year":"2018","unstructured":"Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et\u00a0al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR."},{"key":"e_1_3_2_20_2","volume-title":"ECCV","author":"Guo Sheng","year":"2018","unstructured":"Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R Scott, and Dinglong Huang. 2018. Curriculumnet: Weakly supervised learning from large-scale web images. In ECCV."},{"key":"e_1_3_2_21_2","volume-title":"ECCV","author":"Guo Yandong","year":"2016","unstructured":"Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In ECCV. Springer."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2890560"},{"key":"e_1_3_2_23_2","volume-title":"ACM Multimedia","author":"Huang Feiran","year":"2018","unstructured":"Feiran Huang, Xiaoming Zhang, and Zhoujun Li. 2018. Learning joint multimodal representation with adversarial attention networks. In ACM Multimedia."},{"key":"e_1_3_2_24_2","volume-title":"ECCV","author":"Huang Qingqiu","year":"2020","unstructured":"Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. Movienet: A holistic dataset for movie understanding. In ECCV. Springer."},{"key":"e_1_3_2_25_2","volume-title":"ECCV","author":"Huang Qingqiu","year":"2020","unstructured":"Qingqiu Huang, Lei Yang, Huaiyi Huang, Tong Wu, and Dahua Lin. 2020. Caption-supervised face recognition: Training a state-of-the-art face model without manual annotation. In ECCV."},{"key":"e_1_3_2_26_2","volume-title":"CVPR","author":"Jang Yunseok","year":"2017","unstructured":"Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR."},{"key":"e_1_3_2_27_2","volume-title":"ICCV","author":"Ji Jingwei","year":"2021","unstructured":"Jingwei Ji, Rishi Desai, and Juan Carlos Niebles. 2021. Detecting human-object relationships in videos. In ICCV."},{"key":"e_1_3_2_28_2","volume-title":"ICML","author":"Jiang Lu","year":"2018","unstructured":"Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML. PMLR."},{"key":"e_1_3_2_29_2","volume-title":"BMCV","author":"Kalogeiton Vicky","year":"2020","unstructured":"Vicky Kalogeiton and Andrew Zisserman. 2020. Constrained video face clustering using 1NN relations. In BMCV."},{"key":"e_1_3_2_30_2","volume-title":"CVPR","author":"Kemelmacher-Shlizerman Ira","year":"2016","unstructured":"Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. 2016. The megaface benchmark: 1 million faces for recognition at scale. In CVPR."},{"key":"e_1_3_2_31_2","volume-title":"IJCAI","author":"Kim Kyung-Min","year":"2017","unstructured":"Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. 2017. DeepStory: Video story QA by deep embedded memory networks. In IJCAI."},{"key":"e_1_3_2_32_2","volume-title":"CVPR","author":"Kukleva Anna","year":"2020","unstructured":"Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning interactions and relationships between movie characters. In CVPR."},{"key":"e_1_3_2_33_2","volume-title":"International Workshop on Content-Based Multimedia Indexing","author":"Le Nam","year":"2017","unstructured":"Nam Le, Herv\u00e9 Bredin, Gabriel Sargent, Miquel India, Paula Lopez-Otero, Claude Barras, Camille Guinaudeau, Guillaume Gravier, Gabriel Barbosa da Fonseca, Izabela Lyon Freire, et\u00a0al. 2017. Towards large scale multimedia indexing: A case study on person discovery in broadcast news. In International Workshop on Content-Based Multimedia Indexing."},{"key":"e_1_3_2_34_2","volume-title":"EMNLP","author":"Lei Jie","year":"2018","unstructured":"Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: Localized, compositional video question answering. In EMNLP."},{"key":"e_1_3_2_35_2","volume-title":"ACL","author":"Lei Jie","year":"2020","unstructured":"Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. 2020. TVQA+: Spatio-temporal grounding for video question answering. In ACL."},{"key":"e_1_3_2_36_2","article-title":"Unified spatio-temporal attention networks for action recognition in videos","author":"Li Dong","year":"2018","unstructured":"Dong Li, Ting Yao, Ling-Yu Duan, Tao Mei, and Yong Rui. 2018. Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multimedia (2018).","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_2_37_2","volume-title":"ICCV","author":"Li Yikang","year":"2017","unstructured":"Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In ICCV."},{"key":"e_1_3_2_38_2","volume-title":"ICMLA","author":"Mahon Louis","year":"2020","unstructured":"Louis Mahon, Eleonora Giunchiglia, Bowen Li, and Thomas Lukasiewicz. 2020. Knowledge graph extraction from videos. In ICMLA."},{"key":"e_1_3_2_39_2","volume-title":"CVPR","author":"Marsza\u0142ek Marcin","year":"2009","unstructured":"Marcin Marsza\u0142ek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context. In CVPR."},{"key":"e_1_3_2_40_2","volume-title":"CVPR","author":"Miech Antoine","year":"2020","unstructured":"Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In CVPR."},{"key":"e_1_3_2_41_2","volume-title":"ICCV","author":"Miech Antoine","year":"2019","unstructured":"Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV."},{"key":"e_1_3_2_42_2","volume-title":"ICMR","author":"Mithun Niluthpol Chowdhury","year":"2018","unstructured":"Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR."},{"key":"e_1_3_2_43_2","volume-title":"ICCV","author":"Mun Jonghwan","year":"2017","unstructured":"Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. 2017. Marioqa: Answering questions by watching gameplay videos. In ICCV."},{"key":"e_1_3_2_44_2","volume-title":"MediaEval 2015","author":"Poignant Johann","year":"2015","unstructured":"Johann Poignant, Herv\u00e9 Bredin, and Claude Barras. 2015. Limsi at mediaeval 2015: Person discovery in broadcast tv task. In MediaEval 2015."},{"key":"e_1_3_2_45_2","volume-title":"MediaEval 2015","author":"Poignant Johann","year":"2015","unstructured":"Johann Poignant, Herv\u00e9 Bredin, and Claude Barras. 2015. Multimodal person discovery in broadcast tv at mediaeval 2015. In MediaEval 2015."},{"key":"e_1_3_2_46_2","volume-title":"ISCA","author":"Poignant Johann","year":"2012","unstructured":"Johann Poignant, Herv\u00e9 Bredin, Viet-Bac Le, Laurent Besacier, Claude Barras, and Georges Qu\u00e9not. 2012. Unsupervised speaker identification using overlaid texts in TV broadcast. In ISCA."},{"key":"e_1_3_2_47_2","volume-title":"CVPR","author":"Rohrbach Anna","year":"2015","unstructured":"Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In CVPR."},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0987-1"},{"key":"e_1_3_2_49_2","article-title":"Robust face-name graph matching for movie character identification","author":"Sang Jitao","year":"2012","unstructured":"Jitao Sang and Changsheng Xu. 2012. Robust face-name graph matching for movie character identification. IEEE Trans. Multimedia (2012).","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_2_50_2","volume-title":"AAAI","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multi-task approach for learning embeddings from semantic tasks. In AAAI."},{"key":"e_1_3_2_51_2","volume-title":"FG","author":"Sharma Vivek","year":"2020","unstructured":"Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, and Rainer Stiefelhagen. 2020. Clustering based contrastive learning for improving face representations. In FG. IEEE."},{"key":"e_1_3_2_52_2","volume-title":"NeurIPS","author":"Snell Jake","year":"2017","unstructured":"Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In NeurIPS."},{"key":"e_1_3_2_53_2","volume-title":"ICCV","author":"Sun Chen","year":"2019","unstructured":"Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV."},{"key":"e_1_3_2_54_2","article-title":"Look at what i\u2019m doing: Self-supervised spatial grounding of narrations in instructional videos","author":"Tan Reuben","year":"2021","unstructured":"Reuben Tan, Bryan Plummer, Kate Saenko, Hailin Jin, and Bryan Russell. 2021. Look at what i\u2019m doing: Self-supervised spatial grounding of narrations in instructional videos. NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_2_55_2","volume-title":"CVPR","author":"Tapaswi Makarand","year":"2015","unstructured":"Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. 2015. Book2movie: Aligning video scenes with book chapters. In CVPR."},{"key":"e_1_3_2_56_2","volume-title":"ICCV","author":"Tapaswi Makarand","year":"2019","unstructured":"Makarand Tapaswi, Marc T. Law, and Sanja Fidler. 2019. Video face clustering with unknown number of clusters. In ICCV."},{"key":"e_1_3_2_57_2","volume-title":"CVPR","author":"Tapaswi Makarand","year":"2016","unstructured":"Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In CVPR."},{"key":"e_1_3_2_58_2","unstructured":"the MDZ Digital Library team. 2021. Large-cased BERT Finetuned on English. Retrieved from https:\/\/huggingface.co\/dbmdz\/bert-large-cased-finetuned-conll03-english."},{"key":"e_1_3_2_59_2","volume-title":"ACM Multimedia","author":"Tian Hongshuo","year":"2020","unstructured":"Hongshuo Tian, Ning Xu, An-An Liu, and Yongdong Zhang. 2020. Part-aware interactive learning for scene graph generation. In ACM Multimedia."},{"key":"e_1_3_2_60_2","unstructured":"Atousa Torabi Christopher J. Pal Hugo Larochelle and Aaron C. Courville. 2015. Using descriptive video services to create a large data source for video annotation research. (unpublished)."},{"key":"e_1_3_2_61_2","volume-title":"CVPR","author":"Tran Du","year":"2018","unstructured":"Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR."},{"key":"e_1_3_2_62_2","volume-title":"CVPR","author":"Veit Andreas","year":"2017","unstructured":"Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. 2017. Learning from noisy large-scale datasets with minimal supervision. In CVPR."},{"key":"e_1_3_2_63_2","volume-title":"CVPR","author":"Vicol Paul","year":"2018","unstructured":"Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. Moviegraphs: Towards understanding human-centric situations from videos. In CVPR."},{"key":"e_1_3_2_64_2","volume-title":"NeurIPS","author":"Woo Sanghyun","year":"2018","unstructured":"Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In So Kweon. 2018. LinkNet: Relational embedding for scene graph. In NeurIPS."},{"key":"e_1_3_2_65_2","volume-title":"CVPR","author":"Wu Chao-Yuan","year":"2021","unstructured":"Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In CVPR."},{"key":"e_1_3_2_66_2","article-title":"STAT: Spatial-temporal attention mechanism for video captioning","author":"Yan Chenggang","year":"2019","unstructured":"Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multimedia (2019).","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_2_67_2","volume-title":"ECCV","author":"Yang Jianwei","year":"2018","unstructured":"Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In ECCV."},{"key":"e_1_3_2_68_2","volume-title":"CVPR","author":"Zellers Rowan","year":"2019","unstructured":"Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR."},{"key":"e_1_3_2_69_2","volume-title":"CVPR","author":"Zellers Rowan","year":"2018","unstructured":"Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR."},{"key":"e_1_3_2_70_2","volume-title":"ICCV","author":"Zhan Xunlin","year":"2021","unstructured":"Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In ICCV."},{"key":"e_1_3_2_71_2","volume-title":"ICCV","author":"Zhu Yukun","year":"2015","unstructured":"Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV."},{"key":"e_1_3_2_72_2","volume-title":"CVPR","author":"Zhuang Bohan","year":"2017","unstructured":"Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian Reid. 2017. Attend in groups: A weakly-supervised deep learning framework for learning from web data. In CVPR."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3583138","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3583138","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:54Z","timestamp":1750178274000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3583138"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,7]]},"references-count":71,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2023,10,31]]}},"alternative-id":["10.1145\/3583138"],"URL":"https:\/\/doi.org\/10.1145\/3583138","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2023,6,7]]},"assertion":[{"value":"2022-08-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-22","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}