{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T05:25:10Z","timestamp":1755926710473,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":20,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3551610","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"7215-7219","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Leveraging Text Representation and Face-head Tracking for Long-form Multimodal Semantic Relation Understanding"],"prefix":"10.1145","author":[{"given":"Raksha","family":"Ramesh","sequence":"first","affiliation":[{"name":"Columbia University &amp; Graphen Inc., New York, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vishal","family":"Anand","sequence":"additional","affiliation":[{"name":"Columbia University &amp; Microsoft Corporation, Redmond, WA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zifan","family":"Chen","sequence":"additional","affiliation":[{"name":"Columbia University &amp; Graphen Inc., New York, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yifei","family":"Dong","sequence":"additional","affiliation":[{"name":"Columbia University &amp; Graphen Inc., New York, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yun","family":"Chen","sequence":"additional","affiliation":[{"name":"Graphen Inc., New York, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ching-Yung","family":"Lin","sequence":"additional","affiliation":[{"name":"Columbia University &amp; Graphen Inc., New York, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"volume-title":"MultiModal Language Modelling on Knowledge Graphs for Deep Video Understanding","author":"Anand Vishal","key":"e_1_3_2_1_1_1","unstructured":"Vishal Anand , Raksha Ramesh , Boshen Jin , Ziyin Wang , Xiaoxiao Lei , and Ching-Yung Lin . 2021. MultiModal Language Modelling on Knowledge Graphs for Deep Video Understanding . Association for Computing Machinery , New York, NY, USA , 4868--4872. https:\/\/doi.org\/10.1145\/3474085.3479220 Vishal Anand, Raksha Ramesh, Boshen Jin, Ziyin Wang, Xiaoxiao Lei, and Ching-Yung Lin. 2021. MultiModal Language Modelling on Knowledge Graphs for Deep Video Understanding .Association for Computing Machinery, New York, NY, USA, 4868--4872. https:\/\/doi.org\/10.1145\/3474085.3479220"},{"volume-title":"Story Semantic Relationships from Multimodal Cognitions","author":"Anand Vishal","key":"e_1_3_2_1_2_1","unstructured":"Vishal Anand , Raksha Ramesh , Ziyin Wang , Yijing Feng , Jiana Feng , Wenfeng Lyu , Tianle Zhu , Serena Yuan , and Ching-Yung Lin . 2020. Story Semantic Relationships from Multimodal Cognitions . Association for Computing Machinery , New York, NY, USA , 4650--4654. https:\/\/doi.org\/10.1145\/3394171.3416305 Vishal Anand, Raksha Ramesh, Ziyin Wang, Yijing Feng, Jiana Feng, Wenfeng Lyu, Tianle Zhu, Serena Yuan, and Ching-Yung Lin. 2020. Story Semantic Relationships from Multimodal Cognitions .Association for Computing Machinery, New York, NY, USA, 4650--4654. https:\/\/doi.org\/10.1145\/3394171.3416305"},{"key":"e_1_3_2_1_3_1","volume-title":"VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C. Lawrence Zitnick , and Devi Parikh . 2015 . VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV). Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_1_5_1","volume-title":"Lin (Eds.)","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared D Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel Ziegler , Jeffrey Wu , Clemens Winter , Chris Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , and Dario Amodei . 2020 . Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H . Lin (Eds.) , Vol. 33 . Curran Associates, Inc. , 1877--1901. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372278.3390742"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"J. Deng W. Dong R. Socher L.-J. Li K. Li and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.  J. Deng W. Dong R. Socher L.-J. Li K. Li and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_8_1","volume-title":"ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 4685--4694","author":"Deng Jiankang","year":"2019","unstructured":"Jiankang Deng , Jia Guo , Niannan Xue , and Stefanos Zafeiriou . 2019 . ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 4685--4694 . https:\/\/doi.org\/10.1109\/CVPR.2019.00482 Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 4685--4694. https:\/\/doi.org\/10.1109\/CVPR.2019.00482"},{"volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Drew","key":"e_1_3_2_1_9_1","unstructured":"Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_1_11_1","volume-title":"Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alch\u00e9-Buc","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks . In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alch\u00e9-Buc , E. Fox, and R. Garnett (Eds.), Vol. 32 . Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/ 2019 \/file\/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alch\u00e9-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf"},{"key":"e_1_3_2_1_12_1","volume-title":"Accessed","author":"Massa Francisco","year":"2018","unstructured":"Francisco Massa and Ross Girshick . 2018 . maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch . https:\/\/github.com\/facebookresearch\/maskrcnn-benchmark . Accessed : July 25, 2022. Francisco Massa and Ross Girshick. 2018. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch . https:\/\/github.com\/facebookresearch\/maskrcnn-benchmark. Accessed: July 25, 2022."},{"key":"e_1_3_2_1_13_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , Gretchen Krueger , and Ilya Sutskever . 2021 . Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research , Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748-- 8763 . https:\/\/proceedings.mlr.press\/v139\/radford21a.html Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. https:\/\/proceedings.mlr.press\/v139\/radford21a.html"},{"volume-title":"Kinetics and Scene Features for Intent Detection. In Companion Publication of the 2020 International Conference on Multimodal Interaction","author":"Ramesh Raksha","key":"e_1_3_2_1_14_1","unstructured":"Raksha Ramesh , Vishal Anand , Ziyin Wang , Tianle Zhu , Wenfeng Lyu , Serena Yuan , and Ching-Yung Lin .2020. Kinetics and Scene Features for Intent Detection. In Companion Publication of the 2020 International Conference on Multimodal Interaction ( Virtual Event, Netherlands) (ICMI '20 Companion). Association for Computing Machinery, New York, NY, USA, 135--139. https:\/\/doi.org\/10.1145\/3395035.3425641 Raksha Ramesh, Vishal Anand, Ziyin Wang, Tianle Zhu, Wenfeng Lyu, Serena Yuan, and Ching-Yung Lin.2020. Kinetics and Scene Features for Intent Detection. In Companion Publication of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI '20 Companion). Association for Computing Machinery, New York, NY, USA, 135--139. https:\/\/doi.org\/10.1145\/3395035.3425641"},{"key":"e_1_3_2_1_15_1","volume-title":"Film: An international history of the medium","author":"Sklar Robert","year":"2002","unstructured":"Robert Sklar . 2002 . Film: An international history of the medium . Prentice Hall , 526. Robert Sklar. 2002. Film: An international history of the medium. Prentice Hall, 526."},{"key":"e_1_3_2_1_16_1","volume-title":"VideoBERT: A Joint Model for Video and Language Representation Learning. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Sun Chen","year":"2019","unstructured":"Chen Sun , Austin Myers , Carl Vondrick , Kevin Murphy 0002, and Cordelia Schmid . 2019 . VideoBERT: A Joint Model for Video and Language Representation Learning. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 7463--7472. https:\/\/doi.org\/10.1109\/ICCV.2019.00756 Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy 0002, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 7463--7472. https:\/\/doi.org\/10.1109\/ICCV.2019.00756"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics Hong Kong China 5100--5111. https:\/\/doi.org\/10.18653\/v1\/D19--1514  Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics Hong Kong China 5100--5111. https:\/\/doi.org\/10.18653\/v1\/D19--1514","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00377"},{"key":"e_1_3_2_1_19_1","unstructured":"Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2.  Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2."},{"volume-title":"Deep Relationship Analysis in Video with Multimodal Feature Fusion","author":"Yu Fan","key":"e_1_3_2_1_20_1","unstructured":"Fan Yu , DanDan Wang , Beibei Zhang , and Tongwei Ren . 2020. Deep Relationship Analysis in Video with Multimodal Feature Fusion . Association for Computing Machinery , New York, NY, USA , 4640--4644. https:\/\/doi.org\/10.1145\/3394171.3416303 Fan Yu, DanDan Wang, Beibei Zhang, and Tongwei Ren. 2020. Deep Relationship Analysis in Video with Multimodal Feature Fusion .Association for Computing Machinery, New York, NY, USA, 4640--4644. https:\/\/doi.org\/10.1145\/3394171.3416303"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3551610","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3551610","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:18Z","timestamp":1750182558000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3551610"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":20,"alternative-id":["10.1145\/3503161.3551610","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3551610","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}