{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T21:18:46Z","timestamp":1780607926434,"version":"3.54.1"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,11,7]],"date-time":"2023-11-07T00:00:00Z","timestamp":1699315200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U1903214, 62372339, 62371350, and 61876135"],"award-info":[{"award-number":["U1903214, 62372339, 62371350, and 61876135"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Person-Action instance search (P-A INS) aims to retrieve the instances of a specific person doing a specific action, which appears in the 2019\u20132021 INS tasks of the world-famous TREC Video Retrieval Evaluation (TRECVID). Most of the top-ranking solutions can be summarized with a Division-Fusion-Optimization (DFO) framework, in which person and action recognition scores are obtained separately, then fused, and, optionally, further optimized to generate the final ranking. However, TRECVID only evaluates the final ranking results, ignoring the effects of intermediate steps and their implementation methods. We argue that conducting the fine-grained evaluations of intermediate steps of DFO framework will (1) provide a quantitative analysis of the different methods\u2019 performance in intermediate steps; (2) find out better design choices that contribute to improving retrieval performance; and (3) inspire new ideas for future research from the limitation analysis of current techniques. Particularly, we propose an indirect evaluation method motivated by the leave-one-out strategy, which finds an optimal solution surpassing the champion teams in 2020\u20132021 INS tasks. Moreover, to validate the generalizability and robustness of the proposed solution under various scenarios, we specifically construct a new large-scale P-A INS dataset and conduct comparative experiments with both the leading NIST TRECVID INS solution and the state-of-the-art P-A INS method. Finally, we discuss the limitations of our evaluation work and suggest future research directions.<\/jats:p>","DOI":"10.1145\/3617892","type":"journal-article","created":{"date-parts":[[2023,8,29]],"date-time":"2023-08-29T11:48:14Z","timestamp":1693309694000},"page":"1-34","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Person-action Instance Search in Story Videos: An Experimental Study"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5056-1477","authenticated-orcid":false,"given":"Yanrui","family":"Niu","sequence":"first","affiliation":[{"name":"NERCMS, School of Computer Science, Wuhan University, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8287-8655","authenticated-orcid":false,"given":"Chao","family":"Liang","sequence":"additional","affiliation":[{"name":"NERCMS, School of Computer Science, Wuhan University, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7009-9205","authenticated-orcid":false,"given":"Ankang","family":"Lu","sequence":"additional","affiliation":[{"name":"NERCMS, School of Computer Science, Wuhan University, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4882-5787","authenticated-orcid":false,"given":"Baojin","family":"Huang","sequence":"additional","affiliation":[{"name":"NERCMS, School of Computer Science, Wuhan University, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9796-488X","authenticated-orcid":false,"given":"Zhongyuan","family":"Wang","sequence":"additional","affiliation":[{"name":"NERCMS, School of Computer Science, Wuhan University, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6682-7867","authenticated-orcid":false,"given":"Jiahao","family":"Guo","sequence":"additional","affiliation":[{"name":"NERCMS, School of Computer Science, Wuhan University, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,11,7]]},"reference":[{"key":"e_1_3_3_2_2","volume-title":"Proceedings of the TREC Video Retrieval Evaluation","year":"2021","unstructured":"George Awad, Asad Butt, Keith Curtis, Jonathan G. Fiscus, Afzal A. Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth Jones, and Georges Quenot. 2021. Evaluating multiple video understanding and retrieval tasks at TRECVID 2021. In Proceedings of the TREC Video Retrieval Evaluation. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv21.papers\/tv21overview.pdf"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2018.00020"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_3_5_2","first-page":"381","volume-title":"Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV\u201918)","author":"Chao Yu-Wei","year":"2018","unstructured":"Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learning to detect human-object interactions. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV\u201918). IEEE, 381\u2013389."},{"key":"e_1_3_3_6_2","first-page":"9004","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Mingfei","year":"2021","unstructured":"Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang, and Chen Qian. 2021. Reformulating HOI detection as adaptive set prediction. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9004\u20139013."},{"key":"e_1_3_3_7_2","article-title":"Building a large concept bank for representing events in video","author":"Cui Yin","year":"2014","unstructured":"Yin Cui, Dong Liu, Jiawei Chen, and Shih-Fu Chang. 2014. Building a large concept bank for representing events in video. arXiv preprint arXiv:1403.7591 (2014).","journal-title":"arXiv preprint arXiv:1403.7591"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00482"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2526008"},{"key":"e_1_3_3_12_2","first-page":"1","volume-title":"Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS\u201918)","author":"Galiyawala Hiren","year":"2018","unstructured":"Hiren Galiyawala, Kenil Shah, Vandit Gajjar, and Mehul S. Raval. 2018. Person retrieval in surveillance video using height, color and gender. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS\u201918). IEEE, 1\u20136."},{"key":"e_1_3_3_13_2","doi-asserted-by":"crossref","unstructured":"Cuixiang Guo. 2023. Research on sports video retrieval algorithm based on semantic feature extraction. Multim. Tools Applic . 82 (2023) 21941\u201321955.","DOI":"10.1007\/s11042-020-10178-z"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2890560"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_3_16_2","first-page":"8401","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"33","author":"He Dongliang","year":"2019","unstructured":"Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. 2019. StNet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8401\u20138408."},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_3_18_2","first-page":"p.1168\u20131181","article-title":"Semantic-based surveillance video retrieval","volume":"16","author":"Hu W.","year":"2007","unstructured":"W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank. 2007. Semantic-based surveillance video retrieval. IEEE Trans. Image Process. 16 (2007), p.1168\u20131181.","journal-title":"IEEE Trans. Image Process."},{"key":"e_1_3_3_19_2","first-page":"709","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"Huang Qingqiu","year":"2020","unstructured":"Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the 16th European Conference on Computer Vision. Springer, 709\u2013727."},{"key":"e_1_3_3_20_2","first-page":"603","volume-title":"Proceedings of the International Conference on Multimedia Retrieval","author":"Iinuma Yuko","year":"2021","unstructured":"Yuko Iinuma and Shin\u2019ichi Satoh. 2021. Video action retrieval using action recognition model. In Proceedings of the International Conference on Multimedia Retrieval. 603\u2013606."},{"key":"e_1_3_3_21_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Jiang Longxiang","year":"2019","unstructured":"Longxiang Jiang, Jingyao Yang, Erxuan Guo, Fan Xia, Ruxing Meng, Jingfeng Luo, Xiangyu Li, Xinyi Yan, Zengmin Xu, and Chao Liang. 2019. WHU-NERCMS at TRECVID2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv19.papers\/whu_nercms.pdf"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/1282280.1282352"},{"key":"e_1_3_3_23_2","volume-title":"Proceedings of the TRECVID Workshop","year":"2019","unstructured":"Martin Klinkigt, Duy-Dinh Le, Atsushi Hiroike, Hung-Quoc Vo, Mohit Chabra, Vu-Minh-Hieu Dang, Quan Kong, Vinh-Tiep Nguyen, Tomokazu Murakami, Tien-Van Do, Tomoaki Yoshinaga, Duy-Nhat Nguyen, Sinha Saptarshi, Thanh-Duc Ngo, Charles Limasanches, Tushar Agrawal, Jian Manish Vora, Manikandan Ravikiran, Zheng Wang, and Shin'ichi Satoh. 2019. NII Hitachi UIT at TRECVID 2019. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv19.papers\/nii_hitachi_uit.pdf"},{"key":"e_1_3_3_24_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Le Duy-Dinh","year":"2020","unstructured":"Duy-Dinh Le, Hung-Quoc Vo, Dung-Minh Nguyen, Tien-Van Do, Thinh-Le-Gia Pham, Tri-Le-Minh Vo, Thua-Ngoc Nguyen, Vinh-Tiep Nguyen, Thanh-Duc Ngo, Zheng Wang, and Shin\u2019ichi Satoh. 2020. NII_UIT AT TRECVID 2020. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv20.papers\/nii_uit.pdf"},{"key":"e_1_3_3_25_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Li Ya","year":"2019","unstructured":"Ya Li, Guanyu Chen, Xiangqian Cheng, Chong Chen, Shaoqiang Xu, Xinyu Li, Xuanlu Xiang, Yanyun Zhao, Zhicheng Zhao, and Fei Su. 2019. BUPT-MCPRL at TRECVID 2019: ActEV and INS. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv19.papers\/bupt-mcprl.pdf"},{"key":"e_1_3_3_26_2","first-page":"3377","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Liang Chao","year":"2011","unstructured":"Chao Liang, Changsheng Xu, Jian Cheng, and Hanqing Lu. 2011. TVParser: An automatic TV video parsing method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3377\u20133384."},{"key":"e_1_3_3_27_2","first-page":"482","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liao Yue","year":"2020","unstructured":"Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 482\u2013490."},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718"},{"key":"e_1_3_3_29_2","first-page":"3202","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Ze","year":"2022","unstructured":"Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3202\u20133211."},{"key":"e_1_3_3_30_2","unstructured":"Robert McKee. 2010. Story: Style Structure Substance and the Principles of Screenwriting. HarperCollins e-books."},{"issue":"1","key":"e_1_3_3_31_2","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1109\/TMM.2015.2500734","article-title":"Object instance search in videos via spatio-temporal trajectory discovery","volume":"18","author":"Meng Jingjing","year":"2015","unstructured":"Jingjing Meng, Junsong Yuan, Jiong Yang, Gang Wang, and Yap-Peng Tan. 2015. Object instance search in videos via spatio-temporal trajectory discovery. IEEE Trans. Multim. 18, 1 (2015), 116\u2013127.","journal-title":"IEEE Trans. Multim."},{"key":"e_1_3_3_32_2","article-title":"Efficient estimation of word representations in vector space","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).","journal-title":"arXiv preprint arXiv:1301.3781"},{"key":"e_1_3_3_33_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Mizuno Sosuke","year":"2020","unstructured":"Sosuke Mizuno and Keiji Yanai. 2020. UEC at TRECVID 2020: INS and ActEV. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv20.papers\/uec.pdf"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.omega.2020.102254"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2006.63"},{"key":"e_1_3_3_36_2","first-page":"82","volume-title":"Proceedings of the 29th International Conference on MultiMedia Modeling","author":"Niu Yanrui","year":"2023","unstructured":"Yanrui Niu, Jingyao Yang, Chao Liang, Baojin Huang, and Zhongyuan Wang. 2023. A spatio-temporal identity verification method for person-action instance search in movies. In Proceedings of the 29th International Conference on MultiMedia Modeling. Springer, 82\u201394."},{"key":"e_1_3_3_37_2","volume-title":"Proceedings of the TRECVID Workshop","year":"2021","unstructured":"Yanrui Niu, Jingyao Yang, Ankang Lu, Baojin Huang, Yue Zhang, Ji Huang, Shishi Wen, Dongshu Xu, Chao Liang, Zhongyuan Wang, and Jun Chen. 2021. WHU-NERCMS at TRECVID2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv21.papers\/whu-nercms.pdf"},{"key":"e_1_3_3_38_2","first-page":"3135","volume-title":"Advances in Neural Information Processing Systems","author":"Ouyang Jianbo","year":"2021","unstructured":"Jianbo Ouyang, Hui Wu, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Contextual similarity aggregation with self-attention for visual re-ranking. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3135\u20133148."},{"key":"e_1_3_3_39_2","doi-asserted-by":"crossref","unstructured":"Omkar M. Parkhi Andrea Vedaldi and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of the British Machine Vision Conference 2015 (BMVC 2015 Swansea UK September 7-10 2015) Xianghua Xie Mark W. Jones and Gary K. L. Tam (Eds.). BMVA Press 41.1\u201341.12.","DOI":"10.5244\/C.29.41"},{"key":"e_1_3_3_40_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Peng Yuxin","year":"2019","unstructured":"Yuxin Peng, Xin Huang, Jinwei Qi, Junjie Zhao, Junchao Zhang, Yunzhen Zhao, Yuxin Yuan, Xiangteng He, and Jian Zhang. 2019. PKU-ICST at TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv19.papers\/pku-icst.pdf"},{"key":"e_1_3_3_41_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Peng Yuxin","year":"2020","unstructured":"Yuxin Peng, Zhaoda Ye, Junchao Zhang, Hongbo Sun, Dejie Yang, and Zhenyu Cui. 2020. PKU_WICT at TRECVID 2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv20.papers\/pku-wict.pdf"},{"key":"e_1_3_3_42_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Peng Yuxin","year":"2021","unstructured":"Yuxin Peng, Zhaoda Ye, Junchao Zhang, Hongbo Sun, Dejie Yang, and Zhenyu Cui. 2021. PKU_WICT at TRECVID 2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv21.papers\/pku_wict.pdf"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4419-9326-7_1"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01016"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_3_3_46_2","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1007\/978-981-15-7345-3_10","volume-title":"Inventive Communication and Computational Technologies","author":"Shambharkar Prashant Giridhar","year":"2021","unstructured":"Prashant Giridhar Shambharkar, Umesh Kumar Nimesh, Nihal Kumar, Vj Duy Du, and M. N. Doja. 2021. Automatic face recognition and finding occurrence of actors in movies. In Inventive Communication and Computational Technologies. Springer, 115\u2013129."},{"key":"e_1_3_3_47_2","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).","journal-title":"arXiv preprint arXiv:1409.1556"},{"key":"e_1_3_3_48_2","first-page":"5800","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Siqueira Henrique","year":"2020","unstructured":"Henrique Siqueira, Sven Magg, and Stefan Wermter. 2020. Efficient facial feature learning with wide ensemble-based convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5800\u20135809."},{"key":"e_1_3_3_49_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Song Yinan","year":"2021","unstructured":"Yinan Song, Wenhao Yang, Zhicheng Zhao, Yanyun Zhao, and Fei Su. 2021. BUPT-MCPRL at TRECVID 2021. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv21.papers\/bupt-mcprl.pdf"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00584"},{"key":"e_1_3_3_51_2","first-page":"10410","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Tamura Masato","year":"2021","unstructured":"Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. 2021. QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10410\u201310419."},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_3_53_2","first-page":"527","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Ulutan Oytun","year":"2020","unstructured":"Oytun Ulutan, Swati Rallapalli, Mudhakar Srivatsa, Carlos Torres, and B. S. Manjunath. 2020. Actor conditioned attention maps for video action detection. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 527\u2013536."},{"key":"e_1_3_3_54_2","first-page":"8581","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Vicol Paul","year":"2018","unstructured":"Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8581\u20138590."},{"key":"e_1_3_3_55_2","first-page":"1","volume-title":"Proceedings of the International Conference on Multimedia Analysis and Pattern Recognition (MAPR\u201920)","year":"2020","unstructured":"Hung-Quoc Vo, Dung-Minh Nguyen, Tien Do, Vinh-Tiep Nguyen, Nhat-Duy Nguyen, Thanh Duc Ngo, Duy-Dinh Le, and Shin'ichi Satoh. 2020. Searching for desired person doing desired action based on visual and audio feature in large scale video database. In Proceedings of the International Conference on Multimedia Analysis and Pattern Recognition (MAPR\u201920). IEEE, 1\u20136."},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2956143"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338533.3366594"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46478-7_31"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.917346"},{"key":"e_1_3_3_61_2","unstructured":"Akira Yanagawa Shih-Fu Chang Lyndon Kennedy and Winston Hsu. 2007. Columbia university.s baseline detectors for 374 LSCOM semantic visual concepts. Technical Report. Columbia University. Retrieved from http:\/\/www.ee.columbia.edu\/dvmm\/columbia374"},{"key":"e_1_3_3_62_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Yang Jingyao","year":"2020","unstructured":"Jingyao Yang, Yanrui Niu Kang\u2019an Chen, Xinyao Fan, and Chao Liang. 2020. WHU-NERCMS at TRECVID2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv20.papers\/whu_nercms.pdf"},{"key":"e_1_3_3_63_2","first-page":"2323","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Yang Wenhao","year":"2021","unstructured":"Wenhao Yang, Yinan Song, Zhicheng Zhao, and Fei Su. 2021. Instance search via fusing hierarchical multi-level retrieval and human-object interaction detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 2323\u20132327."},{"key":"e_1_3_3_64_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Yu En","year":"2019","unstructured":"En Yu, Wenhe Liu, Guoliang Kang, Xiaojun Chang, Jiande Sun, and Alexander Hauptmann. 2019. Inf@TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv19.papers\/inf_ins.pdf"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2603342"},{"key":"e_1_3_3_66_2","volume-title":"Proceedings of the TRECVID Workshop","author":"Zhang Qi","year":"2020","unstructured":"Qi Zhang, Jiacheng Zhang, Zhicheng Zhao, Yanyun Zhao, and Fei Su. 2020. BUPT-MCPRL aW TRECVID 2020: INS. In Proceedings of the TRECVID Workshop. Retrieved from https:\/\/www-nlpir.nist.gov\/projects\/tvpubs\/tv20.papers\/bupt-mcprl_ins.pdf"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_43"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617892","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3617892","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:57Z","timestamp":1750178277000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617892"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,7]]},"references-count":66,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3617892"],"URL":"https:\/\/doi.org\/10.1145\/3617892","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,7]]},"assertion":[{"value":"2022-10-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-11","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}