{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:23:27Z","timestamp":1772907807731,"version":"3.50.1"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"8","license":[{"start":{"date-parts":[[2024,6,12]],"date-time":"2024-06-12T00:00:00Z","timestamp":1718150400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62222213, U22B2059, 62072423"],"award-info":[{"award-number":["62222213, U22B2059, 62072423"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,8,31]]},"abstract":"<jats:p>The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics such as character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes, and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this article, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.<\/jats:p>","DOI":"10.1145\/3663668","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T11:56:37Z","timestamp":1714737397000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["InteractNet: Social Interaction Recognition for Semantic-rich Videos"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-2628-334X","authenticated-orcid":false,"given":"Yuanjie","family":"Lyu","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1538-0475","authenticated-orcid":false,"given":"Penggang","family":"Qin","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4246-5386","authenticated-orcid":false,"given":"Tong","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4817-482X","authenticated-orcid":false,"given":"Chen","family":"Zhu","sequence":"additional","affiliation":[{"name":"BOSS Zhipin, Beijing, China and University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4835-4102","authenticated-orcid":false,"given":"Enhong","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]}],"member":"320","published-online":{"date-parts":[[2024,6,12]]},"reference":[{"key":"e_1_3_1_2_2","article-title":"YouTube-8M: A large-scale video classification benchmark","author":"Abu-El-Haija Sami","year":"2016","unstructured":"Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).","journal-title":"arXiv preprint arXiv:1609.08675"},{"key":"e_1_3_1_3_2","first-page":"132","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201918)","author":"Caron Mathilde","year":"2018","unstructured":"Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV\u201918). 132\u2013149."},{"key":"e_1_3_1_4_2","first-page":"6299","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299\u20136308."},{"key":"e_1_3_1_5_2","first-page":"84","volume-title":"Proceedings of the 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI\u201923)","author":"Dong Wenlong","year":"2023","unstructured":"Wenlong Dong, Zhongchen Ma, Qing Zhu, and Qirong Mao. 2023. Two-stage multi-instance multi-label learning model for video social relationship recognition. In Proceedings of the 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI\u201923). IEEE, 84\u201388."},{"key":"e_1_3_1_6_2","first-page":"3575","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Farha Yazan Abu","year":"2019","unstructured":"Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3575\u20133584."},{"key":"e_1_3_1_7_2","first-page":"1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Felzenszwalb Pedro","year":"2008","unstructured":"Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1\u20138."},{"issue":"2","key":"e_1_3_1_8_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3052930","article-title":"Crowd scene understanding from video: A survey","volume":"13","author":"Grant Jason M.","year":"2017","unstructured":"Jason M. Grant and Patrick J. Flynn. 2017. Crowd scene understanding from video: A survey. ACM Trans. Multim. Comput., Commun. Applic. 13, 2 (2017), 1\u201323.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"key":"e_1_3_1_9_2","first-page":"770","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770\u2013778."},{"issue":"8","key":"e_1_3_1_10_2","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735\u20131780.","journal-title":"Neural Computat."},{"key":"e_1_3_1_11_2","first-page":"57","volume-title":"Proceedings of the International Conference on Multimedia Modeling","author":"Hu Yibo","year":"2023","unstructured":"Yibo Hu, Chenyu Cao, Fangtao Li, Chenghao Yan, Jinsheng Qi, and Bin Wu. 2023. Overall-distinctive GCN for social relation recognition on videos. In Proceedings of the International Conference on Multimedia Modeling. Springer, 57\u201368."},{"key":"e_1_3_1_12_2","first-page":"425","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201918)","author":"Huang Qingqiu","year":"2018","unstructured":"Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV\u201918). 425\u2013441."},{"key":"e_1_3_1_13_2","first-page":"709","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201920)","author":"Huang Qingqiu","year":"2020","unstructured":"Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the European Conference on Computer Vision (ECCV\u201920). Springer, 709\u2013727."},{"key":"e_1_3_1_14_2","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).","journal-title":"arXiv preprint arXiv:1412.6980"},{"key":"e_1_3_1_15_2","article-title":"Semi-supervised classification with graph convolutional networks","author":"Kipf Thomas N.","year":"2016","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).","journal-title":"arXiv preprint arXiv:1609.02907"},{"issue":"5","key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"1366","DOI":"10.1007\/s11263-022-01594-9","article-title":"Human action recognition and prediction: A survey","volume":"130","author":"Kong Yu","year":"2022","unstructured":"Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 130, 5 (2022), 1366\u20131401.","journal-title":"Int. J. Comput. Vis."},{"issue":"3","key":"e_1_3_1_17_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2000486.2000488","article-title":"Video quality for face detection, recognition, and tracking","volume":"7","author":"Korshunov Pavel","year":"2011","unstructured":"Pavel Korshunov and Wei Tsang Ooi. 2011. Video quality for face detection, recognition, and tracking. ACM Trans. Multim. Comput., Commun. Applic. 7, 3 (2011), 1\u201321.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"key":"e_1_3_1_18_2","first-page":"9849","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Kukleva Anna","year":"2020","unstructured":"Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning interactions and relationships between movie characters. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9849\u20139858."},{"key":"e_1_3_1_19_2","first-page":"156","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Lea Colin","year":"2017","unstructured":"Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156\u2013165."},{"issue":"3","key":"e_1_3_1_20_2","first-page":"1","article-title":"Social context-aware person search in videos via multi-modal cues","volume":"40","author":"Li Dan","year":"2021","unstructured":"Dan Li, Tong Xu, Peilun Zhou, Weidong He, Yanbin Hao, Yi Zheng, and Enhong Chen. 2021. Social context-aware person search in videos via multi-modal cues. ACM Trans. Inf. Syst. 40, 3 (2021), 1\u201325.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Shi-Jie Li Yazan AbuFarha Yun Liu Ming-Ming Cheng and Juergen Gall. 2023. MS-TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45 6 (2023) 6647--6658.","DOI":"10.1109\/TPAMI.2020.3021756"},{"key":"e_1_3_1_22_2","first-page":"2980","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Lin Tsung-Yi","year":"2017","unstructured":"Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980\u20132988."},{"issue":"6","key":"e_1_3_1_23_2","doi-asserted-by":"crossref","first-page":"166708","DOI":"10.1007\/s11704-021-1248-1","article-title":"Instance-sequence reasoning for video question answering","volume":"16","author":"Liu Rui","year":"2022","unstructured":"Rui Liu and Yahong Han. 2022. Instance-sequence reasoning for video question answering. Front. Comput. Sci. 16, 6 (2022), 166708.","journal-title":"Front. Comput. Sci."},{"key":"e_1_3_1_24_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Liu Xinchen","year":"2019","unstructured":"Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, Chenggang Yan, and Tao Mei. 2019. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)."},{"key":"e_1_3_1_25_2","first-page":"355","volume-title":"Proceedings of the International Conference on Multimedia Modeling","author":"Lv Jinna","year":"2018","unstructured":"Jinna Lv, Wu Liu, Lili Zhou, Bin Wu, and Huadong Ma. 2018. Multi-stream fusion model for social relation recognition from videos. In Proceedings of the International Conference on Multimedia Modeling. Springer, 355\u2013368."},{"issue":"4","key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3458051","article-title":"A new foreground-background based method for behavior-oriented social media image classification","volume":"17","author":"Nandanwar Lokesh","year":"2021","unstructured":"Lokesh Nandanwar, Palaiahnakote Shivakumara, Divya Krishnani, Raghavendra Ramachandra, Tong Lu, Umapada Pal, and Mohan Kankanhalli. 2021. A new foreground-background based method for behavior-oriented social media image classification. ACM Trans. Multim. Comput., Commun. Applic. 17, 4 (2021), 1\u201325.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"issue":"12","key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"2441","DOI":"10.1109\/TPAMI.2012.24","article-title":"Structured learning of human interactions in TV shows","volume":"34","author":"Patron-Perez Alonso","year":"2012","unstructured":"Alonso Patron-Perez, Marcin Marszalek, Ian Reid, and Andrew Zisserman. 2012. Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 12 (2012), 2441\u20132453.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"issue":"2","key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2659521","article-title":"Social event classification via boosted multimodal supervised latent Dirichlet allocation","volume":"11","author":"Qian Shengsheng","year":"2015","unstructured":"Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2015. Social event classification via boosted multimodal supervised latent Dirichlet allocation. ACM Trans. Multim. Comput., Commun. Applic. 11, 2 (2015), 1\u201322.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"key":"e_1_3_1_29_2","unstructured":"Penggang Qin Shiwei Wu Tong Xu Yanbin Hao Fuli Feng Chen Zhu and Enhong Chen. 2023. When I fall in love: Capturing video-oriented social relationship evolution via attentive GNN. IEEE Trans. Circ. Syst. Vid. Technol. (2023). (Early Access)."},{"key":"e_1_3_1_30_2","article-title":"Making monolingual sentence embeddings multilingual using knowledge distillation","author":"Reimers Nils","year":"2020","unstructured":"Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813 (2020).","journal-title":"arXiv preprint arXiv:2004.09813"},{"key":"e_1_3_1_31_2","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_32_2","first-page":"4","volume-title":"Proceedings of the IEEE International Conference on Pattern Recognition Workshops","volume":"2","author":"Ryoo Michael S.","year":"2010","unstructured":"Michael S. Ryoo and J. K. Aggarwal. 2010. UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA). In Proceedings of the IEEE International Conference on Pattern Recognition Workshops, Vol. 2. 4."},{"key":"e_1_3_1_33_2","first-page":"1593","volume-title":"Proceedings of the IEEE 12th International Conference on Computer Vision","author":"Ryoo Michael S.","year":"2009","unstructured":"Michael S. Ryoo and Jake K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the IEEE 12th International Conference on Computer Vision. IEEE, 1593\u20131600."},{"key":"e_1_3_1_34_2","article-title":"Two-stream convolutional networks for action recognition in videos","volume":"27","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","first-page":"1377","DOI":"10.1109\/LSP.2022.3181849","article-title":"Learning social relationship from videos via pre-trained multimodal transformer","volume":"29","author":"Teng Yiyang","year":"2022","unstructured":"Yiyang Teng, Chenguang Song, and Bin Wu. 2022. Learning social relationship from videos via pre-trained multimodal transformer. IEEE Sig. Process. Lett. 29 (2022), 1377\u20131381.","journal-title":"IEEE Sig. Process. Lett."},{"key":"e_1_3_1_36_2","first-page":"4489","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Tran Du","year":"2015","unstructured":"Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489\u20134497."},{"key":"e_1_3_1_37_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_38_2","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Vicol Paul","year":"2018","unstructured":"Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)."},{"key":"e_1_3_1_39_2","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1145\/3581783.3612175","volume-title":"Proceedings of the 31st ACM International Conference on Multimedia","author":"Wang Haorui","year":"2023","unstructured":"Haorui Wang, Yibo Hu, Yangfu Zhu, Jinsheng Qi, and Bin Wu. 2023. Shifted GCN-GAT and cumulative-transformer based social relation recognition for long videos. In Proceedings of the 31st ACM International Conference on Multimedia. 67\u201376."},{"key":"e_1_3_1_40_2","first-page":"4305","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wang Limin","year":"2015","unstructured":"Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305\u20134314."},{"key":"e_1_3_1_41_2","first-page":"1884","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wu Chao-Yuan","year":"2021","unstructured":"Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1884\u20131894."},{"key":"e_1_3_1_42_2","first-page":"4716","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia","author":"Wu Shiwei","year":"2021","unstructured":"Shiwei Wu, Joya Chen, Tong Xu, Liyi Chen, Lingfei Wu, Yao Hu, and Enhong Chen. 2021. Linking the characters: Video-oriented social graph generation via hierarchical-cumulative GCN. In Proceedings of the 29th ACM International Conference on Multimedia. 4716\u20134724."},{"key":"e_1_3_1_43_2","first-page":"4592","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Xiong Yu","year":"2019","unstructured":"Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 4592\u20134601."},{"issue":"5","key":"e_1_3_1_44_2","doi-asserted-by":"crossref","first-page":"175612","DOI":"10.1007\/s11704-022-2223-1","article-title":"Quantifying predictability of sequential recommendation via logical constraints","volume":"17","author":"Xu En","year":"2023","unstructured":"En Xu, Zhiwen Yu, Nuo Li, Helei Cui, Lina Yao, and Bin Guo. 2023. Quantifying predictability of sequential recommendation via logical constraints. Front. Comput. Sci. 17, 5 (2023), 175612.","journal-title":"Front. Comput. Sci."},{"issue":"1","key":"e_1_3_1_45_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3416493","article-title":"Socializing the videos: A multimodal approach for social relation recognition","volume":"17","author":"Xu Tong","year":"2021","unstructured":"Tong Xu, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multim. Comput., Commun. Applic. 17, 1 (2021), 1\u201323.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"key":"e_1_3_1_46_2","first-page":"937","volume-title":"Proceedings of the 22nd ACM International Conference on Multimedia","author":"Xu Yuanlu","year":"2014","unstructured":"Yuanlu Xu, Bingpeng Ma, Rui Huang, and Liang Lin. 2014. Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM International Conference on Multimedia. 937\u2013940."},{"key":"e_1_3_1_47_2","first-page":"358","volume-title":"Proceedings of the International Conference on Multimedia Retrieval","author":"Yan Chenghao","year":"2021","unstructured":"Chenghao Yan, Zihe Liu, Fangtao Li, Chenyu Cao, Zheng Wang, and Bin Wu. 2021. Social relation analysis from videos via multi-entity reasoning. In Proceedings of the International Conference on Multimedia Retrieval. 358\u2013366."},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","unstructured":"Bolei Zhou Agata Lapedriza Aditya Khosla Aude Oliva and Antonio Torralba. 2018. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40 6 (2018) 1452--1464.","DOI":"10.1109\/TPAMI.2017.2723009"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663668","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3663668","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:57:59Z","timestamp":1750294679000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663668"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,12]]},"references-count":47,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,8,31]]}},"alternative-id":["10.1145\/3663668"],"URL":"https:\/\/doi.org\/10.1145\/3663668","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,12]]},"assertion":[{"value":"2023-07-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}