{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T11:25:39Z","timestamp":1763810739927,"version":"3.41.0"},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"1s","license":[{"start":{"date-parts":[[2022,1,25]],"date-time":"2022-01-25T00:00:00Z","timestamp":1643068800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China","award":["SKLMCC2020KF004"],"award-info":[{"award-number":["SKLMCC2020KF004"]}]},{"DOI":"10.13039\/501100009592","name":"Beijing Municipal Science & Technology Commission","doi-asserted-by":"crossref","award":["Z191100007119002"],"award-info":[{"award-number":["Z191100007119002"]}],"id":[{"id":"10.13039\/501100009592","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Key Research Program of Frontier Sciences, CAS","award":["ZDBS-LY-7024"],"award-info":[{"award-number":["ZDBS-LY-7024"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62006221"],"award-info":[{"award-number":["62006221"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,2,28]]},"abstract":"<jats:p>Existing video self-supervised learning methods mainly rely on trimmed videos for model training. They apply their methods and verify the effectiveness on trimmed video datasets including UCF101 and Kinetics-400, among others. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not truly unsupervised. In this article, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. After that, some designed sampling strategies are used to model relations for video clips. The strategies are saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of video clips, which is also an essential procedure for video-related tasks. We validate our learned models with action recognition, video retrieval, and action similarity labeling tasks with four kinds of 3D convolutional neural networks. Experimental results show that ERUV is able to learn richer representations with untrimmed videos, and it outperforms state-of-the-art self-supervised methods with significant margins.<\/jats:p>","DOI":"10.1145\/3473342","type":"journal-article","created":{"date-parts":[[2022,1,25]],"date-time":"2022-01-25T15:06:00Z","timestamp":1643123160000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Exploring Relations in Untrimmed Videos for Self-Supervised Learning"],"prefix":"10.1145","volume":"18","author":[{"given":"Dezhao","family":"Luo","sequence":"first","affiliation":[{"name":"Institute of Information Engineering, ChineseAcademy of Sciences and University of Chinese Academy of Sciences China, Beijing, China"}]},{"given":"Yu","family":"Zhou","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, ChineseAcademy of Sciences and University of Chinese Academy of Sciences China, Beijing, China"}]},{"given":"Bo","family":"Fang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, ChineseAcademy of Sciences and University of Chinese Academy of Sciences China, Beijing, China"}]},{"given":"Yucan","family":"Zhou","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering,Chinese Academy of Sciences China, Beijing, China"}]},{"given":"Dayan","family":"Wu","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering,Chinese Academy of Sciences China, Beijing, China"}]},{"given":"Weiping","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering,Chinese Academy of Sciences China, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,1,25]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_3_1_2_2","DOI":"10.1109\/ICCV.2017.73"},{"doi-asserted-by":"publisher","key":"e_1_3_1_3_2","DOI":"10.1007\/978-3-030-01246-5_27"},{"doi-asserted-by":"publisher","key":"e_1_3_1_4_2","DOI":"10.1109\/CVPR42600.2020.00994"},{"doi-asserted-by":"publisher","key":"e_1_3_1_5_2","DOI":"10.1007\/978-3-030-01267-0_47"},{"key":"e_1_3_1_6_2","first-page":"1597","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. 1597\u20131607."},{"doi-asserted-by":"publisher","key":"e_1_3_1_7_2","DOI":"10.1109\/ICPR48806.2021.9412558"},{"doi-asserted-by":"publisher","key":"e_1_3_1_8_2","DOI":"10.1007\/978-3-030-29894-4_11"},{"doi-asserted-by":"publisher","key":"e_1_3_1_9_2","DOI":"10.1109\/CVPR.2016.278"},{"doi-asserted-by":"publisher","key":"e_1_3_1_10_2","DOI":"10.1109\/ICCV.2015.167"},{"doi-asserted-by":"publisher","key":"e_1_3_1_11_2","DOI":"10.1109\/ICCV.2015.316"},{"doi-asserted-by":"publisher","key":"e_1_3_1_12_2","DOI":"10.1109\/ICCV.2019.00630"},{"doi-asserted-by":"publisher","key":"e_1_3_1_13_2","DOI":"10.1109\/TPAMI.2009.167"},{"doi-asserted-by":"publisher","key":"e_1_3_1_14_2","DOI":"10.1109\/CVPR.2017.607"},{"doi-asserted-by":"publisher","key":"e_1_3_1_15_2","DOI":"10.1109\/CVPR.2018.00586"},{"key":"e_1_3_1_16_2","article-title":"Unsupervised representation learning by predicting image rotations","author":"Gidaris Spyros","year":"2018","unstructured":"Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).","journal-title":"arXiv preprint arXiv:1803.07728"},{"key":"e_1_3_1_17_2","article-title":"Self-supervised co-training for video representation learning","author":"Han Tengda","year":"2020","unstructured":"Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. arXiv preprint arXiv:2010.09709 (2020).","journal-title":"arXiv preprint arXiv:2010.09709"},{"doi-asserted-by":"publisher","key":"e_1_3_1_18_2","DOI":"10.1109\/CVPR42600.2020.00975"},{"doi-asserted-by":"publisher","key":"e_1_3_1_19_2","DOI":"10.1109\/CVPR.2018.00378"},{"doi-asserted-by":"publisher","key":"e_1_3_1_20_2","DOI":"10.1109\/ICCV.2011.6126543"},{"unstructured":"Yu-Gang Jiang Jingen Liu A. Roshan Zamir George Toderici Ivan Laptev Mubarak Shah and Rahul Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes. http:\/\/crcv.ucf.edu\/THUMOS14\/","key":"e_1_3_1_21_2"},{"key":"e_1_3_1_22_2","article-title":"Self-supervised spatiotemporal feature learning via video rotation prediction","author":"Jing Longlong","year":"2018","unstructured":"Longlong Jing, Xiaodong Yang, Jingen Liu, and Yingli Tian. 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018).","journal-title":"arXiv preprint arXiv:1811.11387"},{"key":"e_1_3_1_23_2","article-title":"The Kinetics human action video dataset","author":"Kay Will","year":"2017","unstructured":"Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, et\u00a0al. 2017. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).","journal-title":"arXiv preprint arXiv:1705.06950"},{"doi-asserted-by":"publisher","key":"e_1_3_1_24_2","DOI":"10.1609\/aaai.v33i01.33018545"},{"doi-asserted-by":"publisher","key":"e_1_3_1_25_2","DOI":"10.1109\/TPAMI.2011.209"},{"doi-asserted-by":"publisher","key":"e_1_3_1_26_2","DOI":"10.1109\/CVPR.2019.00202"},{"key":"e_1_3_1_27_2","article-title":"Cycle-contrast for self-supervised video representation learning","author":"Kong Quan","year":"2020","unstructured":"Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, and Tomokazu Murakami. 2020. Cycle-contrast for self-supervised video representation learning. arXiv preprint arXiv:2010.14810 (2020).","journal-title":"arXiv preprint arXiv:2010.14810"},{"doi-asserted-by":"publisher","key":"e_1_3_1_28_2","DOI":"10.5555\/2999134.2999257"},{"doi-asserted-by":"publisher","key":"e_1_3_1_29_2","DOI":"10.1109\/CVPR.2017.96"},{"doi-asserted-by":"publisher","key":"e_1_3_1_30_2","DOI":"10.1109\/ICCV.2017.79"},{"doi-asserted-by":"publisher","key":"e_1_3_1_31_2","DOI":"10.1109\/CVPR.2019.00512"},{"key":"e_1_3_1_32_2","article-title":"MMF: Multi-task multi-structure fusion for hierarchical image classification","author":"Li Xiaoni","year":"2021","unstructured":"Xiaoni Li, Yucan Zhou, Yu Zhou, and Weiping Wang. 2021. MMF: Multi-task multi-structure fusion for hierarchical image classification. arXiv preprint arXiv:2107.00808 (2021).","journal-title":"arXiv preprint arXiv:2107.00808"},{"doi-asserted-by":"publisher","key":"e_1_3_1_33_2","DOI":"10.1609\/aaai.v34i07.6840"},{"doi-asserted-by":"publisher","key":"e_1_3_1_34_2","DOI":"10.1109\/CVPR42600.2020.00674"},{"doi-asserted-by":"publisher","key":"e_1_3_1_35_2","DOI":"10.1007\/978-3-319-46448-0_32"},{"doi-asserted-by":"publisher","key":"e_1_3_1_36_2","DOI":"10.1007\/978-3-319-46466-4_5"},{"key":"e_1_3_1_37_2","article-title":"VideoMoCo: Contrastive video representation learning with temporally adversarial examples","author":"Pan Tian","year":"2021","unstructured":"Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. 2021. VideoMoCo: Contrastive video representation learning with temporally adversarial examples. arXiv preprint arXiv:2103.05905 (2021).","journal-title":"arXiv preprint arXiv:2103.05905"},{"doi-asserted-by":"publisher","key":"e_1_3_1_38_2","DOI":"10.1109\/CVPR.2016.278"},{"doi-asserted-by":"publisher","key":"e_1_3_1_39_2","DOI":"10.1109\/CVPR.2019.01018"},{"key":"e_1_3_1_40_2","article-title":"Spatiotemporal contrastive video representation learning","author":"Qian Rui","year":"2020","unstructured":"Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2020. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800 (2020).","journal-title":"arXiv preprint arXiv:2008.03800"},{"doi-asserted-by":"publisher","key":"e_1_3_1_41_2","DOI":"10.1109\/ICPR48806.2021.9412806"},{"doi-asserted-by":"publisher","key":"e_1_3_1_42_2","DOI":"10.1109\/CVPR42600.2020.01354"},{"doi-asserted-by":"publisher","key":"e_1_3_1_43_2","DOI":"10.1109\/ICASSP39728.2021.9413821"},{"doi-asserted-by":"publisher","key":"e_1_3_1_44_2","DOI":"10.1109\/ICDAR.2019.00095"},{"doi-asserted-by":"publisher","key":"e_1_3_1_45_2","DOI":"10.1007\/s11263-015-0816-y"},{"doi-asserted-by":"publisher","key":"e_1_3_1_46_2","DOI":"10.5555\/2968826.2968890"},{"key":"e_1_3_1_47_2","article-title":"UCF101: A dataset of 101 human actions classes from videos in the wild","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).","journal-title":"arXiv preprint arXiv:1212.0402"},{"doi-asserted-by":"publisher","key":"e_1_3_1_48_2","DOI":"10.1109\/CVPR.2018.00151"},{"doi-asserted-by":"publisher","key":"e_1_3_1_49_2","DOI":"10.1109\/ICCV.2015.510"},{"doi-asserted-by":"publisher","key":"e_1_3_1_50_2","DOI":"10.1109\/CVPR.2018.00675"},{"doi-asserted-by":"publisher","key":"e_1_3_1_51_2","DOI":"10.5555\/3157096.3157165"},{"doi-asserted-by":"publisher","key":"e_1_3_1_52_2","DOI":"10.1109\/CVPR.2019.00413"},{"doi-asserted-by":"publisher","key":"e_1_3_1_53_2","DOI":"10.1007\/978-3-030-58520-4_30"},{"doi-asserted-by":"publisher","key":"e_1_3_1_54_2","DOI":"10.1109\/CVPR.2018.00155"},{"doi-asserted-by":"publisher","key":"e_1_3_1_55_2","DOI":"10.1109\/CVPR.2015.7299059"},{"doi-asserted-by":"publisher","key":"e_1_3_1_56_2","DOI":"10.1109\/CVPR.2017.678"},{"doi-asserted-by":"publisher","key":"e_1_3_1_57_2","DOI":"10.1007\/978-3-319-46484-8_2"},{"doi-asserted-by":"publisher","key":"e_1_3_1_58_2","DOI":"10.1109\/TMM.2020.2995290"},{"doi-asserted-by":"publisher","key":"e_1_3_1_59_2","DOI":"10.1109\/CVPR.2018.00840"},{"doi-asserted-by":"publisher","key":"e_1_3_1_60_2","DOI":"10.1145\/3231737"},{"doi-asserted-by":"publisher","key":"e_1_3_1_61_2","DOI":"10.1007\/978-3-030-01267-0_19"},{"doi-asserted-by":"publisher","key":"e_1_3_1_62_2","DOI":"10.1109\/CVPR.2019.01058"},{"key":"e_1_3_1_63_2","article-title":"Multi-view correlation distillation for incremental object detection","author":"Yang Dongbao","year":"2021","unstructured":"Dongbao Yang, Yu Zhou, and Weiping Wang. 2021. Multi-view correlation distillation for incremental object detection. arXiv preprint arXiv:2107.01787 (2021).","journal-title":"arXiv preprint arXiv:2107.01787"},{"key":"e_1_3_1_64_2","article-title":"Two-level residual distillation based triple network for incremental object detection","author":"Yang Dongbao","year":"2020","unstructured":"Dongbao Yang, Yu Zhou, Dayan Wu, Can Ma, Fei Yang, and Weiping Wang. 2020. Two-level residual distillation based triple network for incremental object detection. arXiv preprint arXiv:2007.13428 (2020).","journal-title":"arXiv preprint arXiv:2007.13428"},{"doi-asserted-by":"publisher","key":"e_1_3_1_65_2","DOI":"10.1109\/CVPR42600.2020.00658"},{"doi-asserted-by":"publisher","key":"e_1_3_1_66_2","DOI":"10.1109\/CVPR.2016.293"},{"key":"e_1_3_1_67_2","article-title":"Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer","author":"Zagoruyko Sergey","year":"2016","unstructured":"Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016).","journal-title":"arXiv preprint arXiv:1612.03928"},{"doi-asserted-by":"publisher","key":"e_1_3_1_68_2","DOI":"10.1109\/CVPR.2016.297"},{"doi-asserted-by":"publisher","key":"e_1_3_1_69_2","DOI":"10.1109\/ICPR48806.2021.9412301"},{"key":"e_1_3_1_70_2","article-title":"Exploring instance relations for unsupervised feature embedding","author":"Zhang Yifei","year":"2021","unstructured":"Yifei Zhang, Yu Zhou, and Weiping Wang. 2021. Exploring instance relations for unsupervised feature embedding. arXiv preprint arXiv:2105.03341 (2021).","journal-title":"arXiv preprint arXiv:2105.03341"},{"doi-asserted-by":"publisher","key":"e_1_3_1_71_2","DOI":"10.1145\/3123266.3123451"},{"doi-asserted-by":"publisher","key":"e_1_3_1_72_2","DOI":"10.1109\/TIP.2018.2885238"},{"doi-asserted-by":"publisher","key":"e_1_3_1_73_2","DOI":"10.1007\/978-3-030-01246-5_49"},{"doi-asserted-by":"publisher","key":"e_1_3_1_74_2","DOI":"10.1109\/TIP.2020.3004267"},{"key":"e_1_3_1_75_2","article-title":"Expert training: Task hardness aware meta-learning for few-shot classification","author":"Zhou Yucan","year":"2020","unstructured":"Yucan Zhou, Yu Wang, Jianfei Cai, Yu Zhou, Qinghua Hu, and Weiping Wang. 2020. Expert training: Task hardness aware meta-learning for few-shot classification. arXiv preprint arXiv:2007.06240 (2020).","journal-title":"arXiv preprint arXiv:2007.06240"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3473342","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3473342","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:15Z","timestamp":1750195695000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3473342"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,25]]},"references-count":74,"journal-issue":{"issue":"1s","published-print":{"date-parts":[[2022,2,28]]}},"alternative-id":["10.1145\/3473342"],"URL":"https:\/\/doi.org\/10.1145\/3473342","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2022,1,25]]},"assertion":[{"value":"2021-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-01-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}