{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T04:24:30Z","timestamp":1778991870959,"version":"3.51.4"},"reference-count":69,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2025,5,22]],"date-time":"2025-05-22T00:00:00Z","timestamp":1747872000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Science and Technology Program of Guangdong Province","award":["2022B0701180001"],"award-info":[{"award-number":["2022B0701180001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,5,31]]},"abstract":"<jats:p>\n            <jats:bold>Weakly-Supervised Temporal Action Localization (WTAL)<\/jats:bold>\n            aims to identify the temporal boundaries and classify actions in untrimmed videos using only video-level labels during training. Despite recent progress, many existing approaches primarily follow a localization-by-classification pipeline, treating snippets as independent instances and thus exploiting only limited contextual information. Besides, these methods struggle to capture multi-scale temporal information and neglect both the internal temporal structures within videos and the semantic consistency between videos, resulting in misclassification and inaccurate localization. To address these limitations, we introduce a novel\n            <jats:bold>Temporal and Semantic Correlation Network (TSC-Net)<\/jats:bold>\n            for WTAL task, which can be trained end-to-end. First, we propose a\n            <jats:bold>Multi-Scale Features Integration Pyramid (MFIP)<\/jats:bold>\n            module to integrate multi-scale temporal features, effectively addressing the challenge of missed detections caused by short action durations. Furthermore, we design a\n            <jats:bold>Temporal Correlation Enhancement (TCE)<\/jats:bold>\n            branch to enhance segment correlations by video-level temporal structures to improve the completeness of action localization. Finally, a\n            <jats:bold>Dataset-Wide Semantic Awareness (DSA)<\/jats:bold>\n            branch is designed to construct and propagate a dataset-level action semantics bank, enhancing the model\u2019s awareness of semantic consistency in actions. Extensive experiments show that TSC-Net outperforms most existing WTAL methods, achieving an average mAP of 46.3% on the THUMOS-14 dataset and 26.5% on the ActivityNet1.2 dataset. Detailed ablation studies further confirm the effectiveness of each component in our model. The code and models are publicly available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/linkang-els\/TSC-Net-main\">https:\/\/github.com\/linkang-els\/TSC-Net-main<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3721433","type":"journal-article","created":{"date-parts":[[2025,4,17]],"date-time":"2025-04-17T16:11:46Z","timestamp":1744906306000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Temporal and Semantic Correlation Network for Weakly-Supervised Temporal Action Localization"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-6716-9226","authenticated-orcid":false,"given":"Kang","family":"Lin","sequence":"first","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9237-7205","authenticated-orcid":false,"given":"Wei","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5584-1785","authenticated-orcid":false,"given":"Zhijie","family":"Zheng","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5432-8149","authenticated-orcid":false,"given":"Dihu","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Integrated Circuits, Sun Yat-sen University, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5227-1337","authenticated-orcid":false,"given":"Tao","family":"Su","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,5,22]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00124"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3174344"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3567828"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01937"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3311447"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.2985708"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107686"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3654670"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01355"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475298"},{"key":"e_1_3_1_14_2","first-page":"5410","article-title":"Exploring rich semantics for open-set action recognition","volume":"26","author":"Hu Yufan","year":"2023","unstructured":"Yufan Hu, Junyu Gao, Jianfeng Dong, Bin Fan, and Hongmin Liu. 2023. Exploring rich semantics for open-set action recognition. IEEE Transactions on Multimedia 26 (2023), 5410\u20135421.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3078324"},{"issue":"9","key":"e_1_3_1_16_2","first-page":"5729","article-title":"Two-branch relational prototypical network for weakly supervised temporal action localization","volume":"44","author":"Huang Linjiang","year":"2021","unstructured":"Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. 2021. Two-branch relational prototypical network for weakly supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5729\u20135746.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00790"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00327"},{"key":"e_1_3_1_19_2","first-page":"1637","volume-title":"Proceedings of the 35th AAAI Conference on Artificial Intelligence","author":"Islam Ashraful","year":"2021","unstructured":"Ashraful Islam, Chengjiang Long, and Richard Radke. 2021. A hybrid attention mechanism for weakly-supervised temporal action localization. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, 1637\u20131645."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681197"},{"key":"e_1_3_1_21_2","unstructured":"Y.-G. Jiang J. Liu A. Roshan Zamir G. Toderici I. Laptev M. Shah and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. Retrieved from http:\/\/crcv.ucf.edu\/THUMOS14\/"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3213478"},{"key":"e_1_3_1_23_2","unstructured":"Chen Ju Peisen Zhao Ya Zhang Yanfeng Wang and Qi Tian. 2020. Point-level temporal action localization: Bridging fully-supervised proposals to weakly-supervised losses. arXiv:2012.08236. Retrieved from https:\/\/arxiv.org\/abs\/2012.08236"},{"key":"e_1_3_1_24_2","first-page":"11320","volume-title":"Proceedings of the 34th AAAI Conference on Artificial Intelligence","author":"Lee Pilhyeon","year":"2020","unstructured":"Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 11320\u201311327."},{"key":"e_1_3_1_25_2","first-page":"1854","volume-title":"Proceedings of the 35th AAAI Conference on Artificial Intelligence","author":"Lee Pilhyeon","year":"2021","unstructured":"Pilhyeon Lee, Jinglu Wang, Yan Lu, and Hyeran Byun. 2021. Weakly-supervised temporal action localization by uncertainty modeling. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, 1854\u20131862."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00399"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123343"},{"key":"e_1_3_1_28_2","first-page":"3","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Lin Tianwei","year":"2018","unstructured":"Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), 3\u201319."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472669"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00139"},{"key":"e_1_3_1_32_2","first-page":"21","volume-title":"Proceedings of the 14th European Conference (Computer Vision\u2013ECCV \u201916), Part I","author":"Liu Wei","year":"2016","unstructured":"Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In Proceedings of the 14th European Conference (Computer Vision\u2013ECCV \u201916), Part I. Springer, 21\u201337."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3195321"},{"key":"e_1_3_1_34_2","first-page":"6176","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Yuan","year":"2021","unstructured":"Yuan Liu, Jingyuan Chen, Zhenfang Chen, Bing Deng, Jianqiang Huang, and Hanwang Zhang. 2021. The blessings of unlabeled background in untrimmed videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 6176\u20136185."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00043"},{"key":"e_1_3_1_36_2","first-page":"420","volume-title":"Proceedings of the 16th European Conference (Computer Vision\u2013ECCV \u201920)","author":"Ma Fan","year":"2020","unstructured":"Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. Sf-net: Single-frame supervision for temporal action localization. In Proceedings of the 16th European Conference (Computer Vision\u2013ECCV \u201920), Part IV. Springer, 420\u2013437."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00706"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00336"},{"key":"e_1_3_1_39_2","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_40_2","first-page":"563","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Paul Sujoy","year":"2018","unstructured":"Sujoy Paul, Sourya Roy, and Amit K. Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), 563\u2013579."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539577"},{"key":"e_1_3_1_42_2","unstructured":"Sanqing Qu Guang Chen Zhijun Li Lijun Zhang Fan Lu and Alois Knoll. 2021. Acm-net: Action context modeling network for weakly-supervised temporal action localization. arXiv:2104.02967. Retrieved from https:\/\/arxiv.org\/abs\/2104.02967"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00109"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3244411"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548077"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.155"},{"key":"e_1_3_1_47_2","first-page":"154","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Shou Zheng","year":"2018","unstructured":"Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), 154\u2013171."},{"key":"e_1_3_1_48_2","first-page":"13739","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Sridhar Deepak","year":"2021","unstructured":"Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. 2021. Class semantics-based attention for action detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 13739\u201313748."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3044218"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.678"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01810"},{"key":"e_1_3_1_52_2","first-page":"358","volume-title":"European Conference on Computer Vision","author":"Weng Yuetian","year":"2022","unstructured":"Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, and Bohan Zhuang. 2022. An efficient spatio-temporal pyramid transformer for action detection. In European Conference on Computer Vision. Springer, 358\u2013375."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01351"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3567827"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01017"},{"key":"e_1_3_1_56_2","first-page":"9070","volume-title":"Proceedings of the 33rd AAAI Conference on Artificial Intelligence","author":"Xu Yunlu","year":"2019","unstructured":"Yunlu Xu, Chengwei Zhang, Zhanzhan Cheng, Jianwen Xie, Yi Niu, Shiliang Pu, and Fei Wu. 2019. Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 9070\u20139078."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3132058"},{"key":"e_1_3_1_58_2","first-page":"53","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Yang Wenfei","year":"2021","unstructured":"Wenfei Yang, Tianzhu Zhang, Xiaoyuan Yu, Tian Qi, Yongdong Zhang, and Feng Wu. 2021. Uncertainty guided collaborative training for weakly supervised temporal action detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 53\u201363."},{"key":"e_1_3_1_59_2","first-page":"3090","volume-title":"Proceedings of the 36th AAAI Conference on Artificial Intelligence","author":"Yang Zichen","year":"2022","unstructured":"Zichen Yang, Jie Qin, and Di Huang. 2022. Acgnet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, 3090\u20133098."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2008.05.030"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00719"},{"key":"e_1_3_1_62_2","first-page":"37","volume-title":"Proceedings of the 16th European Conference (Computer Vision\u2013ECCV \u201920)","author":"Zhai Yuanhao","year":"2020","unstructured":"Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the 16th European Conference (Computer Vision\u2013ECCV \u201920), Part VI. Springer, 37\u201354."},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01575"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3379887"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3185485"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01340"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01473-9"},{"issue":"3","key":"e_1_3_1_68_2","first-page":"3019","article-title":"Equivalent classification mapping for weakly supervised temporal action localization","volume":"45","author":"Zhao Tao","year":"2022","unstructured":"Tao Zhao, Junwei Han, Le Yang, and Dingwen Zhang. 2022. Equivalent classification mapping for weakly supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3019\u20133031.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.317"},{"key":"e_1_3_1_70_2","first-page":"13516","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zhu Zixin","year":"2021","unstructured":"Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and Gang Hua. 2021. Enriching local and global contexts for temporal action localization. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 13516\u201313525."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721433","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721433","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:09:47Z","timestamp":1750295387000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721433"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,22]]},"references-count":69,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,5,31]]}},"alternative-id":["10.1145\/3721433"],"URL":"https:\/\/doi.org\/10.1145\/3721433","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,22]]},"assertion":[{"value":"2024-09-16","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-02","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}