{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T03:37:59Z","timestamp":1776310679731,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,2,25]],"date-time":"2023-02-25T00:00:00Z","timestamp":1677283200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62076183, 61936014 and 61976159"],"award-info":[{"award-number":["62076183, 61936014 and 61976159"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100007219","name":"Natural Science Foundation of Shanghai","doi-asserted-by":"crossref","award":["20ZR1473500, 19ZR1461200"],"award-info":[{"award-number":["20ZR1473500, 19ZR1461200"]}],"id":[{"id":"10.13039\/100007219","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shanghai Innovation Action Project of Science and Technology","award":["20511100700"],"award-info":[{"award-number":["20511100700"]}]},{"name":"National Key Research and Development Project","award":["2019YFB2102300"],"award-info":[{"award-number":["2019YFB2102300"]}]},{"name":"Shanghai Municipal Science and Technology Major Project","award":["2021SHZDZX0100"],"award-info":[{"award-number":["2021SHZDZX0100"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,8,31]]},"abstract":"<jats:p>Weakly supervised action localization is a challenging problem in video understanding and action recognition. Existing models usually formulate the training process as direct classification using video-level supervision. They tend to only locate the most discriminative parts of action instances and produce temporally incomplete detection results. A natural solution for this problem, the adversarial erasing strategy, is to remove such parts from training so that models can attend to complementary parts. Previous works do it in an offline and heuristic way. They adopt a multi-stage pipeline, where discriminative regions are determined and erased under the guidance of detection results from last stage. Such a pipeline can be both ineffective and inefficient, possibly hindering the overall performance. On the contrary, we combine adversarial erasing with dropout mechanism and propose a Temporal Dropout Module that learns where to remove in a data-driven and online manner. This plug-and-play module is trained without iterative stages, which not only simplifies the pipeline but also makes the regularization during training easier and more adaptive. Experiments show that the proposed method outperforms previous erasing-based methods by a large margin. More importantly, it achieves universal improvement when plugged into various direct classification methods and obtains state-of-the-art performance.<\/jats:p>","DOI":"10.1145\/3567827","type":"journal-article","created":{"date-parts":[[2022,11,7]],"date-time":"2022-11-07T11:58:07Z","timestamp":1667822287000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Temporal Dropout for Weakly Supervised Action Localization"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5808-1742","authenticated-orcid":false,"given":"Chi","family":"Xie","sequence":"first","affiliation":[{"name":"Tongji University, Jiading Qu, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2242-4918","authenticated-orcid":false,"given":"Zikun","family":"Zhuang","sequence":"additional","affiliation":[{"name":"Tongji University, Jiading Qu, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4301-394X","authenticated-orcid":false,"given":"Shengjie","family":"Zhao","sequence":"additional","affiliation":[{"name":"Tongji University, Jiading Qu, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0457-6093","authenticated-orcid":false,"given":"Shuang","family":"Liang","sequence":"additional","affiliation":[{"name":"Tongji University, Jiading Qu, Shanghai Shi, China"}]}],"member":"320","published-online":{"date-parts":[[2023,2,25]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"751","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Arnab Anurag","year":"2020","unstructured":"Anurag Arnab, Chen Sun, Arsha Nagrani, and Cordelia Schmid. 2020. Uncertainty-aware weakly supervised action detection from untrimmed videos. In Proceedings of the European Conference on Computer Vision. Springer, 751\u2013768."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00124"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2959977"},{"key":"e_1_3_2_7_2","first-page":"10727","volume-title":"Advances in Neural Information Processing Systems","author":"Ghiasi Golnaz","year":"2018","unstructured":"Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. 2018. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems. 10727\u201310737."},{"key":"e_1_3_2_8_2","first-page":"11053","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Huang Linjiang","year":"2020","unstructured":"Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. 2020. Relational prototypical network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11053\u201311060."},{"key":"e_1_3_2_9_2","first-page":"8002","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Huang Linjiang","year":"2021","unstructured":"Linjiang Huang, Liang Wang, and Hongsheng Li. 2021. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 8002\u20138011."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.10.018"},{"key":"e_1_3_2_11_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Jang Eric","year":"2017","unstructured":"Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_12_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_13_2","first-page":"2575","volume-title":"Advances in Neural Information Processing Systems","author":"Kingma Durk P.","year":"2015","unstructured":"Durk P. Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems. 2575\u20132583."},{"key":"e_1_3_2_14_2","first-page":"13648","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Lee Pilhyeon","year":"2021","unstructured":"Pilhyeon Lee and Hyeran Byun. 2021. Learning action completeness from points for weakly-supervised temporal action localization. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 13648\u201313657."},{"key":"e_1_3_2_15_2","first-page":"11320","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Lee Pilhyeon","year":"2020","unstructured":"Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11320\u201311327."},{"key":"e_1_3_2_16_2","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"2","author":"Lee Pilhyeon","year":"2021","unstructured":"Pilhyeon Lee, Jinglu Wang, Yan Lu, and Hyeran Byun. 2021. Weakly-supervised temporal action localization by uncertainty modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 2."},{"key":"e_1_3_2_17_2","first-page":"3889","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Lin Tianwei","year":"2019","unstructured":"Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 3889\u20133898."},{"key":"e_1_3_2_18_2","first-page":"1298","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Liu Daochang","year":"2019","unstructured":"Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298\u20131307."},{"key":"e_1_3_2_19_2","first-page":"3899","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Liu Ziyi","year":"2019","unstructured":"Ziyi Liu, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, Nanning Zheng, and Gang Hua. 2019. Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the International Conference on Computer Vision. 3899\u20133908."},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","unstructured":"Ziyi Liu Le Wang Qilin Zhang Wei Tang Junsong Yuan Nanning Zheng and Gang Hua. 2021. ACSNet: Action-context separation network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 35 2233\u20132241.","DOI":"10.1609\/aaai.v35i3.16322"},{"key":"e_1_3_2_21_2","first-page":"344","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Long Fuchen","year":"2019","unstructured":"Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 344\u2013353."},{"key":"e_1_3_2_22_2","first-page":"9969","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Luo Wang","year":"2021","unstructured":"Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, and Yongdong Zhang. 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9969\u20139979."},{"key":"e_1_3_2_23_2","first-page":"729","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Luo Zhekun","year":"2020","unstructured":"Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. 2020. Weakly-supervised action localization with expectation-maximization multi-instance learning. In Proceedings of the European Conference on Computer Vision. Springer, 729\u2013745."},{"key":"e_1_3_2_24_2","first-page":"420","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Ma Fan","year":"2020","unstructured":"Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. Sf-net: Single-frame supervision for temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 420\u2013437."},{"key":"e_1_3_2_25_2","first-page":"7587","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ma Junwei","year":"2021","unstructured":"Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, and Guangwei Yu. 2021. Weakly supervised action selection learning in video. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7587\u20137596."},{"key":"e_1_3_2_26_2","first-page":"9915","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Moltisanti Davide","year":"2019","unstructured":"Davide Moltisanti, Sanja Fidler, and Dima Damen. 2019. Action recognition from single timestamp supervision in untrimmed videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9915\u20139924."},{"key":"e_1_3_2_27_2","first-page":"13608","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Narayan Sanath","year":"2021","unstructured":"Sanath Narayan, Hisham Cholakkal, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. 2021. D2-Net: Weakly-supervised action localization via discriminative embeddings and denoised activations. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 13608\u201313617."},{"key":"e_1_3_2_28_2","first-page":"8679","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Narayan Sanath","year":"2019","unstructured":"Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE International Conference on Computer Vision. 8679\u20138687."},{"key":"e_1_3_2_29_2","first-page":"6752","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Nguyen Phuc","year":"2018","unstructured":"Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752\u20136761."},{"key":"e_1_3_2_30_2","first-page":"5502","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Nguyen Phuc Xuan","year":"2019","unstructured":"Phuc Xuan Nguyen, Deva Ramanan, and Charless C. Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 5502\u20135511."},{"key":"e_1_3_2_31_2","first-page":"3319","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Pardo Alejandro","year":"2021","unstructured":"Alejandro Pardo, Humam Alwassel, Fabian Caba, Ali Thabet, and Bernard Ghanem. 2021. Refineloc: Iterative refinement for weakly-supervised action localization. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 3319\u20133328."},{"key":"e_1_3_2_32_2","first-page":"563","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201918)","author":"Paul Sujoy","year":"2018","unstructured":"Sujoy Paul, Sourya Roy, and Amit K. Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV\u201918). 563\u2013579."},{"key":"e_1_3_2_33_2","first-page":"485","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qing Zhiwu","year":"2021","unstructured":"Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 485\u2013494."},{"key":"e_1_3_2_34_2","article-title":"ACM-Net: Action context modeling network for weakly-supervised temporal action localization","author":"Qu Sanqing","year":"2021","unstructured":"Sanqing Qu, Guang Chen, Zhijun Li, Lijun Zhang, Fan Lu, and Alois Knoll. 2021. ACM-Net: Action context modeling network for weakly-supervised temporal action localization. arXiv:2104.02967. Retrieved from https:\/\/arxiv.org\/abs\/2104.02967.","journal-title":"arXiv:2104.02967"},{"key":"e_1_3_2_35_2","first-page":"10598","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ren Zhongzheng","year":"2020","unstructured":"Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Yong Jae Lee, Alexander G Schwing, and Jan Kautz. 2020. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10598\u201310607."},{"key":"e_1_3_2_36_2","first-page":"1009","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Shi Baifeng","year":"2020","unstructured":"Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1009\u20131019."},{"issue":"3","key":"e_1_3_2_37_2","first-page":"1","article-title":"Shuffle-invariant network for action recognition in videos","volume":"18","author":"Shi Qinghongya","year":"2022","unstructured":"Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1\u201318.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.155"},{"key":"e_1_3_2_39_2","first-page":"154","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201918)","author":"Shou Zheng","year":"2018","unstructured":"Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV\u201918). 154\u2013171."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.119"},{"key":"e_1_3_2_41_2","first-page":"568","volume-title":"Advances in Neural Information Processing Systems","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568\u2013576."},{"key":"e_1_3_2_42_2","first-page":"3544","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917)","author":"Singh Krishna Kumar","year":"2017","unstructured":"Krishna Kumar Singh and Yong Jae Lee. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917). IEEE, 3544\u20133553."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_3_2_44_2","first-page":"558","volume-title":"Proceedings of the Asian Conference on Computer Vision","author":"Su Haisheng","year":"2018","unstructured":"Haisheng Su, Xu Zhao, and Tianwei Lin. 2018. Cascaded pyramid mining network for weakly supervised temporal action localization. In Proceedings of the Asian Conference on Computer Vision. Springer, 558\u2013574."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298664"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.678"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.687"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.617"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3016486"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","unstructured":"Zichen Yang Jie Qin and Di Huang. 2022. ACGNet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 36 3090\u20133098.","DOI":"10.1609\/aaai.v36i3.20216"},{"key":"e_1_3_2_51_2","first-page":"5522","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Yu Tan","year":"2019","unstructured":"Tan Yu, Zhou Ren, Yuncheng Li, Enxu Yan, Ning Xu, and Junsong Yuan. 2019. Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE International Conference on Computer Vision. 5522\u20135531."},{"key":"e_1_3_2_52_2","doi-asserted-by":"crossref","first-page":"214","DOI":"10.1007\/978-3-540-74936-3_22","volume-title":"Proceedings of the Joint Pattern Recognition Symposium","author":"Zach Christopher","year":"2007","unstructured":"Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime tv-l 1 optical flow. In Proceedings of the Joint Pattern Recognition Symposium. Springer, 214\u2013223."},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2922108"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00719"},{"key":"e_1_3_2_55_2","doi-asserted-by":"crossref","unstructured":"Runhao Zeng Wenbing Huang Mingkui Tan Yu Rong Peilin Zhao Junzhou Huang and Chuang Gan. 2021. Graph convolutional module for temporal action localization in videos. IEEE Trans. Pattern Anal. Mach. Intell. 44 (2021) 6209\u20136223.","DOI":"10.1109\/TPAMI.2021.3090167"},{"key":"e_1_3_2_56_2","first-page":"37","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Zhai Yuanhao","year":"2020","unstructured":"Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 37\u201354."},{"key":"e_1_3_2_57_2","doi-asserted-by":"crossref","unstructured":"Yuanhao Zhai Le Wang Wei Tang Qilin Zhang Nanning Zheng and Gang Hua. 2021. Action coherence network for weakly-supervised temporal action localization. IEEE Trans. Multimedia 24 (2021) 1857\u20131870.","DOI":"10.1109\/TMM.2021.3073235"},{"key":"e_1_3_2_58_2","first-page":"16010","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Can","year":"2021","unstructured":"Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. 2021. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 16010\u201316019."},{"issue":"10","key":"e_1_3_2_59_2","doi-asserted-by":"crossref","first-page":"2610","DOI":"10.1109\/TMM.2019.2959425","article-title":"Glnet: Global local network for weakly supervised action localization","volume":"22","author":"Zhang Shiwei","year":"2019","unstructured":"Shiwei Zhang, Lin Song, Changxin Gao, and Nong Sang. 2019. Glnet: Global local network for weakly supervised action localization. IEEE Trans. Multimedia 22, 10 (2019), 2610\u20132622.","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_2_60_2","first-page":"539","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Zhao Peisen","year":"2020","unstructured":"Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision. Springer, 539\u2013555."},{"key":"e_1_3_2_61_2","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1145\/3240508.3240511","volume-title":"Proceedings of the 26th ACM International Conference on Multimedia","author":"Zhong Jia-Xing","year":"2018","unstructured":"Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H. Li, and Ge Li. 2018. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In Proceedings of the 26th ACM International Conference on Multimedia. 35\u201344."},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.319"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3361845"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3567827","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3567827","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:13Z","timestamp":1750182673000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3567827"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,25]]},"references-count":62,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,8,31]]}},"alternative-id":["10.1145\/3567827"],"URL":"https:\/\/doi.org\/10.1145\/3567827","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,25]]},"assertion":[{"value":"2022-03-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-09-25","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}