{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T07:25:55Z","timestamp":1740122755225,"version":"3.37.3"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T00:00:00Z","timestamp":1712707200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T00:00:00Z","timestamp":1712707200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100017538","name":"Zhejiang Provincial Ten Thousand Plan for Young Top Talents","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100017538","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Zhejiang Provincial Natural Science Foundation of China","award":["LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007"],"award-info":[{"award-number":["LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007","LQ22F020007"]}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250"],"award-info":[{"award-number":["62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250","62036009, 62276237, 62206250"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Point-level weakly-supervised temporal action localization aims to accurately recognize and localize action segments in untrimmed videos, using only point-level annotations during training. Current methods primarily focus on mining sparse pseudo-labels and generating dense pseudo-labels. However, due to the sparsity of point-level labels and the impact of scene information on action representations, the reliability of dense pseudo-label methods still remains an issue. In this paper, we propose a point-level weakly-supervised temporal action localization method based on local representation enhancement and global temporal optimization. This method comprises two modules that enhance the representation capacity of action features and improve the reliability of class activation sequence classification, thereby enhancing the reliability of dense pseudo-labels and strengthening the model\u2019s capability for completeness learning. Specifically, we first generate representative features of actions using pseudo-label feature and calculate weights based on the feature similarity between representative features of actions and segments features to adjust class activation sequence. Additionally, we maintain the fixed-length queues for annotated segments and design a action contrastive learning framework between videos. The experimental results demonstrate that our modules indeed enhance the model\u2019s capability for comprehensive learning, particularly achieving state-of-the-art results at high IoU thresholds.<\/jats:p>","DOI":"10.1007\/s11063-024-11598-w","type":"journal-article","created":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T05:01:52Z","timestamp":1712725312000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Learning Reliable Dense Pseudo-Labels for Point-Level Weakly-Supervised Action Localization"],"prefix":"10.1007","volume":"56","author":[{"given":"Yuanjie","family":"Dang","sequence":"first","affiliation":[]},{"given":"Guozhu","family":"Zheng","sequence":"additional","affiliation":[]},{"given":"Peng","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Nan","family":"Gao","sequence":"additional","affiliation":[]},{"given":"Ruohong","family":"Huan","sequence":"additional","affiliation":[]},{"given":"Dongdong","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Ronghua","family":"Liang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,4,10]]},"reference":[{"key":"11598_CR1","doi-asserted-by":"publisher","first-page":"4307","DOI":"10.1007\/s11063-022-11042-x","volume":"55","author":"M Bi","year":"2023","unstructured":"Bi M, Li J, Liu X, Zhang Q, Yang Z (2023) Action-aware network with upper and lower limit loss for weakly-supervised temporal action localization. Neural Process Lett 55:4307\u20134324","journal-title":"Neural Process Lett"},{"key":"11598_CR2","doi-asserted-by":"crossref","unstructured":"Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 6299\u20136308","DOI":"10.1109\/CVPR.2017.502"},{"key":"11598_CR3","doi-asserted-by":"publisher","first-page":"3995","DOI":"10.1109\/TIP.2021.3068644","volume":"30","author":"C Chen","year":"2021","unstructured":"Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans Image Process 30:3995\u20134007","journal-title":"IEEE Trans Image Process"},{"key":"11598_CR4","doi-asserted-by":"publisher","first-page":"1090","DOI":"10.1109\/TIP.2019.2934350","volume":"29","author":"C Chen","year":"2019","unstructured":"Chen C, Wang G, Peng C, Zhang X, Qin H (2019) Improved robust video saliency detection based on long-term spatial-temporal information. IEEE Trans Image Process 29:1090\u20131010","journal-title":"IEEE Trans Image Process"},{"key":"11598_CR5","unstructured":"Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning (ICML). 1597\u20131607"},{"key":"11598_CR6","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1016\/j.cviu.2016.02.016","volume":"149","author":"D Damen","year":"2016","unstructured":"Damen D, Leelasawassuk T, Mayol-Cuevas W (2016) You-do, i-learn: egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance. Comput Vis Image Underst 149:98\u2013112","journal-title":"Comput Vis Image Underst"},{"key":"11598_CR7","doi-asserted-by":"crossref","unstructured":"Dou P, Hu H (2023) Complementary attention network for weakly supervised temporal action localization. Neural Proc Lett 55:6713\u20136732","DOI":"10.1007\/s11063-023-11156-w"},{"key":"11598_CR8","doi-asserted-by":"publisher","first-page":"2545","DOI":"10.1007\/s00530-023-01128-4","volume":"29","author":"Z Fang","year":"2023","unstructured":"Fang Z, Fan J, Yu J (2023) Lpr: learning point-level temporal action localization through re-training. Multimedia Syst 29:2545\u20132562","journal-title":"Multimedia Syst"},{"key":"11598_CR9","doi-asserted-by":"publisher","first-page":"7363","DOI":"10.1109\/TIP.2022.3222623","volume":"31","author":"J Fu","year":"2022","unstructured":"Fu J, Gao J, Xu C (2022) Compact representation and reliable classification learning for point-level weakly-supervised action localization. IEEE Trans Image Process 31:7363\u20137377","journal-title":"IEEE Trans Image Process"},{"key":"11598_CR10","doi-asserted-by":"crossref","unstructured":"He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: action-aware segment modeling for weakly-supervised temporal action localization. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13925\u201313935","DOI":"10.1109\/CVPR52688.2022.01355"},{"key":"11598_CR11","doi-asserted-by":"crossref","unstructured":"He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR). 9729\u20139738","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"11598_CR12","doi-asserted-by":"crossref","unstructured":"Huang L, Wang L, Li H (2022) Weakly supervised temporal action localization via representative snippet knowledge propagation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3272\u20133281","DOI":"10.1109\/CVPR52688.2022.00327"},{"key":"11598_CR13","doi-asserted-by":"crossref","unstructured":"Idrees H, Zamir AR, Jiang YG (2017) The thumos challenge on action recognition for videos \u201cin the wild.\u201d Comput Vis Image Und 155:1\u201323","DOI":"10.1016\/j.cviu.2016.10.018"},{"key":"11598_CR14","doi-asserted-by":"crossref","unstructured":"Ju C, Zhao P, Chen S, Zhang Y, Wang Y, Tian Q (2021) Divide and conquer for single-frame temporal action localization. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV). 13455\u201313464","DOI":"10.1109\/ICCV48922.2021.01320"},{"key":"11598_CR15","unstructured":"Ju C, Zhao P, Zhang Y, Wang Y, Tian Q (2020) Point-level temporal action localization: bridging fully-supervised proposals to weakly-supervised losses , arXiv preprint arXiv:2012.08236"},{"key":"11598_CR16","unstructured":"Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980"},{"key":"11598_CR17","doi-asserted-by":"crossref","unstructured":"Lee P, Byun H (2021) Learning action completeness from points for weakly-supervised temporal action localization. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV). 13648\u201313657","DOI":"10.1109\/ICCV48922.2021.01339"},{"key":"11598_CR18","doi-asserted-by":"crossref","unstructured":"Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 6742\u20136751","DOI":"10.1109\/CVPR.2018.00705"},{"key":"11598_CR19","doi-asserted-by":"crossref","unstructured":"Li B, Pan Y, Liu R, Zhu Y (2023) Separately guided context-aware network for weakly supervised temporal action detection. Neural Process Lett 55:6269\u20136288","DOI":"10.1007\/s11063-022-11138-4"},{"key":"11598_CR20","doi-asserted-by":"crossref","unstructured":"Li Z, Abu\u00a0Farha Y, Gall J (2021) Temporal action segmentation from timestamp supervision. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8365\u20138374","DOI":"10.1109\/CVPR46437.2021.00826"},{"key":"11598_CR21","doi-asserted-by":"crossref","unstructured":"Ma F, Zhu L, Yang Y, Zha S, Kundu G, Feiszli M, Shou Z (2020) Sf-net: single-frame supervision for temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV). 420\u2013437","DOI":"10.1007\/978-3-030-58548-8_25"},{"key":"11598_CR22","unstructured":"Mamshad Nayeem R, Mittal G, Yu Y, Hall M, Sajeev S, Shah M, Chen M (2023) Pivotal: Prior-driven supervision for weakly-supervised temporal action localization. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 22992\u201323002"},{"key":"11598_CR23","doi-asserted-by":"crossref","unstructured":"Moltisanti D, Fidler S, Damen D (2019) Action recognition from single timestamp supervision in untrimmed videos. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9915\u20139924","DOI":"10.1109\/CVPR.2019.01015"},{"key":"11598_CR24","doi-asserted-by":"crossref","unstructured":"Ren H, Yang W, Zhang T, Zhang Y (2023) Proposal-based multiple instance learning for weakly-supervised temporal action localization. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2394\u20132404","DOI":"10.1109\/CVPR52729.2023.00237"},{"key":"11598_CR25","doi-asserted-by":"crossref","unstructured":"Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1009\u20131019","DOI":"10.1109\/CVPR42600.2020.00109"},{"key":"11598_CR26","doi-asserted-by":"crossref","unstructured":"Shi D, Zhong Y, Cao Q, Ma L, Li J, Tao D (2023) Tridet: Temporal action detection with relative boundary modeling. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18857\u201318866","DOI":"10.1109\/CVPR52729.2023.01808"},{"key":"11598_CR27","doi-asserted-by":"crossref","unstructured":"Shou Z, Gao H, Zhang L, Miyazawa K, Chang S.F (2018) Autoloc: weakly-supervisedtemporalaction localizationinuntrimmedvideos. In: Proceedings of the European Conference on Computer Vision (ECCV). 154\u2013171","DOI":"10.1007\/978-3-030-01270-0_10"},{"key":"11598_CR28","doi-asserted-by":"crossref","unstructured":"Tian Y, Krishnan D, Isola P (2020) Contrastive multiview coding. In: Proceedings of the European Conference on Computer Vision (ECCV). 776\u2013794","DOI":"10.1007\/978-3-030-58621-8_45"},{"key":"11598_CR29","doi-asserted-by":"crossref","unstructured":"Wang G, Chen C, Fan D, Hao A, Qin H (2021) From semantic categories to fixations: a novel weakly-supervised visual-auditory saliency detection approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 15119\u201315128","DOI":"10.1109\/CVPR46437.2021.01487"},{"key":"11598_CR30","doi-asserted-by":"crossref","unstructured":"Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: sub-graph localization for temporal action detection. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10156\u201310165","DOI":"10.1109\/CVPR42600.2020.01017"},{"key":"11598_CR31","doi-asserted-by":"crossref","unstructured":"Xu S, Luo W, Jia X (2023) Graph contrastive learning with constrained graph data augmentation. Neural Process Lett 55:10705\u201310726","DOI":"10.1007\/s11063-023-11346-6"},{"key":"11598_CR32","doi-asserted-by":"crossref","unstructured":"Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV). 7094\u20137103","DOI":"10.1109\/ICCV.2019.00719"},{"issue":"10","key":"11598_CR33","doi-asserted-by":"publisher","first-page":"6209","DOI":"10.1109\/TPAMI.2021.3090167","volume":"44","author":"R Zeng","year":"2021","unstructured":"Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2021) Graph convolutional module for temporal action localization in videos. IEEE Trans Pattern Anal Mach Intell (TPAMI) 44(10):6209\u20136223","journal-title":"IEEE Trans Pattern Anal Mach Intell (TPAMI)"},{"key":"11598_CR34","doi-asserted-by":"crossref","unstructured":"Zhang C, Cao M, Yang D, Chen J, Zou Y(2021) Cola: weakly-supervised temporal action localization with snippet contrastive learning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (ICCV). 16010\u201316019","DOI":"10.1109\/CVPR46437.2021.01575"},{"key":"11598_CR35","doi-asserted-by":"crossref","unstructured":"Zhang C.L, Wu J, Li Y (2022) Actionformer: localizing moments of actions with transformers. In: Proceedings of the European Conference on Computer Vision (ECCV). 492\u2013510","DOI":"10.1007\/978-3-031-19772-7_29"},{"key":"11598_CR36","doi-asserted-by":"crossref","unstructured":"Zhang S, Chen F, Zhang J, Liu A, Wang F (2022) Multi-level self-supervised representation learning via triple-way attention fusion and local similarity optimization. Neural Process Lett 55:5763\u20135781","DOI":"10.1007\/s11063-022-11110-2"},{"key":"11598_CR37","doi-asserted-by":"crossref","unstructured":"Zhao P, Xie L, Ju C, Zhang Y, Wang Y, Tian Q (2020) Bottom-up temporal action localization with mutual regularization. In: Proceedings of the European Conference on Computer Vision (ECCV). 539\u2013555","DOI":"10.1007\/978-3-030-58598-3_32"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11598-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11598-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11598-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T20:47:56Z","timestamp":1715892476000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11598-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,10]]},"references-count":37,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["11598"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11598-w","relation":{},"ISSN":["1573-773X"],"issn-type":[{"type":"electronic","value":"1573-773X"}],"subject":[],"published":{"date-parts":[[2024,4,10]]},"assertion":[{"value":"17 March 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 April 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that there is no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interest"}}],"article-number":"145"}}