{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,8]],"date-time":"2026-02-08T04:10:41Z","timestamp":1770523841811,"version":"3.49.0"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Fundamental Research Funds for Central Universities, North Minzu University","award":["2022QNPY18"],"award-info":[{"award-number":["2022QNPY18"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Most existing 3D action recognition works rely on the supervised learning paradigm, yet the limited availability of annotated data limits the full potential of encoding networks. As a result, effective self-supervised pre-training strategies have been actively researched. In this paper, we target to explore a self-supervised learning approach for 3D action recognition, and propose the Attention-guided Mask Learning (AML) scheme. Specifically, the dropping mechanism is introduced into contrastive learning to develop Attention-guided Mask (AM) module as well as mask learning strategy, respectively. The AM module leverages the spatial and temporal attention to guide the corresponding features masking, so as to produce the masked contrastive object. The mask learning strategy enables the model to discriminate different actions even with important features masked, which makes action representation learning more discriminative. What\u2019s more, to alleviate the strict positive constraint that would hinder representation learning, the positive-enhanced learning strategy is leveraged in the second-stage training. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed AML scheme improves the performance in self-supervised 3D action recognition, achieving state-of-the-art results.<\/jats:p>","DOI":"10.1007\/s40747-024-01558-1","type":"journal-article","created":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T05:01:45Z","timestamp":1721365305000},"page":"7487-7496","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Attention-guided mask learning for self-supervised 3D action recognition"],"prefix":"10.1007","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6895-3578","authenticated-orcid":false,"given":"Haoyuan","family":"Zhang","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,19]]},"reference":[{"key":"1558_CR1","first-page":"16","volume":"14","author":"S Berretti","year":"2018","unstructured":"Berretti S, Daoudi M, Turaga P, Basu A (2018) Representation, analysis, and recognition of 3d humans: a survey. ACM Trans Multimed Comput Commun Appl (TOMM) 14:16","journal-title":"ACM Trans Multimed Comput Commun Appl (TOMM)"},{"key":"1558_CR2","doi-asserted-by":"crossref","unstructured":"Caetano C, Br\u00e9mond F, Schwartz WR (2019) Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE, pp 16\u201323","DOI":"10.1109\/SIBGRAPI.2019.00011"},{"key":"1558_CR3","unstructured":"Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597\u20131607"},{"key":"1558_CR4","doi-asserted-by":"crossref","unstructured":"Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. IEEE, pp 15750\u201315758","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"1558_CR5","doi-asserted-by":"crossref","unstructured":"Cheng K, Zhang Y, Cao C, Shi L, Cheng J, Lu H (2020) Decoupling gcn with dropgraph module for skeleton-based action recognition. In: European conference on computer vision. Springer, pp 536\u2013553","DOI":"10.1007\/978-3-030-58586-0_32"},{"key":"1558_CR6","doi-asserted-by":"crossref","unstructured":"Dong J, Sun S, Liu Z, Chen S, Liu B, Wang X (2023) Hierarchical contrast for unsupervised skeleton-based action representation learning. In: Proceedings of the AAAI conference on artificial intelligence. Springer, pp 525\u2013533","DOI":"10.1609\/aaai.v37i1.25127"},{"key":"1558_CR7","unstructured":"Grill JB, Strub F, Altch\u00e9 F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG et al (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733"},{"key":"1558_CR8","doi-asserted-by":"crossref","unstructured":"Gui LY, Wang YX, Liang X, Moura JM (2018) Adversarial geometry-aware human motion prediction. In: Proceedings of the European conference on computer vision (ECCV). pp 786\u2013803","DOI":"10.1007\/978-3-030-01225-0_48"},{"key":"1558_CR9","unstructured":"Gutmann MU, Hyv\u00e4rinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13:307\u2013361"},{"key":"1558_CR10","doi-asserted-by":"crossref","unstructured":"He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 9729\u20139738","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"1558_CR11","doi-asserted-by":"crossref","unstructured":"Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 13713\u201313722","DOI":"10.1109\/CVPR46437.2021.01350"},{"key":"1558_CR12","doi-asserted-by":"publisher","first-page":"4293","DOI":"10.1007\/s00521-019-04615-w","volume":"32","author":"C Jing","year":"2020","unstructured":"Jing C, Wei P, Sun H, Zheng N (2020) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293\u20134302","journal-title":"Neural Comput Appl"},{"key":"1558_CR13","doi-asserted-by":"crossref","unstructured":"Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3288\u20133297","DOI":"10.1109\/CVPR.2017.486"},{"key":"1558_CR14","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2023.104689","volume":"135","author":"D Li","year":"2023","unstructured":"Li D, Tang Y, Zhang Z, Zhang W (2023) Cross-stream contrastive learning for self-supervised skeleton-based action recognition. Image Vis Comput 135:104689","journal-title":"Image Vis Comput"},{"key":"1558_CR15","unstructured":"Li J, Wong Y, Zhao Q, Kankanhalli MS (2018) Unsupervised learning of view-invariant action representations. arXiv preprint arXiv:1809.01844"},{"key":"1558_CR16","doi-asserted-by":"crossref","unstructured":"Li L, Wang M, Ni B, Wang H, Yang J, Zhang W (2021) 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 4741\u20134750","DOI":"10.1109\/CVPR46437.2021.00471"},{"key":"1558_CR17","doi-asserted-by":"publisher","first-page":"384","DOI":"10.3390\/math10030384","volume":"10","author":"Y Li","year":"2022","unstructured":"Li Y, Tang Y (2022) Design on intelligent feature graphics based on convolution operation. Mathematics 10:384","journal-title":"Mathematics"},{"key":"1558_CR18","doi-asserted-by":"publisher","first-page":"1644","DOI":"10.3390\/math11071644","volume":"11","author":"Y Li","year":"2023","unstructured":"Li Y, Tang Y (2023) Novel creation method of feature graphics for image generation based on deep learning algorithms. Mathematics 11:1644","journal-title":"Mathematics"},{"key":"1558_CR19","doi-asserted-by":"crossref","unstructured":"Lin L, Song S, Yang W, Liu J (2020) Ms2l: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia. pp 2490\u20132498","DOI":"10.1145\/3394171.3413548"},{"key":"1558_CR20","doi-asserted-by":"publisher","first-page":"2684","DOI":"10.1109\/TPAMI.2019.2916873","volume":"42","author":"J Liu","year":"2019","unstructured":"Liu J, Shahroudy A, Perez M, Wang G, Duan LY, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42:2684\u20132701","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1558_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3418212","volume":"16","author":"J Liu","year":"2020","unstructured":"Liu J, Song S, Liu C, Li Y, Hu Y (2020) A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans Multimed Comput Commun Appl (TOMM) 16:1\u201324","journal-title":"ACM Trans Multimed Comput Commun Appl (TOMM)"},{"key":"1558_CR22","doi-asserted-by":"publisher","first-page":"14593","DOI":"10.1007\/s00521-020-05144-7","volume":"32","author":"Z Liu","year":"2020","unstructured":"Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware lstm for action recognition. Neural Comput Appl 32:14593\u201314602","journal-title":"Neural Comput Appl"},{"key":"1558_CR23","unstructured":"Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983"},{"key":"1558_CR24","doi-asserted-by":"crossref","unstructured":"Luo Z, Peng B, Huang DA, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2203\u20132212","DOI":"10.1109\/CVPR.2017.751"},{"key":"1558_CR25","unstructured":"Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748"},{"key":"1558_CR26","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1016\/j.ins.2021.04.023","volume":"569","author":"H Rao","year":"2021","unstructured":"Rao H, Xu S, Hu X, Cheng J, Hu B (2021) Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf Sci 569:90\u2013109","journal-title":"Inf Sci"},{"key":"1558_CR27","doi-asserted-by":"crossref","unstructured":"Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1010\u20131019","DOI":"10.1109\/CVPR.2016.115"},{"key":"1558_CR28","doi-asserted-by":"crossref","unstructured":"Shi Z, Kim TK (2017) Learning and refining of privileged information-based rnns for action recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3461\u20133470","DOI":"10.1109\/CVPR.2017.498"},{"key":"1558_CR29","doi-asserted-by":"publisher","first-page":"469","DOI":"10.1007\/s00521-020-05018-y","volume":"33","author":"T Singh","year":"2021","unstructured":"Singh T, Vishwakarma DK (2021) A deeply coupled convnet for human activity recognition using dynamic and rgb images. Neural Comput Appl 33:469\u2013485","journal-title":"Neural Comput Appl"},{"key":"1558_CR30","unstructured":"Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning. PMLR, pp 843\u2013852"},{"key":"1558_CR31","doi-asserted-by":"crossref","unstructured":"Su K, Liu X, Shlizerman E (2020) Predict & cluster: unsupervised skeleton based action recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 9631\u20139640","DOI":"10.1109\/CVPR42600.2020.00965"},{"key":"1558_CR32","doi-asserted-by":"crossref","unstructured":"Thoker FM, Doughty H, Snoek CG (2021) Skeleton-contrastive 3d action representation learning. In: Proceedings of the 29th ACM international conference on multimedia. pp 1655\u20131663","DOI":"10.1145\/3474085.3475307"},{"key":"1558_CR33","doi-asserted-by":"crossref","unstructured":"Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: a new representation for rgb-d based action recognition with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 595\u2013604","DOI":"10.1109\/CVPR.2017.52"},{"key":"1558_CR34","doi-asserted-by":"crossref","unstructured":"Wu C, Wu XJ, Kittler J, Xu T, Ahmed S, Awais M, Feng Z (2024) Scd-net: Spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence. pp 5949\u20135957","DOI":"10.1609\/aaai.v38i6.28409"},{"key":"1558_CR35","doi-asserted-by":"crossref","unstructured":"Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3733\u20133742","DOI":"10.1109\/CVPR.2018.00393"},{"key":"1558_CR36","doi-asserted-by":"crossref","unstructured":"Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"1558_CR37","unstructured":"You Y, Gitman I, Ginsburg B (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888"},{"key":"1558_CR38","unstructured":"Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230"},{"key":"1558_CR39","doi-asserted-by":"crossref","unstructured":"Zhang H, Hou Y, Zhang W, Li W (2022a) Contrastive positive mining for unsupervised 3d action representation learning. In: European conference on computer vision. Springer, pp 36\u201351","DOI":"10.1007\/978-3-031-19772-7_3"},{"key":"1558_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s00521-022-07584-9","volume":"34","author":"W Zhang","year":"2022","unstructured":"Zhang W, Hou Y, Zhang H (2022) Unsupervised skeleton-based action representation learning via relation consistency pursuit. Neural Comput Appl 34:1\u201313","journal-title":"Neural Comput Appl"},{"key":"1558_CR41","doi-asserted-by":"crossref","unstructured":"Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z (2018) Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI conference on artificial intelligence","DOI":"10.1609\/aaai.v32i1.11853"},{"key":"1558_CR42","doi-asserted-by":"crossref","unstructured":"Zhou Y, Duan H, Rao A, Su B, Wang J (2023) Self-supervised action representation learning from partial spatio-temporal skeleton sequences. In: Proceedings of the AAAI conference on artificial intelligence. pp 3825\u20133833","DOI":"10.1609\/aaai.v37i3.25495"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01558-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-024-01558-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01558-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,16]],"date-time":"2024-10-16T22:07:30Z","timestamp":1729116450000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-024-01558-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,19]]},"references-count":42,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["1558"],"URL":"https:\/\/doi.org\/10.1007\/s40747-024-01558-1","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,19]]},"assertion":[{"value":"25 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 July 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}