{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T20:08:19Z","timestamp":1767125299944,"version":"build-2065373602"},"reference-count":53,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2023,2,22]],"date-time":"2023-02-22T00:00:00Z","timestamp":1677024000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Chinese National Natural Science Foundation","award":["62071378"],"award-info":[{"award-number":["62071378"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Human action recognition has drawn significant attention because of its importance in computer vision-based applications. Action recognition based on skeleton sequences has rapidly advanced in the last decade. Conventional deep learning-based approaches are based on extracting skeleton sequences through convolutional operations. Most of these architectures are implemented by learning spatial and temporal features through multiple streams. These studies have enlightened the action recognition endeavor from various algorithmic angles. However, three common issues are observed: (1) The models are usually complicated; therefore, they have a correspondingly higher computational complexity. (2) For supervised learning models, the reliance on labels during training is always a drawback. (3) Implementing large models is not beneficial to real-time applications. To address the above issues, in this paper, we propose a multi-layer perceptron (MLP)-based self-supervised learning framework with a contrastive learning loss function (ConMLP). ConMLP does not require a massive computational setup; it can effectively reduce the consumption of computational resources. Compared with supervised learning frameworks, ConMLP is friendly to the huge amount of unlabeled training data. In addition, it has low requirements for system configuration and is more conducive to being embedded in real-world applications. Extensive experiments show that ConMLP achieves the top one inference result of 96.9% on the NTU RGB+D dataset. This accuracy is higher than the state-of-the-art self-supervised learning method. Meanwhile, ConMLP is also evaluated in a supervised learning manner, which has achieved comparable performance to the state of the art of recognition accuracy.<\/jats:p>","DOI":"10.3390\/s23052452","type":"journal-article","created":{"date-parts":[[2023,2,23]],"date-time":"2023-02-23T02:01:25Z","timestamp":1677117685000},"page":"2452","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["ConMLP: MLP-Based Self-Supervised Contrastive Learning for Skeleton Data Analysis and Action Recognition"],"prefix":"10.3390","volume":"23","author":[{"given":"Chuan","family":"Dai","sequence":"first","affiliation":[{"name":"School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yajuan","family":"Wei","sequence":"additional","affiliation":[{"name":"School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK"},{"name":"School of Cyberspace Security, Xi\u2019an University of Posts and Telecommunications, Xi\u2019an 710061, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhijie","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6628-1421","authenticated-orcid":false,"given":"Minsi","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ying","family":"Liu","sequence":"additional","affiliation":[{"name":"International Joint Research Center for Wireless Communication and Information Processing, Xi\u2019an 710121, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiulun","family":"Fan","sequence":"additional","affiliation":[{"name":"School of Communications and Information Engineering, Xi\u2019an University of Posts and Telecommunications, Xi\u2019an 710061, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,2,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Lemieux, N., and Noumeir, R. (2020). A Hierarchical Learning Approach for Human Action Recognition. Sensors, 20.","DOI":"10.3390\/s20174946"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1145\/2398356.2398381","article-title":"Real-Time Human Pose Recognition in Parts from Single Depth Images","volume":"56","author":"Shotton","year":"2013","journal-title":"Commun. ACM"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Ke, Q., Bennamoun, M., An, S., Sohel, F., and Boussaid, F. (2017, January 21\u201326). A New Representation of Skeleton Sequences for 3d Action Recognition. Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.486"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Lee, G.C., and Loo, C.K. (2022). On the Post Hoc Explainability of Optimized Self-Organizing Reservoir Network for Action Recognition. Sensors, 22.","DOI":"10.3390\/s22051905"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., and Zheng, N. (2017, January 22\u201329). View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data. Proceedings of the 16th IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy.","DOI":"10.1109\/ICCV.2017.233"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, Y., and Lin, D. (2018, January 2\u20137). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Liu, Z., Zhang, H., Chen, Z., Wang, Z., and Ouyang, W. (2020, January 13\u201319). Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00022"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Pan, Q., Zhao, Z., Xie, X., Li, J., Cao, Y., and Shi, G. (2022). View-Normalized and Subject-Independent Skeleton Generation for Action Recognition. IEEE Trans. Circuits Syst. Video Technol., 1.","DOI":"10.1109\/TCSVT.2022.3219864"},{"key":"ref_9","first-page":"1131","article-title":"Towards to-a-T Spatio-Temporal Focus for Skeleton-Based Action Recognition","volume":"36","author":"Ke","year":"2022","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Leibe, B., Matas, J., Sebe, N., and Welling, M. (2016, January 11\u201314). Colorful Image Colorization. Proceedings of the 14th European Conference on Computer Vision, ECCV 2016, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46454-1"},{"key":"ref_11","unstructured":"Gidaris, S., Singh, P., and Komodakis, N. (May, January 30). Unsupervised Representation Learning by Predicting Image Rotations. Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Leibe, B., Matas, J., Sebe, N., and Welling, M. (2016, January 11\u201314). Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. Proceedings of the 14th European Conference on Computer Vision, ECCV 2016, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46454-1"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Kolesnikov, A., Zhai, X., and Beyer, L. (2019, January 15\u201320). Revisiting Self-Supervised Visual Representation Learning. Proceedings of the 32nd IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00202"},{"key":"ref_14","unstructured":"Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. (2020, January 6\u201312). Supervised Contrastive Learning. Proceedings of the 34th Conference on Neural Information Processing Systems, NeurIPS 2020, Online."},{"key":"ref_15","unstructured":"Kingma, D.P., and Welling, M. (2014, January 14\u201316). Auto-Encoding Variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7\u201312). Facenet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_17","unstructured":"Sohn, K. (2016, January 5\u201310). Improved Deep Metric Learning with Multi-Class N-Pair Loss Objective. Proceedings of the 30th Annual Conference on Neural Information Processing Systems, NIPS 2016, Barcelona, Spain."},{"key":"ref_18","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 12\u201318). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"20327","DOI":"10.1007\/s00521-022-07584-9","article-title":"Unsupervised Skeleton-Based Action Representation Learning Via Relation Consistency Pursuit","volume":"34","author":"Zhang","year":"2022","journal-title":"Neural Comput. Appl."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"8623","DOI":"10.1109\/TCSVT.2022.3194350","article-title":"Motion Guided Attention Learning for Self-Supervised 3d Human Action Recognition","volume":"32","author":"Yang","year":"2022","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"6224","DOI":"10.1109\/TIP.2022.3207577","article-title":"Contrast-Reconstruction Representation Learning for Self-Supervised Skeleton-Based Action Recognition","volume":"31","author":"Wang","year":"2022","journal-title":"IEEE Trans. Image Process."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Tanfous, A.B., Zerroug, A., Linsley, D., and Serre, T. (2022, January 4\u20138). How and What to Learn: Taxonomizing Self-Supervised Learning for 3d Action Recognition. Proceedings of the 22nd IEEE\/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00294"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhang, H., Hou, Y., and Zhang, W. (2022, January 11\u201315). Skeletal Twins: Unsupervised Skeleton-Based Action Representation Learning. Proceedings of the 2022 IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan.","DOI":"10.1109\/ICME52920.2022.9859595"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Qiu, H., Wu, Y., Duan, M., and Jin, C. (2022, January 11\u201315). Glta-Gcn: Global-Local Temporal Attention Graph Convolutional Network for Unsupervised Skeleton-Based Action Recognition. Proceedings of the 2022 IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan.","DOI":"10.1109\/ICME52920.2022.9859752"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.neucom.2020.10.037","article-title":"Skeleton Edge Motion Networks for Human Action Recognition","volume":"423","author":"Wang","year":"2021","journal-title":"Neurocomputing"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Thoker, F.M., Doughty, H., and Snoek, C.G.M. (2021, January 20\u201324). Skeleton-Contrastive 3d Action Representation Learning. Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, Virtual.","DOI":"10.1145\/3474085.3475307"},{"key":"ref_27","unstructured":"Xu, Z., Shen, X., Wong, Y., and Kankanhalli, M.S. (2021, January 6\u201314). Unsupervised Motion Representation Learning with Capsule Autoencoders. Proceedings of the 35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Zhu, C., Li, X., Li, J., Dai, S., and Tong, W. (2022). Multi-Sourced Knowledge Integration for Robust Self-Supervised Facial Landmark Tracking. IEEE Trans. Multimed., 1\u201313.","DOI":"10.1109\/TMM.2022.3212265"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Dong, X., Yu, S.I., Weng, X., Wei, S.E., Yang, Y., and Sheikh, Y. (2018, January 18\u201323). Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors. Proceedings of the 31st Meeting of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00045"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1346","DOI":"10.1038\/s41551-022-00914-1","article-title":"Self-Supervised Learning in Medicine and Healthcare","volume":"6","author":"Krishnan","year":"2022","journal-title":"Nat. Biomed. Eng."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 15\u201320). Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the 32nd IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01230"},{"key":"ref_32","unstructured":"Yu, F., and Koltun, V. (2016, January 2\u20134). Multi-Scale Context Aggregation by Dilated Convolutions. Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, S., Yang, B., Li, Q., and Liu, H. (2021, January 7\u201312). Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. Proceedings of the 35th AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual.","DOI":"10.1609\/aaai.v35i2.16197"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., and Lu, H. (2020, January 13\u201319). Skeleton-Based Action Recognition with Shift Graph Convolutional Network. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00026"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J., and Keutzer, K. (2018, January 18\u201323). Shift: A Zero Flop, Zero Parameter Alternative to Spatial Convolutions. Proceedings of the 31st Meeting of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00951"},{"key":"ref_36","unstructured":"Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., and Uszkoreit, J. (2021, January 6\u201314). Mlp-Mixer: An All-Mlp Architecture for Vision. Proceedings of the 35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., and Verbeek, J. (2022). Resmlp: Feedforward Networks for Image Classification with Data-Efficient Training. IEEE Trans. Pattern Anal. Mach. Intell., 1\u20139.","DOI":"10.1109\/TPAMI.2022.3206148"},{"key":"ref_38","unstructured":"Liu, H., Dai, Z., So, D.R., and Le, Q.V. (2021, January 6\u201314). Pay Attention to Mlps. Proceedings of the 35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual."},{"key":"ref_39","unstructured":"Ding, X., Zhang, X., Han, J., and Ding, G. (2021). Repmlp: Re-Parameterizing Convolutions into Fully-Connected Layers for Image Recognition. arXiv."},{"key":"ref_40","unstructured":"Chen, S., Xie, E., Ge, C., Chen, R., Liang, D., and Luo, P. (2021). Cyclemlp: A Mlp-Like Architecture for Dense Prediction. arXiv."},{"key":"ref_41","unstructured":"Hendrycks, D., and Gimpel, K. (2016). Gaussian Error Linear Units (Gelus). arXiv."},{"key":"ref_42","unstructured":"Lei Ba, J., Ryan Kiros, J., and Geoffrey Hinton, E. (2016). Layer Normalization. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1016\/j.ins.2021.04.023","article-title":"Augmented Skeleton Based Contrastive Action Learning with Momentum Lstm for Unsupervised Action Recognition","volume":"569","author":"Rao","year":"2021","journal-title":"Inf. Sci."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Budisteanu, E.A., and Mocanu, I.G. (2021). Combining Supervised and Unsupervised Learning Algorithms for Human Activity Recognition. Sensors, 21.","DOI":"10.3390\/s21186309"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., and Tian, Q. (2019, January 15\u201320). Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the 32nd IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00371"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Si, C., Chen, W., Wang, W., Wang, L., and Tan, T. (2019, January 15\u201320). An Attention Enhanced Graph Convolutional Lstm Network for Skeleton-Based Action Recognition. Proceedings of the 32nd IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00132"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 15\u201320). Skeleton-Based Action Recognition with Directed Graph Neural Networks. Proceedings of the 32nd IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00810"},{"key":"ref_49","unstructured":"GitHub (2022, November 03). GitHub-Sovrasov\/Flops-Counter.Pytorch: Flops Counter for Convolutional Networks in Pytorch Framework. Available online: https:\/\/github.com\/sovrasov\/flops-counter.pytorch."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Shahroudy, A., Liu, J., Ng, T.T., and Wang, G. (2016, January 27\u201330). Ntu Rgb+D: A Large Scale Dataset for 3d Human Activity Analysis. Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.115"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"2684","DOI":"10.1109\/TPAMI.2019.2916873","article-title":"Ntu Rgb+D 120: A Large-Scale Benchmark for 3d Human Activity Understanding","volume":"42","author":"Liu","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_52","unstructured":"Loshchilov, I., and Hutter, F. (2017, January 24\u201326). Sgdr: Stochastic Gradient Descent with Warm Restarts. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Wang, F., and Liu, H. (2021, January 20\u201325). Understanding the Behaviour of Contrastive Loss. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00252"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/5\/2452\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:39:52Z","timestamp":1760121592000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/5\/2452"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,22]]},"references-count":53,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2023,3]]}},"alternative-id":["s23052452"],"URL":"https:\/\/doi.org\/10.3390\/s23052452","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2023,2,22]]}}}