{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T23:43:59Z","timestamp":1775000639885,"version":"3.50.1"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62277014, 62262056"],"award-info":[{"award-number":["62277014, 62262056"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>Recently, Mamba has gained widespread attention due to its ability to model long-range dependencies with linear computational complexity. To explore the application of Mamba in 2D human pose estimation, we propose MatPose, a Mamba-Transformer hybrid model specifically designed for efficient 2D human pose estimation. The model aims to combine Mamba\u2019s efficient modeling of long-range dependencies with the powerful global context modeling capabilities of the Transformer to effectively extract human pose keypoints. Initially, to address the lack of local features when Mamba is applied to computer vision tasks, we design a Cross-Stage Multi-Scale Convolution (CSMSC) module by integrating multi-scale convolution, cross-stage feature fusion, and spatial attention mechanisms to effectively extract local features. Then, to mitigate the long-range forgetting issue inherent in Mamba, we shorten the sequence length using the Conv-Reduce operation. In addition, we design a Channel Selection Attention (CSA) mechanism to compensate for the feature loss caused by the Conv-Reduce operation. Finally, to explore a suitable integration method for the Mamba-Transformer hybrid model in 2D human pose estimation, we conduct a comprehensive ablation study on the feasibility of integrating Mamba and Transformer models. Experimental results show that the proposed method, compared to the baseline model, improves performance while reducing computational overhead. On the COCO val2017 dataset, MatPose achieves an AP of 74.6 with only 5.18 GFLOPs, outperforming most existing human pose estimation models.<\/jats:p>","DOI":"10.1145\/3777469","type":"journal-article","created":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T16:05:03Z","timestamp":1763568303000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["MatPose: A 2D Human Pose Estimation Model with Hybrid Mamba-Transformer"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1292-3868","authenticated-orcid":false,"given":"Wenjun","family":"Xie","sequence":"first","affiliation":[{"name":"School of Software, Hefei University of Technology, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-5396-9767","authenticated-orcid":false,"given":"Kejun","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9457-263X","authenticated-orcid":false,"given":"Dong","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Electronic and Information Engineering, Anhui Jianzhu University, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0760-6262","authenticated-orcid":false,"given":"Xiaoping","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"691","volume-title":"Proceedings of the 38th AAAI Conference on Artificial Intelligence","author":"An Xiaoqi","year":"2024","unstructured":"Xiaoqi An, Lin Zhao, Chen Gong, Nannan Wang, Di Wang, and Jian Yang. 2024. Sharpose: Sparse high-resolution representation for human pose estimation. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, 691\u2013699."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00742"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503464"},{"key":"e_1_3_1_5_2","first-page":"2334","volume-title":"2017 IEEE International Conference on Computer Vision (ICCV)","author":"Fang Hao-Shu","year":"2017","unstructured":"Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional multi-person pose estimation. In 2017 IEEE International Conference on Computer Vision (ICCV), 2334\u20132343."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3633781"},{"key":"e_1_3_1_7_2","unstructured":"Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752. Retrieved from https:\/\/arxiv.org\/abs\/2312.00752"},{"key":"e_1_3_1_8_2","first-page":"222","volume-title":"European Conference on Computer Vision","author":"Guo Hang","year":"2024","unstructured":"Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. 2024. Mambair: A simple baseline for image restoration with state-space model. In European Conference on Computer Vision. Springer, 222\u2013241."},{"key":"e_1_3_1_9_2","unstructured":"Ali Hatamizadeh and Jan Kautz. 2024. Mambavision: A hybrid mamba-transformer vision backbone. arXiv:2407.08083. Retrieved from https:\/\/arxiv.org\/abs\/2407.08083"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_12_2","unstructured":"Tao Huang Xiaohuan Pei Shan You Fei Wang Chen Qian and Chang Xu. 2024. Localmamba: Visual state space model with windowed selective scan. arXiv:2403.09338. Retrieved from https:\/\/arxiv.org\/abs\/2403.09338"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00311"},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","first-page":"121352","DOI":"10.1016\/j.eswa.2023.121352","article-title":"Large separable kernel attention: Rethinking the large kernel attention design in cnn","volume":"236","author":"Wai Lau Kin","year":"2024","unstructured":"Kin Wai Lau, Lai-Man Po, and Yasar Abbas Ur Rehman. 2024. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Systems with Applications 236 (2024), 121352.","journal-title":"Expert Systems with Applications"},{"key":"e_1_3_1_15_2","unstructured":"Boyun Li Haiyu Zhao Wenxin Wang Peng Hu Yuanbiao Gou and Xi Peng. 2024. MaIR: A locality-and continuity-preserving Mamba for image restoration. arXiv:2412.20066. Retrieved from https:\/\/arxiv.org\/abs\/2412.20066"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01112"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3248144"},{"issue":"1","key":"e_1_3_1_18_2","doi-asserted-by":"crossref","first-page":"e12339","DOI":"10.1049\/cvi2.12339","article-title":"SMGNFORMER: Fusion mamba-graph transformer network for human pose estimation","volume":"19","author":"Li Yi","year":"2025","unstructured":"Yi Li, Zan Wang, and Weiran Niu. 2025. SMGNFORMER: Fusion mamba-graph transformer network for human pose estimation. IET Computer Vision 19, 1 (2025), e12339.","journal-title":"IET Computer Vision"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01112"},{"key":"e_1_3_1_20_2","unstructured":"Huajun Liu Fuqiang Liu Xinyi Fan and Dong Huang. 2021. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv:2107.00782. Retrieved from https:\/\/arxiv.org\/abs\/2107.00782"},{"key":"e_1_3_1_21_2","unstructured":"Mushui Liu Jun Dan Ziqian Lu Yunlong Yu Yingming Li and Xi Li. 2024. CM-UNet: Hybrid CNN-Mamba UNet for remote sensing image semantic segmentation. arXiv:2405.10530. Retrieved from https:\/\/arxiv.org\/abs\/2405.10530"},{"key":"e_1_3_1_22_2","first-page":"103031","article-title":"Vmamba: Visual state space model","volume":"37","author":"Liu Yue","year":"2024","unstructured":"Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. 2024. Vmamba: Visual state space model. Advances in Neural Information Processing Systems 37 (2024), 103031\u2013103063.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_23_2","first-page":"1","volume-title":"2024 International Joint Conference on Neural Networks (IJCNN)","author":"Lu LiPing","year":"2024","unstructured":"LiPing Lu, Qian Xiong, Bingrong Xu, and Duanfeng Chu. 2024. MixDehazeNet: Mix structure block for image dehazing network. In 2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 1\u201310."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20065-6_25"},{"key":"e_1_3_1_25_2","unstructured":"Yucong Meng Zhiwei Yang Zhijian Song and Yonghong Shi. 2025. DM-Mamba: Dual-domain multi-scale Mamba for MRI reconstruction. arXiv:2501.08163. Retrieved from https:\/\/arxiv.org\/abs\/2501.08163"},{"key":"e_1_3_1_26_2","unstructured":"Weichao Pan Xu Wang and Wenqing Huan. 2024. EFA-YOLO: An efficient feature attention model for fire and flame detection. arXiv:2409.12635. Retrieved from https:\/\/arxiv.org\/abs\/2409.12635"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58529-7_29"},{"key":"e_1_3_1_28_2","unstructured":"Yuheng Shi Minjing Dong and Chang Xu. 2024. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv:2405.14174. Retrieved from https:\/\/arxiv.org\/abs\/2405.14174"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00584"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics12071648"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ADICS58448.2024.10533619"},{"key":"e_1_3_1_32_2","first-page":"1","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 1\u201311.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_33_2","doi-asserted-by":"crossref","first-page":"992","DOI":"10.1109\/LSP.2022.3163678","article-title":"A fast and effective transformer for human pose estimation","volume":"29","author":"Wang Dong","year":"2022","unstructured":"Dong Wang, Wenjun Xie, Youcheng Cai, and Xiaoping Liu. 2022. A fast and effective transformer for human pose estimation. IEEE Signal Processing Letters 29 (2022), 992\u2013996.","journal-title":"IEEE Signal Processing Letters"},{"key":"e_1_3_1_34_2","first-page":"213","volume-title":"European Conference on Computer Vision","author":"Wang Haonan","year":"2024","unstructured":"Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, and Yong Wang. 2024. GTPT: Group-based token pruning transformer for efficient human pose estimation. In European Conference on Computer Vision. Springer, 213\u2013230."},{"key":"e_1_3_1_35_2","first-page":"911","volume-title":"2024 5th International Conference on Big Data and Artificial Intelligence and Software Engineering (ICBASE)","author":"Wang Li","year":"2024","unstructured":"Li Wang and Rihan Gu. 2024. Combining dynamic split convolutions and lightweight inverse residual module for human pose estimation. In 2024 5th International Conference on Big Data and Artificial Intelligence and Software Engineering (ICBASE). IEEE, 911\u2013914."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3688803"},{"key":"e_1_3_1_37_2","unstructured":"Juan Wen Weiyan Hou Luc Van Gool and Radu Timofte. 2025. MatIR: A hybrid Mamba-Transformer image restoration model. arXiv:2501.18401. Retrieved from https:\/\/arxiv.org\/abs\/2501.18401"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_29"},{"issue":"2","key":"e_1_3_1_40_2","first-page":"1212","article-title":"Vitpose++: vision transformer for generic body pose estimation","volume":"46","author":"Xu Yufei","year":"2023","unstructured":"Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2023. Vitpose++: vision transformer for generic body pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 2 (2023), 1212\u20131230.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_41_2","unstructured":"Chenhongyi Yang Zehui Chen Miguel Espinosa Linus Ericsson Zhenyu Wang Jiaming Liu and Elliot J. Crowley. 2024. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv:2403.17695. Retrieved from https:\/\/arxiv.org\/abs\/2403.17695"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01159"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00215"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3596445"},{"key":"e_1_3_1_45_2","first-page":"1","volume-title":"41st International Conference on Machine Learning","author":"Zhu Lianghui","unstructured":"Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision mamba: Efficient visual representation learning with bidirectional state space model. In 41st International Conference on Machine Learning, 1\u201314."},{"key":"e_1_3_1_46_2","article-title":"Merging context clustering with visual state space models for medical image segmentation","author":"Zhu Yun","year":"2025","unstructured":"Yun Zhu, Dong Zhang, Yi Lin, Yifei Feng, and Jinhui Tang. 2025. Merging context clustering with visual state space models for medical image segmentation. IEEE Transactions on Medical Imaging. 2025.","journal-title":"IEEE Transactions on Medical Imaging"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3777469","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T14:20:20Z","timestamp":1768314020000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3777469"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,13]]},"references-count":45,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3777469"],"URL":"https:\/\/doi.org\/10.1145\/3777469","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,13]]},"assertion":[{"value":"2025-04-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}