{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T13:07:26Z","timestamp":1775653646360,"version":"3.50.1"},"reference-count":50,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T00:00:00Z","timestamp":1775001600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T00:00:00Z","timestamp":1775606400000},"content-version":"vor","delay-in-days":7,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2026,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Micro-gesture is an imperceptible non-verbal behaviour characterised by low-intensity movement. However, its low-intensity and short-duration nature pose challenges for traditional action recognition models. To address this, we propose micro-gesture Mamba-inspired linear attention (MGMILA), a motion-aware framework integrating Mamba-inspired linear attention (MILA), a linear complexity model optimized for video-based micro-gesture recognition. Additionally, we design motion extraction module variants, motion as layer (MAL), motion as content (MAC), and motion as gate (MAG) to enhance spatiotemporal motion localization. Furthermore, we introduce human segmentation mask prediction as an auxiliary task to guide the network in attending to human-related regions, thereby improving its motion perception and recognition capability. Experiments on iMiGUE, spontaneous micro gesture (SMG), and MA-52 demonstrate state-of-the-art (SOTA) performance, validating the effectiveness of our approach.<\/jats:p>","DOI":"10.1007\/s11633-025-1587-8","type":"journal-article","created":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T10:35:35Z","timestamp":1775644535000},"page":"352-365","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["MGMILA: Eulerian Motion-aware MILA for Micro-gesture Recognition"],"prefix":"10.1007","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-5924-4178","authenticated-orcid":false,"given":"Bohao","family":"Xing","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7140-0701","authenticated-orcid":false,"given":"Deng","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8938-8222","authenticated-orcid":false,"given":"Rong","family":"Gao","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2242-6139","authenticated-orcid":false,"given":"Xin","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0790-6847","authenticated-orcid":false,"given":"Heikki","family":"K\u00e4lvi\u00e4inen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,4,8]]},"reference":[{"key":"1587_CR1","doi-asserted-by":"publisher","first-page":"10626","DOI":"10.1109\/CVPR46437.2021.01049","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"X Liu","year":"2021","unstructured":"X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, G. Zhao. iMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 10626\u201310637, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.01049."},{"key":"1587_CR2","doi-asserted-by":"publisher","first-page":"5626","DOI":"10.1109\/TIP.2021.3087348","volume":"30","author":"Z Yu","year":"2021","unstructured":"Z. Yu, B. Zhou, J. Wan, P. Wang, H. Chen, X. Liu, S. Z Li, G. Zhao. Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE Transactions on Image Processing, vol. 30, pp. 5626\u20135640, 2021. DOI: https:\/\/doi.org\/10.1109\/TIP.2021.3087348.","journal-title":"IEEE Transactions on Image Processing"},{"issue":"4","key":"1587_CR3","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1109\/MIS.2022.3147585","volume":"37","author":"H Shi","year":"2022","unstructured":"H. Shi, W. Peng, H. Chen, X. Liu, G. Zhao. Multiscale 3D-shift graph convolution network for emotion recognition from human actions. IEEE Intelligent Systems, vol. 37, no. 4, pp. 103\u2013110, 2022. DOI: https:\/\/doi.org\/10.1109\/MIS.2022.3147585.","journal-title":"IEEE Intelligent Systems"},{"issue":"6","key":"1587_CR4","doi-asserted-by":"publisher","first-page":"1346","DOI":"10.1007\/s11263-023-01761-6","volume":"131","author":"H Chen","year":"2023","unstructured":"H. Chen, H. Shi, X. Liu, X. Li, G. Zhao. SMG: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision, vol. 131, no. 6, pp. 1346\u20131366, 2023. DOI: https:\/\/doi.org\/10.1007\/s11263-023-01761-6.","journal-title":"International Journal of Computer Vision"},{"key":"1587_CR5","doi-asserted-by":"publisher","first-page":"1309","DOI":"10.1109\/LSP.2024.3396656","volume":"31","author":"D Li","year":"2024","unstructured":"D. Li, B. Xing, X. Liu. Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning. IEEE Signal Processing Letters, vol. 31, pp. 1309\u20131313, 2024. DOI: https:\/\/doi.org\/10.1109\/LSP.2024.3396656.","journal-title":"IEEE Signal Processing Letters"},{"issue":"7","key":"1587_CR6","doi-asserted-by":"publisher","first-page":"6238","DOI":"10.1109\/TCSVT.2024.3358415","volume":"34","author":"D Guo","year":"2024","unstructured":"D. Guo, K. Li, B. Hu, Y. Zhang, M. Wang. Benchmarking micro-action recognition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 6238\u20136252, 2024. DOI: https:\/\/doi.org\/10.1109\/TCSVT.2024.3358415.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"1587_CR7","volume-title":"EMO-LLaMA: Enhancing facial emotion understanding with instruction tuning","author":"B Xing","year":"2024","unstructured":"B. Xing, Z. Yu, X. Liu, K. Yuan, Q. Ye, W. Xie, H. Yue, J. Yang, H. K\u00e4lvi\u00e4inen. EMO-LLaMA: Enhancing facial emotion understanding with instruction tuning, [Online], Available: https:\/\/arxiv.org\/abs\/2408.11424, 2024."},{"key":"1587_CR8","volume-title":"EALD-MLLM: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model","author":"D Li","year":"2024","unstructured":"D. Li, X. Liu, B. Xing, B. Xia, Y. Zong, B. Wen, H. K\u00e4lvi\u00e4inen. EALD-MLLM: Emotion analysis in long-sequential and de-identity videos with multi-modal large language model, [Online], Available: https:\/\/arxiv.org\/abs\/2405.00574, 2024."},{"key":"1587_CR9","volume-title":"Identity-free artificial emotional intelligence via micro-gesture understanding","author":"R Gao","year":"2024","unstructured":"R. Gao, X. Liu, B. Xing, Z. Yu, B. W Schuller, H. K\u00e4lvi\u00e4inen. Identity-free artificial emotional intelligence via micro-gesture understanding, [Online], Available: https:\/\/arxiv.org\/abs\/2405.13206, 2024."},{"key":"1587_CR10","volume-title":"DEEMO: De-identity multimodal emotion recognition and reasoning","author":"D Li","year":"2025","unstructured":"D. Li, B. Xing, X. Liu, B. Xia, B. Wen, H. K\u00e4lvi\u00e4inen. DEEMO: De-identity multimodal emotion recognition and reasoning, [Online], Available: https:\/\/arxiv.org\/abs\/2504.19549, 2025."},{"key":"1587_CR11","doi-asserted-by":"publisher","DOI":"10.1109\/MicroCom.2016.7522586","volume-title":"Proceedings of International Conference on Microelectronics, Computing and Communications","author":"M Pal","year":"2016","unstructured":"M. Pal, S. Saha, A. Konar. Distance matching based gesture recognition for healthcare using Microsoft\u2019s Kinect sensor. In Proceedings of International Conference on Microelectronics, Computing and Communications, IEEE, Durgapur, India, 2016. DOI: https:\/\/doi.org\/10.1109\/MicroCom.2016.7522586."},{"key":"1587_CR12","doi-asserted-by":"publisher","first-page":"190","DOI":"10.1109\/IE.2016.42","volume-title":"Proceedings of International Conference on Intelligent Environments","author":"R Ne\u00dfelrath","year":"2016","unstructured":"R. Ne\u00dfelrath, M. M. Moniri, M. Feld. Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions. In Proceedings of International Conference on Intelligent Environments, IEEE, London, UK pp. 190\u2013193, 2016. DOI: https:\/\/doi.org\/10.1109\/IE.2016.42."},{"key":"1587_CR13","doi-asserted-by":"publisher","first-page":"22078","DOI":"10.1109\/ICCV51070.2023.02023","volume-title":"Proceedings of International Conference on Computer Vision","author":"X Guo","year":"2023","unstructured":"X. Guo, N. M. Selvaraj, Z. Yu, A. W. K. Kong, B. Shen, A. Kot. Audio-visual deception detection: DOLOS dataset and parameter-efficient crossmodal learning. In Proceedings of International Conference on Computer Vision, IEEE, Paris, France, pp. 22078\u201322088, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.02023."},{"issue":"6","key":"1587_CR14","doi-asserted-by":"publisher","first-page":"1307","DOI":"10.1007\/s11263-023-01758-1","volume":"131","author":"Z Yu","year":"2023","unstructured":"Z. Yu, Y. Shen, J. Shi, H. Zhao, Y. Cui, J. Zhang, P. Torr, G. Zhao. PhysFormer++: Facial video-based physiological measurement with SlowFast temporal difference transformer. International Journal of Computer Vision, vol. 131, no. 6, pp. 1307\u20131330, 2023. DOI: https:\/\/doi.org\/10.1007\/s11263-023-01758-1.","journal-title":"International Journal of Computer Vision"},{"key":"1587_CR15","volume-title":"EmotionHallucer: Evaluating emotion hallucinations in multimodal large language models","author":"B Xing","year":"2025","unstructured":"B. Xing, X. Liu, G. Zhao, C. Liu, X. Fu, H. K\u00e4lvi\u00e4inen. EmotionHallucer: Evaluating emotion hallucinations in multimodal large language models, [Online], Available: https:\/\/arxiv.org\/abs\/2505.11405, 2025."},{"key":"1587_CR16","doi-asserted-by":"publisher","first-page":"13595","DOI":"10.1109\/CVPR52734.2025.01269","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"R Gao","year":"2025","unstructured":"R. Gao, X. Liu, Z. Hu, B. Xing, B. Xia, Z. Yu, H. K\u00e4lvi\u00e4inen. FSBench: A figure skating benchmark for advancing artistic sports understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 13595\u201313605, 2025. DOI: https:\/\/doi.org\/10.1109\/CVPR52734.2025.01269."},{"key":"1587_CR17","doi-asserted-by":"publisher","first-page":"150","DOI":"10.1145\/3267782.3267796","volume-title":"Proceedings of Symposium on Spatial User Interaction","author":"R Boldu","year":"2018","unstructured":"R. Boldu, A. Dancu, D. J. C. Matthies, P. G. Casc\u00f3n, S. Ransir, S. Nanayakkara. Thumb-in-motion: Evaluating thumb-to-ring microgestures for athletic activity. In Proceedings of Symposium on Spatial User Interaction, Berlin, Germany, pp. 150\u2013157, 2018. DOI: https:\/\/doi.org\/10.1145\/3267782.3267796."},{"issue":"1","key":"1587_CR18","doi-asserted-by":"publisher","first-page":"84","DOI":"10.3724\/SP.J.2096-5796.2018.0006","volume":"1","author":"Y Li","year":"2019","unstructured":"Y. Li, J. Huang, F. Tian, H. A. Wang, G. Z. Dai. Gesture interaction in virtual reality. Virtual Reality & Intelligent Hardware, vol. 1, no. 1, pp. 84\u2013112, 2019. DOI: https:\/\/doi.org\/10.3724\/SP.J.2096-5796.2018.0006.","journal-title":"Virtual Reality & Intelligent Hardware"},{"issue":"2","key":"1587_CR19","doi-asserted-by":"publisher","first-page":"505","DOI":"10.1109\/TAFFC.2018.2874986","volume":"12","author":"F Noroozi","year":"2021","unstructured":"F. Noroozi, C. A. Corneanu, D. Kaminska, T. Sapinski, S. Escalera, G. Anbarjafari. Survey on emotional body gesture recognition. IEEE Transactions on Affective Computing, vol. 12, no. 2, pp. 505\u2013523, 2021. DOI: https:\/\/doi.org\/10.1109\/TAFFC.2018.2874986.","journal-title":"IEEE Transactions on Affective Computing"},{"issue":"6111","key":"1587_CR20","doi-asserted-by":"publisher","first-page":"1225","DOI":"10.1126\/science.1224313","volume":"338","author":"H Aviezer","year":"2012","unstructured":"H. Aviezer, Y. Trope, A. Todorov. Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science, vol. 338, no. 6111, pp. 1225\u20131229, 2012. DOI: https:\/\/doi.org\/10.1126\/science.1224313.","journal-title":"Science"},{"key":"1587_CR21","volume-title":"AU-TTT: Vision test-time training model for facial action unit detection","author":"B Xing","year":"2025","unstructured":"B. Xing, K. Yuan, Z. Yu, X. Liu, H. K\u00e4lvi\u00e4inen. AU-TTT: Vision test-time training model for facial action unit detection, [Online], Available: https:\/\/arxiv.org\/abs\/2503.23450, 2025."},{"key":"1587_CR22","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1007\/978-3-031-72973-7_25","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"K Yuan","year":"2024","unstructured":"K. Yuan, Z. Yu, X. Liu, W. Xie, H. Yue, J. Yang. AU-Former: Vision transformers are parameter-efficient facial action unit detectors. In Proceedings of the 18th European Conference on Computer Vision, Springer, Milan, Italy, pp. 427\u2013445, 2024. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72973-7_25."},{"issue":"2","key":"1587_CR23","doi-asserted-by":"publisher","first-page":"697","DOI":"10.1109\/TAFFC.2024.3460538","volume":"16","author":"X Liu","year":"2025","unstructured":"X. Liu, K. Yuan, X. Niu, J. Shi, Z. Yu, H. Yue, J. Yang. Multi-scale promoted self-adjusting correlation learning for facial action unit detection. IEEE Transactions on Affective Computing, vol. 16, no. 2, pp. 697\u2013711, 2025. DOI: https:\/\/doi.org\/10.1109\/TAFFC.2024.3460538.","journal-title":"IEEE Transactions on Affective Computing"},{"issue":"1","key":"1587_CR24","doi-asserted-by":"publisher","first-page":"363","DOI":"10.1109\/TCSS.2022.3223251","volume":"11","author":"S Xu","year":"2024","unstructured":"S. Xu, J. Fang, X. Hu, E. Ngai, W. Wang, Y. Guo, V. C. M. Leung. Emotion recognition from gait analyses: Current research and future directions. IEEE Transactions on Computational Social Systems, vol. 11, no. 1, pp. 363\u2013377, 2024. DOI: https:\/\/doi.org\/10.1109\/TCSS.2022.3223251.","journal-title":"IEEE Transactions on Computational Social Systems"},{"key":"1587_CR25","doi-asserted-by":"publisher","first-page":"1933","DOI":"10.1109\/CVPR.2016.213","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"C Feichtenhofer","year":"2016","unstructured":"C. Feichtenhofer, A. Pinz, A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1933\u20131941, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.213."},{"key":"1587_CR26","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1109\/ICCV48922.2021.00009","volume-title":"Proceedings of International Conference on Computer Vision","author":"H Wu","year":"2021","unstructured":"H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, L. Zhang. CvT: Introducing convolutions to vision transformers. In Proceedings of International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 22\u201331, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00009."},{"key":"1587_CR27","doi-asserted-by":"publisher","unstructured":"S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah. Transformers in vision: A survey. ACM Computing Surveys, vol. 54, no. 10s, Article number 200, 2022. DOI: https:\/\/doi.org\/10.1145\/3505244.","DOI":"10.1145\/3505244"},{"issue":"11","key":"1587_CR28","doi-asserted-by":"publisher","first-page":"2740","DOI":"10.1109\/TPAMI.2018.2868668","volume":"41","author":"L Wang","year":"2019","unstructured":"L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2740\u20132755, 2019. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2018.2868668.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1587_CR29","doi-asserted-by":"publisher","first-page":"7082","DOI":"10.1109\/ICCV.2019.00718","volume-title":"Proceedings of International Conference on Computer Vision","author":"J Lin","year":"2019","unstructured":"J. Lin, C. Gan, S. Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 7082\u20137092, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCV.2019.00718."},{"key":"1587_CR30","doi-asserted-by":"publisher","first-page":"4489","DOI":"10.1109\/ICCV.2015.510","volume-title":"Proceedings of International Conference on Computer Vision","author":"D Tran","year":"2015","unstructured":"D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489\u20134497, 2015. DOI: https:\/\/doi.org\/10.1109\/ICCV.2015.510."},{"key":"1587_CR31","doi-asserted-by":"publisher","first-page":"4724","DOI":"10.1109\/CVPR.2017.502","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"J Carreira","year":"2017","unstructured":"J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4724\u20134733, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.502."},{"key":"1587_CR32","doi-asserted-by":"publisher","first-page":"3192","DOI":"10.1109\/CVPR52688.2022.00320","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"Z Liu","year":"2022","unstructured":"Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu. Video Swin transformer. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 3192\u20133201, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.00320."},{"key":"1587_CR33","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"G Bertasius","year":"2021","unstructured":"G. Bertasius, H. Wang, L. Torresani. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning, Article number 4, 2021."},{"key":"1587_CR34","doi-asserted-by":"publisher","first-page":"1632","DOI":"10.1109\/ICCV51070.2023.00157","volume-title":"Proceedings of International Conference on Computer Vision","author":"K Li","year":"2023","unstructured":"K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, L. Wang, Y. Qiao. UniFormerV2: Unlocking the potential of image ViTs for video understanding. In Proceedings of International Conference on Computer Vision, IEEE, Paris, France, pp. 1632\u20131643, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00157."},{"issue":"1","key":"1587_CR35","doi-asserted-by":"publisher","first-page":"549","DOI":"10.1109\/TAFFC.2020.3031841","volume":"14","author":"A Behera","year":"2023","unstructured":"A. Behera, Z. Wharton, Y. Liu, M. Ghahremani, S. Kumar, N. Bessis. Regional attention network (RAN) for head pose and fine-grained gesture recognition. IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 549\u2013562, 2023. DOI: https:\/\/doi.org\/10.1109\/TAFFC.2020.3031841.","journal-title":"IEEE Transactions on Affective Computing"},{"key":"1587_CR36","doi-asserted-by":"publisher","first-page":"386","DOI":"10.1007\/978-3-031-19772-7_23","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"T Li","year":"2022","unstructured":"T. Li, L. G. Foo, Q. Ke, H. Rahmani, A. Wang, J. Wang, J. Liu. Dynamic spatio-temporal specialization learning for fine-grained action recognition. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 386\u2013403, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-19772-7_23."},{"key":"1587_CR37","volume-title":"Proceedings of International Conference on Learning Representations","author":"A Dosovitskiy","year":"2020","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16\u00d716 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, Vienna, Austria, 2020."},{"key":"1587_CR38","doi-asserted-by":"publisher","first-page":"9992","DOI":"10.1109\/ICCV48922.2021.00986","volume-title":"Proceedings of International Conference on Computer Vision","author":"Z Liu","year":"2021","unstructured":"Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992\u201310002, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986."},{"key":"1587_CR39","doi-asserted-by":"publisher","first-page":"548","DOI":"10.1109\/ICCV48922.2021.00061","volume-title":"Proceedings of International Conference on Computer Vision","author":"W Wang","year":"2021","unstructured":"W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 548\u2013558, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00061."},{"key":"1587_CR40","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"A Katharopoulos","year":"2020","unstructured":"A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, Article number 478, 2020."},{"key":"1587_CR41","volume-title":"Proceedings of the 1st Conference on Language Modeling","author":"A Gu","year":"2023","unstructured":"A. Gu, T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the 1st Conference on Language Modeling, Philadelphia, USA, 2023."},{"key":"1587_CR42","doi-asserted-by":"publisher","unstructured":"X. Xie, Y. Cui, T. Tan, X. Zheng, Z. Yu. FusionMamba: Dynamic feature enhancement for multimodal image fusion with mamba. Visual Intelligence, vol. 2, no. 1, Article number 37, 2024. DOI: https:\/\/doi.org\/10.1007\/s44267-024-00072-9.","DOI":"10.1007\/s44267-024-00072-9"},{"key":"1587_CR43","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"Y Liu","year":"2024","unstructured":"Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, Y. Liu. VMamba: Visual state space model. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 3273, 2024."},{"key":"1587_CR44","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1007\/978-3-031-91979-4_2","volume-title":"Proceedings of Computer Vision","author":"T Huang","year":"2024","unstructured":"T. Huang, X. Pei, S. You, F. Wang, C. Qian, C. Xu. LocalMamba: Visual state space model with windowed selective scan. In Proceedings of Computer Vision, Springer, Milan, Italy, pp. 12\u201322, 2024. DOI: https:\/\/doi.org\/10.1007\/978-3-031-91979-4_2."},{"key":"1587_CR45","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"D Han","year":"2024","unstructured":"D. Han, Z. Wang, Z. Xia, Y. Han, Y. Pu, C. Ge, J. Song, S. Song, B. Zheng, G. Huang. Demystify Mamba in vision: A linear attention perspective. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 4039, 2024."},{"key":"1587_CR46","doi-asserted-by":"publisher","first-page":"7482","DOI":"10.1109\/CVPR.2018.00781","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"R Cipolla","year":"2018","unstructured":"R. Cipolla, Y. Gal, A. Kendall. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7482\u20137491, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00781."},{"key":"1587_CR47","doi-asserted-by":"publisher","first-page":"3992","DOI":"10.1109\/ICCV51070.2023.00371","volume-title":"Proceedings of International Conference on Computer Vision","author":"A Kirillov","year":"2023","unstructured":"A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, P. Doll\u00e1r, R. Girshick. Segment anything. In Proceedings of International Conference on Computer Vision, IEEE, Paris, France, pp. 3992\u20134003, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00371."},{"key":"1587_CR48","doi-asserted-by":"publisher","first-page":"608","DOI":"10.1109\/TIP.2025.3528347","volume":"34","author":"C Hao","year":"2025","unstructured":"C. Hao, Z. Yu, X. Liu, J. Xu, H. Yue, J. Yang. A simple yet effective network based on vision transformer for camouflaged object and salient object detection. IEEE Transactions on Image Processing, vol. 34, pp. 608\u2013622, 2025. DOI: https:\/\/doi.org\/10.1109\/TIP.2025.3528347.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1587_CR49","doi-asserted-by":"publisher","first-page":"350","DOI":"10.1109\/CVPR.2018.00044","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"R Girdhar","year":"2018","unstructured":"R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, D. Tran. Detect-and-track: Efficient pose estimation in videos. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 350\u2013359, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00044."},{"key":"1587_CR50","doi-asserted-by":"publisher","first-page":"831","DOI":"10.1007\/978-3-030-01246-5_49","volume-title":"Proceedings of the 15th European Conference on Computer Vision","author":"B Zhou","year":"2018","unstructured":"B. Zhou, A. Andonian, A. Oliva, A. Torralba. Temporal relational reasoning in videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 831\u2013846, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01246-5_49."}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-025-1587-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-025-1587-8","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-025-1587-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T12:04:56Z","timestamp":1775649896000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-025-1587-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4]]},"references-count":50,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["1587"],"URL":"https:\/\/doi.org\/10.1007\/s11633-025-1587-8","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"value":"2731-538X","type":"print"},{"value":"2731-5398","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4]]},"assertion":[{"value":"3 June 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 August 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 April 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declared that they have no conflicts of in terest to this work.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations of conflict of interest"}}]}}