{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T07:11:26Z","timestamp":1781593886914,"version":"3.54.5"},"reference-count":36,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,8,12]],"date-time":"2025-08-12T00:00:00Z","timestamp":1754956800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100019180","name":"HORIZON EUROPE European Research Council","doi-asserted-by":"publisher","award":["101070408"],"award-info":[{"award-number":["101070408"]}],"id":[{"id":"10.13039\/100019180","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Comput. Sci."],"abstract":"<jats:p>The lack of labeled sensor data for Human Activity Recognition (HAR) has driven researchers to synthesize Inertial Measurement Unit (IMU) data from video, utilizing the rich activity annotations available in video datasets. However, current synthetic IMU data often struggles to capture subtle, fine-grained motions, limiting its effectiveness in real-world HAR applications. To address these limitations, we introduce Multi<jats:sup>3<\/jats:sup>Net+, an advanced framework leveraging cross-modal, multitask representations of text, pose, and IMU data. Building on its predecessor, Multi<jats:sup>3<\/jats:sup>Net, it uses improved pre-training strategies and a mixture of experts classifier to effectively learn robust joint representations. By leveraging refined contrastive learning across modalities, Multi<jats:sup>3<\/jats:sup>Net+ bridges the gap between video and wearable sensor data, enhancing HAR performance for complex, fine-grained activities. Our experiments validate the superiority of Multi<jats:sup>3<\/jats:sup>Net+, showing significant improvements in generating high-quality synthetic IMU data and achieving state-of-the-art performance in wearable HAR tasks. These results demonstrate the efficacy of the proposed approach in advancing real-world HAR by combining cross-modal learning with multi-task optimization.<\/jats:p>","DOI":"10.3389\/fcomp.2025.1569205","type":"journal-article","created":{"date-parts":[[2025,8,12]],"date-time":"2025-08-12T05:31:07Z","timestamp":1754976667000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Improving IMU based human activity recognition using simulated multimodal representations and a MoE classifier"],"prefix":"10.3389","volume":"7","author":[{"given":"Lala Shakti Swarup","family":"Ray","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qingxin","family":"Xia","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Vitor Fortes","family":"Rey","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kaishun","family":"Wu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Paul","family":"Lukowicz","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1965","published-online":{"date-parts":[[2025,8,12]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"4596","DOI":"10.3390\/s22124596","article-title":"The state-of-the-art sensing techniques in human activity recognition: a survey","volume":"22","author":"Bian","year":"2022","journal-title":"Sensors"},{"key":"B2","doi-asserted-by":"publisher","first-page":"3891","DOI":"10.3390\/s24123891","article-title":"Real-time sensor-based human activity recognition for efitness and ehealth platforms","volume":"24","author":"Czekaj","year":"2024","journal-title":"Sensors"},{"key":"B3","first-page":"2735","article-title":"\u201cHow2sign: a large-scale multimodal dataset for continuous American sign language,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Duarte","year":"2021"},{"key":"B4","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1145\/3675095.3676609","article-title":"\u201cEnhancing inertial hand based har through joint representation of language, pose and synthetic IMUS,\u201d","volume-title":"Proceedings of the 2024 ACM International Symposium on Wearable Computers","author":"Fortes Rey","year":"2024"},{"key":"B5","first-page":"5152","article-title":"\u201cGenerating diverse and natural 3D human motions from text,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Guo","year":"2022"},{"key":"B6","first-page":"770","article-title":"\u201cDeep residual learning for image recognition,\u201d","author":"He","year":"2016","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B7","doi-asserted-by":"publisher","first-page":"1244","DOI":"10.1145\/3351244","article-title":"Integrating activity recognition and nursing care records: The system, deployment, and a verification study","volume":"3","author":"Inoue","year":"2019","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol"},{"key":"B8","doi-asserted-by":"publisher","first-page":"40811","DOI":"10.1007\/s11042-023-16795-8","article-title":"HAR-CO: a comparative analytical review for recognizing conventional human activity in stream data relying on challenges and approaches","volume":"83","author":"Keyvanpour","year":"2024","journal-title":"Multimedia Tools Appl."},{"key":"B9","doi-asserted-by":"publisher","first-page":"11841","DOI":"10.1145\/3411841","article-title":"Imutube: automatic extraction of virtual on-body accelerometry from video for human activity recognition","volume":"4","author":"Kwon","year":"2020","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol"},{"key":"B10","doi-asserted-by":"publisher","first-page":"545","DOI":"10.1145\/3678545","article-title":"Imugpt 2.0: language-based cross modality transfer for sensor-based human activity recognition","volume":"8","author":"Leng","year":"2024","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol"},{"key":"B11","first-page":"55","article-title":"\u201cOn the utility of virtual on-body acceleration data for fine-grained human activity recognition,\u201d","volume-title":"Proceedings of the 2023 ACM International Symposium on Wearable Computers, ISWC '23","author":"Leng","year":""},{"key":"B12","first-page":"39","article-title":"\u201cGenerating virtual on-body accelerometer data from virtual textual descriptions for human activity recognition,\u201d","volume-title":"Proceedings of the 2023 ACM International Symposium on Wearable Computers, ISWC '23","author":"Leng","year":""},{"key":"B13","first-page":"39","article-title":"\u201cGenerating virtual on-body accelerometer data from virtual textual descriptions for human activity recognition,\u201d","author":"Leng","year":"","journal-title":"Proceedings of the 2023 ACM International Symposium on Wearable Computers"},{"key":"B14","doi-asserted-by":"publisher","first-page":"2818013","DOI":"10.1145\/2816795.2818013","article-title":"Smpl: a skinned multi-person linear model","volume":"34","author":"Loper","year":"2015","journal-title":"ACM Trans. Graph"},{"key":"B15","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/CSCAIoT62585.2024.00005","article-title":"\u201cEarDA: towards accurate and data-efficient earable activity sensing,\u201d","volume-title":"2024 IEEE Coupling of Sensing & Computing in AIoT Systems (CSCAIoT)","author":"Lyu","year":"2024"},{"key":"B16","doi-asserted-by":"publisher","first-page":"13246","DOI":"10.18653\/v1\/2023.findings-emnlp.883","article-title":"\u201cIMU2CLIP: language-grounded motion sensor translation with multimodal contrastive learning,\u201d","author":"Moon","year":"2023","journal-title":"Findings of the Association for Computational Linguistics: EMNLP 2023"},{"key":"B17","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1807.03748","article-title":"Representation learning with contrastive predictive coding","author":"Oord","year":"2018","journal-title":"arXiv"},{"key":"B18","first-page":"8748","article-title":"\u201cLearning transferable visual models from natural language supervision,\u201d","author":"Radford","year":"2021","journal-title":"Proceedings of the 38th International Conference on Machine Learning, Volume 139 of Proceedings of Machine Learning Research"},{"key":"B19","doi-asserted-by":"publisher","first-page":"133","DOI":"10.1007\/978-3-031-78110-0_9","article-title":"\u201cAls-har: harnessing wearable ambient light sensors to?enhance imu-based human activity recognition,\u201d","author":"Ray","year":"2025","journal-title":"Pattern Recognition"},{"key":"B20","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1109\/PerComWorkshops59983.2024.10503379","article-title":"\u201cText me the data: generating ground pressure sequence from textual descriptions for har,\u201d","volume-title":"2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)","author":"Ray","year":"2024"},{"key":"B21","doi-asserted-by":"crossref","DOI":"10.1109\/PerComWorkshops56833.2023.10150221","article-title":"\u201cPressim: an end-to-end framework for dynamic ground pressure profile generation from monocular videos using physics-based 3D simulation,\u201d","volume-title":"2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)","author":"Ray","year":"2023"},{"key":"B22","first-page":"699","article-title":"\u201cLet there be imu data: generating training data for wearable, motion sensor based activity recognition from monocular rgb videos,\u201d","volume-title":"Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers, UbiComp\/ISWC '19 Adjunct","author":"Rey","year":"2019"},{"key":"B23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3596261","article-title":"Synthetic smartwatch imu data generation from in-the-wild asl videos","volume":"74","author":"Santhalingam","year":"2023","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol"},{"key":"B24","article-title":"Outrageously large neural networks: the sparsely-gated mixture-of-experts layer","author":"Shazeer","year":"2017","journal-title":"arXiv preprint"},{"key":"B25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3432701","article-title":"Mm-fit: multimodal deep learning for automatic exercise logging across sensing devices","volume":"4","author":"Str\u00f6mb\u00e4ck","year":"2020","journal-title":"Proc. ACM Inter. Mobile, Wear. Ubiquit. Technol"},{"key":"B26","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2212.09741","article-title":"One embedder, any task: instruction-finetuned text embeddings","author":"Su","year":"2022","journal-title":"arXiv"},{"key":"B27","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1007\/978-3-030-58548-8_34","article-title":"\u201cGrab: a dataset of whole-body human grasping of objects,\u201d","volume-title":"Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part IV 16","author":"Taheri","year":"2020"},{"key":"B28","doi-asserted-by":"publisher","first-page":"3411836","DOI":"10.1145\/3411836","article-title":"Robust unsupervised factory activity recognition with body-worn accelerometer using temporal structure of multiple sensor data motifs","volume":"4","author":"Xia","year":"2020","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol"},{"key":"B29","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1007\/978-3-030-69873-7_19","article-title":"\u201cA deep learning method for complex human activity recognition using virtual wearable sensors,\u201d","author":"Xiao","year":"2021","journal-title":"Spatial Data and Intelligence"},{"key":"B30","first-page":"10225","article-title":"\u201cSpatial-related sensors matters: 3D human motion reconstruction assisted with textual semantics,\u201d","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Yang","year":"2024"},{"key":"B31","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1109\/PerCom59722.2024.10494448","article-title":"\u201cOpenpack: a large-scale dataset for recognizing packaging works in iot-enabled logistic environments,\u201d","volume-title":"2024 IEEE International Conference on Pervasive Computing and Communications (PerCom)","author":"Yoshimura","year":"2024"},{"key":"B32","first-page":"199","article-title":"\u201cImusim: a simulation environment for inertial sensing algorithm design and evaluation,\u201d","volume-title":"Proceedings of the 10th ACM\/IEEE International Conference on Information Processing in Sensor Networks","author":"Young","year":"2011"},{"key":"B33","first-page":"14730","article-title":"\u201cGenerating human motion from textual descriptions with discrete representations,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Zhang","year":"2023"},{"key":"B34","doi-asserted-by":"publisher","first-page":"4115","DOI":"10.1109\/TPAMI.2024.3355414","article-title":"Motiondiffuse: text-driven human motion generation with diffusion model","volume":"46","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"B35","first-page":"364","article-title":"\u201cRemodiffuse: retrieval-augmented motion diffusion model,\u201d","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Zhang","year":"2023"},{"key":"B36","first-page":"11656","article-title":"\u201c3D human pose estimation with spatial and temporal transformers,\u201d","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV)","author":"Zheng","year":"2021"}],"container-title":["Frontiers in Computer Science"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fcomp.2025.1569205\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,12]],"date-time":"2025-08-12T05:31:12Z","timestamp":1754976672000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fcomp.2025.1569205\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,12]]},"references-count":36,"alternative-id":["10.3389\/fcomp.2025.1569205"],"URL":"https:\/\/doi.org\/10.3389\/fcomp.2025.1569205","relation":{},"ISSN":["2624-9898"],"issn-type":[{"value":"2624-9898","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,12]]},"article-number":"1569205"}}