{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T16:11:15Z","timestamp":1774455075570,"version":"3.50.1"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,9,27]],"date-time":"2023-09-27T00:00:00Z","timestamp":1695772800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Tsinghua University Initiative Scientifc Research Program"},{"name":"Institute for Artifcial Intelligence, Tsinghua University"},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62132010, 62002198"],"award-info":[{"award-number":["62132010, 62002198"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Young Elite Scientists Sponsorship Program by CAST","award":["2021QNRC001"],"award-info":[{"award-number":["2021QNRC001"]}]},{"name":"Beijing Key Lab of Networked Multimedia"},{"DOI":"10.13039\/501100017582","name":"Beijing National Research Center for Information Science and Technology","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100017582","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2023,9,27]]},"abstract":"<jats:p>Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements (11.13% of cross-subject F1-score on the MMAct dataset) than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.<\/jats:p>","DOI":"10.1145\/3610872","type":"journal-article","created":{"date-parts":[[2023,9,27]],"date-time":"2023-09-27T15:45:03Z","timestamp":1695829503000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["MMTSA"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-4553-7828","authenticated-orcid":false,"given":"Ziqi","family":"Gao","sequence":"first","affiliation":[{"name":"Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Global Innovation Exchange (GIX) Institute, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4249-8893","authenticated-orcid":false,"given":"Yuntao","family":"wang","sequence":"additional","affiliation":[{"name":"Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Beijing, China and Department of Computer Technology and Application, Qinghai University, Xining, Qinghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4616-7500","authenticated-orcid":false,"given":"Jianguo","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Virginia, Charlottesville, VA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6801-0510","authenticated-orcid":false,"given":"Junliang","family":"Xing","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6300-4389","authenticated-orcid":false,"given":"Shwetak","family":"Patel","sequence":"additional","affiliation":[{"name":"Paul G. Allen School for Computer Science and Engineering, University of Washington, Seattle, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9279-5386","authenticated-orcid":false,"given":"Xin","family":"Liu","sequence":"additional","affiliation":[{"name":"Paul G. Allen School for Computer Science and Engineering, University of Washington, Seattle, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2273-6927","authenticated-orcid":false,"given":"Yuanchun","family":"Shi","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China and Qinghai University, Xining, Qinghai, China"}]}],"member":"320","published-online":{"date-parts":[[2023,9,27]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cogsys.2018.04.002"},{"key":"e_1_2_1_2_1","first-page":"4","article-title":"Is space-time attention all you need for video understanding?","volume":"2","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.","journal-title":"ICML"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/JIOT.2019.2920283"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447744"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1049\/iet-cvi.2018.5088"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00552"},{"key":"e_1_2_1_8_1","volume-title":"Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition. arXiv preprint arXiv:2211.04331","author":"Choi Hyeongju","year":"2022","unstructured":"Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, and Irfan Essa. 2022. Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition. arXiv preprint arXiv:2211.04331 (2022)."},{"key":"e_1_2_1_9_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.3390\/s17112688"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2017.09.027"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS45743.2020.9340987"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2021.3059624"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Md Mofijul Islam and Tariq Iqbal. 2022. MuMu: Cooperative multitask learning-based guided multimodal fusion \". AAAI.","DOI":"10.1609\/aaai.v36i1.19988"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01330"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.223"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00559"},{"key":"e_1_2_1_20_1","volume-title":"Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451","author":"Kitaev Nikita","year":"2020","unstructured":"Nikita Kitaev, \u0141ukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020)."},{"key":"e_1_2_1_21_1","volume-title":"MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In The IEEE International Conference on Computer Vision (ICCV).","author":"Kong Quan","year":"2019","unstructured":"Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. 2019. MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In The IEEE International Conference on Computer Vision (ICCV)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1964897.1964918"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-49409-8_7"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478114"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3086590"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12319"},{"key":"e_1_2_1_28_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSEN.2019.2911204"},{"key":"e_1_2_1_30_1","volume-title":"Wearable sensors for human activity monitoring: A review","author":"Mukhopadhyay Subhas Chandra","year":"2014","unstructured":"Subhas Chandra Mukhopadhyay. 2014. Wearable sensors for human activity monitoring: A review. IEEE sensors journal 15, 3 (2014), 1321--1330."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.3390\/s17112556"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2013.6474999"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/EMBC.2017.8037349"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/EMBC.2017.8037349"},{"key":"e_1_2_1_35_1","volume-title":"International conference on machine learning. PMLR, 1310--1318","author":"Pascanu Razvan","year":"2013","unstructured":"Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International conference on machine learning. PMLR, 1310--1318."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00625"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3070646"},{"key":"e_1_2_1_38_1","volume-title":"Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14569\/IJACSA.2019.0100311"},{"key":"e_1_2_1_40_1","volume-title":"Asian conference on computer vision. Springer, 445--458","author":"Song Sibo","year":"2014","unstructured":"Sibo Song, Vijay Chandrasekhar, Ngai-Man Cheung, Sanath Narayan, Liyuan Li, and Joo-Hwee Lim. 2014. Activity recognition in egocentric life-logging videos. In Asian conference on computer vision. Springer, 445--458."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2016.54"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472171"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.3390\/s18092892"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3026027"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.236"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2968529"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3494995"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_2_1_49_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2018.02.010"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.2991\/cnci-19.2019.95"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517698"},{"key":"e_1_2_1_54_1","volume-title":"Workshops at the twenty-ninth AAAI conference on artificial intelligence.","author":"Wang Zhiguang","year":"2015","unstructured":"Zhiguang Wang and Tim Oates. 2015. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Workshops at the twenty-ninth AAAI conference on artificial intelligence."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.3390\/s19173680"},{"key":"e_1_2_1_56_1","volume-title":"Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084","author":"Wu Chuhan","year":"2021","unstructured":"Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, and Xing Xie. 2021. Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084 (2021)."},{"key":"e_1_2_1_57_1","volume-title":"A comparison of 1-D and 2-D deep convolutional neural networks in ECG classification. arXiv preprint arXiv:1810.07088","author":"Wu Yunan","year":"2018","unstructured":"Yunan Wu, Feng Yang, Ying Liu, Xuefan Zha, and Shaofeng Yuan. 2018. A comparison of 1-D and 2-D deep convolutional neural networks in ECG classification. arXiv preprint arXiv:1810.07088 (2018)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485730.3485937"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01367"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.4108\/icst.mobicase.2014.257786"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544549.3585903"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3610872","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3610872","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,28]],"date-time":"2025-07-28T16:27:26Z","timestamp":1753720046000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3610872"}},"subtitle":["Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition"],"short-title":[],"issued":{"date-parts":[[2023,9,27]]},"references-count":61,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,9,27]]}},"alternative-id":["10.1145\/3610872"],"URL":"https:\/\/doi.org\/10.1145\/3610872","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,27]]},"assertion":[{"value":"2023-09-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}