{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T16:56:21Z","timestamp":1775235381596,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T00:00:00Z","timestamp":1732147200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"The National Key R&D Program of China","award":["2022ZD0119100"],"award-info":[{"award-number":["2022ZD0119100"]}]},{"name":"China NSF grant","award":["62472278, 62025204, 62432007, 62332014, and 62332013"],"award-info":[{"award-number":["62472278, 62025204, 62432007, 62332014, and 62332013"]}]},{"name":"Alibaba Innovation Research Program"},{"name":"Tencent Rhino Bird Key Research Project"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2024,11,21]]},"abstract":"<jats:p>This paper proposes a novel contrastive cross-modal knowledge transfer framework, SemiCMT, for multi-modal IoT sensing applications. It effectively transfers the feature extraction capability (also called knowledge) learned from a source modality (e.g., acoustic signals) with abundant unlabeled training data, to a target modality (e.g., seismic signals) that lacks enough training data, in a self-supervised manner with the help of only a small set of synchronized multi-modal pairs. The transferred model can be quickly finetuned to downstream target-modal tasks with only limited labels. The key design constitutes of three aspects: First, we factorize the latent embedding of each modality into shared and private components and perform knowledge transfer considering both the modality information commonality and gaps. Second, we enforce structural correlation constraints between the source modality and the target modality, to push the target modal embedding space symmetric to the source modal embedding space, with the anchoring of additional source-modal samples, which expands the existing modal-matching objective in current multi-modal contrastive frameworks. Finally, we conduct downstream task finetuning in the spherical space with a KNN classifier to better align with the structured modality embedding space. Extensive evaluations on five multimodal IoT datasets are performed to validate the effectiveness of SemiCMT in cross-modal knowledge transfer, including a new self-collected dataset using seismic and acoustic signals for office activity monitoring. SemiCMT consistently outperforms existing self-supervised and knowledge transfer approaches by up to 36.47% in the finetuned target-modal classification tasks. The code and the self-collected dataset will be released at https:\/\/github.com\/SJTU-RTEAS\/SemiCMT.<\/jats:p>","DOI":"10.1145\/3699779","type":"journal-article","created":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T12:23:32Z","timestamp":1732191812000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal Signals"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8954-9109","authenticated-orcid":false,"given":"Yatong","family":"Chen","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0713-4393","authenticated-orcid":false,"given":"Chenzhi","family":"Hu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-4297-5865","authenticated-orcid":false,"given":"Tomoyoshi","family":"Kimura","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4881-8376","authenticated-orcid":false,"given":"Qinya","family":"Li","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7643-7239","authenticated-orcid":false,"given":"Shengzhong","family":"Liu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0965-9058","authenticated-orcid":false,"given":"Fan","family":"Wu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6934-1685","authenticated-orcid":false,"given":"Guihai","family":"Chen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University"}]}],"member":"320","published-online":{"date-parts":[[2024,11,21]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01594"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3494994"},{"key":"e_1_2_1_3_1","volume-title":"International Conference on Machine Learning (ICML). PMLR, 1597--1607","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML). PMLR, 1597--1607."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3616855.3635795"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550316"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3308189"},{"key":"e_1_2_1_7_1","first-page":"1","article-title":"MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition","volume":"7","author":"Gao Ziqi","year":"2023","unstructured":"Ziqi Gao, Yuntao Wang, Jianguo Chen, Junliang Xing, Shwetak Patel, Xin Liu, and Yuanchun Shi. 2023. MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 3 (2023), 1--26.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)"},{"key":"e_1_2_1_8_1","first-page":"1","article-title":"Deep Heterogeneous Contrastive Hyper-Graph Learning for In-the-Wild Context-Aware Human Activity Recognition","volume":"7","author":"Ge Wen","year":"2024","unstructured":"Wen Ge, Guanyi Mou, Emmanuel O Agu, and Kyumin Lee. 2024. Deep Heterogeneous Contrastive Hyper-Graph Learning for In-the-Wild Context-Aware Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 4 (2024), 1--23.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2858933"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.3390\/s20030949"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3410531.3414306"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3463506"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550299"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2019.8852082"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3517246"},{"key":"e_1_2_1_17_1","volume-title":"A Survey of IMU Based Cross-Modal Transfer Learning in Human Activity Recognition. arXiv preprint arXiv:2403.15444","author":"Kamboj Abhi","year":"2024","unstructured":"Abhi Kamboj and Minh Do. 2024. A Survey of IMU Based Cross-Modal Transfer Learning in Human Activity Recognition. arXiv preprint arXiv:2403.15444 (2024), 1--18."},{"key":"e_1_2_1_18_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.202"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544794.3558471"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475247"},{"key":"e_1_2_1_22_1","unstructured":"Hangyu Lin Chen Liu Chengming Xu Zhengqi Gao Hang Zhao Yanwei Fu and Yuan Yao. 2023. Generalizable Cross-Modality Distillation with Contrastive Learning. (2023) 1--22."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCCN52240.2021.9522151"},{"key":"e_1_2_1_24_1","first-page":"1","article-title":"Contrastive Learning based Modality-Invariant Feature Acquisition for Robust Multimodal Emotion Recognition with Missing Modalities","volume":"01","author":"Liu Rui","year":"2024","unstructured":"Rui Liu, Haolin Zuo, Zheng Lian, Bjorn W Schuller, and Haizhou Li. 2024. Contrastive Learning based Modality-Invariant Feature Acquisition for Robust Multimodal Emotion Recognition with Missing Modalities. IEEE Transactions on Affective Computing (TAC) 01 (2024), 1--18.","journal-title":"IEEE Transactions on Affective Computing (TAC)"},{"key":"e_1_2_1_25_1","volume-title":"FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent Space. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS). 1--30","author":"Liu Shengzhong","year":"2023","unstructured":"Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, Jinyang Li, Suhas Diggavi, Mani Srivastava, and Tarek Abdelzaher. 2023. FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent Space. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS). 1--30."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2024.3356910"},{"key":"e_1_2_1_28_1","volume-title":"SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations (ICLR). 1--16","author":"Loshchilov Ilya","year":"2016","unstructured":"Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations (ICLR). 1--16."},{"key":"e_1_2_1_29_1","volume-title":"Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR). 1--19","author":"Loshchilov Ilya","year":"2018","unstructured":"Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR). 1--19."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_11"},{"key":"e_1_2_1_31_1","volume-title":"Spherical text embedding. Advances in neural information processing systems (NeurIPS) 32","author":"Meng Yu","year":"2019","unstructured":"Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. Advances in neural information processing systems (NeurIPS) 32 (2019), 1--10."},{"key":"e_1_2_1_32_1","first-page":"1","article-title":"Spatial-Temporal Masked Autoencoder for Multi-Device Wearable Human Activity Recognition","volume":"7","author":"Miao Shenghuan","year":"2024","unstructured":"Shenghuan Miao, Ling Chen, and Rong Hu. 2024. Spatial-Temporal Masked Autoencoder for Multi-Device Wearable Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 4 (2024), 1--25.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)"},{"key":"e_1_2_1_33_1","volume-title":"International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 4348--4380","author":"Nakada Ryumei","year":"2023","unstructured":"Ryumei Nakada, Halil Ibrahim Gulluk, Zhun Deng, Wenlong Ji, James Zou, and Linjun Zhang. 2023. Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 4348--4380."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548238"},{"key":"e_1_2_1_35_1","volume-title":"International Conference on Machine Learning (ICML). PMLR, 16969--16989","author":"Nonnenmacher Manuel T","year":"2022","unstructured":"Manuel T Nonnenmacher, Lukas Oldenburg, Ingo Steinwart, and David Reeb. 2022. Utilizing expert features for contrastive learning of time-series representations. In International Conference on Machine Learning (ICML). PMLR, 16969--16989."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3495243.3560519"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISWC.2012.13"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01312"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3328932"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSPW59220.2023.10193365"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00338"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3495243.3560529"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/PERCOM.2016.7456521"},{"key":"e_1_2_1_44_1","volume-title":"Exploring contrastive learning in human activity recognition for healthcare. arXiv preprint arXiv:2011.11542","author":"Tang Chi Ian","year":"2020","unstructured":"Chi Ian Tang, Ignacio Perez-Pozuelo, Dimitris Spathis, and Cecilia Mascolo. 2020. Exploring contrastive learning in human activity recognition for healthcare. arXiv preprint arXiv:2011.11542 (2020), 1--6."},{"key":"e_1_2_1_45_1","volume-title":"Contrastive Multiview Coding. In European Conference on Computer Vision (ECCV). 776--794","author":"Tian Yonglong","year":"2020","unstructured":"Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Multiview Coding. In European Conference on Computer Vision (ECCV). 776--794."},{"key":"e_1_2_1_46_1","volume-title":"CVF International Conference on Computer Vision (ICCV). 11--17","author":"Yu TIAN","unstructured":"Yu TIAN, Guansong PANG, Yuanhong CHEN, Rajvinder SINGH, Johan W VERJANS, and Gustavo CARNEIRO. [n.d.]. Weakly-supervised video anomaly detection with contrastive learning of long and short-range temporal features. In CVF International Conference on Computer Vision (ICCV). 11--17."},{"key":"e_1_2_1_47_1","volume-title":"Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding. In International Conference on Learning Representations (ICLR). 1--17","author":"Tonekaboni Sana","year":"2020","unstructured":"Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. 2020. Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding. In International Conference on Learning Representations (ICLR). 1--17."},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the Asian Conference on Computer Vision (ACCV). 3223--3240","author":"Tran Vinh","year":"2022","unstructured":"Vinh Tran, Niranjan Balasubramanian, and Minh Hoai. 2022. From Within to Between: Knowledge Distillation for Cross Modality Retrieval. In Proceedings of the Asian Conference on Computer Vision (ACCV). 3223--3240."},{"key":"e_1_2_1_49_1","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).","journal-title":"Journal of machine learning research"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 1--15","author":"Vielzeuf Valentin","year":"2018","unstructured":"Valentin Vielzeuf, Alexis Lechervy, St\u00e9phane Pateux, and Fr\u00e9d\u00e9ric Jurie. 2018. Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 1--15."},{"key":"e_1_2_1_51_1","first-page":"1","article-title":"RF-CM: Cross-modal framework for RF-enabled few-shot human activity recognition","volume":"7","author":"Wang Xuan","year":"2023","unstructured":"Xuan Wang, Tong Liu, Chao Feng, Dingyi Fang, and Xiaojiang Chen. 2023. RF-CM: Cross-modal framework for RF-enabled few-shot human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 1 (2023), 1--28.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)"},{"key":"e_1_2_1_52_1","volume-title":"Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems (NeurIPS) 36","author":"Wang Zehan","year":"2024","unstructured":"Zehan Wang, Yang Zhao, Haifeng Huang, Jiageng Liu, Aoxiong Yin, Li Tang, Linjun Li, Yongqi Wang, Ziang Zhang, and Zhou Zhao. 2024. Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems (NeurIPS) 36 (2024)."},{"key":"e_1_2_1_53_1","volume-title":"The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation. In The Eleventh International Conference on Learning Representations (ICLR). 1--41","author":"Xue Zihui","year":"2022","unstructured":"Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao. 2022. The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation. In The Eleventh International Conference on Learning Representations (ICLR). 1--41."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR56361.2022.9956607"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3050314"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00837"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i8.20881"},{"key":"e_1_2_1_58_1","first-page":"3988","article-title":"Self-supervised contrastive pre-training for time series via time-frequency consistency","volume":"35","author":"Zhang Xiang","year":"2022","unstructured":"Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems (NiPS) 35 (2022), 3988--4003.","journal-title":"Advances in Neural Information Processing Systems (NiPS)"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3569482"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.3029181"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3610905"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00148"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3699779","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3699779","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T16:26:26Z","timestamp":1758817586000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3699779"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,21]]},"references-count":62,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,11,21]]}},"alternative-id":["10.1145\/3699779"],"URL":"https:\/\/doi.org\/10.1145\/3699779","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,21]]},"assertion":[{"value":"2024-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}