{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T19:26:06Z","timestamp":1770751566810,"version":"3.50.0"},"reference-count":69,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>This article addresses a new task: distributed multimedia sensor event analysis (DiMSEA). DiMSEA aims to analyze a series of human and machine activities (called \u201cevents\u201d in this article) in complex and extensive real-world environments. Since an observation from a single sensor is often missing or fragmented in such an environment, observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose guided masked self-distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to effectively distill fragmented target event information from sensors without over-relying on any specific sensors. To validate the effectiveness of the proposed method in DiMSEA, we recorded two new datasets: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.<\/jats:p>","DOI":"10.1145\/3779057","type":"journal-article","created":{"date-parts":[[2025,12,19]],"date-time":"2025-12-19T14:15:40Z","timestamp":1766153740000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2776-9701","authenticated-orcid":false,"given":"Masahiro","family":"Yasuda","sequence":"first","affiliation":[{"name":"NTT, Inc., Musashino, Japan and Tokyo Metropolitan University, Hachioji, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1759-4533","authenticated-orcid":false,"given":"Noboru","family":"Harada","sequence":"additional","affiliation":[{"name":"NTT, Inc., Atsugi-shi, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7856-248X","authenticated-orcid":false,"given":"Yasunori","family":"Ohishi","sequence":"additional","affiliation":[{"name":"NTT, Inc., Atsugi-shi, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8712-0464","authenticated-orcid":false,"given":"Shoichiro","family":"Saito","sequence":"additional","affiliation":[{"name":"NTT, Inc., Masashino-shi, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3815-5254","authenticated-orcid":false,"given":"Akira","family":"Nakayama","sequence":"additional","affiliation":[{"name":"NTT, Inc., Masashino-shi, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4242-2773","authenticated-orcid":false,"given":"Nobutaka","family":"Ono","sequence":"additional","affiliation":[{"name":"Tokyo Metropolitan University, Hachioji, Japan"}]}],"member":"320","published-online":{"date-parts":[[2026,2,10]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1109\/JSTSP.2018.2885636","article-title":"Sound event localization and detection of overlapping sources using convolutional recurrent neural networks","volume":"13","author":"Adavanne S.","year":"2018","unstructured":"S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen. 2018. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Process 13 (2018), 34\u201348.","journal-title":"IEEE Journal of Selected Topics in Signal Process"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.2001369"},{"key":"e_1_3_2_4_2","first-page":"609","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Arandjelovi\u0107 Relja","year":"2017","unstructured":"Relja Arandjelovi\u0107 and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, 609\u2013617."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19821-2_26"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2022.3186162"},{"key":"e_1_3_2_7_2","first-page":"11618","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Caesar H.","year":"2020","unstructured":"H. Caesar, V. K. R. Bankiti, A. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. 2020. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 11618\u201311628."},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2017.10.009"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3037496"},{"key":"e_1_3_2_10_2","volume-title":"Proceedings of the 37th International Conference on Machine Learning 2020","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning 2020, Article 149."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.691"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.886263"},{"key":"e_1_3_2_13_2","first-page":"4171","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019, 4171\u20134186."},{"key":"e_1_3_2_14_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy A.","year":"2021","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_15_2","first-page":"1","volume-title":"Proceedings of the 1st Annual Conference on Robot Learning 2017","volume":"78","author":"Dosovitskiy A.","year":"2017","unstructured":"A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. 2017. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning 2017, Vol. 78, 1\u201316."},{"key":"e_1_3_2_16_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Gao R.","year":"2020","unstructured":"R. Gao, T. H. Oh, K. Grauman, and L. Torresani. 2020. Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_17_2","volume-title":"Proceeding of Neural Information Processing Systems 2020","author":"Grill Jean-Bastien","year":"2020","unstructured":"Jean-Bastien Grill, Florian Strub, Florent Altch\u00e9, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent a new approach to Self-Supervised learning. In Proceeding of Neural Information Processing Systems 2020."},{"key":"e_1_3_2_18_2","first-page":"6047","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Gu C.","year":"2018","unstructured":"C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 6047\u20136056."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1148\/radiology.143.1.7063747"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_21_2","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Hershey S.","year":"2017","unstructured":"S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing."},{"key":"e_1_3_2_22_2","first-page":"4806","article-title":"The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling","author":"Ho Y.","year":"2019","unstructured":"Y. Ho and S. Wookey. 2019. The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access 8 (2019), 4806\u20134813.","journal-title":"IEEE Access 8"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2771462"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2690559"},{"key":"e_1_3_2_25_2","volume-title":"Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop","author":"Imoto Keisuke","year":"2019","unstructured":"Keisuke Imoto and Nobutaka Ono. 2019. RU multichannel domestic acoustic scenes 2019: A multichannel dataset recorded by distributed microphones with various properties. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop."},{"key":"e_1_3_2_26_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Joze H. R. V.","year":"2020","unstructured":"H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_27_2","first-page":"15979","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Kaiming He","year":"2022","unstructured":"He Kaiming, Chen Xinlei, Xie Saining, Li Yanghao, Doll\u00e1r Piotr, and B. Girshick Ross. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 15979\u201315988."},{"key":"e_1_3_2_28_2","volume-title":"Proceedings of the Automatic Speech Recognition and Understanding Workshop","author":"Karita S.","year":"2019","unstructured":"S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al. 2019. A comparative study on transformer vs RNN in speech applications. In Proceedings of the Automatic Speech Recognition and Understanding Workshop."},{"key":"e_1_3_2_29_2","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Kazakos E.","year":"2019","unstructured":"E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen. 2019. EPIC-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision."},{"key":"e_1_3_2_30_2","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Kong Q.","year":"2019","unstructured":"Q. Kong, Z. Wu, Z. Deng, M. Klinkigt, B. Tong, and T. Murakami. 2019. MMAct: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE International Conference on Computer Vision."},{"key":"e_1_3_2_31_2","first-page":"777","volume-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing","author":"Kong Q.","year":"2018","unstructured":"Q. Kong, Y. Xu, I. Sobieraj, and M. Plumbley. 2018. Sound event detection and Time-Frequency segmentation from weakly labelled data. IEEE\/ACM Transactions on Audio, Speech, and Language Processing 27 (2018), 777\u2013787."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3150469"},{"key":"e_1_3_2_33_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_2_34_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Loshchilov I.","year":"2019","unstructured":"I. Loshchilov and F. Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_35_2","first-page":"5694","article-title":"Dual masked modeling for weakly-supervised temporal boundary discovery","author":"Ma Yuer","year":"2023","unstructured":"Yuer Ma, Yi Liu, Limin Wang, Wenxiong Kang, Yu Qiao, and Yali Wang. 2023. Dual masked modeling for weakly-supervised temporal boundary discovery. IEEE Transactions on Multimedia 26 (2023), 5694\u20135704.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_36_2","volume-title":"Proceedings of the 24th European Signal Processing Conference","author":"Mesaros A.","year":"2016","unstructured":"A. Mesaros, T. Heittola, and T. Virtanen. 2016. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 24th European Signal Processing Conference."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/1882992.1883032"},{"key":"e_1_3_2_38_2","first-page":"1","volume-title":"Proceedings of the International Joint Conference on Neural Networks (IJCNN)","author":"Niizumi Daisuke","year":"2021","unstructured":"Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2021. BYOL for audio: Self-supervised learning for general-purpose audio representation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 1\u20138."},{"key":"e_1_3_2_39_2","first-page":"1","volume-title":"Holistic Evaluation of Audio Representations (HEAR)","volume":"166","author":"Niizumi Daisuke","year":"2022","unstructured":"Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2022. Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. In Holistic Evaluation of Audio Representations (HEAR), Vol. 166, 1\u201324."},{"key":"e_1_3_2_40_2","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Niizumi D.","year":"2023","unstructured":"D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino. 2023. Masked modeling duo: Learning representations by encouraging both networks to model the input. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing."},{"issue":"2","key":"e_1_3_2_41_2","first-page":"45","article-title":"Human activity recognition from multiple sensors data using multi-fusion representations and CNNs","volume":"16","author":"Noori Farzan Majeed","year":"2020","unstructured":"Farzan Majeed Noori, Michael Riegler, Md Zia Uddin, and Jim Torresen. 2020. Human activity recognition from multiple sensors data using multi-fusion representations and CNNs. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 45.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_2_42_2","first-page":"2405","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Owens A.","year":"2016","unstructured":"A. Owens, P. Isola, J. H. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. 2016. Visually indicated sounds. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2405\u20132413."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2020.3004325"},{"key":"e_1_3_2_44_2","first-page":"2613","volume-title":"Proceedings of the Interspeech","author":"Park Daniel S.","year":"2019","unstructured":"Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Interspeech, 2613\u20132617."},{"key":"e_1_3_2_45_2","first-page":"9557","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Patrick Mandela","year":"2020","unstructured":"Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth C. Fong, Jo\u00e3o F. Henriques, Geoffrey Zweig, and Andrea Vedaldi. 2020. On compositions of transformations in contrastive self-supervised learning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 9557\u20139567."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00713"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00700"},{"key":"e_1_3_2_48_2","first-page":"6960","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qian Rui","year":"2021","unstructured":"Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 6960\u20136970."},{"issue":"10","key":"e_1_3_2_49_2","doi-asserted-by":"crossref","first-page":"2576","DOI":"10.1109\/TMM.2019.2902489","article-title":"Multi-Speaker tracking from an audio\u2013visual sensing device","volume":"21","author":"Qian Xinyuan","year":"2019","unstructured":"Xinyuan Qian, Alessio Brutti, Oswald Lanz, Maurizio Omologo, and Andrea Cavallaro. 2019. Multi-Speaker tracking from an audio\u2013visual sensing device. IEEE Transactions on Multimedia 21, 10 (2019), 2576\u20132588.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.115"},{"key":"e_1_3_2_51_2","first-page":"3544","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Singh K.","year":"2017","unstructured":"K. Singh and Y. Lee. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision, 3544\u20133553."},{"key":"e_1_3_2_52_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Su H.","year":"2015","unstructured":"H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE\/CVF International Conference on Computer Vision."},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_2_54_2","volume-title":"Proceedings of Neural Information Processing Systems","volume":"30","author":"Vaswani A.","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, A. N. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of Neural Information Processing Systems, Vol. 30."},{"key":"e_1_3_2_55_2","volume-title":"Proceedings of the European Conference on Computer Vision Workshops","author":"Vielzeuf V.","year":"2018","unstructured":"V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie. 2018. CentralNet: A multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision Workshops."},{"key":"e_1_3_2_56_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Wang D.","year":"2018","unstructured":"D. Wang, W. Ouyang, W. Li, and D. Xu. 2018. Dividing and aggregating network for multi-view action recognition. In Proceedings of the European Conference on Computer Vision."},{"key":"e_1_3_2_57_2","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wang J.","year":"2014","unstructured":"J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu. 2014. Cross-view action modeling, learning, and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2868668"},{"key":"e_1_3_2_59_2","doi-asserted-by":"crossref","first-page":"6906","DOI":"10.1109\/TMM.2024.3358085","article-title":"CM-MaskSD: Cross-modality masked self-distillation for referring image segmentation","author":"Wang Wenxuan","year":"2024","unstructured":"Wenxuan Wang, Xingjian He, Yisi Zhang, Longteng Guo, Jiachen Shen, Jiangyun Li, and Jing Liu. 2024. CM-MaskSD: Cross-modality masked self-distillation for referring image segmentation. IEEE Transactions on Multimedia 26 (2024), 6906\u20136916.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3016222"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3147369"},{"key":"e_1_3_2_62_2","doi-asserted-by":"crossref","unstructured":"Yi Wu Yuxin Wu Georgia Gkioxari and Yuandong Tian. 2018. Building generalizable agents with a realistic and rich 3D environment. arXiv:1801.02209. Retrieved from https:\/\/arxiv.org\/abs\/1801.02209","DOI":"10.1016\/j.nano.2017.11.172"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00943"},{"key":"e_1_3_2_64_2","first-page":"4638","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Yasuda Masahiro","year":"2022","unstructured":"Masahiro Yasuda, Yasunori Ohishi, Shoichiro Saito, and Noboru Harado. 2022. Multi-view and multi-modal event detection utilizing transformer-based multi-sensor fusion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 4638\u20134642."},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1007\/s12652-017-0597-y"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3325965"},{"key":"e_1_3_2_67_2","first-page":"1528","article-title":"MAMO: Fine-grained vision-language representations learning with masked multimodal modeling","author":"Zhao Zijia","year":"2023","unstructured":"Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, and Jing Liu. 2023. MAMO: Fine-grained vision-language representations learning with masked multimodal modeling. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201923), 1528\u20131538.","journal-title":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201923)"},{"key":"e_1_3_2_68_2","doi-asserted-by":"crossref","first-page":"2542","DOI":"10.1109\/TIP.2016.2548242","article-title":"Cross-view action recognition via a transferable dictionary pair","volume":"25","author":"Zheng J.","year":"2012","unstructured":"J. Zheng, Z. Jiang, P. J. Phillips, and R. Chellappa. 2012. Cross-view action recognition via a transferable dictionary pair. IEEE Transactions on Image Processing 25 (2012), 2542\u20132556.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_2_69_2","first-page":"13001","article-title":"Random erasing data augmentation","volume":"34","author":"Zhong Zhun","year":"2020","unstructured":"Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the Association for the Advancement of Artificial Intelligence, Vol. 34, 13001\u201313008.","journal-title":"Proceedings of the Association for the Advancement of Artificial Intelligence"},{"key":"e_1_3_2_70_2","doi-asserted-by":"crossref","first-page":"1619","DOI":"10.1109\/ICME55011.2023.00279","volume-title":"Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME)","author":"Zhu He","year":"2023","unstructured":"He Zhu, Yang Chen, Guyue Hu, and Shan Yu. 2023. Information-density masking strategy for masked image modeling. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), 1619\u20131624."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3779057","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T12:13:26Z","timestamp":1770725606000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3779057"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,10]]},"references-count":69,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3779057"],"URL":"https:\/\/doi.org\/10.1145\/3779057","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,10]]},"assertion":[{"value":"2024-12-04","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}