{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:07:14Z","timestamp":1760238434941,"version":"build-2065373602"},"reference-count":29,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2020,8,9]],"date-time":"2020-08-09T00:00:00Z","timestamp":1596931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>Deep learning (DL) models have emerged in recent years as the state-of-the-art technique across numerous machine learning application domains. In particular, image processing-related tasks have seen a significant improvement in terms of performance due to increased availability of large datasets and extensive growth of computing power. In this paper we investigate the problem of group activity recognition in office environments using a multimodal deep learning approach, by fusing audio and visual data from video. Group activity recognition is a complex classification task, given that it extends beyond identifying the activities of individuals, by focusing on the combinations of activities and the interactions between them. The proposed fusion network was trained based on the audio\u2013visual stream from the AMI Corpus dataset. The procedure consists of two steps. First, we extract a joint audio\u2013visual feature representation for activity recognition, and second, we account for the temporal dependencies in the video in order to complete the classification task. We provide a comprehensive set of experimental results showing that our proposed multimodal deep network architecture outperforms previous approaches, which have been designed for unimodal analysis, on the aforementioned AMI dataset.<\/jats:p>","DOI":"10.3390\/fi12080133","type":"journal-article","created":{"date-parts":[[2020,8,10]],"date-time":"2020-08-10T05:07:23Z","timestamp":1597036043000},"page":"133","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments"],"prefix":"10.3390","volume":"12","author":[{"given":"George Albert","family":"Florea","sequence":"first","affiliation":[{"name":"Department of Computer Science, Malm\u00f6 University, 20506 Malm\u00f6, Sweden"}]},{"given":"Radu-Casian","family":"Mihailescu","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Malm\u00f6 University, 20506 Malm\u00f6, Sweden"},{"name":"Internet of Things and People Research Center, Malm\u00f6 University, 20506 Malm\u00f6, Sweden"}]}],"member":"1968","published-online":{"date-parts":[[2020,8,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"678","DOI":"10.1109\/ACCESS.2015.2437951","article-title":"The Internet of Things for Health Care: A Comprehensive Survey","volume":"3","author":"Islam","year":"2015","journal-title":"IEEE Access"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1662","DOI":"10.1016\/j.eswa.2012.09.004","article-title":"Elderly activities recognition and classification for applications in assisted living","volume":"40","author":"Chernbumroong","year":"2013","journal-title":"Expert Syst. Appl."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1109\/JIOT.2017.2647881","article-title":"IoT Considerations, Requirements, and Architectures for Smart Buildings\u2014Energy Optimization and Next-Generation Building Management Systems","volume":"4","author":"Minoli","year":"2017","journal-title":"IEEE Internet Things J."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Lim, B., Van Den Briel, M., Thi\u00e9baux, S., Backhaus, S., and Bent, R. (2015, January 25\u201330). HVAC-Aware Occupancy Scheduling. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI\u201915, Austin, TX, USA.","DOI":"10.1609\/aaai.v29i1.9236"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Renals, S., and Bengio, S. (2006). The AMI Meeting Corpus: A Pre-announcement. Machine Learning for Multimodal Interaction, Springer.","DOI":"10.1007\/11965152"},{"key":"ref_6","unstructured":"Truong, N.C., Baarslag, T., Ramchurn, G., and Tran-Thanh, L. (2016, January 9\u201311). Interactive scheduling of appliance usage in the home. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI-160 (15\/07\/16), New York, NY, USA."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Yang, Y., Hao, J., Zheng, Y., and Yu, C. (2019, January 10\u201316). Large-Scale Home Energy Management Using Entropy-Based Collective Multiagent Deep Reinforcement Learning Framework. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China.","DOI":"10.24963\/ijcai.2019\/89"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"146","DOI":"10.1016\/j.apenergy.2017.11.055","article-title":"Real-time activity recognition for energy efficiency in buildings","volume":"211","author":"Ghahramani","year":"2018","journal-title":"Appl. Energy"},{"key":"ref_9","unstructured":"Ye, H., Gu, T., Zhu, X., Xu, J., Tao, X., Lu, J., and Jin, N. (2012, January 19\u201323). FTrack: Infrastructure-free floor localization via mobile phone sensing. Proceedings of the 2012 IEEE International Conference on Pervasive Computing and Communications, Lugano, Switzerland."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sarker, K., Masoud, M., Belkasim, S., and Ji, S. (2018, January 17\u201320). Towards Robust Human Activity Recognition from RGB Video Stream with Limited Labeled Data. Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA.","DOI":"10.1109\/ICMLA.2018.00029"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Haubrick, P., and Ye, J. (2019, January 11\u201315). Robust Audio Sensing with Multi-Sound Classification. Proceedings of the 2019 IEEE International Conference on Pervasive Computing and Communications, Kyoto, Japan.","DOI":"10.1109\/PERCOM.2019.8767402"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Badica, C., El Fallah Seghrouchni, A., Beynier, A., Camacho, D., Herpson, C., Hindriks, K., and Novais, P. (2017). Towards Collaborative Sensing using Dynamic Intelligent Virtual Sensors. Intelligent Distributed Computing, Springer International Publishing.","DOI":"10.1007\/978-3-319-48829-5"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wu, Z., Jiang, Y.G., Wang, X., Ye, H., and Xue, X. (2016, January 15\u201319). Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification. Proceedings of the 24th ACM International Conference on Multimedia, MM \u201916, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2964328"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Arabac\u0131, M.A., \u00d6zkan, F., Surer, E., Jan\u010dovi\u010d, P., and Temizel, A. (2020). Multi-modal egocentric activity recognition using multi-kernel learning. Multimed. Tools Appl.","DOI":"10.1007\/s11042-020-08789-7"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kazakos, E., Nagrani, A., Zisserman, A., and Damen, D. (2019, January 27\u201328). EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00559"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Leibe, B., Matas, J., Sebe, N., and Welling, M. (2016). Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Computer Vision\u2014ECCV 2016, Springer International Publishing.","DOI":"10.1007\/978-3-319-46454-1"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Casserfelt, K., and Mihailescu, R. (2019, January 11\u201315). An investigation of transfer learning for deep architectures in group activity recognition. Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops, PerCom Workshops 2019, Kyoto, Japan.","DOI":"10.1109\/PERCOMW.2019.8730589"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_20","unstructured":"Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for Simplicity: The All Convolutional Net. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely Connected Convolutional Networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_22","unstructured":"Larsson, G., Maire, M., and Shakhnarovich, G. (2017, January 24\u201326). FractalNet: Ultra-Deep Neural Networks without Residuals. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France."},{"key":"ref_23","unstructured":"Srivastava, R.K., Greff, K., and Schmidhuber, J. (2015). Training Very Deep Networks. Proceedings of the 28th International Conference on Neural Information Processing Systems\u2014Volume 2;, MIT Press."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sapru, A., and Valente, F. (2012, January 25\u201330). Automatic speaker role labeling in AMI meetings: Recognition of formal and social roles. Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan.","DOI":"10.1109\/ICASSP.2012.6289057"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zhao, Z., Pan, H., Fan, C., Liu, Y., Li, L., Yang, M., and Cai, D. (2019, January 13\u201317). Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning. Proceedings of the World Wide Web Conference, WWW \u201919, San Francisco, CA USA.","DOI":"10.1145\/3308558.3313619"},{"key":"ref_26","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the International Conference on Learning Representations, San Diego, CA, USA."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016, January 27\u201330). Rethinking the Inception Architecture for Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Corchado, E., Yin, H., Botti, V., and Fyfe, C. (2006). Audio and Video Feature Fusion for Activity Recognition in Unconstrained Videos. Intelligent Data Engineering and Automated Learning\u2014IDEAL 2006, Springer.","DOI":"10.1007\/11875581"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/12\/8\/133\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:58:30Z","timestamp":1760176710000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/12\/8\/133"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,9]]},"references-count":29,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2020,8]]}},"alternative-id":["fi12080133"],"URL":"https:\/\/doi.org\/10.3390\/fi12080133","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2020,8,9]]}}}