{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T16:46:01Z","timestamp":1778604361821,"version":"3.51.4"},"reference-count":99,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T00:00:00Z","timestamp":1767830400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"FCT.IP","award":["UID\/04516\/NOVA"],"award-info":[{"award-number":["UID\/04516\/NOVA"]}]},{"name":"AI.EVENT: Monitor Live Audience with AI","award":["ALGARVE-FEDER-01180500"],"award-info":[{"award-number":["ALGARVE-FEDER-01180500"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"<jats:p>The accurate measurement of audience engagement in real-world live events remains a significant challenge, with the majority of existing research confined to controlled environments like classrooms. This paper presents a comprehensive survey of Computer Vision AI-driven methods for real-time audience engagement monitoring and proposes a novel, holistic architecture to address this gap, with this architecture being the main contribution of the paper. The paper identifies and defines five core constructs essential for a robust analysis: Attention, Emotion and Sentiment, Body Language, Scene Dynamics, and Behaviours. Through a selective review of state-of-the-art techniques for each construct, the necessity of a multimodal approach that surpasses the limitations of isolated indicators is highlighted. The work synthesises a fragmented field into a unified taxonomy and introduces a modular architecture that integrates these constructs with practical, business-oriented metrics such as Commitment, Conversion, and Retention. Finally, by integrating cognitive, affective, and behavioural signals, this work provides a roadmap for developing operational systems that can transform live event experience and management through data-driven, real-time analytics.<\/jats:p>","DOI":"10.3390\/mti10010008","type":"journal-article","created":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T10:10:50Z","timestamp":1767867050000},"page":"8","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["From Cues to Engagement: A Comprehensive Survey and Holistic Architecture for Computer Vision-Based Audience Analysis in Live Events"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-8727-4254","authenticated-orcid":false,"given":"Marco","family":"Lemos","sequence":"first","affiliation":[{"name":"NOVA LINCS and ISE, Universidade do Algarve, 8005-139 Faro, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4803-7964","authenticated-orcid":false,"given":"Pedro J. S.","family":"Cardoso","sequence":"additional","affiliation":[{"name":"NOVA LINCS and ISE, Universidade do Algarve, 8005-139 Faro, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3562-6025","authenticated-orcid":false,"given":"Jo\u00e3o M. F.","family":"Rodrigues","sequence":"additional","affiliation":[{"name":"NOVA LINCS and ISE, Universidade do Algarve, 8005-139 Faro, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"318","DOI":"10.1016\/j.inffus.2020.07.008","article-title":"Revisiting crowd behaviour analysis through deep learning: Taxonomy, anomaly detection, crowd emotions, datasets, opportunities and prospects","volume":"64","author":"Hupont","year":"2020","journal-title":"Inf. Fusion"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1080\/09540091.2020.1772723","article-title":"Towards the cognitive and psychological perspectives of crowd behaviour: A vision-based analysis","volume":"33","author":"Varghese","year":"2021","journal-title":"Connect. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Lemos, M., Cardoso, P.J.S., and Rodrigues, J.M.F. (2025, January 7\u20139). Microscopic Binary Engagement Model. Proceedings of the Computational Science\u2014ICCS 2025, Singapore.","DOI":"10.1007\/978-3-031-97632-2_9"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1398","DOI":"10.1109\/JPROC.2023.3309560","article-title":"Engagement Detection and Its Applications in Learning: A Tutorial and Selective Review","volume":"111","author":"Booth","year":"2023","journal-title":"Proc. IEEE"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"4069","DOI":"10.1007\/s10639-022-11370-4","article-title":"Facial emotion recognition of deaf and hard-of-hearing students for engagement detection using deep learning","volume":"28","author":"Lasri","year":"2023","journal-title":"Educ. Inf. Technol."},{"key":"ref_6","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very Deep Convolutional Networks For Large-Scale Image Recognition. arXiv, arXiv:1409.1556."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"136063","DOI":"10.1109\/ACCESS.2023.3337435","article-title":"ICAPD Framework and simAM-YOLOv8n for Student Cognitive Engagement Detection in Classroom","volume":"11","author":"Xu","year":"2023","journal-title":"IEEE Access"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1012","DOI":"10.1109\/TAFFC.2021.3127692","article-title":"Multimodal Engagement Analysis From Facial Videos in the Classroom","volume":"14","author":"Sumer","year":"2023","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"3553","DOI":"10.1007\/s11760-024-03020-8","article-title":"A lightweight facial expression recognition model for automated engagement detection","volume":"18","author":"Zhao","year":"2024","journal-title":"Signal Image Video Process."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Vrochidis, A., Dimitriou, N., Krinidis, S., Panagiotidis, S., Parcharidis, S., and Tzovaras, D. (2024). A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events. Int. J. Comput. Intell. Syst., 17.","DOI":"10.1007\/s44196-024-00512-w"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Ruiz, N., Chong, E., and Rehg, J.M. (2018, January 18\u201322). Fine-grained head pose estimation without keypoints. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPRW.2018.00281"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1007\/s11263-020-01378-z","article-title":"JAA-Net: Joint facial action unit detection and face alignment via adaptive attention","volume":"129","author":"Shao","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., and Zafeiriou, S. (2019). Retinaface: Single-stage dense face localisation in the wild. arXiv.","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"habib Albohamood, A., Shaker Alqattan, M., and Padua Vizcarra, C. (2025, January 13\u201314). Real-time Student Engagement Monitoring in Classroom Environments using Machine Learning and Computer Vision. Proceedings of the 2025 4th International Conference on Computing and Information Technology (ICCIT), Tabuk, Saudi Arabia.","DOI":"10.1109\/ICCIT63348.2025.10989406"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Qarbal, I., Sael, N., and Ouahabi, S. (2024, January 1\u20133). Student Engagement Detection Based on Head Pose Estimation and Facial Expressions Using Transfer Learning. Proceedings of the International Conference on Smart City Applications, Tangier, Morocco.","DOI":"10.1007\/978-3-031-88653-9_25"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Teotia, J., Zhang, X., Mao, R., and Cambria, E. (2024, January 9\u201312). Evaluating Vision Language Models in Detecting Learning Engagement. Proceedings of the 2024 IEEE International Conference on Data Mining Workshops (ICDMW), Abu Dhabi, United Arab Emirates.","DOI":"10.1109\/ICDMW65004.2024.00069"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1641","DOI":"10.1007\/s12369-024-01146-w","article-title":"From the definition to the automatic assessment of engagement in human\u2013robot interaction: A systematic review","volume":"16","author":"Sorrentino","year":"2024","journal-title":"Int. J. Soc. Robot."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"884","DOI":"10.1080\/01691864.2025.2526668","article-title":"Exploring task and social engagement in companion social robots: A comparative analysis of feedback types","volume":"39","author":"Ravandi","year":"2025","journal-title":"Adv. Robot."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"140519","DOI":"10.1109\/ACCESS.2025.3596885","article-title":"Students Engagement Detection Based on Computer Vision: A Systematic Literature Review","volume":"13","author":"Qarbal","year":"2025","journal-title":"IEEE Access"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Bei, Y., Guo, S., Gao, K., and Feng, Z. (2025). Behavior capture guided engagement recognition. Pattern Recognit., 164.","DOI":"10.1016\/j.patcog.2025.111534"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, J., Yuan, S., Lu, T., Zhao, H., and Zhao, Y. (2025). Video-based real-time monitoring of engagement in E-learning using MediaPipe through multi-feature analysis. Expert Syst. Appl., 288.","DOI":"10.1016\/j.eswa.2025.128239"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Lu, W., Yang, Y., Song, R., Chen, Y., Wang, T., and Bian, C. (2025). A Video Dataset for Classroom Group Engagement Recognition. Sci. Data, 12.","DOI":"10.1038\/s41597-025-04987-w"},{"key":"ref_24","unstructured":"(2025, August 28). HAPPEI Dataset. Available online: https:\/\/users.cecs.anu.edu.au\/~few_group\/Group.htm."},{"key":"ref_25","unstructured":"(2025, August 28). MED: Multimodal Event Dataset. Available online: https:\/\/github.com\/hosseinm\/med?tab=readme-ov-file."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"537","DOI":"10.1109\/TAFFC.2024.3507289","article-title":"Affective Computing Databases: In-Depth Analysis of Systematic Reviews and Surveys","volume":"16","author":"Vaz","year":"2025","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_27","unstructured":"(2025, August 28). UCSD Anomaly Detection Dataset. Available online: http:\/\/www.svcl.ucsd.edu\/projects\/anomaly\/dataset.htm."},{"key":"ref_28","unstructured":"(2025, August 28). CUHK Avenue Dataset. Available online: https:\/\/www.cse.cuhk.edu.hk\/leojia\/projects\/detectabnormal\/dataset.html."},{"key":"ref_29","unstructured":"(2025, August 28). UMN Crowd Dataset. Available online: https:\/\/mha.cs.umn.edu\/proj_events.shtml."},{"key":"ref_30","unstructured":"(2025, August 28). ShanghaiTech Campus Dataset. Available online: https:\/\/svip-lab.github.io\/dataset\/campus_dataset.html."},{"key":"ref_31","unstructured":"(2025, August 28). UT-Interaction Dataset. Available online: https:\/\/cvrc.ece.utexas.edu\/SDHA2010\/Human_Interaction.html."},{"key":"ref_32","unstructured":"(2025, August 28). UCF-Crime Dataset. Available online: https:\/\/www.crcv.ucf.edu\/research\/real-world-anomaly-detection-in-surveillance-videos\/."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1077","DOI":"10.1016\/j.neucom.2015.10.022","article-title":"A new descriptor of gradients self-similarity for smile detection in unconstrained scenarios","volume":"174","author":"Gao","year":"2016","journal-title":"Neurocomputing"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Lingenfelter, B., Davis, S.R., and Hand, E.M. (2022, January 3\u20135). A quantitative analysis of labeling issues in the celeba dataset. Proceedings of the International Symposium on Visual Computing, San Diego, CA, USA.","DOI":"10.1007\/978-3-031-20713-6_10"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"96053","DOI":"10.1109\/ACCESS.2022.3205018","article-title":"Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space","volume":"10","author":"Hwooi","year":"2022","journal-title":"IEEE Access"},{"key":"ref_36","unstructured":"Savchenko, A.V. (2022). Frame-level prediction of facial expressions, valence, arousal and action units for mobile devices. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Cao, Q., Shen, L., Xie, W., Parkhi, O.M., and Zisserman, A. (2018, January 15\u201319). Vggface2: A dataset for recognising faces across pose and age. Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00020"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Nguyen, H.H., Huynh, V.T., and Kim, S.H. (2022). An ensemble approach for facial expression analysis in video. arXiv.","DOI":"10.1109\/CVPRW56347.2022.00281"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., and Doll\u00e1r, P. (2020, January 13\u201319). Designing network design spaces. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01044"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Cho, K. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv, arXiv:1409.1259.","DOI":"10.3115\/v1\/W14-4012"},{"key":"ref_41","unstructured":"Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"11365","DOI":"10.1007\/s11042-022-13558-9","article-title":"Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models","volume":"82","author":"Gupta","year":"2023","journal-title":"Multimed. Tools Appl."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Hai, L., and Guo, H. (2020, January 23\u201325). Face detection with improved face r-CNN training method. Proceedings of the 3rd International Conference on Control and Computer Vision, Macau, China.","DOI":"10.1145\/3425577.3425582"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Bruin, J., Stuldreher, I.V., Perone, P., Hogenelst, K., Naber, M., Kamphuis, W., and Brouwer, A.M. (2024). Detection of arousal and valence from facial expressions and physiological responses evoked by different types of stressors. Front. Neuroergonomics, 5.","DOI":"10.3389\/fnrgo.2024.1338243"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1142\/S0219720005001004","article-title":"Minimum redundancy feature selection from microarray gene expression data","volume":"3","author":"Ding","year":"2005","journal-title":"J. Bioinform. Comput. Biol."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Berlincioni, L., Cultrera, L., Becattini, F., and Bimbo, A.D. (2024). Neuromorphic valence and arousal estimation. J. Ambient Intell. Humaniz. Comput., 1\u201311.","DOI":"10.1007\/s12652-024-04885-w"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1016\/j.imavis.2017.02.001","article-title":"AFEW-VA database for valence and arousal estimation in-the-wild","volume":"65","author":"Kossaifi","year":"2017","journal-title":"Image Vis. Comput."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Hu, Y., Liu, S.C., and Delbruck, T. (2021, January 20\u201325). v2e: From video frames to realistic DVS events. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPRW53098.2021.00144"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Manojkumar, K., and Helen, L.S. (2025). Monitoring the crowd emotion using valence and arousal of crowd based on prominent features of crowd. Signal Image Video Process., 19.","DOI":"10.1007\/s11760-025-04062-2"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Hempel, T., Abdelrahman, A.A., and Al-Hamadi, A. (2022, January 16\u201319). 6d rotation representation for unconstrained head pose estimation. Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France.","DOI":"10.1109\/ICIP46576.2022.9897219"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"2377","DOI":"10.1109\/TIP.2024.3378180","article-title":"Toward Robust and Unconstrained Full Range of Rotation Head Pose Estimation","volume":"33","author":"Hempel","year":"2024","journal-title":"IEEE Trans. Image Process."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Reich, A., and Wuensche, H.J. (2021, January 1\u20134). Monocular 3d multi-object tracking with an ekf approach for long-term stable tracks. Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa.","DOI":"10.23919\/FUSION49465.2021.9626850"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Zhou, X., Koltun, V., and Kr\u00e4henb\u00fchl, P. (2020, January 23\u201328). Tracking objects as points. Proceedings of the European Conference on Computer Vision, Online.","DOI":"10.1007\/978-3-030-58548-8_28"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Hossain, M.R., Rahman, M.M., Karim, M.R., Al Amin, M.J., and Bepery, C. (2022, January 26\u201329). Determination of 3D Coordinates of Objects from Image with Deep Learning Model. Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA.","DOI":"10.1109\/CCWC54503.2022.9720795"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the Computer Vision\u2014ECCV 2014: 13th European Conference, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_56","first-page":"21875","article-title":"Depth anything v2","volume":"37","author":"Yang","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_57","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., and El-Nouby, A. (2024). DINOv2: Learning Robust Visual Features without Supervision. arXiv."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Sundararaman, R., De Almeida Braga, C., Marchand, E., and Pettre, J. (2021, January 20\u201325). Tracking pedestrian heads in dense crowd. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00386"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Zhu, C., Tao, R., Luu, K., and Savvides, M. (2018, January 18\u201323). Seeing small faces from robust anchor\u2019s perspective. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00538"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Tang, X., Du, D.K., He, Z., and Liu, J. (2018, January 8\u201314). Pyramidbox: A context-assisted single shot face detector. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_49"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Deb, T., Rahmun, M., Bijoy, S.A., Raha, M.H., and Khan, M.A. (2021, January 18\u201322). UUCT-HyMP: Towards tracking dispersed crowd groups from UAVs. Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China.","DOI":"10.1109\/IJCNN52387.2021.9533600"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Airsim: High-fidelity visual and physical simulation for autonomous vehicles. Field and Service Robotics: Results of the 11th International Conference, Springer.","DOI":"10.1007\/978-3-319-67361-5_40"},{"key":"ref_65","unstructured":"Bhat, G., Danelljan, M., Gool, L.V., and Timofte, R. (November, January 27). Learning discriminative model prediction for tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_66","unstructured":"Kipf, T.N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv."},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Yeh, K.H., Hsu, I.C., Chou, Y.Z., Chen, G.Y., and Tsai, Y.S. (2022, January 3\u20136). An aerial crowd-flow analyzing system for drone under YOLOv5 and StrongSort. Proceedings of the 2022 International Automatic Control Conference (CACS), Kaohsiung, Taiwan.","DOI":"10.1109\/CACS55319.2022.9969785"},{"key":"ref_68","unstructured":"Jocher, G., Stoken, A., Borovec, J., Changyu, L., Hogan, A., Diaconu, L., Ingham, F., Poznanski, J., Fang, J., and Yu, L. (2020). ultralytics\/yolov5: V3. 1-bug fixes and performance improvements. Zenodo."},{"key":"ref_69","doi-asserted-by":"crossref","first-page":"8725","DOI":"10.1109\/TMM.2023.3240881","article-title":"Strongsort: Make deepsort great again","volume":"25","author":"Du","year":"2023","journal-title":"IEEE Trans. Multimed."},{"key":"ref_70","unstructured":"Zhou, K., Yang, Y., Cavallaro, A., and Xiang, T. (November, January 27). Omni-scale feature learning for person re-identification. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republci of Korea."},{"key":"ref_71","unstructured":"(2025, August 28). Real-Time Multi-Camera Multi-Object Tracker Using YOLOv5 and StrongSORT with OSNet. Available online: https:\/\/github.com\/mikel-brostrom\/Yolov5_StrongSORT_OSNet."},{"key":"ref_72","doi-asserted-by":"crossref","first-page":"6032","DOI":"10.1109\/TIP.2022.3205210","article-title":"Video crowd localization with multifocus gaussian neighborhood attention and a large-scale benchmark","volume":"31","author":"Li","year":"2022","journal-title":"IEEE Trans. Image Process."},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Ekanayake, E., Lei, Y., and Li, C. (2022). Crowd density level estimation and anomaly detection using multicolumn multistage bilinear convolution attention network (MCMS-BCNN-Attention). Appl. Sci., 13.","DOI":"10.3390\/app13010248"},{"key":"ref_74","unstructured":"Tan, M., and Le, Q. (2021, January 18\u201324). Efficientnetv2: Smaller models and faster training. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Zhou, D., Chen, S., Gao, S., and Ma, Y. (2016, January 26\u201330). Single-image crowd counting via multi-column convolutional neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.70"},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Ferryman, J., and Shahrokni, A. (2009, January 7\u20139). Pets2009: Dataset and challenge. Proceedings of the 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Snowbird, UT, USA.","DOI":"10.1109\/PETS-WINTER.2009.5399556"},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Mehran, R., Oyama, A., and Shah, M. (2009, January 20\u201325). Abnormal crowd behavior detection using social force model. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPRW.2009.5206641"},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/j.neucom.2022.04.106","article-title":"Crowd counting method via a dynamic-refined density map network","volume":"497","author":"Liu","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Lin, H., Ma, Z., Ji, R., Wang, Y., and Hong, X. (2022, January 18\u201324). Boosting crowd counting via multifaceted attention. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01901"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"306","DOI":"10.1016\/j.ins.2022.01.046","article-title":"Hybrid attention network based on progressive embedding scale-context for crowd counting","volume":"591","author":"Wang","year":"2022","journal-title":"Inf. Sci."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Qi, Z., Zhou, M., Zhu, G., and Xue, Y. (2022). Multiple pedestrian tracking in dense crowds combined with head tracking. Appl. Sci., 13.","DOI":"10.3390\/app13010440"},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"Pan, Z. (2023, January 20\u201321). Multi-Scale Occluded Pedestrian Detection Based on Deep Learning. Proceedings of the 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), Bengaluru, India.","DOI":"10.1109\/EASCT59475.2023.10393467"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1007\/s00371-021-02356-3","article-title":"Motion pattern-based crowd scene classification using histogram of angular deviations of trajectories","volume":"39","author":"Pai","year":"2023","journal-title":"Vis. Comput."},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Zhou, B., Tang, X., and Wang, X. (2013, January 23\u201328). Measuring crowd collectiveness. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.392"},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"Zhang, X., Sun, Y., Li, Q., Li, X., and Shi, X. (2023). Crowd density estimation and mapping method based on surveillance video and GIS. ISPRS Int. J. Geo-Inf., 12.","DOI":"10.3390\/ijgi12020056"},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018, January 8\u201314). Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"ref_87","doi-asserted-by":"crossref","first-page":"11089","DOI":"10.1007\/s13369-021-05634-3","article-title":"A back-propagation neural network model based on genetic algorithm for prediction of build-up rate in drilling process","volume":"47","author":"Qiu","year":"2022","journal-title":"Arab. J. Sci. Eng."},{"key":"ref_88","unstructured":"Zhang, Y., Chen, H., Lai, Z., Zhang, Z., and Yuan, D. (December, January 18). Handling heavy occlusion in dense crowd tracking by focusing on the heads. Proceedings of the Australasian Joint Conference on Artificial Intelligence, Brisbane, Australia."},{"key":"ref_89","unstructured":"Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv."},{"key":"ref_90","unstructured":"Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., and Sun, J. (2018). CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv."},{"key":"ref_91","unstructured":"Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., and Leal-Taix\u00e9, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. arXiv."},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Mei, L., Yu, M., Jia, L., and Fu, M. (2024). Crowd Density Estimation via Global Crowd Collectiveness Metric. Drones, 8.","DOI":"10.3390\/drones8110616"},{"key":"ref_93","doi-asserted-by":"crossref","unstructured":"Ranasinghe, Y., Nair, N.G., Bandara, W.G.C., and Patel, V.M. (2024, January 16\u201322). CrowdDiff: Multi-hypothesis crowd density estimation using diffusion models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01217"},{"key":"ref_94","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_95","unstructured":"Nichol, A.Q., and Dhariwal, P. (2021, January 18\u201324). Improved denoising diffusion probabilistic models. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"Akpulat, M., and Ekinci, M. (2025). Anomaly detection in crowd scenes via cross trajectories. Appl. Intell., 55.","DOI":"10.1007\/s10489-025-06338-z"},{"key":"ref_97","doi-asserted-by":"crossref","first-page":"65","DOI":"10.5530\/irc.1.2.10","article-title":"Detection and Tracking of People in a Dense Crowd through Deep Learning Approach-A Systematic Literature Review","volume":"1","author":"Badauraudine","year":"2025","journal-title":"Inf. Res. Commun."},{"key":"ref_98","doi-asserted-by":"crossref","first-page":"192871","DOI":"10.1109\/ACCESS.2025.3630563","article-title":"Affective Computing Emotional Body Gesture Recognition: Evolution and the Cream of the Crop","volume":"13","author":"Martins","year":"2025","journal-title":"IEEE Access"},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Rodrigues, J.M.F., Cardoso, P.J.S., Lemos, M., Cherniavska, O., and Bica, P. (2025, January 13\u201315). Engagement Monitorization in Crowded Environments: A Conceptual Framework. Proceedings of the 11th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-Exclusion, New York, NY, USA. DSAI \u201924.","DOI":"10.1145\/3696593.3696632"}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/10\/1\/8\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T10:36:18Z","timestamp":1767868578000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/10\/1\/8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,8]]},"references-count":99,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["mti10010008"],"URL":"https:\/\/doi.org\/10.3390\/mti10010008","relation":{},"ISSN":["2414-4088"],"issn-type":[{"value":"2414-4088","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,8]]}}}