{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T16:21:04Z","timestamp":1781281264985,"version":"3.54.1"},"reference-count":70,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2024,1,9]],"date-time":"2024-01-09T00:00:00Z","timestamp":1704758400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,1,9]],"date-time":"2024-01-09T00:00:00Z","timestamp":1704758400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100004681","name":"Higher Education Commission, Pakistan","doi-asserted-by":"publisher","award":["PM\/HRDI-UESTPs\/UETs-457 I\/Phase-1\/Batch-VI\/2018"],"award-info":[{"award-number":["PM\/HRDI-UESTPs\/UETs-457 I\/Phase-1\/Batch-VI\/2018"]}],"id":[{"id":"10.13039\/501100004681","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100022855","name":"Office of National Intelligence","doi-asserted-by":"publisher","award":["NIPG-2021-001"],"award-info":[{"award-number":["NIPG-2021-001"]}],"id":[{"id":"10.13039\/501100022855","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001798","name":"Edith Cowan University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001798","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2024,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/tinyurl.com\/4ps2ux6n\">https:\/\/tinyurl.com\/4ps2ux6n<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s00521-023-09186-5","type":"journal-article","created":{"date-parts":[[2024,1,9]],"date-time":"2024-01-09T09:02:09Z","timestamp":1704790929000},"page":"5499-5513","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":43,"title":["Multimodal fusion for audio-image and video action recognition"],"prefix":"10.1007","volume":"36","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9042-5018","authenticated-orcid":false,"given":"Muhammad Bilal","family":"Shaikh","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Douglas","family":"Chai","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Syed Mohammed Shamsul","family":"Islam","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Naveed","family":"Akhtar","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,1,9]]},"reference":[{"key":"9186_CR1","doi-asserted-by":"crossref","unstructured":"Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: IEEE, Proceedings of the ICCV, pp 609\u2013617","DOI":"10.1109\/ICCV.2017.73"},{"key":"9186_CR2","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1016\/j.neucom.2017.12.049","volume":"283","author":"A Baldominos","year":"2018","unstructured":"Baldominos A, Saez Y, Isasi P (2018) Evolutionary convolutional neural networks: an application to handwriting recognition. Neurocomputing 283:38\u201352","journal-title":"Neurocomputing"},{"issue":"6","key":"9186_CR3","doi-asserted-by":"publisher","first-page":"723","DOI":"10.1038\/s43018-022-00388-9","volume":"3","author":"KM Boehm","year":"2022","unstructured":"Boehm KM, Aherne EA, Ellenson L et al (2022) Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat Cancer 3(6):723\u2013733","journal-title":"Nature Cancer"},{"key":"9186_CR4","doi-asserted-by":"crossref","unstructured":"Brousmiche M, Rouat J, Dupont S (2019) Audio-visual fusion and conditioning with neural networks for event recognition. In: IEEE, Proceedings of the machine learning for signal processing (MLSP) Workshop, pp 1\u20136","DOI":"10.1109\/MLSP.2019.8918712"},{"key":"9186_CR5","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1016\/j.inffus.2022.03.001","volume":"85","author":"M Brousmiche","year":"2022","unstructured":"Brousmiche M, Rouat J, Dupont S (2022) Multimodal attentive fusion network for audio-visual event recognition. Inf Fusion 85:52\u201359","journal-title":"Inf Fusion"},{"key":"9186_CR6","doi-asserted-by":"crossref","unstructured":"Deng Z, Lei L, Sun H, et\u00a0al (2017) An enhanced deep convolutional neural network for densely packed objects detection in remote sensing images. In: IEEE, proceedings of the remote sensing with intelligent processing (RSIP) workshops, pp 1\u20134","DOI":"10.1109\/RSIP.2017.7958800"},{"key":"9186_CR7","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE, Proceedings of The CVPR, pp 11933\u201311941","DOI":"10.1109\/CVPR.2016.213"},{"key":"9186_CR8","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, et\u00a0al (2019) Slowfast networks for video recognition. In: Proceedings of the ICCV, pp 6202\u20136211","DOI":"10.1109\/ICCV.2019.00630"},{"key":"9186_CR9","doi-asserted-by":"publisher","unstructured":"Gao R, Grauman K (2021) VisualVoice: Audio-visual speech separation with cross-modal consistency. IEEE, Proceedings of the CVPR, pp 15495\u201315505, https:\/\/doi.org\/10.1109\/CVPR46437.2021.01524","DOI":"10.1109\/CVPR46437.2021.01524"},{"key":"9186_CR10","doi-asserted-by":"crossref","unstructured":"Gao R, et\u00a0al (2020) Listen to look: action recognition by previewing audio. In: IEEE Proceedings of the CVPR, pp 10457\u201310467","DOI":"10.1109\/CVPR42600.2020.01047"},{"key":"9186_CR11","doi-asserted-by":"crossref","unstructured":"Gao Y, Beijbom O, Zhang N, et\u00a0al (2016) Compact bilinear pooling. In: IEEE, Proceedings of the CVPR, pp 317\u2013326","DOI":"10.1109\/CVPR.2016.41"},{"issue":"1","key":"9186_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1207\/s15326969eco0501_1","volume":"5","author":"WW Gaver","year":"1993","unstructured":"Gaver WW (1993) What in the world do we hear?: an ecological approach to auditory event perception. Ecol. Psychol. 5(1):1\u201329","journal-title":"Ecol. Psychol."},{"key":"9186_CR13","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-79337-3","author":"DC Gibbon","year":"2008","unstructured":"Gibbon DC, Liu Z (2008) Introduction to video search engines. Springer. https:\/\/doi.org\/10.1007\/978-3-540-79337-3","journal-title":"Springer"},{"key":"9186_CR14","doi-asserted-by":"crossref","unstructured":"Girdhar R, et\u00a0al (2017) ActionVLAD: Learning spatio-temporal aggregation for action classification. In: IEEE, Proceedings of the CVPR, pp 971\u2013980","DOI":"10.1109\/CVPR.2017.337"},{"key":"9186_CR15","unstructured":"Gouyon F, Dixon S, Pampalk E, et\u00a0al (2004) Evaluating rhythmic descriptors for musical genre classification. In: Proceedings of the AESIC, p 204"},{"key":"9186_CR16","doi-asserted-by":"crossref","unstructured":"Gu J, et\u00a0al (2021) NTIRE 2021 challenge on perceptual image quality assessment. In: IEEE, Proceedings of the CVPR, pp 677\u2013690","DOI":"10.1109\/CVPRW53098.2021.00077"},{"key":"9186_CR17","doi-asserted-by":"crossref","unstructured":"He D, et\u00a0al (2019) StNet: Local and global spatial-temporal modeling for action recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 8401\u20138408","DOI":"10.1609\/aaai.v33i01.33018401"},{"key":"9186_CR18","doi-asserted-by":"publisher","unstructured":"He K, Zhang X, Ren S, et\u00a0al (2016) Deep residual learning for image recognition. In: IEEE, Proceedings of the CVPR, pp 770\u2013778, https:\/\/doi.org\/10.1109\/CVPR.2016.90","DOI":"10.1109\/CVPR.2016.90"},{"key":"9186_CR19","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1016\/j.imavis.2017.01.010","volume":"60","author":"S Herath","year":"2017","unstructured":"Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4\u201321","journal-title":"Image Vis Comput"},{"key":"9186_CR20","doi-asserted-by":"publisher","unstructured":"Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: PMLR, Proceedings of the ICML, pp 448\u2013456, https:\/\/doi.org\/10.5555\/3045118.3045167","DOI":"10.5555\/3045118.3045167"},{"key":"9186_CR21","doi-asserted-by":"publisher","first-page":"4293","DOI":"10.1007\/s00521-019-04615-w","volume":"32","author":"C Jing","year":"2020","unstructured":"Jing C, Wei P, Sun H et al (2020) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293\u20134302","journal-title":"Neural Comput Appl"},{"key":"9186_CR22","doi-asserted-by":"crossref","unstructured":"Jung D, Son JW, Kim SJ (2018) Shot category detection based on object detection using convolutional neural networks. In: IEEE, Proceedings of the ICACT, pp 36\u201339","DOI":"10.23919\/ICACT.2018.8323637"},{"key":"9186_CR23","volume-title":"On-road intelligent vehicles: motion planning for intelligent transportation systems","author":"R Kala","year":"2016","unstructured":"Kala R (2016) On-road intelligent vehicles: motion planning for intelligent transportation systems. Butterworth-Heinemann, OXford"},{"key":"9186_CR24","unstructured":"Kay W, Carreira J, Simonyan K, et\u00a0al (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950"},{"key":"9186_CR25","doi-asserted-by":"crossref","unstructured":"Kazakos E, et\u00a0al (2019) EPIC-Fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the ICCV, pp 5492\u20135501","DOI":"10.1109\/ICCV.2019.00559"},{"key":"9186_CR26","doi-asserted-by":"publisher","unstructured":"Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980https:\/\/doi.org\/10.48550\/arXiv.1412.6980","DOI":"10.48550\/arXiv.1412.6980"},{"key":"9186_CR27","doi-asserted-by":"publisher","first-page":"118","DOI":"10.1016\/j.neunet.2018.03.019","volume":"103","author":"SR Kulkarni","year":"2018","unstructured":"Kulkarni SR, Rajendran B (2018) Spiking neural networks for handwritten digit recognition-supervised learning and network optimization. Neural Netw 103:118\u2013127","journal-title":"Neural Netw"},{"key":"9186_CR28","doi-asserted-by":"crossref","unstructured":"Kwon H, Kim M, Kwak S, et\u00a0al (2021) Learning self-similarity in space and time as generalized motion for video action recognition. In: Proceedings of the ICCV, pp 13065\u201313075","DOI":"10.1109\/ICCV48922.2021.01282"},{"key":"9186_CR29","doi-asserted-by":"crossref","unstructured":"Lei J, Li L, Zhou L, et\u00a0al (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In: IEEE, Proceedings of the CVPR, pp 7331\u20137341","DOI":"10.1109\/CVPR46437.2021.00725"},{"issue":"2","key":"9186_CR30","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1109\/MIC.2020.2971447","volume":"24","author":"Y Li","year":"2020","unstructured":"Li Y, Zou B, Deng S et al (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49\u201356","journal-title":"IEEE Internet Comput"},{"issue":"2","key":"9186_CR31","first-page":"1","volume":"18","author":"Y Li","year":"2021","unstructured":"Li Y, Tao P, Deng S et al (2021) Deffusion: Cnn-based continuous authentication using deep feature fusion. ACM Trans Sens Netw (TOSN) 18(2):1\u201320","journal-title":"ACM Trans Sens Netw (TOSN)"},{"key":"9186_CR32","doi-asserted-by":"publisher","DOI":"10.1109\/TMC.2022.3186614","author":"Y Li","year":"2022","unstructured":"Li Y, Liu L, Qin H et al (2022) Adaptive deep feature fusion for continuous authentication with data augmentation. IEEE Trans Mobile Comput. https:\/\/doi.org\/10.1109\/TMC.2022.3186614","journal-title":"IEEE Trans Mobile Comput"},{"key":"9186_CR33","doi-asserted-by":"crossref","unstructured":"Li Y, et\u00a0al (2016) VLAD3: encoding dynamics of deep features for action recognition. In: IEEE, Proceedings of the CVPR, pp 1951\u20131960","DOI":"10.1109\/CVPR.2016.215"},{"issue":"11","key":"9186_CR34","doi-asserted-by":"publisher","first-page":"1989","DOI":"10.1109\/TMM.2015.2477035","volume":"17","author":"Z Li","year":"2015","unstructured":"Li Z, Tang J (2015) Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Trans Multimed 17(11):1989\u20131999","journal-title":"IEEE Trans Multimed"},{"issue":"9","key":"9186_CR35","doi-asserted-by":"publisher","first-page":"2070","DOI":"10.1109\/TPAMI.2018.2852750","volume":"41","author":"Z Li","year":"2018","unstructured":"Li Z, Tang J, Mei T (2018) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070\u20132083","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"9186_CR36","unstructured":"Lidy T, Rauber A (2005) Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proceedings of the ISMIR, pp 34\u201341"},{"key":"9186_CR37","doi-asserted-by":"crossref","unstructured":"Lin J, Gan C, Han S (2019) TSM: Temporal shift module for efficient video understanding. In: Procedings of the ICCV, pp 7083\u20137093","DOI":"10.1109\/ICCV.2019.00718"},{"key":"9186_CR38","doi-asserted-by":"crossref","unstructured":"Long X, Gan C, De\u00a0Melo G, et\u00a0al (2018a) Attention clusters: purely attention based local feature integration for video classification. In: IEEE, Proceedings of the CVPR, pp 7834\u20137843","DOI":"10.1109\/CVPR.2018.00817"},{"key":"9186_CR39","doi-asserted-by":"crossref","unstructured":"Long X, Gan C, Melo G, et\u00a0al (2018b) Multimodal keyless attention fusion for video classification. In: No.\u00a01 in Proceedings of the AAAI","DOI":"10.1609\/aaai.v32i1.12319"},{"key":"9186_CR40","doi-asserted-by":"crossref","unstructured":"Long X, De\u00a0Melo G, He D, et\u00a0al (2020) Purely attention based local feature integration for video classification. IEEE TPAMI pp 2140 \u2013 2154","DOI":"10.1109\/TPAMI.2020.3029554"},{"issue":"86","key":"9186_CR41","first-page":"2579","volume":"9","author":"LV der Maaten","year":"2008","unstructured":"der Maaten LV, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(86):2579\u20132605","journal-title":"J Mach Learn Res"},{"key":"9186_CR42","doi-asserted-by":"crossref","unstructured":"McFee B, Raffel C, Liang D, et\u00a0al (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the python in science conference, pp 18\u201325","DOI":"10.25080\/Majora-7b98e3ed-003"},{"issue":"8","key":"9186_CR43","doi-asserted-by":"publisher","first-page":"1224","DOI":"10.1038\/s41591-020-0931-3","volume":"26","author":"X Mei","year":"2020","unstructured":"Mei X, Lee HC, Ky Diao et al (2020) Artificial intelligence-enabled rapid diagnosis of patients with covid-19. Nat Med 26(8):1224\u20131228","journal-title":"Nat Med"},{"key":"9186_CR44","doi-asserted-by":"publisher","unstructured":"Neimark D, Bar O, Zohar M, et\u00a0al (2021) Video transformer network. In: Proceedings of the ICCV, pp 3163\u20133172, https:\/\/doi.org\/10.1109\/ICCVW54120.2021.00355","DOI":"10.1109\/ICCVW54120.2021.00355"},{"key":"9186_CR45","doi-asserted-by":"publisher","first-page":"120","DOI":"10.1016\/j.isprsjprs.2017.11.021","volume":"145","author":"M Paoletti","year":"2018","unstructured":"Paoletti M, Haut J, Plaza J et al (2018) A new deep convolutional neural network for fast hyperspectral image classification. ISPRS J Photogramm Remote Sens 145:120\u2013147. https:\/\/doi.org\/10.1016\/j.isprsjprs.2017.11.021","journal-title":"ISPRS J Photogramm Remote Sens"},{"key":"9186_CR46","first-page":"8024","volume":"32","author":"A Paszke","year":"2019","unstructured":"Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8024\u20138035","journal-title":"Adv Neural Inf Process Syst"},{"key":"9186_CR47","doi-asserted-by":"publisher","first-page":"284","DOI":"10.1016\/j.compeleceng.2016.06.004","volume":"70","author":"CI Patel","year":"2018","unstructured":"Patel CI, Garg S, Zaveri T et al (2018) Human action recognition using fusion of features for unconstrained video sequences. Comput Electr Eng 70:284\u2013301","journal-title":"Comput Electr Eng"},{"key":"9186_CR48","doi-asserted-by":"crossref","unstructured":"Roitberg A, Pollert T, Haurilet M, et\u00a0al (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: IEEE, Proceedings of The CVPRW, pp 198\u2013206","DOI":"10.1109\/CVPRW.2019.00029"},{"issue":"3","key":"9186_CR49","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211\u2013252","journal-title":"Int J Comput Vis"},{"key":"9186_CR50","doi-asserted-by":"publisher","first-page":"328","DOI":"10.1016\/j.eswa.2018.09.022","volume":"116","author":"Y Seo","year":"2019","unstructured":"Seo Y, Ks Shin (2019) Hierarchical convolutional neural networks for fashion image classification. Expert Syst Appl 116:328\u2013339. https:\/\/doi.org\/10.1016\/j.eswa.2018.09.022","journal-title":"Expert Syst Appl"},{"issue":"12","key":"9186_CR51","doi-asserted-by":"publisher","first-page":"4246","DOI":"10.3390\/s21124246","volume":"21","author":"MB Shaikh","year":"2021","unstructured":"Shaikh MB, Chai D (2021) RGB-D data-based action recognition: a review. Sensors 21(12):4246","journal-title":"Sensors"},{"key":"9186_CR52","doi-asserted-by":"crossref","unstructured":"Shaikh MB, Chai D, Islam SMS, et\u00a0al (2022) Maivar: multimodal audio-image and video action recognizer. In: IEEE, Proceedings of the VCIP, pp 1\u20135","DOI":"10.1109\/VCIP56404.2022.10008833"},{"key":"9186_CR53","doi-asserted-by":"publisher","first-page":"377","DOI":"10.1016\/j.procs.2018.05.198","volume":"132","author":"N Sharma","year":"2018","unstructured":"Sharma N, Jain V, Mishra A (2018) An analysis of convolutional neural networks for image classification. Procedia Comput Sci 132:377\u2013384","journal-title":"Procedia Comput Sci"},{"issue":"11","key":"9186_CR54","doi-asserted-by":"publisher","first-page":"9205","DOI":"10.1007\/s00521-022-06947-6","volume":"34","author":"S Slade","year":"2022","unstructured":"Slade S, Zhang L, Yu Y et al (2022) An evolving ensemble model of multi-stream convolutional neural networks for human action recognition in still images. Neural Comput Appl 34(11):9205\u20139231","journal-title":"Neural Comput Appl"},{"key":"9186_CR55","unstructured":"Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402"},{"key":"9186_CR56","doi-asserted-by":"crossref","unstructured":"Sudhakaran S, Escalera S, Lanz O (2020) Gate-shift networks for video action recognition. In: IEEE, Proceedings of the CVPR, pp 1102\u20131111","DOI":"10.1109\/CVPR42600.2020.00118"},{"key":"9186_CR57","doi-asserted-by":"publisher","unstructured":"Szegedy C, et\u00a0al (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the AAAI, pp 4278\u20134284, https:\/\/doi.org\/10.5555\/3298023.3298188","DOI":"10.5555\/3298023.3298188"},{"issue":"3","key":"9186_CR58","first-page":"513","volume":"20","author":"N Takahashi","year":"2017","unstructured":"Takahashi N, Gygli M, Van Gool L (2017) AENet: learning deep audio features for video analysis. IEEE TMM 20(3):513\u2013524","journal-title":"IEEE TMM"},{"key":"9186_CR59","doi-asserted-by":"publisher","unstructured":"Tan M, Le Q (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the ICML, pp 6105\u20136114, https:\/\/doi.org\/10.48550\/arXiv.1905.11946","DOI":"10.48550\/arXiv.1905.11946"},{"key":"9186_CR60","doi-asserted-by":"publisher","first-page":"202","DOI":"10.1016\/j.engappai.2018.09.006","volume":"76","author":"W Tao","year":"2018","unstructured":"Tao W, Leu MC, Yin Z (2018) American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion. Eng Appl Artif Intell 76:202\u2013213","journal-title":"Eng Appl Artif Intell"},{"key":"9186_CR61","doi-asserted-by":"crossref","unstructured":"Tian Y, Shi J, Li B, et\u00a0al (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the ECCV, pp 247\u2013263","DOI":"10.1007\/978-3-030-01216-8_16"},{"key":"9186_CR62","doi-asserted-by":"publisher","unstructured":"Tran D, Bourdev L, Fergu R, et\u00a0al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the ICCV, pp 4489\u20134497, https:\/\/doi.org\/10.1109\/ICCV.2015.510","DOI":"10.1109\/ICCV.2015.510"},{"key":"9186_CR63","doi-asserted-by":"publisher","first-page":"12295","DOI":"10.1007\/s00521-019-04408-1","volume":"32","author":"B Vandersmissen","year":"2020","unstructured":"Vandersmissen B, Knudde N, Jalalvand A et al (2020) Indoor human activity recognition using high-dimensional sensors and deep neural networks. Neural Comput Appl 32:12295\u201312309","journal-title":"Neural Comput Appl"},{"key":"9186_CR64","doi-asserted-by":"publisher","unstructured":"Vinyes\u00a0Mora S, Knottenbelt WJ (2017) Deep learning for domain-specific action recognition in tennis. In: IEEE, Proceedings of the CVPR Workshops, pp 114\u2013122, https:\/\/doi.org\/10.1109\/CVPRW.2017.27","DOI":"10.1109\/CVPRW.2017.27"},{"key":"9186_CR65","doi-asserted-by":"publisher","first-page":"274","DOI":"10.1016\/j.compeleceng.2018.07.042","volume":"72","author":"S Wan","year":"2018","unstructured":"Wan S, Liang Y, Zhang Y (2018) Deep convolutional neural networks for diabetic retinopathy detection by image classification. Comput Electr Eng 72:274\u2013282","journal-title":"Comput Electr Eng"},{"key":"9186_CR66","doi-asserted-by":"publisher","unstructured":"Wang L, et\u00a0al (2016) Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the ECCV, pp 20\u201336, https:\/\/doi.org\/10.1007\/978-3-319-46484-8_2","DOI":"10.1007\/978-3-319-46484-8_2"},{"issue":"3s","key":"9186_CR67","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3468872","volume":"17","author":"C Yan","year":"2021","unstructured":"Yan C, Teng T, Liu Y et al (2021) Precise no-reference image quality evaluation based on distortion identification. ACM Trans Multimed Comput Commun Appl(TOMM) 17(3s):1\u201321","journal-title":"ACM Trans Multimed Comput Commun Appl(TOMM)"},{"issue":"3","key":"9186_CR68","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0265115","volume":"17","author":"G Yang","year":"2022","unstructured":"Yang G et al (2022) STA-TSN: spatial-temporal attention temporal segment network for action recognition in video. PloS one 17(3):1\u201319","journal-title":"PloS one"},{"issue":"4","key":"9186_CR69","doi-asserted-by":"publisher","first-page":"1085","DOI":"10.3390\/s20041085","volume":"20","author":"K Zhang","year":"2020","unstructured":"Zhang K, Li D, Huang J et al (2020) Automated video behavior recognition of pigs using two-stream convolutional networks. Sensors 20(4):1085","journal-title":"Sensors"},{"key":"9186_CR70","doi-asserted-by":"crossref","unstructured":"Zhou B, et\u00a0al (2018) Temporal relational reasoning in videos. In: Proceedings of the ECCV, pp 803\u2013818","DOI":"10.1007\/978-3-030-01246-5_49"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-023-09186-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-023-09186-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-023-09186-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,7]],"date-time":"2024-03-07T20:13:18Z","timestamp":1709842398000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-023-09186-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,9]]},"references-count":70,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["9186"],"URL":"https:\/\/doi.org\/10.1007\/s00521-023-09186-5","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,9]]},"assertion":[{"value":"13 April 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 October 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 January 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}