{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T03:17:35Z","timestamp":1761621455255,"version":"3.41.0"},"reference-count":170,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2019,5,31]],"date-time":"2019-05-31T00:00:00Z","timestamp":1559260800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,5,31]]},"abstract":"<jats:p>Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition, and detection, deep learning has been shown to perform well in event recognition tasks also. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this article, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this article, we extensively review different deep-learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.<\/jats:p>","DOI":"10.1145\/3306240","type":"journal-article","created":{"date-parts":[[2019,6,6]],"date-time":"2019-06-06T12:28:42Z","timestamp":1559824122000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":40,"title":["How Deep Features Have Improved Event Recognition in Multimedia"],"prefix":"10.1145","volume":"15","author":[{"given":"Kashif","family":"Ahmad","sequence":"first","affiliation":[{"name":"University of Trento, Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nicola","family":"Conci","sequence":"additional","affiliation":[{"name":"University of Trento, Trento, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,6,5]]},"reference":[{"unstructured":"Sharath Adavanne Giambattista Parascandolo Pasi Pertil\u00e4 Toni Heittola and Tuomas Virtanen. 2017. Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017).  Sharath Adavanne Giambattista Parascandolo Pasi Pertil\u00e4 Toni Heittola and Tuomas Virtanen. 2017. Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017).","key":"e_1_2_1_1_1"},{"doi-asserted-by":"crossref","unstructured":"Sharath Adavanne Archontis Politis and Tuomas Virtanen. 2018. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features. arXiv preprint arXiv:1801.09522 (2018).  Sharath Adavanne Archontis Politis and Tuomas Virtanen. 2018. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features. arXiv preprint arXiv:1801.09522 (2018).","key":"e_1_2_1_2_1","DOI":"10.1109\/IJCNN.2018.8489542"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/2910017.2910624"},{"doi-asserted-by":"publisher","key":"e_1_2_1_4_1","DOI":"10.1117\/1.JEI.26.6.060502"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1016\/j.image.2017.09.009"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1109\/GlobalSIP.2016.7906036"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1109\/ICIP.2017.8296810"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.2352\/ISSN.2470-1173.2018.2.VIPC-173"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.1145\/3199668"},{"doi-asserted-by":"publisher","key":"e_1_2_1_10_1","DOI":"10.1007\/s11042-018-5982-9"},{"volume-title":"Proceedings of the MediaEval 2017 Workshop","author":"Ahmad Kashif","key":"e_1_2_1_11_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.1109\/IVMSPW.2018.8448670"},{"volume-title":"Proceedings of the MediaEval 2017 Workshop (Sept. 13--15","year":"2017","author":"Ahmad Sheharyar","key":"e_1_2_1_13_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1109\/IGARSS.2016.7730352"},{"doi-asserted-by":"crossref","unstructured":"Nazia Attari Ferda Ofli Mohammad Awad Ji Lucas and Sanjay Chawla. 2016. Nazr-CNN: Fine-grained classification of UAV imagery for damage assessment. arXiv preprint arXiv:1611.06474 (2016).  Nazia Attari Ferda Ofli Mohammad Awad Ji Lucas and Sanjay Chawla. 2016. Nazr-CNN: Fine-grained classification of UAV imagery for damage assessment. arXiv preprint arXiv:1611.06474 (2016).","key":"e_1_2_1_15_1","DOI":"10.1109\/DSAA.2017.72"},{"volume-title":"Proceedings of the Working Notes Proceeding MediaEval Workshop","year":"2017","author":"Avgerinakis Konstantinos","key":"e_1_2_1_16_1"},{"volume-title":"Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems. 892--900.","year":"2016","author":"Aytar Yusuf","key":"e_1_2_1_17_1"},{"unstructured":"Elham Babaee Nor Badrul Anuar Ainuddin Wahid Abdul Wahab Shahaboddin Shamshirband and Anthony T. Chronopoulos. 2018. An overview of audio event detection methods from feature extraction to classification. Applied Artificial Intelligence (2018) 1--54.  Elham Babaee Nor Badrul Anuar Ainuddin Wahid Abdul Wahab Shahaboddin Shamshirband and Anthony T. Chronopoulos. 2018. An overview of audio event detection methods from feature extraction to classification. Applied Artificial Intelligence (2018) 1--54.","key":"e_1_2_1_18_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_19_1","DOI":"10.1016\/j.jvcir.2016.07.021"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.5555\/1698924.1699041"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.1007\/11744023_32"},{"volume-title":"Proceedings of the Working Notes Proceeding MediaEval Workshop","author":"Bischke Benjamin","key":"e_1_2_1_22_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1145\/2964284.2984063"},{"volume-title":"Proceedings of the MediaEval 2017 Workshop (Sept. 13-15","year":"2017","author":"Bischke Benjamin","key":"e_1_2_1_24_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.1145\/1282280.1282340"},{"doi-asserted-by":"publisher","key":"e_1_2_1_26_1","DOI":"10.1109\/ICCV.2013.151"},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1145\/2324796.2324823"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1109\/IJCNN.2015.7280624"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1109\/IJCNN.2016.7727634"},{"volume":"28","volume-title":"NIST TRECVID Workshop","author":"Cao Liangliang","key":"e_1_2_1_30_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_31_1","DOI":"10.1109\/TPAMI.2016.2608901"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 6th IASTED International Conference","volume":"134643","author":"Chatzichristofis S.","year":"2009"},{"volume-title":"Proceedings of the International Conference on Computer Vision Systems. Springer, 312--322","author":"Savvas","key":"e_1_2_1_33_1"},{"volume-title":"Mosift: Recognizing human actions in surveillance videos.","year":"2009","author":"Alexander Hauptmann Chen","key":"e_1_2_1_34_1"},{"unstructured":"Tao Chen Damian Borth Trevor Darrell and Shih-Fu Chang. 2014. DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586 (2014).  Tao Chen Damian Borth Trevor Darrell and Shih-Fu Chang. 2014. DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586 (2014).","key":"e_1_2_1_35_1"},{"volume-title":"Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)","year":"2016","author":"Choi Inkyu","key":"e_1_2_1_36_1"},{"volume-title":"Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, IEEE, 885--888","author":"Chu Selina","key":"e_1_2_1_37_1"},{"volume-title":"Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 69--72","author":"Courtenay","key":"e_1_2_1_38_1"},{"doi-asserted-by":"crossref","unstructured":"Ekin D. Cubuk Barret Zoph Dandelion Mane Vijay Vasudevan and Quoc V. Le. 2018. AutoAugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).  Ekin D. Cubuk Barret Zoph Dandelion Mane Vijay Vasudevan and Quoc V. Le. 2018. AutoAugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).","key":"e_1_2_1_39_1","DOI":"10.1109\/CVPR.2019.00020"},{"volume-title":"Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events.","year":"2016","author":"Dai Wei Juncheng Li","key":"e_1_2_1_40_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_41_1","DOI":"10.1007\/s11042-012-1153-6"},{"doi-asserted-by":"publisher","key":"e_1_2_1_42_1","DOI":"10.1109\/CVPR.2009.5206848"},{"unstructured":"Terrance DeVries and Graham W. Taylor. 2017. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538 (2017).  Terrance DeVries and Graham W. Taylor. 2017. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538 (2017).","key":"e_1_2_1_43_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_44_1","DOI":"10.1016\/j.patcog.2015.04.005"},{"doi-asserted-by":"publisher","key":"e_1_2_1_45_1","DOI":"10.1109\/ICCVW.2015.40"},{"doi-asserted-by":"publisher","key":"e_1_2_1_46_1","DOI":"10.1109\/CVPR.2018.00630"},{"doi-asserted-by":"publisher","key":"e_1_2_1_47_1","DOI":"10.1145\/2964284.2967290"},{"unstructured":"Jonathan G. Fiscus. 2010. TRECVID multimedia event detection 2010 evaluation. (2010).  Jonathan G. Fiscus. 2010. TRECVID multimedia event detection 2010 evaluation. (2010).","key":"e_1_2_1_48_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_49_1","DOI":"10.1016\/j.patrec.2015.06.026"},{"doi-asserted-by":"publisher","key":"e_1_2_1_50_1","DOI":"10.1109\/TITS.2015.2470216"},{"doi-asserted-by":"publisher","key":"e_1_2_1_51_1","DOI":"10.1145\/2502081.2502245"},{"doi-asserted-by":"publisher","key":"e_1_2_1_52_1","DOI":"10.1109\/MMUL.2005.87"},{"doi-asserted-by":"crossref","unstructured":"Steve Frolking Jianjun Qiu Stephen Boles Xiangming Xiao Jiyuan Liu Yahui Zhuang Changsheng Li and Xiaoguang Qin. 2002. Combining remote sensing and ground census data to develop new maps of the distribution of rice agriculture in China. Global Biogeochemical Cycles 16 4 (2002).  Steve Frolking Jianjun Qiu Stephen Boles Xiangming Xiao Jiyuan Liu Yahui Zhuang Changsheng Li and Xiaoguang Qin. 2002. Combining remote sensing and ground census data to develop new maps of the distribution of rice agriculture in China. Global Biogeochemical Cycles 16 4 (2002).","key":"e_1_2_1_53_1","DOI":"10.1029\/2001GB001425"},{"doi-asserted-by":"publisher","key":"e_1_2_1_54_1","DOI":"10.1109\/ICCV.2015.230"},{"doi-asserted-by":"publisher","key":"e_1_2_1_55_1","DOI":"10.1007\/978-3-319-46487-9_52"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577","author":"Gan Chuang","key":"e_1_2_1_56_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_57_1","DOI":"10.1109\/CVPR.2016.106"},{"volume-title":"Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO). IEEE, 506--510","year":"2014","author":"Gencoglu Oguzhan","key":"e_1_2_1_58_1"},{"doi-asserted-by":"crossref","unstructured":"D. Giannoulis E. Benetos D. Stowell M. Rossignol M. Lagrange and M. Plumbley. 2013. IEEE AASP challenge: Detection and classification of acoustic scenes and events. Queen Mary University of London: London UK (2013).  D. Giannoulis E. Benetos D. Stowell M. Rossignol M. Lagrange and M. Plumbley. 2013. IEEE AASP challenge: Detection and classification of acoustic scenes and events. Queen Mary University of London: London UK (2013).","key":"e_1_2_1_59_1","DOI":"10.1109\/WASPAA.2013.6701819"},{"doi-asserted-by":"publisher","key":"e_1_2_1_60_1","DOI":"10.1109\/CVPR.2014.81"},{"unstructured":"Ian Goodfellow Yoshua Bengio Aaron Courville and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press Cambridge.   Ian Goodfellow Yoshua Bengio Aaron Courville and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press Cambridge.","key":"e_1_2_1_61_1"},{"unstructured":"Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.   Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.","key":"e_1_2_1_62_1"},{"volume-title":"Proceedings of the 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.","year":"2015","author":"Guo Cong","key":"e_1_2_1_63_1"},{"unstructured":"Cong Guo Xinmei Tian and Tao Mei. 2017. Multi-granular event recognition of personal photo albums. IEEE Transactions on Multimedia (2017).  Cong Guo Xinmei Tian and Tao Mei. 2017. Multi-granular event recognition of personal photo albums. IEEE Transactions on Multimedia (2017).","key":"e_1_2_1_64_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_65_1","DOI":"10.1109\/ICME.2005.1521503"},{"volume-title":"Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)","year":"2016","author":"Hayashi Tomoki","key":"e_1_2_1_66_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_67_1","DOI":"10.1109\/ICASSP.2017.7952259"},{"unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).","key":"e_1_2_1_68_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_69_1","DOI":"10.1109\/CVPR.2016.90"},{"doi-asserted-by":"publisher","key":"e_1_2_1_70_1","DOI":"10.1109\/ICASSP.2013.6639360"},{"doi-asserted-by":"publisher","key":"e_1_2_1_71_1","DOI":"10.1016\/j.cviu.2004.02.005"},{"volume-title":"Sound Event Detection in Real Life Audio Using Multimodel System. Technical Report. DCASE2017 Challenge, Tech. Rep.","year":"2017","author":"Hou Yuanbo","key":"e_1_2_1_72_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_73_1","DOI":"10.1007\/s11263-015-0823-z"},{"doi-asserted-by":"publisher","key":"e_1_2_1_74_1","DOI":"10.1109\/ICPR.2014.125"},{"doi-asserted-by":"publisher","key":"e_1_2_1_75_1","DOI":"10.1145\/2647868.2654889"},{"doi-asserted-by":"publisher","key":"e_1_2_1_76_1","DOI":"10.1145\/2393347.2393412"},{"doi-asserted-by":"publisher","key":"e_1_2_1_77_1","DOI":"10.1109\/TPAMI.2017.2670560"},{"doi-asserted-by":"publisher","key":"e_1_2_1_78_1","DOI":"10.1145\/2964284.2964309"},{"unstructured":"Andreas Kamilaris and Francesc X. Prenafeta-Bold\u00fa. 2018. Disaster monitoring using unmanned aerial vehicles and deep learning. arXiv preprint arXiv:1807.11805 (2018).  Andreas Kamilaris and Francesc X. Prenafeta-Bold\u00fa. 2018. Disaster monitoring using unmanned aerial vehicles and deep learning. arXiv preprint arXiv:1807.11805 (2018).","key":"e_1_2_1_79_1"},{"volume-title":"Proceedings of the MediaEval 2017 Workshop (Sept. 13--15","year":"2017","author":"Nogueira Keiller","key":"e_1_2_1_80_1"},{"doi-asserted-by":"crossref","unstructured":"Zvi Kons and Orith Toledo-Ronen. 2013. Audio event classification using deep neural networks. In Interspeech. 1482--1486.  Zvi Kons and Orith Toledo-Ronen. 2013. Audio event classification using deep neural networks. In Interspeech. 1482--1486.","key":"e_1_2_1_81_1","DOI":"10.21437\/Interspeech.2013-384"},{"unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.   Alex Krizhevsky Ilya Sutskever and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.","key":"e_1_2_1_82_1"},{"volume-title":"Proceedings on the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE'16)","author":"K\u00fcrby Julian","key":"e_1_2_1_83_1"},{"unstructured":"Ying-Hui Lai Chun-Hao Wang Shi-Yan Hou Bang-Yin Chen Yu Tsao and Yi-Wen Liu. 2016. DCASE report for task 3: Sound event detection in real life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).  Ying-Hui Lai Chun-Hao Wang Shi-Yan Hou Bang-Yin Chen Yu Tsao and Yi-Wen Liu. 2016. DCASE report for task 3: Sound event detection in real life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).","key":"e_1_2_1_84_1"},{"key":"e_1_2_1_85_1","volume-title":"TRECVID 2013 Workshop","volume":"1","author":"Lan Zhen-Zhong","year":"2013"},{"volume-title":"DCASE2017 Challenge.","year":"2017","author":"Lee Donmoon","key":"e_1_2_1_86_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_87_1","DOI":"10.1109\/ICCV.2007.4408872"},{"volume-title":"Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop.","year":"2017","author":"Lim Hyungui","key":"e_1_2_1_88_1"},{"unstructured":"Min Lin Qiang Chen and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).  Min Lin Qiang Chen and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).","key":"e_1_2_1_89_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_90_1","DOI":"10.1109\/ICCVW.2015.44"},{"doi-asserted-by":"publisher","key":"e_1_2_1_91_1","DOI":"10.1145\/2461466.2461493"},{"doi-asserted-by":"publisher","key":"e_1_2_1_92_1","DOI":"10.1016\/j.procs.2016.07.144"},{"doi-asserted-by":"publisher","key":"e_1_2_1_93_1","DOI":"10.1609\/aaai.v32i1.12319"},{"doi-asserted-by":"publisher","key":"e_1_2_1_94_1","DOI":"10.1109\/CVPR.2018.00817"},{"volume-title":"Proceedings of the MediaEval 2017 Workshop (Sept. 13--15","year":"2017","author":"Lopez-Fuentes Laura","key":"e_1_2_1_95_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_96_1","DOI":"10.1023\/B:VISI.0000029664.99615.94"},{"doi-asserted-by":"publisher","key":"e_1_2_1_97_1","DOI":"10.1145\/2910017.2910630"},{"key":"e_1_2_1_98_1","first-page":"9","article-title":"Event-based media organization and indexing","volume":"3","author":"Mattivi R.","year":"2011","journal-title":"Infocommunications Journal"},{"volume-title":"DCASE 2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.","year":"2017","author":"Mesaros Annamaria","key":"e_1_2_1_99_1"},{"volume-title":"Proceedings of the 2010 18th European Signal Processing Conference. IEEE, 1267--1271","year":"2010","author":"Mesaros Annamaria","key":"e_1_2_1_100_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_101_1","DOI":"10.3390\/app6060162"},{"doi-asserted-by":"publisher","key":"e_1_2_1_102_1","DOI":"10.1145\/2911996.2912036"},{"unstructured":"Matthias Meyer Lukas Cavigelli and Lothar Thiele. 2017. Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017).  Matthias Meyer Lukas Cavigelli and Lothar Thiele. 2017. Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017).","key":"e_1_2_1_103_1"},{"volume-title":"Proc. of the MediaEval Workshop (Sept. 13--15","year":"2017","author":"Minh-Son Dao","key":"e_1_2_1_104_1"},{"volume-title":"Proceedings of the MediaEval 2017 Workshop (Sept. 13--15","year":"2017","author":"Muhammad Hanif","key":"e_1_2_1_105_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_106_1","DOI":"10.1109\/MMUL.2006.63"},{"unstructured":"Keiller Nogueira Samuel G. Fadel \u00cdcaro C. Dourado Rafael de O. Werneck Javier A. V. Mu\u00f1oz Ot\u00e1vio A. B. Penatti Rodrigo T. Calumby Lin Tzy Li Jefersson A. dos Santos and Ricardo da S. Torres. 2017. Exploiting ConvNet diversity for flooding identification. arXiv preprint arXiv:1711.03564 (2017).  Keiller Nogueira Samuel G. Fadel \u00cdcaro C. Dourado Rafael de O. Werneck Javier A. V. Mu\u00f1oz Ot\u00e1vio A. B. Penatti Rodrigo T. Calumby Lin Tzy Li Jefersson A. dos Santos and Ricardo da S. Torres. 2017. Exploiting ConvNet diversity for flooding identification. arXiv preprint arXiv:1711.03564 (2017).","key":"e_1_2_1_107_1"},{"volume-title":"TRECVID Workshop.","year":"2012","author":"Oneata Dan","key":"e_1_2_1_108_1"},{"volume-title":"TRECVID 2014--An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID. 52","year":"2014","author":"Over Paul","key":"e_1_2_1_109_1"},{"volume-title":"Proceedings of MediaEval.","year":"2011","author":"Papadopoulos Symeon","key":"e_1_2_1_110_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_111_1","DOI":"10.1109\/ICASSP.2016.7472917"},{"doi-asserted-by":"publisher","key":"e_1_2_1_112_1","DOI":"10.1109\/CVPRW.2015.7301335"},{"volume-title":"Proceedings of the International Conference on Multimedia Retrieval Workshop on Social Events in Web Multimedia (SEWM).","year":"2014","author":"Petkos Georgios","key":"e_1_2_1_113_1"},{"doi-asserted-by":"crossref","unstructured":"Huy Phan Lars Hertel Marco Maass and Alfred Mertins. 2016. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338 (2016).  Huy Phan Lars Hertel Marco Maass and Alfred Mertins. 2016. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338 (2016).","key":"e_1_2_1_114_1","DOI":"10.21437\/Interspeech.2016-123"},{"doi-asserted-by":"publisher","key":"e_1_2_1_115_1","DOI":"10.1145\/2733373.2806390"},{"doi-asserted-by":"publisher","key":"e_1_2_1_116_1","DOI":"10.1109\/ICASSP.2014.6854293"},{"doi-asserted-by":"publisher","key":"e_1_2_1_117_1","DOI":"10.1109\/ISM.2016.0048"},{"doi-asserted-by":"publisher","key":"e_1_2_1_118_1","DOI":"10.1142\/S1793351X17400050"},{"volume-title":"Proceedings of the Korea-Japan Joint Workshop on Frontiers of Computer Vision. 85--90","year":"2016","author":"Rachmadi Reza Fuad","key":"e_1_2_1_119_1"},{"volume-title":"Proceedings of the MediaEval Multimedia Benchmark Workshop Barcelona, Spain, October 18--19","year":"2013","author":"Reuter Timo","key":"e_1_2_1_120_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_121_1","DOI":"10.1016\/j.rse.2010.07.005"},{"doi-asserted-by":"publisher","key":"e_1_2_1_122_1","DOI":"10.1109\/FG.2015.7163105"},{"doi-asserted-by":"publisher","key":"e_1_2_1_123_1","DOI":"10.1145\/2647868.2655045"},{"doi-asserted-by":"publisher","key":"e_1_2_1_124_1","DOI":"10.1109\/CVPRW.2015.7301334"},{"unstructured":"Emmanouil Schinas Georgios Petkos Symeon Papadopoulos and Yiannis Kompatsiaris. 2012. CERTH@ MediaEval 2012 social event detection task. In MediaEval. Citeseer.  Emmanouil Schinas Georgios Petkos Symeon Papadopoulos and Yiannis Kompatsiaris. 2012. CERTH@ MediaEval 2012 social event detection task. In MediaEval. Citeseer.","key":"e_1_2_1_125_1"},{"key":"e_1_2_1_126_1","volume-title":"Proceedings of the 1999 Congress on Evolutionary Computation (CEC\u201999)","volume":"3","author":"Shi Yuhui","year":"1945"},{"unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).","key":"e_1_2_1_127_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_128_1","DOI":"10.1109\/ICCV.2015.518"},{"doi-asserted-by":"publisher","key":"e_1_2_1_130_1","DOI":"10.1109\/CVPR.2018.00151"},{"doi-asserted-by":"publisher","key":"e_1_2_1_131_1","DOI":"10.1109\/CVPR.2015.7298594"},{"doi-asserted-by":"publisher","key":"e_1_2_1_132_1","DOI":"10.1109\/CVPR.2016.308"},{"doi-asserted-by":"crossref","unstructured":"Naoya Takahashi Michael Gygli Beat Pfister and Luc Van Gool. 2016. Deep convolutional neural networks and data augmentation for acoustic event detection. arXiv preprint arXiv:1604.07160 (2016).  Naoya Takahashi Michael Gygli Beat Pfister and Luc Van Gool. 2016. Deep convolutional neural networks and data augmentation for acoustic event detection. arXiv preprint arXiv:1604.07160 (2016).","key":"e_1_2_1_133_1","DOI":"10.21437\/Interspeech.2016-805"},{"doi-asserted-by":"publisher","key":"e_1_2_1_134_1","DOI":"10.1109\/TMM.2017.2751969"},{"unstructured":"Planet Team. 2016. Planet application program interface: In Space for Life on Earth. San Francisco CA.  Planet Team. 2016. Planet application program interface: In Space for Life on Earth. San Francisco CA.","key":"e_1_2_1_135_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_136_1","DOI":"10.1145\/2812802"},{"doi-asserted-by":"publisher","key":"e_1_2_1_137_1","DOI":"10.1109\/ICCV.2015.510"},{"doi-asserted-by":"publisher","key":"e_1_2_1_138_1","DOI":"10.1109\/ICME.2011.6012232"},{"doi-asserted-by":"publisher","key":"e_1_2_1_139_1","DOI":"10.1016\/j.imavis.2016.05.005"},{"unstructured":"Dmitrii Ubskii and Alexei Pugachev. 2016. Sound event detection in real-life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).  Dmitrii Ubskii and Alexei Pugachev. 2016. Sound event detection in real-life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).","key":"e_1_2_1_140_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_141_1","DOI":"10.1007\/s11263-013-0620-5"},{"volume-title":"Proceedings of the NAG\/DAGA International Conference on Acoustics.","year":"2009","author":"Van Grootel MWW","key":"e_1_2_1_142_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_143_1","DOI":"10.1109\/CVPR.2011.5995407"},{"doi-asserted-by":"publisher","key":"e_1_2_1_144_1","DOI":"10.1109\/ICCV.2013.441"},{"unstructured":"Jun Wang and Jean-Daniel Zucker. 2000. Solving multiple-instance problem: A lazy learning approach. (2000).  Jun Wang and Jean-Daniel Zucker. 2000. Solving multiple-instance problem: A lazy learning approach. (2000).","key":"e_1_2_1_145_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_146_1","DOI":"10.1109\/ICCVW.2015.46"},{"doi-asserted-by":"publisher","key":"e_1_2_1_147_1","DOI":"10.1007\/s11263-017-1043-5"},{"doi-asserted-by":"publisher","key":"e_1_2_1_148_1","DOI":"10.1109\/CVPR.2018.00813"},{"doi-asserted-by":"crossref","unstructured":"Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. arXiv preprint arXiv:1806.01810 (2018).  Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. arXiv preprint arXiv:1806.01810 (2018).","key":"e_1_2_1_149_1","DOI":"10.1007\/978-3-030-01228-1_25"},{"doi-asserted-by":"publisher","key":"e_1_2_1_150_1","DOI":"10.1109\/CVPR.2015.7299071"},{"doi-asserted-by":"publisher","key":"e_1_2_1_151_1","DOI":"10.1145\/2911996.2912048"},{"doi-asserted-by":"publisher","key":"e_1_2_1_152_1","DOI":"10.1109\/ICASSP.2017.7952704"},{"doi-asserted-by":"publisher","key":"e_1_2_1_153_1","DOI":"10.1109\/ICASSP.2016.7472176"},{"doi-asserted-by":"publisher","key":"e_1_2_1_154_1","DOI":"10.1109\/ICCVW.2015.45"},{"unstructured":"Sebastien C. Wong Adam Gatt Victor Stamatescu and Mark D. McDonnell. 2016. Understanding data augmentation for classification: When to warp? arXiv preprint arXiv:1609.08764 (2016).  Sebastien C. Wong Adam Gatt Victor Stamatescu and Mark D. McDonnell. 2016. Understanding data augmentation for classification: When to warp? arXiv preprint arXiv:1609.08764 (2016).","key":"e_1_2_1_155_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_156_1","DOI":"10.1109\/TMM.2015.2477681"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1600--1609","year":"2015","author":"Xiong Yuanjun","key":"e_1_2_1_157_1"},{"doi-asserted-by":"crossref","unstructured":"Dan Xu Elisa Ricci Yan Yan Jingkuan Song and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).  Dan Xu Elisa Ricci Yan Yan Jingkuan Song and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).","key":"e_1_2_1_158_1","DOI":"10.5244\/C.29.8"},{"volume-title":"Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1798--1807","author":"Xu Zhongwen","key":"e_1_2_1_159_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_160_1","DOI":"10.1109\/3477.752789"},{"doi-asserted-by":"publisher","key":"e_1_2_1_161_1","DOI":"10.1109\/ICCV.2015.512"},{"doi-asserted-by":"publisher","key":"e_1_2_1_162_1","DOI":"10.1145\/2733373.2806221"},{"doi-asserted-by":"publisher","key":"e_1_2_1_163_1","DOI":"10.1007\/s11263-017-1013-y"},{"doi-asserted-by":"publisher","key":"e_1_2_1_164_1","DOI":"10.1016\/j.neucom.2016.03.102"},{"key":"e_1_2_1_165_1","volume-title":"MER. In Proceedings of the NIST TRECVID Video Retrieval Evaluation Workshop","volume":"24","author":"Yu I","year":"2014"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 53--61","author":"Yue-Hei Ng Joe","key":"e_1_2_1_166_1"},{"unstructured":"Shengxin Zha Florian Luisier Walter Andrews Nitish Srivastava and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).  Shengxin Zha Florian Luisier Walter Andrews Nitish Srivastava and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).","key":"e_1_2_1_167_1"},{"unstructured":"Dongqing Zhang and Dan Ellis. 2001. Detecting sound events in basketball video archive. Dept. Electronic Eng. Columbia Univ. New York (2001).  Dongqing Zhang and Dan Ellis. 2001. Detecting sound events in basketball video archive. Dept. Electronic Eng. Columbia Univ. New York (2001).","key":"e_1_2_1_168_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_169_1","DOI":"10.1109\/TIP.2015.2511585"},{"unstructured":"Bolei Zhou Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495.   Bolei Zhou Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495.","key":"e_1_2_1_170_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_171_1","DOI":"10.1007\/s11263-017-1033-7"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3306240","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3306240","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:25:28Z","timestamp":1750206328000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3306240"}},"subtitle":["A Survey"],"short-title":[],"issued":{"date-parts":[[2019,5,31]]},"references-count":170,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,5,31]]}},"alternative-id":["10.1145\/3306240"],"URL":"https:\/\/doi.org\/10.1145\/3306240","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2019,5,31]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-06-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}