{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T02:18:05Z","timestamp":1771467485854,"version":"3.50.1"},"reference-count":177,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2026,4,30]]},"abstract":"<jats:p>Analyzing action scenes in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task, divided into action recognition, spotting key moments, and identifying actions in both time and space (spatio-temporal action localization) in soccer. We explore publicly available data sources and metrics used to evaluate models\u2019 performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional approaches. Our analysis begins with methods based on feature engineering, followed by an exploration of various deep learning techniques. This includes using Convolutional Neural Networks (CNNs) for visual information processing, Recurrent Neural Networks (RNNs) for analyzing temporal sequences, and transformer architectures to effectively capture context. In particular, we focus on the specifics of multimodal data, illustrating the potential for improved model accuracy and robustness. This includes an exploration of methods that integrate information from multiple sources, such as video and audio data, and methods that represent a single data source through multiple analytical lenses, offering a richer, more nuanced understanding of soccer actions (e.g., using a graph representation of players). Finally, the article highlights some of the open research questions and future directions in the field of soccer action analysis, especially the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of analyzing action scenes in soccer.<\/jats:p>","DOI":"10.1145\/3776541","type":"journal-article","created":{"date-parts":[[2025,11,11]],"date-time":"2025-11-11T14:45:57Z","timestamp":1762872357000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Survey of Action Recognition, Spotting, and Spatio-Temporal Localization in Soccer\u2014Current Trends and Research Perspectives"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0617-7301","authenticated-orcid":false,"given":"Karolina","family":"Seweryn","sequence":"first","affiliation":[{"name":"NASK National Research Institute, Warszawa, Poland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3407-7570","authenticated-orcid":false,"given":"Anna","family":"Wr\u00f3blewska","sequence":"additional","affiliation":[{"name":"Warsaw University of Technology, Warszawa, Poland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6716-610X","authenticated-orcid":false,"given":"Szymon","family":"\u0141ukasik","sequence":"additional","affiliation":[{"name":"AGH University of Krakow, Krakow, Poland and NASK National Research Institute, Warszawa, Poland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,20]]},"reference":[{"issue":"3","key":"e_1_3_3_2_2","doi-asserted-by":"crossref","first-page":"897","DOI":"10.1007\/s00530-022-01027-0","article-title":"Use of deep learning in soccer videos analysis: Survey","volume":"29","author":"Akan Sara","year":"2023","unstructured":"Sara Akan and Song\u00fcl Varl\u0131. 2023. Use of deep learning in soccer videos analysis: Survey. Multimedia Systems 29, 3 (2023), 897\u2013915.","journal-title":"Multimedia Systems"},{"key":"e_1_3_3_3_2","doi-asserted-by":"crossref","first-page":"1437","DOI":"10.1109\/TPAMI.2017.2711011","article-title":"NetVLAD: CNN architecture for weakly supervised place recognition","volume":"40","author":"Arandjelovi\u0107 Relja","year":"2015","unstructured":"Relja Arandjelovi\u0107, Petr Gron\u00e1t, Akihiko Torii, Tom\u00e1s Pajdla, and Josef Sivic. 2015. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2015), 1437\u20131451.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_3_4_2","first-page":"6816","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Arnab Anurag","year":"2021","unstructured":"Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 6816\u20136826. DOI: 10.1109\/ICCV48922.2021.00676"},{"issue":"2","key":"e_1_3_3_5_2","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","article-title":"Multimodal machine learning: A survey and taxonomy","volume":"41","author":"Baltru\u0161aitis Tadas","year":"2018","unstructured":"Tadas Baltru\u0161aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423\u2013443.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_3_6_2","first-page":"1","volume-title":"Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports (MMSports \u201921)","author":"Biermann Henrik","year":"2021","unstructured":"Henrik Biermann, Jonas Theiner, Manuel Bassek, Dominik Raabe, Daniel Memmert, and Ralph Ewerth. 2021. A unified taxonomy and multimodal dataset for events in invasion games. In Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports (MMSports \u201921). ACM, New York, NY, 1\u201310. DOI: 10.1145\/3475722.3482792"},{"key":"e_1_3_3_7_2","doi-asserted-by":"crossref","first-page":"5562","DOI":"10.1109\/ICCV.2017.593","volume-title":"2017 IEEE International Conference on Computer Vision (ICCV)","author":"Bodla Navaneeth","year":"2017","unstructured":"Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Soft-NMS\u2014Improving object detection with one line of code. In 2017 IEEE International Conference on Computer Vision (ICCV), 5562\u20135570."},{"key":"e_1_3_3_8_2","first-page":"8.1","volume-title":"Proceedings of the British Machine Vision Conference","author":"Bregonzio Matteo","year":"2010","unstructured":"Matteo Bregonzio, Jian Li, Shaogang Gong, and Tao Xiang. 2010. Discriminative topics modelling for action feature selection and recognition. In Proceedings of the British Machine Vision Conference. BMVA Press, 8.1\u20138.11. DOI: 10.5244\/C.24.8"},{"key":"e_1_3_3_9_2","article-title":"Using network science to analyse football passing networks: Dynamics, space, time, and the multilayer nature of the game","volume":"9","author":"Buld\u00fa Javier M.","year":"2018","unstructured":"Javier M. Buld\u00fa, Javier Busquets, Johann H. Mart\u00ednez, Jos\u00e9 L. Herrera-Diestra, Ignacio Echegoyen, Javier Galeano, and Jordi Luque. 2018. Using network science to analyse football passing networks: Dynamics, space, time, and the multilayer nature of the game. Frontiers in Psychology 9 (2018) 1900.","journal-title":"Frontiers in Psychology"},{"key":"e_1_3_3_10_2","first-page":"3386","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops","author":"Cabado Bruno","year":"2024","unstructured":"Bruno Cabado, Anthony Cioppa, Silvio Giancola, Andr\u00e9s Villa, Bertha Guijarro-Berdi\u00f1as, Emilio J. Padr\u00f3n, Bernard Ghanem, and Marc Van Droogenbroeck. 2024. Beyond the premier: Assessing action spotting transfer capability across diverse domains. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 3386\u20133398."},{"key":"e_1_3_3_11_2","first-page":"1","volume-title":"2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP)","author":"Cao Mengqi","year":"2022","unstructured":"Mengqi Cao, Min Yang, Guozhen Zhang, Xiaotian Li, Yilu Wu, Gangshan Wu, and Limin Wang. 2022. SpotFormer: A transformer-based framework for precise soccer action spotting. In 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), 1\u20136. DOI: 10.1109\/MMSP55362.2022.9948888"},{"key":"e_1_3_3_12_2","first-page":"6299","volume-title":"In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299\u20136308."},{"key":"e_1_3_3_13_2","first-page":"93","volume-title":"Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports \u201922)","author":"Cartas Alejandro","year":"2022","unstructured":"Alejandro Cartas, Coloma Ballester, and Gloria Haro. 2022. A graph-based method for soccer action spotting using unsupervised player classification. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports \u201922). ACM, New York, NY, 93\u2013102. DOI: 10.1145\/3552437.3555691"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","unstructured":"Shimin Chen Chen Chen Wei Li Xunqiang Tao and Yandong Guo. 2022. Faster-TAD: Towards Temporal Action Detection with Proposal Generation and Classification in a Unified Network. DOI: 10.48550\/ARXIV.2204.02674","DOI":"10.48550\/ARXIV.2204.02674"},{"key":"e_1_3_3_15_2","unstructured":"Junyoung Chung Caglar Gulcehre Kyunghyun Cho and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555. Retrieved from https:\/\/arxiv.org\/abs\/1412.3555."},{"key":"e_1_3_3_16_2","doi-asserted-by":"crossref","first-page":"4532","DOI":"10.1109\/CVPRW53098.2021.00511","volume-title":"2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),","author":"Cioppa Anthony","year":"2021","unstructured":"Anthony Cioppa, Adrien Deli\u00e8ge, Floriane Magera, Silvio Giancola, Olivier Barnich, Bernard Ghanem, and Marc Van Droogenbroeck. 2021. Camera calibration and player localization in SoccerNet-v2 and investigation of their representations for action spotting. In 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 4532\u20134541."},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-022-01469-1"},{"key":"e_1_3_3_18_2","first-page":"13123","volume-title":"The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Cioppa Anthony","year":"2020","unstructured":"Anthony Cioppa, Adrien Deli\u00e8ge, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck, Rikke Gade, and Thomas B. Moeslund. 2020. A context-aware loss function for action spotting in soccer videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 13123\u201313133."},{"key":"e_1_3_3_19_2","first-page":"1107","volume-title":"Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers","author":"Conneau Alexis","year":"2017","unstructured":"Alexis Conneau, Holger Schwenk, Lo\u00efc Barrault, and Yann Lecun. 2017. Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, 1107\u20131116. Retrieved from https:\/\/aclanthology.org\/E17-1104"},{"key":"e_1_3_3_20_2","first-page":"886","volume-title":"2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR \u201905)","author":"Dalal N.","year":"2005","unstructured":"N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR \u201905), 886\u2013893. DOI: 10.1109\/CVPR.2005.177"},{"key":"e_1_3_3_21_2","first-page":"87","volume-title":"Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports \u201922)","author":"Darwish Abdulrahman","year":"2022","unstructured":"Abdulrahman Darwish and Tallal El-Shabrway. 2022. STE: Spatio-temporal encoder for action spotting in soccer videos. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports \u201922). ACM, New York, NY, 87\u201392. DOI: 10.1145\/3552437.3555704"},{"key":"e_1_3_3_22_2","doi-asserted-by":"crossref","first-page":"4503","DOI":"10.1109\/CVPRW53098.2021.00508","volume-title":"2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","author":"Deli\u00e8ge Adrien","year":"2021","unstructured":"Adrien Deli\u00e8ge, Anthony Cioppa, Silvio Giancola, Meisam J. Seikavandi, Jacob V. Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B. Moeslund, and Marc Van Droogenbroeck. 2021. SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 4503\u20134514. DOI: 10.1109\/CVPRW53098.2021.00508"},{"key":"e_1_3_3_23_2","first-page":"530","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops","author":"Denize Julien","year":"2024","unstructured":"Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, and Romain H\u00e9rault. 2024. COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops. IEEE, 530\u2013540."},{"key":"e_1_3_3_24_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2010.03.009"},{"issue":"12","key":"e_1_3_3_26_2","doi-asserted-by":"crossref","first-page":"16995","DOI":"10.1007\/s11042-018-7083-1","article-title":"Event detection in soccer videos using unsupervised learning of spatio-temporal features based on pooled spatial pyramid model","volume":"78","author":"Fakhar Babak","year":"2019","unstructured":"Babak Fakhar, Hamidreza Rashidy Kanan, and Alireza Behrad. 2019. Event detection in soccer videos using unsupervised learning of spatio-temporal features based on pooled spatial pyramid model. Multimedia Tools and Applications 78, 12 (Jun. 2019), 16995\u201317025.","journal-title":"Multimedia Tools and Applications"},{"key":"e_1_3_3_27_2","first-page":"6804","volume-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Fan Haoqi","year":"2021","unstructured":"Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), 6804\u20136815."},{"key":"e_1_3_3_28_2","first-page":"3340","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Faure Gueter Josmy","year":"2023","unstructured":"Gueter Josmy Faure, Min-Hung Chen, and Shang-Hong Lai. 2023. Holistic interaction transformer network for action detection. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, 3340\u20133350."},{"key":"e_1_3_3_29_2","first-page":"6201","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Feichtenhofer Christoph","year":"2018","unstructured":"Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2018. SlowFast networks for video recognition. 2019. In IEEE\/CVF International Conference on Computer Vision (ICCV), 6201\u20136210."},{"key":"e_1_3_3_30_2","doi-asserted-by":"crossref","first-page":"28971","DOI":"10.1007\/s11042-020-09414-3","article-title":"SSET: A dataset for shot segmentation, event detection, player tracking in soccer videos","author":"Feng Na","year":"2020","unstructured":"Na Feng, Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Yizhu Zhao, Yunfeng He, and Tao Guan. 2020. SSET: A dataset for shot segmentation, event detection, player tracking in soccer videos. Multimedia Tools and Applications 79, 39\u201340 (2020), 28971\u201328992.","journal-title":"Multimedia Tools and Applications"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-024-01341-9"},{"key":"e_1_3_3_32_2","volume-title":"International Conference on Learning Representations","author":"Foret Pierre","year":"2021","unstructured":"Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2021. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=6Tm1mposlrM"},{"key":"e_1_3_3_33_2","first-page":"1","volume-title":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","author":"Gan Yaozong","year":"2022","unstructured":"Yaozong Gan, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2022. Transformer based multimodal scene recognition in soccer videos. In 2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 1\u20136. DOI: 10.1109\/ICMEW56448.2022.9859304"},{"key":"e_1_3_3_34_2","first-page":"1","volume-title":"2020 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","author":"Gao Xin","year":"2020","unstructured":"Xin Gao, Xusheng Liu, Taotao Yang, Guilin Deng, Hao Peng, Qiaosong Zhang, Hai Li, and Junhui Liu. 2020. Automatic key moment extraction and highlights generation based on comprehensive soccer video understanding. In 2020 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 1\u20136. DOI: 10.1109\/ICMEW46912.2020.9106051"},{"key":"e_1_3_3_35_2","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1145\/3552463.3557019","volume-title":"Proceedings of the 1st Workshop on User-Centric Narrative Summarization of Long Videos (NarSUM \u201922)","author":"Gautam Sushant","year":"2022","unstructured":"Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and P\u00e5l Halvorsen. 2022. Soccer game summarization using audio commentary, metadata, and captions. In Proceedings of the 1st Workshop on User-Centric Narrative Summarization of Long Videos (NarSUM \u201922). ACM, New York, NY, 13\u201322. DOI: 10.1145\/3552463.3557019"},{"key":"e_1_3_3_36_2","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1109\/ISM63611.2024.00016","volume-title":"2024 International Symposium on Multimedia (ISM)","author":"Gautam Sushant","year":"2024","unstructured":"Sushant Gautam, Mehdi Houshmand Sarkhoosh, Jan Held, Cise Midoglu, Anthony Cioppa, Silvio Giancola, Vajira Thambawita, Michael A. Riegler, Pal Halvorsen, and Mubarak Shah. 2024. SoccerNet-Echoes: A soccer game audio commentary dataset. In 2024 International Symposium on Multimedia (ISM), 71\u201378. DOI: 10.1109\/ISM63611.2024.00016"},{"key":"e_1_3_3_37_2","doi-asserted-by":"crossref","first-page":"776","DOI":"10.1109\/ICASSP.2017.7952261","volume-title":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Gemmeke Jort F.","year":"2017","unstructured":"Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776\u2013780."},{"key":"e_1_3_3_38_2","doi-asserted-by":"crossref","first-page":"1792","DOI":"10.1109\/CVPRW.2018.00223","volume-title":"2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","author":"Giancola Silvio","year":"2018","unstructured":"Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. 2018. SoccerNet: A scalable dataset for action spotting in soccer videos. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 1792\u2013179210. DOI: 10.1109\/CVPRW.2018.00223"},{"key":"e_1_3_3_39_2","doi-asserted-by":"crossref","first-page":"4485","DOI":"10.1109\/CVPRW53098.2021.00506","volume-title":"2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","author":"Giancola Silvio","year":"2021","unstructured":"Silvio Giancola and Bernard Ghanem. 2021. Temporally-aware feature pooling for action spotting in soccer broadcasts. In 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 4485\u20134494."},{"key":"e_1_3_3_40_2","doi-asserted-by":"crossref","unstructured":"Yuan Gong Yu-An Chung and James R. Glass. 2021. AST: Audio spectrogram transformer. arXiv:2104.01778. Retrieved from https:\/\/arxiv.org\/abs\/2104.01778","DOI":"10.21437\/Interspeech.2021-698"},{"key":"e_1_3_3_41_2","doi-asserted-by":"crossref","first-page":"63373","DOI":"10.1109\/ACCESS.2019.2916887","article-title":"Deep multimodal representation learning: A survey","volume":"7","author":"Guo Wenzhong","year":"2019","unstructured":"Wenzhong Guo, Jianwen Wang, and Shiping Wang. 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 63373\u201363394.","journal-title":"IEEE Access"},{"key":"e_1_3_3_42_2","volume-title":"British Machine Vision Conference","author":"He Bo","year":"2020","unstructured":"Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, and Abhinav Shrivastava. 2020. GTA: Global temporal attention for video action understanding. In British Machine Vision Conference."},{"key":"e_1_3_3_43_2","first-page":"386","article-title":"Mask R-CNN","volume":"42","author":"He Kaiming","year":"2017","unstructured":"Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross B. Girshick. 2017. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2017), 386\u2013397.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_3_44_2","first-page":"770","volume-title":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"He Kaiming","year":"2015","unstructured":"Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770\u2013778."},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_3_46_2","first-page":"33","volume-title":"European Conference on Computer Vision","volume":"13695","author":"Hong James","year":"2022","unstructured":"James Hong, Haotian Zhang, Micha\u00ebl Gharbi, Matthew Fisher, and Kayvon Fatahalian. 2022. Spotting temporally precise, fine-grained events in video. In European Conference on Computer Vision, Vol. 13695, 33\u201351."},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","unstructured":"Yuxi Hong Chen Ling and Zuochang Ye. 2018. End-to-end soccer video scene and event classification with deep transfer learning. 1\u20134. DOI: 10.1109\/ISACV.2018.8369043","DOI":"10.1109\/ISACV.2018.8369043"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2012.10.007"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.876289"},{"key":"e_1_3_3_50_2","doi-asserted-by":"crossref","first-page":"1971","DOI":"10.1109\/CVPR.2016.217","volume-title":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Ibrahim Mostafa S.","year":"2016","unstructured":"Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1971\u20131980. DOI: 10.1109\/CVPR.2016.217"},{"key":"e_1_3_3_51_2","first-page":"1647","volume-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Ilg Eddy","year":"2016","unstructured":"Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2016. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1647\u20131655."},{"key":"e_1_3_3_52_2","first-page":"14","volume-title":"2013 International Conference on Signal-Image Technology and Internet-Based Systems","author":"Itoh Hiroki","year":"2013","unstructured":"Hiroki Itoh, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Event detection and recognition using HMM with whistle sounds. In 2013 International Conference on Signal-Image Technology and Internet-Based Systems, 14\u201321. DOI: 10.1109\/SITIS.2013.14"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","unstructured":"Haohao Jiang Yao Lu and Jing Xue. 2016. Automatic soccer video event detection based on a deep neural network combined CNN and RNN. 490\u2013494. DOI: 10.1109\/ICTAI.2016.0081","DOI":"10.1109\/ICTAI.2016.0081"},{"key":"e_1_3_3_54_2","first-page":"1","volume-title":"Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (MMSports \u201920)","author":"Jiang Yudong","year":"2020","unstructured":"Yudong Jiang, Kaixu Cui, Leilei Chen, Canjin Wang, and Changliang Xu. 2020. SoccerDB: A large-scale database for comprehensive video understanding. In Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (MMSports \u201920). ACM, New York, NY, 1\u20138. DOI: 10.1145\/3422844.3423051"},{"key":"e_1_3_3_55_2","doi-asserted-by":"crossref","first-page":"4415","DOI":"10.1109\/ICCV.2017.472","volume-title":"2017 IEEE International Conference on Computer Vision (ICCV)","author":"Kalogeiton Vicky S.","year":"2017","unstructured":"Vicky S. Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In 2017 IEEE International Conference on Computer Vision (ICCV), 4415\u20134423."},{"key":"e_1_3_3_56_2","first-page":"293","volume-title":"Computer Vision in Sports","author":"Kapela Rafal","year":"2015","unstructured":"Rafal Kapela, Kevin McGuinness, Aleksandra Swietlicka, and Noel E. O\u2019Connor. 2015. Real-time event detection in field sport videos. In Computer Vision in Sports. Springer, 293\u2013316."},{"key":"e_1_3_3_57_2","unstructured":"Ali Karimi Ramin Toosi and Mohammad Ali Akhaee. 2021. Soccer event detection using deep learning. arXiv:2102.04331. Retrieved from https:\/\/arxiv.org\/abs\/2102.04331"},{"key":"e_1_3_3_58_2","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1109\/ICCKE57176.2022.9959985","volume-title":"2022 12th International Conference on Computer and Knowledge Engineering (ICCKE)","author":"Karimi Ali","year":"2022","unstructured":"Ali Karimi, Ramin Toosi, and Mohammad Ali Akhaee. 2022. Soccer video event detection using metric learning. In 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE), 48\u201352. DOI: 10.1109\/ICCKE57176.2022.9959985"},{"key":"e_1_3_3_59_2","first-page":"250","volume-title":"Proceedings of the 15th ACM Multimedia Systems Conference (MMSys \u201924)","author":"Kassab Evan J\u00e5sund","year":"2024","unstructured":"Evan J\u00e5sund Kassab, H\u00e5kon Maric Solberg, Sushant Gautam, Saeed Shafiee Sabet, Thomas Torjusen, Michael Riegler, P\u00e5l Halvorsen, and Cise Midoglu. 2024. TACDEC: Dataset of tackle events in soccer game videos. In Proceedings of the 15th ACM Multimedia Systems Conference (MMSys \u201924). ACM, New York, NY, 250\u2013256. DOI: 10.1145\/3625468.3652166"},{"key":"e_1_3_3_60_2","first-page":"119","volume-title":"International Conference on Image Processing and Pattern Recognition (IPPR)","author":"Khan Abdullah","year":"2018","unstructured":"Abdullah Khan, Beatrice Lazzerini, Gaetano Calabrese, and Luciano Serafini. 2018. Soccer event detection. In International Conference on Image Processing and Pattern Recognition (IPPR), 119\u2013129. DOI: 10.5121\/csit.2018.80509"},{"key":"e_1_3_3_61_2","first-page":"1","volume-title":"2018 14th International Conference on Emerging Technologies (ICET)","author":"Khan Muhammad Zeeshan","year":"2018","unstructured":"Muhammad Zeeshan Khan, Summra Saleem, Muhammad A. Hassan, and Muhammad Usman Ghanni Khan. 2018. Learning deep C3D features for soccer video event detection. In 2018 14th International Conference on Emerging Technologies (ICET), 1\u20136."},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.3390\/app10228046"},{"key":"e_1_3_3_63_2","first-page":"1746","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Kim Yoon","year":"2014","unstructured":"Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746\u20131751. DOI: 10.3115\/v1\/D14-1181"},{"key":"e_1_3_3_64_2","first-page":"9796","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Kirillov Alexander","year":"2019","unstructured":"Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross B. Girshick. 2019. PointRend: Image segmentation as rendering. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9796\u20139805."},{"key":"e_1_3_3_65_2","unstructured":"Gregory Koch Richard Zemel and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. arXiv:1506.04080. Retrieved from https:\/\/arxiv.org\/abs\/1506.04080"},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBC.2015.2424011"},{"key":"e_1_3_3_67_2","unstructured":"Okan K\u00f6p\u00fckl\u00fc Xiangyu Wei and Gerhard Rigoll. 2019. You only watch once: A unified CNN architecture for real-time spatiotemporal action localization. arXiv:1911.06644. Retrieved from https:\/\/arxiv.org\/abs\/1911.06644"},{"key":"e_1_3_3_68_2","first-page":"542","volume-title":"5th International AAAI Conference on Weblogs and Social Media","author":"Lanagan James","year":"2011","unstructured":"James Lanagan and Alan Smeaton. 2011. Using Twitter to detect and tag important events in sports media. In 5th International AAAI Conference on Weblogs and Social Media, 542\u2013545."},{"key":"e_1_3_3_69_2","first-page":"3280","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops","author":"Leduc Arnaud","year":"2024","unstructured":"Arnaud Leduc, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2024. SoccerNet-Depth: A scalable dataset for monocular depth estimation in sports videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 3280\u20133292."},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2004.826751"},{"key":"e_1_3_3_71_2","first-page":"833","volume-title":"European Conference on Computer Vision","volume":"9910","author":"Lev Guy","year":"2015","unstructured":"Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2015. RNN fisher vectors for action recognition and image annotation. In European Conference on Computer Vision, Vol.\u202f9910, 833\u2013850."},{"key":"e_1_3_3_72_2","first-page":"169","volume-title":"Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP \u201903)","author":"Li Baoxin","year":"2003","unstructured":"Baoxin Li, Hao Pan, and I. Sezan. 2003. A general framework for sports video summarization with its application to soccer. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP \u201903), Vol.\u202f3, 169\u2013172. DOI: 10.1109\/ICASSP.2003.1199134"},{"key":"e_1_3_3_73_2","unstructured":"Guohao Li Chenxin Xiong Ali K. Thabet and Bernard Ghanem. 2020. DeeperGCN: All you need to train deeper GCNs. arXiv:2006.07739. Retrieved from https:\/\/arxiv.org\/abs\/2006.07739"},{"key":"e_1_3_3_74_2","first-page":"13516","volume-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Li Yixuan","year":"2021","unstructured":"Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. 2021. MultiSports: A multi-person video dataset of spatio-temporally localized sports actions. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), 13516\u201313525. DOI: 10.1109\/ICCV48922.2021.01328"},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58517-4_5"},{"key":"e_1_3_3_76_2","first-page":"936","volume-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Lin Tsung-Yi","year":"2016","unstructured":"Tsung-Yi Lin, Piotr Doll\u00e1r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2016. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 936\u2013944."},{"key":"e_1_3_3_77_2","first-page":"21","volume-title":"European Conference on Computer Vision","volume":"9905","author":"Liu W.","year":"2015","unstructured":"W. Liu, Dragomir Anguelov, D. Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. SSD: Single shot multiBox detector. In European Conference on Computer Vision, Vol. 9905, 21\u201337."},{"key":"e_1_3_3_78_2","doi-asserted-by":"crossref","first-page":"9992","DOI":"10.1109\/ICCV48922.2021.00986","volume-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Liu Ze","year":"2021","unstructured":"Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), 9992\u201310002."},{"key":"e_1_3_3_79_2","first-page":"3192","volume-title":"2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Liu Ze","year":"2021","unstructured":"Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video swin transformer. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3192\u20133201."},{"key":"e_1_3_3_80_2","first-page":"97","volume-title":"2011 IEEE International Conference on Automatic Face and Gesture Recognition (FG)","author":"Lui Yui Man","year":"2011","unstructured":"Yui Man Lui and J. Ross Beveridge. 2011. Tangent bundle for human action recognition. In 2011 IEEE International Conference on Automatic Face and Gesture Recognition (FG), 97\u2013102. DOI: 10.1109\/FG.2011.5771378"},{"key":"e_1_3_3_81_2","doi-asserted-by":"publisher","unstructured":"Sifan Ma En Shao Xiang Xie and Wei Liu. 2020. Event Detection in Soccer Video Based on Self-Attention. 1852\u20131856. DOI: 10.1109\/ICCC51575.2020.9344896","DOI":"10.1109\/ICCC51575.2020.9344896"},{"key":"e_1_3_3_82_2","doi-asserted-by":"crossref","first-page":"61929","DOI":"10.1109\/ACCESS.2021.3074831","article-title":"Spotting football events using two-stream convolutional neural network and dilated recurrent neural network","volume":"9","author":"Mahaseni Behzad","year":"2021","unstructured":"Behzad Mahaseni, Erma Rahayu Mohd Faizal, and Ram Gopal Raj. 2021. Spotting football events using two-stream convolutional neural network and dilated recurrent neural network. IEEE Access 9 (2021), 61929\u201361942.","journal-title":"IEEE Access"},{"key":"e_1_3_3_83_2","doi-asserted-by":"crossref","first-page":"4121","DOI":"10.1109\/TMM.2022.3171679","article-title":"Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations","volume":"25","author":"Mai Sijie","year":"2023","unstructured":"Sijie Mai, Ying Zeng, and Haifeng Hu. 2023. Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations. IEEE Transactions on Multimedia 25 (2023), 4121\u20134134.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_3_84_2","unstructured":"Antoine Miech Ivan Laptev and Josef Sivic. 2017. Learnable pooling with context gating for video classification. arXiv:1706.06905. Retrieved from https:\/\/arxiv.org\/abs\/1706.06905"},{"key":"e_1_3_3_85_2","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1007\/978-3-030-50347-5_11","volume-title":"Image Analysis and Recognition","author":"Morra Lia","year":"2020","unstructured":"Lia Morra, Francesco Manigrasso, Giuseppe Canto, Claudio Gianfrate, Enrico Guarino, and Fabrizio Lamberti. 2020. Slicing and dicing soccer: automatic detection of complex events from spatio-temporal data. In Image Analysis and Recognition. Aur\u00e9lio Campilho, Fakhri Karray, and Zhou Wang (Eds.). Springer International Publishing, Cham, 107\u2013121."},{"key":"e_1_3_3_86_2","first-page":"14200","volume-title":"Advances in Neural Information Processing Systems","author":"Nagrani Arsha","year":"2021","unstructured":"Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems. M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 14200\u201314213. Retrieved from https:\/\/proceedings.neurips.cc\/paper\/2021\/file\/76ba9f564ebbc35b1014ac498fafadd0-Paper.pdf"},{"key":"e_1_3_3_87_2","doi-asserted-by":"publisher","DOI":"10.3390\/app12094429"},{"key":"e_1_3_3_88_2","doi-asserted-by":"crossref","first-page":"3156","DOI":"10.1109\/ICCVW54120.2021.00355","volume-title":"2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW)","author":"Neimark Daniel","year":"2021","unstructured":"Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. 2021. Video transformer network. In 2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), 3156\u20133165."},{"key":"e_1_3_3_89_2","first-page":"119","volume-title":"Proceedings of the 6th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP \u201908)","author":"Nisha J.","year":"2009","unstructured":"J. Nisha, Santanu Chaudhury, Sumantra Dutta Roy, Prasenjit Mukherjee, Krishanu Seal, and Kumar Talluri. 2009. A novel learning-based framework for detecting interesting events in soccer videos. In Proceedings of the 6th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP \u201908), 119\u2013125. DOI: 10.1109\/ICVGIP.2008.71"},{"key":"e_1_3_3_90_2","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1109\/ISM.2020.00030","volume-title":"2020 IEEE International Symposium on Multimedia (ISM)","author":"Rongved Olav A. Norg\u00e5rd","year":"2020","unstructured":"Olav A. Norg\u00e5rd Rongved, Steven A. Hicks, Vajira Thambawita, H\u00e5kon K. Stensland, Evi Zouganeli, Dag Johansen, Michael A. Riegler, and P\u00e5l Halvorsen. 2020. Real-time detection of events in soccer videos using 3D convolutional neural networks. In 2020 IEEE International Symposium on Multimedia (ISM), 135\u2013144. DOI: 10.1109\/ISM.2020.00030"},{"key":"e_1_3_3_91_2","first-page":"1210","volume-title":"2012 IEEE Conference on Computer Vision and Pattern Recognition","author":"O\u2019Hara Stephen","year":"2012","unstructured":"Stephen O\u2019Hara and Bruce A. Draper. 2012. Scalable action recognition with a subspace forest. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1210\u20131217. DOI: 10.1109\/CVPR.2012.6247803"},{"key":"e_1_3_3_92_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-012-9332-4"},{"key":"e_1_3_3_93_2","volume-title":"Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (MMSports \u201920)","author":"Panse Neeraj","year":"2020","unstructured":"Neeraj Panse and Ameya Mahabaleshwarkar. 2020. A dataset and methodology for computer vision based offside detection in soccer. In Proceedings of the 3rd International Workshop on Multimedia Content Analysis in Sports (MMSports \u201920). ACM, New York, NY. DOI: 10.1145\/3422844.3423055"},{"issue":"1","key":"e_1_3_3_94_2","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1038\/s41597-019-0247-7","article-title":"A public data set of spatio-temporal match events in soccer competitions","volume":"6","author":"Pappalardo Luca","year":"2019","unstructured":"Luca Pappalardo, Paolo Cintia, Alessio Rossi, Emanuele Massucco, Paolo Ferragina, Dino Pedreschi, and Fosca Giannotti. 2019. A public data set of spatio-temporal match events in soccer competitions. Science Data 6, 1 (Oct. 2019), 236.","journal-title":"Science Data"},{"key":"e_1_3_3_95_2","first-page":"1","volume-title":"Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)","author":"Parcalabescu Letitia","year":"2021","unstructured":"Letitia Parcalabescu, Nils Trost, and Anette Frank. 2021. What is multimodality? In Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR). Association for Computational Linguistics, Groningen, Netherlands, 1\u201310. Retrieved from https:\/\/aclanthology.org\/2021.mmsr-1.1"},{"issue":"2","key":"e_1_3_3_96_2","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1016\/j.jsams.2010.10.459","article-title":"Networks as a novel tool for studying team ball sports as complex social systems","volume":"14","author":"Passos Pedro","year":"2011","unstructured":"Pedro Passos, Keith Davids, Duarte Ara\u00fajo, Natasha Sophia C. Paz, J. Mingu\u00e9ns, and Joana Mendes. 2011. Networks as a novel tool for studying team ball sports as complex social systems. Journal of Science and Medicine in Sport 14, 2 (2011), 170\u2013176.","journal-title":"Journal of Science and Medicine in Sport"},{"key":"e_1_3_3_97_2","unstructured":"Dhanuja S. Patil and S. B. Waykar. 2014. A survey on event recognition and summarization in football videos. International Journal of Science and Research. Retrieved February 23 2023 from https:\/\/www.ijsr.net\/get_abstract.php?paper_id=OCT14705"},{"key":"e_1_3_3_98_2","doi-asserted-by":"crossref","first-page":"3743","DOI":"10.1109\/CVPR.2015.7298998","volume-title":"2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Perronnin Florent","year":"2015","unstructured":"Florent Perronnin and Diane Larlus. 2015. Fisher vectors meet neural networks: A hybrid classification architecture. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3743\u20133752."},{"key":"e_1_3_3_99_2","first-page":"1","volume-title":"2008 IEEE Conference on Computer Vision and Pattern Recognition","author":"Philbin James","year":"2008","unstructured":"James Philbin, Ond\u0159ej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 1\u20138."},{"key":"e_1_3_3_100_2","doi-asserted-by":"crossref","first-page":"865","DOI":"10.1109\/ICCIS.2010.215","volume-title":"2010 International Conference on Computational and Information Sciences","author":"Pixi Zhao","year":"2010","unstructured":"Zhao Pixi, Li Hongyan, and Wang Wei. 2010. Research on event detection of soccer video based on hidden Markov model. In 2010 International Conference on Computational and Information Sciences, 865\u2013868. DOI: 10.1109\/ICCIS.2010.215"},{"key":"e_1_3_3_101_2","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1109\/TCSVT.2019.2894161","article-title":"stagNet: An attentive semantic RNN for group activity and individual action recognition","volume":"30","author":"Qi Mengshi","year":"2020","unstructured":"Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2020. stagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology 30 (2020), 549\u2013565.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_3_102_2","first-page":"439","volume-title":"Advances in Multimedia Information Processing (PCM \u201910)","author":"Qian Xueming","year":"2010","unstructured":"Xueming Qian, Guizhong Liu, Huan Wang, Zhi Li, and Zhe Wang. 2010. Soccer video event detection by fusing middle level visual semantics of an event clip. In Advances in Multimedia Information Processing (PCM \u201910). Springer Berlin, Berlin, 439\u2013451."},{"key":"e_1_3_3_103_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-011-0817-y"},{"key":"e_1_3_3_104_2","first-page":"28492","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492\u201328518."},{"key":"e_1_3_3_105_2","doi-asserted-by":"crossref","first-page":"10425","DOI":"10.1109\/CVPR42600.2020.01044","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Radosavovic Ilija","year":"2020","unstructured":"Ilija Radosavovic, Raj Prateek Kosaraju, Ross B. Girshick, Kaiming He, and Piotr Doll\u00e1r. 2020. Designing network design spaces. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10425\u201310433."},{"key":"e_1_3_3_106_2","doi-asserted-by":"publisher","DOI":"10.11591\/ijeecs.v11.i3.pp987-993"},{"issue":"1","key":"e_1_3_3_107_2","doi-asserted-by":"crossref","first-page":"301","DOI":"10.1186\/s40064-015-1065-9","article-title":"Automatic summarization of soccer highlights using audio-visual descriptors","volume":"4","author":"Ravent\u00f3s A.","year":"2015","unstructured":"A. Ravent\u00f3s, R. Quijada, Luis Torres, and Francesc Tarr\u00e9s. 2015. Automatic summarization of soccer highlights using audio-visual descriptors. SpringerPlus 4, 1 (Jun. 2015), 301.","journal-title":"SpringerPlus"},{"key":"e_1_3_3_108_2","unstructured":"Fitsum Reda Robert Pottorff Jon Barker and Bryan Catanzaro. 2017. flownet2-pytorch: Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. Retrieved from https:\/\/github.com\/NVIDIA\/flownet2-pytorch"},{"key":"e_1_3_3_109_2","first-page":"1137","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137\u20131149.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_3_110_2","first-page":"27","volume-title":"Proceedings of the International Conference on Internet of Things and Big Data (IoTBD)","author":"Richly Keven","year":"2016","unstructured":"Keven Richly, Max Bothe, Tobias Rohloff, and Christian Schwarz. 2016. Recognizing compound events in spatio-temporal football data. In Proceedings of the International Conference on Internet of Things and Big Data (IoTBD), 27\u201335."},{"key":"e_1_3_3_111_2","first-page":"1","volume-title":"2008 IEEE Conference on Computer Vision and Pattern Recognition","author":"Rodriguez Mikel D.","year":"2008","unstructured":"Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 1\u20138. DOI: 10.1109\/CVPR.2008.4587727"},{"key":"e_1_3_3_112_2","doi-asserted-by":"crossref","first-page":"1030","DOI":"10.3390\/make3040051","article-title":"Automated event detection and classification in soccer: The potential of using multiple modalities","volume":"3","author":"Nerg\u00e5rd Rongved Olav Andre","year":"2021","unstructured":"Olav Andre Nerg\u00e5rd Rongved, Markus Stige, S. Hicks, Vajira Lasantha Thambawita, Cise Midoglu, Evi Zouganeli, Dag Johansen, M. Riegler, and P. Halvorsen. 2021. Automated event detection and classification in soccer: The potential of using multiple modalities. Machine Learning and Knowledge Extraction 3 (2021), 1030\u20131054.","journal-title":"Machine Learning and Knowledge Extraction"},{"key":"e_1_3_3_113_2","first-page":"234","volume-title":"Medical Image Computing and Computer-Assisted Intervention (MICCAI \u201915)","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI \u201915). Springer International Publishing, Cham, 234\u2013241."},{"key":"e_1_3_3_114_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2005.854237"},{"key":"e_1_3_3_115_2","first-page":"3163","volume-title":"Proceedings of the Computer Vision and Pattern Recognition Conference","author":"Santra Sanchayan","year":"2025","unstructured":"Sanchayan Santra, Vishal Chudasama, Pankaj Wasnik, and Vineeth N. Balasubramanian. 2025. Precise event spotting in sports videos: Solving Long-Range dependency and class imbalance. In Proceedings of the Computer Vision and Pattern Recognition Conference, 3163\u20133172."},{"key":"e_1_3_3_116_2","first-page":"13624","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Sha Long","year":"2020","unstructured":"Long Sha, Jennifer Hobbs, Panna Felsen, Xinyu Wei, Patrick Lucey, and Sujoy Ganguly. 2020. End-to-end camera calibration for broadcast videos. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13624\u201313633. DOI: 10.1109\/CVPR42600.2020.01364"},{"key":"e_1_3_3_117_2","first-page":"3183","volume-title":"2022 26th International Conference on Pattern Recognition (ICPR)","author":"Shi Yuzhi","year":"2022","unstructured":"Yuzhi Shi, Hiroaki Minoura, Takayoshi Yamashita, Tsubasa Hirakawa, Hironobu Fujiyoshi, Mitsuru Nakazawa, Yeongnam Chae, and Bj\u00f6rn Stenger. 2022. Action spotting in soccer videos using multiple scene encoders. In 2022 26th International Conference on Pattern Recognition (ICPR), 3183\u20133189. DOI: 10.1109\/ICPR56361.2022.9956667"},{"key":"e_1_3_3_118_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_3_119_2","first-page":"5998","volume-title":"2023 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV)","author":"Singh Gurkirt","year":"2022","unstructured":"Gurkirt Singh, Vasileios Choutas, Suman Saha, Fisher Yu, and Luc Van Gool. 2022. Spatio-temporal action detection under large motion. In 2023 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), 5998\u20136007."},{"key":"e_1_3_3_120_2","first-page":"3657","volume-title":"2017 IEEE International Conference on Computer Vision (ICCV),","author":"Singh Gurkirt","year":"2016","unstructured":"Gurkirt Singh, Suman Saha, Michael Sapienza, Philip H. S. Torr, and Fabio Cuzzolin. 2016. Online real-time multiple spatiotemporal action localisation and prediction. In 2017 IEEE International Conference on Computer Vision (ICCV), 3657\u20133666."},{"key":"e_1_3_3_121_2","doi-asserted-by":"crossref","first-page":"2796","DOI":"10.1109\/ICIP46576.2022.9897256","volume-title":"2022 IEEE International Conference on Image Processing (ICIP)","author":"Soares Joao V. B.","year":"2022","unstructured":"Joao V. B. Soares, Avijit Shah, and Topojoy Biswas. 2022. Temporally precise action spotting in soccer videos using dense detection anchors. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2796\u20132800."},{"key":"e_1_3_3_122_2","doi-asserted-by":"publisher","unstructured":"Jo\u00e3o V. B. Soares and Avijit Shah. 2022. Action Spotting using Dense Detection Anchors Revisited: Submission to the SoccerNet Challenge 2022. DOI: 10.48550\/ARXIV.2206.07846","DOI":"10.48550\/ARXIV.2206.07846"},{"key":"e_1_3_3_123_2","first-page":"1","volume-title":"2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","author":"Song Wei","year":"2017","unstructured":"Wei Song and Hani Hagras. 2017. A type-2 fuzzy logic system for event detection in soccer videos. In 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1\u20136. DOI: 10.1109\/FUZZ-IEEE.2017.8015426"},{"key":"e_1_3_3_124_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-09396-3_9"},{"key":"e_1_3_3_125_2","unstructured":"Khurram Soomro Amir Roshan Zamir and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https:\/\/arxiv.org\/abs\/1212.0402"},{"key":"e_1_3_3_126_2","unstructured":"Michael St\u00f6ckl Thomas Seidl Daniel Marley and Paul Power. 2021. Making Offensive Play Predictable-Using a Graph Convolutional Network to Understand Defensive Performance in Soccer. Retrieved from https:\/\/www.statsperform.com\/wp-content\/uploads\/2021\/04\/Making-Offensive-Play-Predictable.pdf"},{"key":"e_1_3_3_127_2","first-page":"1099","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Sudhakaran Swathikiran","year":"2019","unstructured":"Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2019. Gate-shift networks for video action recognition. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1099\u20131108."},{"key":"e_1_3_3_128_2","unstructured":"Alessandro Suglia Jos\u00e9 Gabriel Pereira Lopes Emanuele Bastianelli Andrea Vanzo Shubham Agarwal Malvina Nikandrou Lu Yu Ioannis Konstas and Verena Rieser. 2022. Going for GOAL: A resource for grounded football commentaries. arXiv:2211.04534. Retrieved from https:\/\/arxiv.org\/abs\/2211.04534"},{"key":"e_1_3_3_129_2","doi-asserted-by":"crossref","first-page":"1402","DOI":"10.1109\/CVPR.2014.182","volume-title":"2014 IEEE Conference on Computer Vision and Pattern Recognition","author":"Sydorov Vladyslav","year":"2014","unstructured":"Vladyslav Sydorov, Mayu Sakurada, and Christoph H. Lampert. 2014. Deep fisher kernels\u2014End to end learning of the fisher kernel GMM parameters. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1402\u20131409."},{"key":"e_1_3_3_130_2","first-page":"10096","volume-title":"International Conference on Machine Learning","author":"Tan Mingxing","year":"2021","unstructured":"Mingxing Tan and Quoc Le. 2021. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning. PMLR, 10096\u201310106."},{"key":"e_1_3_3_131_2","first-page":"6105","volume-title":"Proceedings of the 36th International Conference on Machine Learning (ICML \u201919)","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML \u201919). PMLR, 6105\u20136114. Retrieved from http:\/\/proceedings.mlr.press\/v97\/tan19a.html"},{"key":"e_1_3_3_132_2","doi-asserted-by":"crossref","first-page":"4619","DOI":"10.1109\/BigData.2018.8621906","volume-title":"2018 IEEE International Conference on Big Data (Big Data)","author":"Tang Kaiyu","year":"2018","unstructured":"Kaiyu Tang, Yixin Bao, Zhijian Zhao, Liang Zhu, Yining Lin, and Yao Peng. 2018. AutoHighlight : Automatic highlights detection and segmentation in soccer matches. In 2018 IEEE International Conference on Big Data (Big Data), 4619\u20134624. DOI: 10.1109\/BigData.2018.8621906"},{"key":"e_1_3_3_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2013.2243640"},{"key":"e_1_3_3_134_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.04.011"},{"key":"e_1_3_3_135_2","doi-asserted-by":"crossref","first-page":"7699","DOI":"10.1109\/ICPR48806.2021.9412268","volume-title":"2020 25th International Conference on Pattern Recognition (ICPR)","author":"Tomei Matteo","year":"2021","unstructured":"Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, and Rita Cucchiara. 2021. RMS-Net: Regression and masking for soccer event spotting. In 2020 25th International Conference on Pattern Recognition (ICPR), 7699\u20137706."},{"key":"e_1_3_3_136_2","first-page":"10078","article-title":"Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training","volume":"35","author":"Tong Zhan","year":"2022","unstructured":"Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems 35 (2022), 10078\u201310093.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_137_2","doi-asserted-by":"publisher","unstructured":"Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. 4489\u20134497. DOI: 10.1109\/ICCV.2015.510","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_3_138_2","doi-asserted-by":"crossref","first-page":"5551","DOI":"10.1109\/ICCV.2019.00565","volume-title":"2019 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Tran Du","year":"2019","unstructured":"Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), 5551\u20135560."},{"key":"e_1_3_3_139_2","first-page":"1","volume-title":"2024 International Joint Conference on Neural Networks (IJCNN)","author":"Tran Kim Hoang","year":"2024","unstructured":"Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, and Ngan Le. 2024. Unifying global and local scene entities modelling for precise action spotting. In 2024 International Joint Conference on Neural Networks (IJCNN), 1\u20138. DOI: 10.1109\/IJCNN60899.2024.10650009"},{"key":"e_1_3_3_140_2","doi-asserted-by":"publisher","DOI":"10.2352\/ISSN.2470-1173.2017.16.CVAS-344"},{"key":"e_1_3_3_141_2","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1109\/CVPRW.2017.25","volume-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","author":"Tsunoda Takamasa","year":"2017","unstructured":"Takamasa Tsunoda, Yasuhiro Komori, Masakazu Matsugu, and Tatsuya Harada. 2017. Football action recognition using hierarchical LSTM. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 155\u2013163. DOI: 10.1109\/CVPRW.2017.25"},{"key":"e_1_3_3_142_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3286254"},{"key":"e_1_3_3_143_2","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1007\/s11263-013-0620-5","article-title":"Selective search for object recognition","volume":"104","author":"Uijlings Jasper R. R.","year":"2013","unstructured":"Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. International Journal of Computer Vision 104 (2013), 154\u2013171.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_3_144_2","doi-asserted-by":"crossref","first-page":"3921","DOI":"10.1109\/CVPRW50498.2020.00456","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","author":"Vanderplaetse Bastien","year":"2020","unstructured":"Bastien Vanderplaetse and St\u00e9phane Dupont. 2020. Improved soccer action spotting using both audio and video streams. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3921\u20133931."},{"key":"e_1_3_3_145_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_3_146_2","doi-asserted-by":"crossref","first-page":"3856","DOI":"10.1109\/CVPRW50498.2020.00449","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","author":"Vats Kanav","year":"2020","unstructured":"Kanav Vats, Mehrnaz Fani, Pascale Walters, David A. Clausi, and John S. Zelek. 2020. Event detection in coarsely annotated sports videos via parallel multi receptive field 1D convolutions. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3856\u20133865."},{"key":"e_1_3_3_147_2","doi-asserted-by":"publisher","DOI":"10.1007\/s12283-022-00381-6"},{"key":"e_1_3_3_148_2","first-page":"599","volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo (ICME),","author":"Wang Jinjun","year":"2004","unstructured":"Jinjun Wang, Changsheng Xu, Eng Chng, and Qi Tian. 2004. Sports highlight detection from keyword sequences using HMM. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 599\u2013602. DOI: 10.1109\/ICME.2004.1394263"},{"key":"e_1_3_3_149_2","first-page":"7794","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wang Xiaolong","year":"2018","unstructured":"Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794\u20137803."},{"key":"e_1_3_3_150_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3326362","article-title":"Dynamic graph CNN for learning on point clouds","volume":"38","author":"Wang Yue","year":"2018","unstructured":"Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2018. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics 38 (2018), 1\u201312.","journal-title":"ACM Transactions on Graphics"},{"key":"e_1_3_3_151_2","doi-asserted-by":"crossref","first-page":"3164","DOI":"10.1109\/ICCV.2015.362","volume-title":"2015 IEEE International Conference on Computer Vision (ICCV)","author":"Weinzaepfel Philippe","year":"2015","unstructured":"Philippe Weinzaepfel, Za\u00efd Harchaoui, and Cordelia Schmid. 2015. Learning to track for spatio-temporal action localization. In 2015 IEEE International Conference on Computer Vision (ICCV), 3164\u20133172."},{"key":"e_1_3_3_152_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3232034"},{"key":"e_1_3_3_153_2","first-page":"489","volume-title":"CVPR 2011","author":"Wu Xinxiao","year":"2011","unstructured":"Xinxiao Wu, Dong Xu, Lixin Duan, and Jiebo Luo. 2011. Action recognition using context and appearance distribution features. In CVPR 2011, 489\u2013496. DOI: 10.1109\/CVPR.2011.5995624"},{"key":"e_1_3_3_154_2","first-page":"93","volume-title":"Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports (MMSports \u201923)","author":"Xarles Artur","year":"2023","unstructured":"Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Albert Clap\u00e9s. 2023. ASTRA: An action spotting TRAnsformer for soccer videos. In Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports (MMSports \u201923). ACM, New York, NY, 93\u2013102. DOI: 10.1145\/3606038.3616153"},{"key":"e_1_3_3_155_2","first-page":"5987","volume-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Xie Saining","year":"2016","unstructured":"Saining Xie, Ross B. Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5987\u20135995."},{"issue":"4","key":"e_1_3_3_156_2","first-page":"670","article-title":"A novel framework for soccer goal detection based on semantic rule","volume":"28","author":"Xie Wenjuan","year":"2011","unstructured":"Wenjuan Xie and Ming Tong. 2011. A novel framework for soccer goal detection based on semantic rule. Journal of Electronics 28, 4\u20136 (2011), 670\u2013674.","journal-title":"Journal of Electronics"},{"issue":"1","key":"e_1_3_3_157_2","doi-asserted-by":"crossref","first-page":"62","DOI":"10.1007\/s10044-005-0244-7","article-title":"Audio-visual sports highlights extraction using coupled hidden Markov models","volume":"8","author":"Xiong Ziyou","year":"2005","unstructured":"Ziyou Xiong. 2005. Audio-visual sports highlights extraction using coupled hidden Markov models. Pattern Analysis and Applications 8, 1\u20132 (Sept. 2005), 62\u201371.","journal-title":"Pattern Analysis and Applications"},{"key":"e_1_3_3_158_2","first-page":"III","volume-title":"Proceedings of the 2003 International Conference on Multimedia and Expo (ICME \u201903)","author":"Xiong Ziyou","year":"2003","unstructured":"Ziyou Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang. 2003. Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework. In Proceedings of the 2003 International Conference on Multimedia and Expo (ICME \u201903). III\u2013401. DOI: 10.1109\/ICME.2003.1221333"},{"key":"e_1_3_3_159_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.2004912"},{"key":"e_1_3_3_160_2","first-page":"566","volume-title":"Proceedings of the 5th Pacific Rim Conference on Multimedia on Advances in Multimedia Information Processing (PCM \u201904)","author":"Xu Min","year":"2005","unstructured":"Min Xu, Ling-Yu Duan, Jianfei Cai, Liang-Tien Chia, Changsheng Xu, and Qi Tian. 2005. HMM-based audio keyword generation. In Proceedings of the 5th Pacific Rim Conference on Multimedia on Advances in Multimedia Information Processing (PCM \u201904), Part III. Springer, 566\u2013574."},{"key":"e_1_3_3_161_2","first-page":"II","volume-title":"Proceedings of the 2003 International Conference on Multimedia and Expo (ICME \u201903)","author":"Xu Min","year":"2003","unstructured":"Min Xu, N. C. Maddage, Changsheng Xu, M. Kankanhalli, and Qi Tian. 2003. Creating audio keywords for event detection in soccer video. In Proceedings of the 2003 International Conference on Multimedia and Expo (ICME \u201903), II\u2013281. DOI: 10.1109\/ICME.2003.1221608"},{"key":"e_1_3_3_162_2","doi-asserted-by":"crossref","first-page":"588","DOI":"10.1109\/CVPR42600.2020.00067","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Yang Ceyuan","year":"2020","unstructured":"Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 588\u2013597."},{"key":"e_1_3_3_163_2","first-page":"1480","volume-title":"Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Yang Zichao","year":"2016","unstructured":"Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 1480\u20131489. DOI: 10.18653\/v1\/N16-1174"},{"key":"e_1_3_3_164_2","first-page":"455","volume-title":"Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA \u201905)","author":"Ye Qixiang","year":"2005","unstructured":"Qixiang Ye, Qingming Huang, Wen Gao, and Shuqiang Jiang. 2005. Exciting event detection in broadcast soccer video with mid-level description and incremental learning. In Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA \u201905). ACM, New York, NY, 455\u2013458. DOI: 10.1145\/1101149.1101250"},{"key":"e_1_3_3_165_2","first-page":"377","volume-title":"MultiMedia Modeling","author":"Yu Junqing","year":"2019","unstructured":"Junqing Yu, Aiping Lei, and Yangliu Hu. 2019. Soccer video event detection based on deep learning. In MultiMedia Modeling. Springer International Publishing, Cham, 377\u2013389."},{"key":"e_1_3_3_166_2","first-page":"418","volume-title":"2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)","author":"Yu Junqing","year":"2018","unstructured":"Junqing Yu, Aiping Lei, Zikai Song, Tingting Wang, Hengyou Cai, and Na Feng. 2018. Comprehensive dataset of broadcast soccer videos. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 418\u2013423. DOI: 10.1109\/MIPR.2018.00090"},{"key":"e_1_3_3_167_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-39940-9_481"},{"key":"e_1_3_3_168_2","unstructured":"Hongyi Zhang Moustapha Ciss\u00e9 Yann Dauphin and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv:1710.09412. Retrieved from https:\/\/arxiv.org\/abs\/1710.09412"},{"key":"e_1_3_3_169_2","volume-title":"The 12th International Conference on Learning Representations","author":"Zhang Jiaxu","year":"2024","unstructured":"Jiaxu Zhang, Shaoli Huang, Zhigang Tu, Xin Chen, Xiaohang Zhan, Gang Yu, and Ying Shan. 2024. TapMo: Shape-aware motion generation of skeleton-free characters. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=OeH6Fdhv7q"},{"key":"e_1_3_3_170_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3386777"},{"key":"e_1_3_3_171_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3386553"},{"key":"e_1_3_3_172_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649447"},{"key":"e_1_3_3_173_2","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1109\/ACPR.2015.7486522","volume-title":"2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR)","author":"Zhao Wei","year":"2015","unstructured":"Wei Zhao, Yao Lu, Haohao Jiang, and Wei Huang. 2015. Event detection in soccer videos using shot focus identification. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 341\u2013345. DOI: 10.1109\/ACPR.2015.7486522"},{"key":"e_1_3_3_174_2","doi-asserted-by":"crossref","first-page":"11636","DOI":"10.1109\/ICCV48922.2021.01145","volume-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Zheng Ce","year":"2021","unstructured":"Ce Zheng, Sijie Zhu, Mat\u2019ias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 2021. 3D human pose estimation with spatial and temporal transformers. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), 11636\u201311645."},{"key":"e_1_3_3_175_2","unstructured":"Xin Zhou Le Kang Zhiyu Cheng Bo He and Jingyu Xin. 2021. Feature combination meets attention: baidu soccer embeddings and transformer based temporal detection. arXiv:2106.14447. Retrieved from https:\/\/arxiv.org\/abs\/2106.14447"},{"key":"e_1_3_3_176_2","first-page":"2049","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Zhou Yuxuan","year":"2024","unstructured":"Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, and Xian-Sheng Hua. 2024. BlockGCN: Redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2049\u20132058."},{"key":"e_1_3_3_177_2","first-page":"103","volume-title":"Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports \u201922)","author":"Zhu He","year":"2022","unstructured":"He Zhu, Junwei Liang, Chengzhi Lin, Jun Zhang, and Jianming Hu. 2022. A transformer-based system for action spotting in soccer videos. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports (MMSports \u201922). ACM, New York, NY, 103\u2013109. DOI: 10.1145\/3552437.3555693"},{"key":"e_1_3_3_178_2","first-page":"695","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Zolfaghari Mohammadreza","year":"2018","unstructured":"Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 695\u2013712."}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3776541","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T14:15:16Z","timestamp":1768918516000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3776541"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,20]]},"references-count":177,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,4,30]]}},"alternative-id":["10.1145\/3776541"],"URL":"https:\/\/doi.org\/10.1145\/3776541","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,20]]},"assertion":[{"value":"2024-04-09","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-27","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}