{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,31]],"date-time":"2025-12-31T03:41:37Z","timestamp":1767152497442,"version":"3.48.0"},"reference-count":49,"publisher":"World Scientific Pub Co Pte Ltd","issue":"04","funder":[{"name":"Research Council of Norway,","award":["346671"],"award-info":[{"award-number":["346671"]}]},{"name":"Research Council of Norway,","award":["270053"],"award-info":[{"award-number":["270053"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Int. J. Semantic Computing"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:p>In this paper, we present SoccerNet-Echoes, an extension of the SoccerNet dataset which has been curated by augmenting the 550 games in the original dataset with multilingual audio commentary transcriptions, with a pipeline utilizing OpenAI\u2019s Whisper models for transcription and Google Translate for translation to English. We demonstrate the potential of SoccerNet-Echoes through several applications. Our experiments reveal that incorporating ASR-generated transcripts as a third modality alongside audio and video can improve the performance of multimodal event detection, with our audio\u2013video-text model achieving a top F1-score of 0.7175. We also introduce a novel framework that leverages Large Language Models (LLMs) to extract both predefined, official events, as well as unscripted, unofficial events directly from the commentary. Our evaluation shows that the Gemini-1.5-Pro model effectively identifies official events from text alone, and that LLM-generated game summaries are more descriptive and accurate when using SoccerNet-Echoes compared to using only structured event data. Surprisingly, our experiments also show that feeding powerful LLMs like Gemini-1.5-Pro with visual data may not improve results compared to their text-only counterpart, but rather degrade performance, for which we analyze the potential reasons. By releasing SoccerNet-Echoes, we provide a resource for the scientific community and offer benchmarks that highlight the current capabilities and limitations of ASR and LLM technologies in the domain of multimodal sports analysis.<\/jats:p>","DOI":"10.1142\/s1793351x25450035","type":"journal-article","created":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T09:38:29Z","timestamp":1761298709000},"page":"589-613","source":"Crossref","is-referenced-by-count":0,"title":["Beyond Audio: Enhancing SoccerNet-Echoes with Multimodal Event Extraction Using LLMs"],"prefix":"10.1142","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-4616-4592","authenticated-orcid":false,"given":"Mehdi Houshmand","family":"Sarkhoosh","sequence":"first","affiliation":[{"name":"Oslo Metropolitan University (OsloMet), Norway"},{"name":"Forzasys AS, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9232-2661","authenticated-orcid":false,"given":"Sushant","family":"Gautam","sequence":"additional","affiliation":[{"name":"Oslo Metropolitan University (OsloMet), Norway"},{"name":"Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0991-4418","authenticated-orcid":false,"given":"Cise","family":"Midoglu","sequence":"additional","affiliation":[{"name":"Forzasys AS, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7044-1731","authenticated-orcid":false,"given":"Thu","family":"Nguyen","sequence":"additional","affiliation":[{"name":"Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-7907-2895","authenticated-orcid":false,"given":"Jan","family":"Held","sequence":"additional","affiliation":[{"name":"University of Li\u00e8ge, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5314-9015","authenticated-orcid":false,"given":"Anthony","family":"Cioppa","sequence":"additional","affiliation":[{"name":"University of Li\u00e8ge, Belgium"},{"name":"King Abdullah University of Science and Technology (KAUST), Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3937-9834","authenticated-orcid":false,"given":"Silvio","family":"Giancola","sequence":"additional","affiliation":[{"name":"King Abdullah University of Science and Technology (KAUST), Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6026-0929","authenticated-orcid":false,"given":"Vajira","family":"Thambawita","sequence":"additional","affiliation":[{"name":"Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3153-2064","authenticated-orcid":false,"given":"Michael A.","family":"Riegler","sequence":"additional","affiliation":[{"name":"Simula Research Laboratory, Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2073-7029","authenticated-orcid":false,"given":"P\u00e5l","family":"Halvorsen","sequence":"additional","affiliation":[{"name":"Oslo Metropolitan University (OsloMet), Norway"},{"name":"Forzasys AS, Norway"},{"name":"Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"219","published-online":{"date-parts":[[2025,11,27]]},"reference":[{"key":"S1793351X25450035BIB001","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2018.00223"},{"key":"S1793351X25450035BIB002","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00508"},{"key":"S1793351X25450035BIB003","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW56347.2022.00393"},{"key":"S1793351X25450035BIB004","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-022-01469-1"},{"key":"S1793351X25450035BIB005","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW59228.2023.00536"},{"key":"S1793351X25450035BIB006","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW59228.2023.00537"},{"key":"S1793351X25450035BIB007","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00332"},{"key":"S1793351X25450035BIB008","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00334"},{"key":"S1793351X25450035BIB009","unstructured":"R. Chakraborty, R. Chakraborty, A. Dasgupta and S. Chaurasia, Do we need large VLMs for spotting soccer actions? arXiv:2506.17144."},{"key":"S1793351X25450035BIB010","unstructured":"OpenAI, Whisper GitHub (2024), https:\/\/github.com\/openai\/whisper."},{"key":"S1793351X25450035BIB011","unstructured":"Google, Google Translate (2024), https:\/\/translate.google.com."},{"key":"S1793351X25450035BIB012","doi-asserted-by":"publisher","DOI":"10.1109\/ISM63611.2024.00016"},{"key":"S1793351X25450035BIB013","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-020-10073-7"},{"key":"S1793351X25450035BIB014","doi-asserted-by":"publisher","DOI":"10.1561\/116.00000050"},{"key":"S1793351X25450035BIB015","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-020-03097-z"},{"key":"S1793351X25450035BIB016","first-page":"306","volume-title":"Proc. Third IEEE Int. Conf. Multimedia Computing and Systems","author":"Chang Y.-L.","year":"1996"},{"key":"S1793351X25450035BIB017","unstructured":"K. Soomro, A. R. Zamir and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv:1212.0402."},{"key":"S1793351X25450035BIB018","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-34372-9"},{"key":"S1793351X25450035BIB019","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2655624"},{"key":"S1793351X25450035BIB020","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW63382.2024.00343"},{"key":"S1793351X25450035BIB021","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"S1793351X25450035BIB022","doi-asserted-by":"publisher","DOI":"10.1109\/NNSP.2003.1318046"},{"key":"S1793351X25450035BIB023","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00456"},{"key":"S1793351X25450035BIB024","doi-asserted-by":"publisher","DOI":"10.1145\/3577190.3614225"},{"key":"S1793351X25450035BIB025","doi-asserted-by":"publisher","DOI":"10.1145\/3552463.3557019"},{"key":"S1793351X25450035BIB026","first-page":"25","volume-title":"Proc. 12th IOE Graduate Conf.","author":"Gautam S.","year":"2022"},{"key":"S1793351X25450035BIB027","doi-asserted-by":"publisher","DOI":"10.1109\/ICMEW46912.2020.9106051"},{"key":"S1793351X25450035BIB028","unstructured":"A. Cioppa\n                      et al.\n                      , SoccerNet 2023 challenges results, arXiv:2309.06006."},{"key":"S1793351X25450035BIB029","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00487"},{"key":"S1793351X25450035BIB030","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00497"},{"key":"S1793351X25450035BIB031","doi-asserted-by":"publisher","DOI":"10.1145\/3583780.3615120"},{"key":"S1793351X25450035BIB032","doi-asserted-by":"publisher","DOI":"10.1109\/ICOSST60641.2023.10414235"},{"key":"S1793351X25450035BIB033","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01743"},{"key":"S1793351X25450035BIB034","first-page":"1","volume":"9","author":"Zhang Z.","year":"2018","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"S1793351X25450035BIB035","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3365948"},{"volume-title":"29th Annual Conf. Language Processing Society","year":"2023","author":"Yin Y.","key":"S1793351X25450035BIB036"},{"key":"S1793351X25450035BIB037","first-page":"2520","volume-title":"IEEE\/CVF Conf. Computer Vision and Pattern Recognition Workshops","author":"Merler M.","year":"2018"},{"key":"S1793351X25450035BIB038","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2876046"},{"key":"S1793351X25450035BIB039","unstructured":"K. Ge, L. Chen, K. Zhang, Y. Luo, T. Shi, L. Fan, X. Li, G. Wang and S. Zhang, SCBench: A sports commentary benchmark for video LLMs, arXiv:2412.17637."},{"key":"S1793351X25450035BIB040","unstructured":"S. Schneider\n                      et al.\n                      , wav2vec: Unsupervised pre-training for speech recognition, arXiv:1904.05862."},{"key":"S1793351X25450035BIB041","unstructured":"Z. Tong\n                      et al.\n                      , VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training, arXiv:2203.12602."},{"key":"S1793351X25450035BIB042","unstructured":"Y. Liu\n                      et al.\n                      , RoBERTa: A robustly optimized BERT pretraining approach, arXiv:1907.11692."},{"key":"S1793351X25450035BIB043","unstructured":"R. Zhang, J. Han, C. Liu, A. Zhou, P. Lu, Y. Qiao, H. Li and P. Gao, LLaMA-Adapter: Efficient fine-tuning of large language models with zero-initialized attention, arXiv:2303.16199."},{"key":"S1793351X25450035BIB044","unstructured":"Z. Li, Q. Xu, D. Zhang, H. Song, Y. Cai, Q. Qi, R. Zhou, J. Pan, Z. Li, V. T. Vu, Z. Huang and T. Wang, GroundingGPT: Language enhanced multi-modal grounding model, arXiv:2401.06071."},{"key":"S1793351X25450035BIB045","first-page":"1","volume-title":"34th Conf. Neural Information Processing Systems","author":"Brown T.","year":"2020"},{"key":"S1793351X25450035BIB046","unstructured":"M. H. Sarkhoosh, S. M. M. Dorcheh, S. Gautam, C. Midoglu, S. S. Sabet and P. Halvorsen, Soccer on social media, arXiv:2310.12328."},{"key":"S1793351X25450035BIB047","first-page":"74","volume-title":"Proc. Workshop on Text Summarization Branches Out","author":"Lin C.-Y.","year":"2004"},{"key":"S1793351X25450035BIB048","unstructured":"T. Zhang\n                      et al.\n                      , BERTScore: Evaluating text generation with BERT, arXiv:1904.09675."},{"key":"S1793351X25450035BIB049","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K19-1039"}],"container-title":["International Journal of Semantic Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S1793351X25450035","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,31]],"date-time":"2025-12-31T03:36:45Z","timestamp":1767152205000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S1793351X25450035"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,27]]},"references-count":49,"journal-issue":{"issue":"04","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["10.1142\/S1793351X25450035"],"URL":"https:\/\/doi.org\/10.1142\/s1793351x25450035","relation":{},"ISSN":["1793-351X","1793-7108"],"issn-type":[{"type":"print","value":"1793-351X"},{"type":"electronic","value":"1793-7108"}],"subject":[],"published":{"date-parts":[[2025,11,27]]}}}