{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T23:36:40Z","timestamp":1761176200856,"version":"build-2065373602"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686318","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T00:00:00Z","timestamp":1761004800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,10,21]]},"abstract":"<jats:p>Dense video captioning is crucial for enhancing video understanding in daily applications and presents a significant challenge in multimodal analysis. Existing methods often overlook video-to-dynamic-space mapping at varying scales, resulting in captions that lack specificity and remain overly general, failing to capture real-world physical detail. To address this limitation, we propose a multi-granularity Spatio-Temporal Reasoning (STaR) approach, which integrates: (i) efficient global feature integration to model long-term temporal dependencies, (ii) spatial attention mechanisms with position encoding to capture absolute spatial information, and (iii) cross-modal feature fusion to align and unify global, local, and spatial representations. Moreover, we enhance the framework using a Large Language Model (LLM) to improve the richness and naturalness of the generated descriptions. Comparative experiments have been conducted to evaluate the effectiveness of the proposed method on SoccerNet dataset. Experimental results demonstrate that our model effectively enhances localization accuracy and generates captions with superior temporal and spatial detail fidelity. The code is available at https:\/\/github.com\/bread-555\/STaR.<\/jats:p>","DOI":"10.3233\/faia251078","type":"book-chapter","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:50:51Z","timestamp":1761126651000},"source":"Crossref","is-referenced-by-count":0,"title":["STaR: Multi-Granular Spatio-Temporal Reasoning for Long-Form Dense Video Captioning"],"prefix":"10.3233","author":[{"given":"Yihao","family":"Wu","sequence":"first","affiliation":[{"name":"Hangzhou Dianzi University"}]},{"given":"Chenhuan","family":"Cai","sequence":"additional","affiliation":[{"name":"University of Zurich"}]},{"given":"Liqi","family":"Yan","sequence":"additional","affiliation":[{"name":"Hangzhou Dianzi University"}]},{"given":"Huapeng","family":"Li","sequence":"additional","affiliation":[{"name":"University of Zurich"}]},{"given":"Jianhui","family":"Zhang","sequence":"additional","affiliation":[{"name":"Hangzhou Dianzi University"}]},{"given":"Jiahao","family":"Liu","sequence":"additional","affiliation":[{"name":"Meituan"}]},{"given":"Qifan","family":"Wang","sequence":"additional","affiliation":[{"name":"Meta AI"}]},{"given":"Fangli","family":"Guan","sequence":"additional","affiliation":[{"name":"Hangzhou Dianzi University"}]},{"given":"Pan","family":"Li","sequence":"additional","affiliation":[{"name":"Hangzhou Dianzi University"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","ECAI 2025"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA251078","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:50:52Z","timestamp":1761126652000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA251078"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,21]]},"ISBN":["9781643686318"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia251078","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,21]]}}}