{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T17:07:18Z","timestamp":1780765638697,"version":"3.54.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"13","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>Recent advances in video understanding have been driven by MLLMs.\nBut these MLLMs are good at analyzing short videos,\nwhile suffering from difficulties in understanding videos with a longer context.\nTo address this difficulty,\nseveral agent paradigms have recently been proposed, \nusing MLLMs as agents for retrieving extra contextual knowledge in a long video.\nHowever,\nmost existing agents ignore the key fact that a long video is composed with multiple shots,\ni.e.,\nto answer the user question from a long video, \nit is critical to deeply understand its relevant shots like human.\nWithout such insight,\nthese agents often mistakenly find redundant even noisy temporal context,\nrestricting their capacity for long video understanding.\nTo fill this gap,\nwe propose VideoChat-A1, \na novel long video agent paradigm.\nDifferent from the previous works,\nour VideoChat-A1 can deeply think with long videos,\nvia a distinct chain-of-shot reasoning paradigm.\nMore specifically,\nit can progressively select the relevant shots of user question,\nand \nlook into these shots in a coarse-to-fine partition.\nBy multi-modal reasoning along the shot chain,\nVideoChat-A1 can effectively mimic step-by-step human thinking process,\nallowing the interactive discovery of preferable temporal context for thoughtful understanding in long videos.\nExtensive experiments show that,\nVideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks,\ne.g., it achieves 77.0 on VideoMME(w\/ subs) and 70.1 on EgoSchema, \noutperforming its strong baselines (e.g., InternVL2.5-8B and InternVideo2.5-8B),\nby up to 10.1% and 6.2%.  Compared to leading closed-source GPT-4o and Gemini 1.5 Pro,  VideoChat-A1 offers competitive accuracy, \nbut only with 7% input frames and 12% inference time on average.<\/jats:p>","DOI":"10.1609\/aaai.v40i13.38018","type":"journal-article","created":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T00:04:04Z","timestamp":1773792244000},"page":"10467-10475","source":"Crossref","is-referenced-by-count":1,"title":["VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning"],"prefix":"10.1609","volume":"40","author":[{"given":"Zikang","family":"Wang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Boyu","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhengrong","family":"Yue","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yi","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yu","family":"Qiao","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Limin","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yali","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/38018\/41980","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/38018\/41980","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T00:04:04Z","timestamp":1773792244000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/38018"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"13","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i13.38018","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}