{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T02:47:44Z","timestamp":1773802064768,"version":"3.50.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"14","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an \u201cAudio-Visual Confusion\u201d scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs \u201cIs there a\/an {muted-object} sound\u201d. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves accuracy by 10~30% over the baseline model with limited training data.<\/jats:p>","DOI":"10.1609\/aaai.v40i14.38183","type":"journal-article","created":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T00:09:20Z","timestamp":1773792560000},"page":"11955-11963","source":"Crossref","is-referenced-by-count":0,"title":["When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?"],"prefix":"10.1609","volume":"40","author":[{"given":"Qilang","family":"Ye","sequence":"first","affiliation":[]},{"given":"Wei","family":"Zeng","sequence":"additional","affiliation":[]},{"given":"Meng","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Jie","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Yupeng","family":"Hu","sequence":"additional","affiliation":[]},{"given":"Zitong","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Yu","family":"Zhou","sequence":"additional","affiliation":[]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/38183\/42145","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/38183\/42145","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T00:09:20Z","timestamp":1773792560000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/38183"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i14.38183","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}