{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T02:42:06Z","timestamp":1773801726399,"version":"3.50.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>Most existing multi-modal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations.\nTo address these limitations, we propose MDTrack, a novel framework for modality-aware fusion and decoupled temporal propagation in multi-modal object tracking.\nSpecifically, for modality-aware fusion, we allocate dedicated experts to each modality (Infrared, Event, Depth, and RGB) to process their respective representations. The gating mechanism within the Mixture of Experts (MoE) then dynamically selects the optimal experts based on the input features, enabling adaptive and modality-specific fusion.\nFor decoupled temporal propagation, we introduce two separate State Space Model (SSM) structures to independently store and update the hidden states h of the RGB and X-modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross-attentions between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone via another set of cross-attention, enhancing MDTrack\u2019s ability to leverage temporal information. \nExtensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack-S (Modality-Specific Training) and MDTrack-U (Unified-Modality Training) achieve state-of-the-art performance across five multi-modal tracking benchmarks.<\/jats:p>","DOI":"10.1609\/aaai.v40i12.37973","type":"journal-article","created":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T23:52:48Z","timestamp":1773791568000},"page":"10065-10073","source":"Crossref","is-referenced-by-count":0,"title":["Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking"],"prefix":"10.1609","volume":"40","author":[{"given":"Shilei","family":"Wang","sequence":"first","affiliation":[]},{"given":"Pujian","family":"Lai","sequence":"additional","affiliation":[]},{"given":"Dong","family":"Gao","sequence":"additional","affiliation":[]},{"given":"Jifeng","family":"Ning","sequence":"additional","affiliation":[]},{"given":"Gong","family":"Cheng","sequence":"additional","affiliation":[]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/37973\/41935","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/37973\/41935","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T23:52:48Z","timestamp":1773791568000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/37973"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i12.37973","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}