{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T23:28:19Z","timestamp":1768519699670,"version":"3.49.0"},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8]]},"abstract":"<jats:p>We consider the problem of forecasting the future locations of pedestrians in an ego-centric view of a moving vehicle. Current CNNs or RNNs are flawed in capturing the high dynamics of motion between pedestrians and the ego-vehicle, and suffer from the massive parameter usages due to the inefficiency of learning long-term temporal dependencies. To address these issues, we propose an efficient multimodal transformer network that aggregates the trajectory and ego-vehicle speed variations at a coarse granularity and interacts with the optical flow in a fine-grained level to fill the vacancy of highly dynamic motion. Specifically, a coarse-grained fusion stage fuses the information between trajectory and ego-vehicle speed modalities to capture the general temporal consistency. Meanwhile, a fine-grained fusion stage merges the optical flow in the center area and pedestrian area, which compensates the highly dynamic motion of ego-vehicle and target pedestrian. Besides, the whole network is only attention-based that can efficiently model long-term sequences for better capturing the temporal variations. Our multimodal transformer is validated on the PIE and JAAD datasets and achieves state-of-the-art performance with the most light-weight model size. The codes are available at https:\/\/github.com\/ericyinyzy\/MTN_trajectory.<\/jats:p>","DOI":"10.24963\/ijcai.2021\/174","type":"proceedings-article","created":{"date-parts":[[2021,8,11]],"date-time":"2021-08-11T11:00:49Z","timestamp":1628679649000},"page":"1259-1265","source":"Crossref","is-referenced-by-count":40,"title":["Multimodal Transformer Networks for Pedestrian Trajectory Prediction"],"prefix":"10.24963","author":[{"given":"Ziyi","family":"Yin","sequence":"first","affiliation":[{"name":"Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, China"}]},{"given":"Ruijin","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, China"}]},{"given":"Zhiliang","family":"Xiong","sequence":"additional","affiliation":[{"name":"Shenzhen Forward Innovation Digital Technology Co. Ltd"}]},{"given":"Zejian","family":"Yuan","sequence":"additional","affiliation":[{"name":"Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, China"}]}],"member":"10584","event":{"name":"Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}","theme":"Artificial Intelligence","location":"Montreal, Canada","acronym":"IJCAI-2021","number":"30","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"start":{"date-parts":[[2021,8,19]]},"end":{"date-parts":[[2021,8,27]]}},"container-title":["Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2021,8,11]],"date-time":"2021-08-11T11:01:47Z","timestamp":1628679707000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2021\/174"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2021,8]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2021\/174","relation":{},"subject":[],"published":{"date-parts":[[2021,8]]}}}