{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T20:45:56Z","timestamp":1778013956224,"version":"3.51.4"},"reference-count":244,"publisher":"Association for Computing Machinery (ACM)","issue":"9","license":[{"start":{"date-parts":[[2024,4,24]],"date-time":"2024-04-24T00:00:00Z","timestamp":1713916800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"<jats:p>Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination\/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early\/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.<\/jats:p>","DOI":"10.1145\/3649447","type":"journal-article","created":{"date-parts":[[2024,2,24]],"date-time":"2024-02-24T09:17:19Z","timestamp":1708766239000},"page":"1-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":304,"title":["Deep Multimodal Data Fusion"],"prefix":"10.1145","volume":"56","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-8535-4923","authenticated-orcid":false,"given":"Fei","family":"Zhao","sequence":"first","affiliation":[{"name":"The University of Alabama at Birmingham, Birmingham, AL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5868-6450","authenticated-orcid":false,"given":"Chengcui","family":"Zhang","sequence":"additional","affiliation":[{"name":"The University of Alabama at Birmingham, Birmingham, AL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9596-0359","authenticated-orcid":false,"given":"Baocheng","family":"Geng","sequence":"additional","affiliation":[{"name":"The University of Alabama at Birmingham, Birmingham, AL, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,4,24]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.3390\/s21103465"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2021.06.003"},{"key":"e_1_3_1_4_2","article-title":"Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text","author":"Akbari Hassan","year":"2021","unstructured":"Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34 (2021), 24206\u201324221.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2017.2697839"},{"key":"e_1_3_1_6_2","doi-asserted-by":"crossref","unstructured":"Chris Alberti Jeffrey Ling Michael Collins and David Reitter. 2019. Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2131\u20132140.","DOI":"10.18653\/v1\/D19-1219"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.infrared.2022.104209"},{"key":"e_1_3_1_10_2","unstructured":"Mehmet Ayg\u00fcn Yusuf H\u00fcseyin \u015eahin and G\u00f6zde \u00dcnal. 2018. Multi modal convolutional neural networks for brain tumor segmentation. arXiv:1809.06191. Retrieved from https:\/\/arxiv.org\/abs\/1809.06191"},{"key":"e_1_3_1_11_2","unstructured":"Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Retrieved from https:\/\/arxiv.org\/abs\/1409.0473"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_1_13_2","first-page":"65","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65\u201372."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33863-2_43"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP42928.2021.9506108"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.50"},{"key":"e_1_3_1_17_2","first-page":"1","volume-title":"Proceedings of the 2021 International Conference on Cyber Situational Awareness, Data Analytics, and Assessment (CyberSA)","year":"2021","unstructured":"Padmalochan Bera and Shobh. 2021. ModCGAN: A multimodal approach to detect new malware. In Proceedings of the 2021 International Conference on Cyber Situational Awareness, Data Analytics, and Assessment (CyberSA). IEEE, 1\u20132."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1177\/0278364913507326"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2006.1657816"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2975093"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2015.7350781"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58565-5_13"},{"key":"e_1_3_1_23_2","unstructured":"Feilong Chen Minglun Han Haozhi Zhao Qingyang Zhang Jing Shi Shuang Xu and Bo Xu. 2023. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160. Retrieved from https:\/\/arxiv.org\/abs\/2305.04160"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2020.02.002"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01124"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2014.757"},{"key":"e_1_3_1_28_2","unstructured":"Camille Couprie Cl\u00e9ment Farabet Laurent Najman and Yann LeCun. 2013. Indoor semantic segmentation using depth information. arXiv:1301.3572. Retrieved from https:\/\/arxiv.org\/abs\/1301.3572"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/SIBGRAPI-T.2012.13"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/94"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2821921"},{"key":"e_1_3_1_32_2","unstructured":"Liuyuan Deng Ming Yang Tianyi Li Yuesheng He and Chunxiang Wang. 2019. RFBNet: Deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation. arXiv:1907.00135. Retrieved from https:\/\/arxiv.org\/abs\/1907.00135"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2023.3293423"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2022.08.011"},{"key":"e_1_3_1_35_2","first-page":"19822","article-title":"Cogview: Mastering text-to-image generation via transformers","volume":"34","author":"Ding Ming","year":"2021","unstructured":"Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. 2021. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021), 19822\u201319835.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_36_2","unstructured":"Denis Dresvyanskiy Elena Ryumina Heysem Kaya Maxim Markitantov Alexey Karpov and Wolfgang Minker. 2020. An audio-video deep and transfer learning framework for multimodal emotion recognition in the wild. arXiv:2010.03692. Retrieved from https:\/\/arxiv.org\/abs\/2010.03692"},{"key":"e_1_3_1_37_2","unstructured":"Weichen Fan Jinghuan Chen Jiabin Ma Jun Hou and Shuai Yi. 2022. Styleflow for content-fixed image to image translation. arXiv:2207.01909. Retrieved from https:\/\/arxiv.org\/abs\/2207.01909"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2020.12.001"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cobeha.2021.02.018"},{"key":"e_1_3_1_40_2","unstructured":"Fahimeh Fooladgar and Shohreh Kasaei. 2019. Multi-modal attention-based fusion model for semantic segmentation of RGB-depth images. arXiv:1912.11691. Retrieved from https:\/\/arxiv.org\/abs\/1912.11691"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","unstructured":"K. Foster G. Christie and M. Brown. 2020. IEEE Dataport: Urban semantic 3D dataset. 10.21227\/9frn-7208","DOI":"10.21227\/9frn-7208"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01276"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01419"},{"key":"e_1_3_1_45_2","article-title":"Are you talking to a machine? dataset and methods for multilingual image question","author":"Gao Haoyuan","year":"2015","unstructured":"Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? dataset and methods for multilingual image question. Advances in Neural Information Processing Systems 28 (2015), 2296\u20132304.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco_a_01273"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2907071"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00680"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/JBHI.2021.3097721"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1177\/0278364913491297"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/MASS.2019.00014"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACII.2019.8925444"},{"key":"e_1_3_1_53_2","volume-title":"Proceedings of the 14th International Conference on Quantitative Infrared Thermography","author":"Ghiass Reza Shoja","year":"2018","unstructured":"Reza Shoja Ghiass, Hakim Bendada, and Xavier Maldague. 2018. Universit\u00e9 laval face motion and time-lapse video database (ul-fmtv). In Proceedings of the 14th International Conference on Quantitative Infrared Thermography."},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01816"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSMC.2016.2617465"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3483381"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1162\/0899766042321814"},{"key":"e_1_3_1_59_2","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840\u20136851.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_60_2","article-title":"Deep encoder-decoder networks for classification of hyperspectral and LiDAR data","author":"Hong Danfeng","year":"2022","unstructured":"Danfeng Hong, Lianru Gao, Renlong Hang, Bing Zhang, and Jocelyn Chanussot. 2022. Deep encoder-decoder networks for classification of hyperspectral and LiDAR data. IEEE Geoscience and Remote Sensing LettersLetters 19 (2022), 1\u20135.","journal-title":"IEEE Geoscience and Remote Sensing Letters"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3551593"},{"key":"e_1_3_1_62_2","unstructured":"Jingwen Hu Yuchen Liu Jinming Zhao and Qin Jin. 2021. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online 5666\u20135675."},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00147"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1364\/JOSAA.32.000431"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2019.8803025"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00448"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16253"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01050"},{"key":"e_1_3_1_69_2","unstructured":"Zhicheng Huang Zhaoyang Zeng Bei Liu Dongmei Fu and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849. Retrieved from https:\/\/arxiv.org\/abs\/2004.00849"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2019.8803360"},{"key":"e_1_3_1_71_2","unstructured":"O. Iosifova I. Iosifov and O. Rolik. 2020. Techniques and components for natural language processing. MoMLeT&DS 2631 I (2020) 57\u201367."},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4471-4640-7_8"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1177\/0278364919843996"},{"key":"e_1_3_1_74_2","doi-asserted-by":"crossref","unstructured":"Myeonghun Jeong Hyeongju Kim Sung Jun Cheon Byoung Jin Choi and Nam Soo Kim. 2021. Diff-tts: A denoising diffusion model for text-to-speech. arXiv:2104.01409. Retrieved from https:\/\/arxiv.org\/abs\/2104.01409","DOI":"10.21437\/Interspeech.2021-469"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.02.062"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-022-07862-6"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2020.03.109"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics12061390"},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_1_80_2","unstructured":"Laurynas Karazija Iro Laina and Christian Rupprecht. 2021. Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. arXiv:2111.10265. Retrieved from https:\/\/arxiv.org\/abs\/2111.10265"},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1086"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313552"},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_41"},{"key":"e_1_3_1_84_2","unstructured":"Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https:\/\/arxiv.org\/abs\/1312.6114"},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ophoto.2021.100001"},{"key":"e_1_3_1_86_2","unstructured":"D. N. Krishna. 2021. Using large pre-trained models with cross-modal attention for multi-modal emotion recognition. arXiv:2108.09669. Retrieved from https:\/\/arxiv.org\/abs\/2108.09669"},{"key":"e_1_3_1_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_1_88_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.54"},{"key":"e_1_3_1_89_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14539"},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475431"},{"key":"e_1_3_1_91_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_3_1_92_2","doi-asserted-by":"crossref","unstructured":"Jie Lei Licheng Yu Mohit Bansal and Tamara L. Berg. 2018. Tvqa: Localized compositional video question answering. arXiv:1809.01696. Retrieved from https:\/\/arxiv.org\/abs\/1809.01696","DOI":"10.18653\/v1\/D18-1167"},{"key":"e_1_3_1_93_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_1_94_2","article-title":"Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition","author":"Li Jiang","year":"2024","unstructured":"Jiang Li, Xiaoping Wang, Guoqing Lv, and Zhigang Zeng. 2024. Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition. IEEE Transactions on Multimedia 26 (2024), 77\u201389.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_95_2","doi-asserted-by":"crossref","unstructured":"Linjie Li Yen-Chun Chen Yu Cheng Zhe Gan Licheng Yu and Jingjing Liu. 2020. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv:2005.00200. Retrieved from https:\/\/arxiv.org\/abs\/2005.00200","DOI":"10.18653\/v1\/2020.emnlp-main.161"},{"key":"e_1_3_1_96_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.01041"},{"key":"e_1_3_1_97_2","unstructured":"Liunian Harold Li Mark Yatskar Da Yin Cho-Jui Hsieh and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https:\/\/arxiv.org\/abs\/1908.03557"},{"key":"e_1_3_1_98_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00602"},{"key":"e_1_3_1_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00602"},{"key":"e_1_3_1_100_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.502"},{"key":"e_1_3_1_101_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR48806.2021.9412891"},{"key":"e_1_3_1_102_2","unstructured":"Junyang Lin An Yang Yichang Zhang Jie Liu Jingren Zhou and Hongxia Yang. 2020. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv:2003.13198. Retrieved from https:\/\/arxiv.org\/abs\/2003.13198"},{"key":"e_1_3_1_103_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00122"},{"key":"e_1_3_1_104_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2021.3065495"},{"key":"e_1_3_1_105_2","article-title":"Multi-modal mutual attention and iterative interaction for referring image segmentation","author":"Liu Chang","year":"2023","unstructured":"Chang Liu, Henghui Ding, Yulun Zhang, and Xudong Jiang. 2023. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing 32 (2023), 3054\u20133065.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_106_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.trit.2017.04.001"},{"key":"e_1_3_1_107_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00431"},{"key":"e_1_3_1_108_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01170"},{"key":"e_1_3_1_109_2","doi-asserted-by":"crossref","unstructured":"Zhun Liu Ying Shen Varun Bharadhwaj Lakshminarasimhan Paul Pu Liang Amir Zadeh and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv:1806.00064. Retrieved from https:\/\/arxiv.org\/abs\/1806.00064","DOI":"10.18653\/v1\/P18-1209"},{"key":"e_1_3_1_110_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.07.012"},{"key":"e_1_3_1_111_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-020-02036-0"},{"key":"e_1_3_1_112_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3180725"},{"key":"e_1_3_1_113_2","article-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","volume":"32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019), 13\u201323.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_114_2","unstructured":"Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Taroon Bharti and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353. Retrieved from https:\/\/arxiv.org\/abs\/2002.06353"},{"key":"e_1_3_1_115_2","doi-asserted-by":"publisher","DOI":"10.1177\/0278364916679498"},{"key":"e_1_3_1_116_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_1"},{"key":"e_1_3_1_117_2","article-title":"A multi-world approach to question answering about real-world scenes based on uncertain input","volume":"27","author":"Malinowski Mateusz","year":"2014","unstructured":"Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in Neural Information Processing Systems 27 (2014), 1682\u20131690.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_118_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503927"},{"key":"e_1_3_1_119_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.9"},{"key":"e_1_3_1_120_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_39"},{"key":"e_1_3_1_121_2","doi-asserted-by":"crossref","unstructured":"KAMAL UDDIN MD. 2021. Cross-modal and multi-modal person re-identification with RGB-D sensors. Array 12 (2021) 100089.","DOI":"10.1016\/j.array.2021.100089"},{"key":"e_1_3_1_122_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2019.12.001"},{"key":"e_1_3_1_123_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00272"},{"key":"e_1_3_1_124_2","doi-asserted-by":"publisher","DOI":"10.1145\/3206025.3206064"},{"key":"e_1_3_1_125_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-018-00166-3"},{"key":"e_1_3_1_126_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.109848"},{"key":"e_1_3_1_127_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00054"},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4471-6296-4_8"},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1109\/EHB52898.2021.9657722"},{"key":"e_1_3_1_130_2","article-title":"Out of the box: Reasoning with graph convolution nets for factual visual question answering","volume":"31","author":"Narasimhan Medhini","year":"2018","unstructured":"Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. Advances in Neural Information Processing Systems 31 (2018), 2654\u20132665.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_131_2","volume-title":"Proceedings of the ECCV","author":"Hoiem Pushmeet Kohli Nathan Silberman, Derek","year":"2012","unstructured":"Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the ECCV."},{"key":"e_1_3_1_132_2","doi-asserted-by":"publisher","DOI":"10.3390\/s17030605"},{"key":"e_1_3_1_133_2","article-title":"Learning conditioned graph structures for interpretable visual question answering","volume":"31","author":"Norcliffe-Brown Will","year":"2018","unstructured":"Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. Advances in Neural Information Processing Systems 31 (2018), 8334\u20138343.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2013.6474999"},{"key":"e_1_3_1_135_2","unstructured":"Ozan Oktay Jo Schlemper Loic Le Folgoc Matthew Lee Mattias Heinrich Kazunari Misawa Kensaku Mori Steven McDonagh Nils Y. Hammerla Bernhard Kainz Ben Glocker and Daniel Ruecker. 2018. Attention u-net: Learning where to look for the pancreas. arXiv:1804.03999. Retrieved from https:\/\/arxiv.org\/abs\/1804.03999"},{"key":"e_1_3_1_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2015.2424056"},{"key":"e_1_3_1_137_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_1_138_2","doi-asserted-by":"publisher","DOI":"10.1145\/3284750"},{"key":"e_1_3_1_139_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3195643"},{"key":"e_1_3_1_140_2","unstructured":"Di Qi Lin Su Jia Song Edward Cui Taroon Bharti and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv:2001.07966. Retrieved from https:\/\/arxiv.org\/abs\/2001.07966"},{"key":"e_1_3_1_141_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-020-10431-5"},{"key":"e_1_3_1_142_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.556"},{"key":"e_1_3_1_143_2","doi-asserted-by":"publisher","DOI":"10.1145\/3451215"},{"key":"e_1_3_1_144_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2942480"},{"issue":"2","key":"e_1_3_1_145_2","first-page":"3","article-title":"Hierarchical text-conditional image generation with clip latents","volume":"1","author":"Ramesh Aditya","year":"2022","unstructured":"Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.","journal-title":"arXiv preprint arXiv:2204.06125"},{"key":"e_1_3_1_146_2","first-page":"8821","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning. PMLR, 8821\u20138831."},{"key":"e_1_3_1_147_2","doi-asserted-by":"publisher","DOI":"10.1109\/IROS45743.2020.9340849"},{"key":"e_1_3_1_148_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00121"},{"key":"e_1_3_1_149_2","unstructured":"Adri\u00e0 Recasens Jason Lin Jo\u0101o Carreira Drew Jaegle Luyu Wang Jean-baptiste Alayrac Pauline Luc Antoine Miech Lucas Smaira Ross Hemsley and Andrew Zisserma. 2023. Zorro: The masked multimodal transformer. arXiv:2301.09595. Retrieved from https:\/\/arxiv.org\/abs\/2301.09595"},{"key":"e_1_3_1_150_2","unstructured":"Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv:1804.02767. Retrieved from https:\/\/arxiv.org\/abs\/1804.02767"},{"key":"e_1_3_1_151_2","article-title":"Exploring models and data for image question answering","volume":"28","author":"Ren Mengye","year":"2015","unstructured":"Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. Advances in Neural Information Processing Systems 28 (2015), 2953\u20132961.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_152_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_1_153_2","first-page":"36479","article-title":"Photorealistic text-to-image diffusion models with deep language understanding","volume":"35","author":"Saharia Chitwan","year":"2022","unstructured":"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479\u201336494.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_154_2","doi-asserted-by":"crossref","unstructured":"Gaurav Sahu and Olga Vechtomova. 2019. Adaptive fusion techniques for multimodal data. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online 3156\u20133166.","DOI":"10.18653\/v1\/2021.eacl-main.275"},{"key":"e_1_3_1_155_2","first-page":"3070","article-title":"Multimodal graph networks for compositional generalization in visual question answering","volume":"33","author":"Saqur Raeid","year":"2020","unstructured":"Raeid Saqur and Karthik Narasimhan. 2020. Multimodal graph networks for compositional generalization in visual question answering. Advances in Neural Information Processing Systems 33 (2020), 3070\u20133081.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_156_2","doi-asserted-by":"crossref","unstructured":"Paul Hongsuck Seo Arsha Nagrani Anurag Arnab and Cordelia Schmid. 2022. End-to-end Generative Pretraining for Multimodal Video Captioning. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922) Los Alamitos CA 17938\u201317947.","DOI":"10.1109\/CVPR52688.2022.01743"},{"key":"e_1_3_1_157_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2021EDP7189"},{"key":"e_1_3_1_158_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-022-04355-w"},{"key":"e_1_3_1_159_2","unstructured":"Gen Shi Yifan Zhu Wenjin Liu and Xuesong Li. 2021. A heterogeneous graph based framework for multimodal neuroimaging fusion learning. arXiv:2110.08465. Retrieved from https:\/\/arxiv.org\/abs\/2110.08465"},{"key":"e_1_3_1_160_2","doi-asserted-by":"crossref","unstructured":"Nir Shlezinger Jay Whang Yonina C. Eldar and Alexandros G Dimakis. 2020. Model-based deep learning. Proc. IEEE 111 5 (2023) 465\u2013499.","DOI":"10.1109\/JPROC.2023.3247480"},{"key":"e_1_3_1_161_2","unstructured":"Gunnar A. Sigurdsson Abhinav Gupta Cordelia Schmid Ali Farhadi and Karteek Alahari. 2018. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv:1804.09626. Retrieved from https:\/\/arxiv.org\/abs\/1804.09626"},{"key":"e_1_3_1_162_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2011.6130298"},{"key":"e_1_3_1_163_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_1_164_2","doi-asserted-by":"publisher","DOI":"10.1504\/IJSI.2022.121102"},{"key":"e_1_3_1_165_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298655"},{"key":"e_1_3_1_166_2","article-title":"Generative modeling by estimating gradients of the data distribution","volume":"32","author":"Song Yang","year":"2019","unstructured":"Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019), 11918\u201311930.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_167_2","doi-asserted-by":"publisher","DOI":"10.23919\/APNOMS.2019.8892906"},{"key":"e_1_3_1_168_2","unstructured":"Chen Sun Fabien Baradel Kevin Murphy and Cordelia Schmid. 2019. Learning video representations using contrastive bidirectional transformer. arXiv:1906.05743. Retrieved from https:\/\/arxiv.org\/abs\/1906.05743"},{"key":"e_1_3_1_169_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_1_170_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414654"},{"key":"e_1_3_1_171_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3551575"},{"key":"e_1_3_1_172_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2019.2904733"},{"key":"e_1_3_1_173_2","doi-asserted-by":"crossref","unstructured":"Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong 5100\u20135111.","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_1_174_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3551607"},{"key":"e_1_3_1_175_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.193"},{"key":"e_1_3_1_176_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2020.102277"},{"key":"e_1_3_1_177_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMI.2018.2868977"},{"key":"e_1_3_1_178_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_179_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_1_180_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-020-74399-w"},{"key":"e_1_3_1_181_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2019.00200"},{"key":"e_1_3_1_182_2","doi-asserted-by":"publisher","DOI":"10.3389\/fphar.2019.01592"},{"key":"e_1_3_1_183_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.107075"},{"key":"e_1_3_1_184_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00631"},{"key":"e_1_3_1_185_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2754246"},{"issue":"1","key":"e_1_3_1_186_2","first-page":"1","article-title":"MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification","volume":"12","author":"Wang Tongxin","year":"2021","unstructured":"Tongxin Wang, Wei Shao, Zhi Huang, Haixu Tang, Jie Zhang, Zhengming Ding, and Kun Huang. 2021. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nature Communications 12, 1 (2021), 1\u201313.","journal-title":"Nature Communications"},{"key":"e_1_3_1_187_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_1_188_2","first-page":"4835","article-title":"Deep multimodal fusion by channel exchanging","volume":"33","author":"Wang Yikai","year":"2020","unstructured":"Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. 2020. Deep multimodal fusion by channel exchanging. Advances in Neural Information Processing Systems 33 (2020), 4835\u20134845.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_189_2","doi-asserted-by":"publisher","DOI":"10.1145\/3372278.3390713"},{"key":"e_1_3_1_190_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3174215"},{"key":"e_1_3_1_191_2","doi-asserted-by":"crossref","unstructured":"Yanan Wang Michihiro Yasunaga Hongyu Ren Shinya Wada and Jure Leskovec. 2022. Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering. arXiv:2205.11501. Retrieved from https:\/\/arxiv.org\/abs\/2205.11501","DOI":"10.1109\/ICCV51070.2023.01973"},{"key":"e_1_3_1_192_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00071"},{"key":"e_1_3_1_193_2","doi-asserted-by":"publisher","DOI":"10.3390\/s20102905"},{"key":"e_1_3_1_194_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413556"},{"key":"e_1_3_1_195_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351034"},{"key":"e_1_3_1_196_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00365"},{"key":"e_1_3_1_197_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.575"},{"key":"e_1_3_1_198_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01023"},{"key":"e_1_3_1_199_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2018.07.029"},{"key":"e_1_3_1_200_2","unstructured":"Shengqiong Wu Hao Fei Leigang Qu Wei Ji and Tat-Seng Chua. 2023. NExT-GPT: Any-to-Any Multimodal LLM. arXiv:2309.05519. Retrieved from https:\/\/arxiv.org\/abs\/2309.05519"},{"key":"e_1_3_1_201_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.226"},{"key":"e_1_3_1_202_2","doi-asserted-by":"crossref","unstructured":"Hu Xu Gargi Ghosh Po-Yao Huang Prahal Arora Masoumeh Aminzadeh Christoph Feichtenhofer Florian Metze and Luke Zettlemoyer. 2021. VLM: Task-agnostic video-language model pre-training for video understanding. arXiv:2105.09996. Retrieved from https:\/\/arxiv.org\/abs\/2105.09996","DOI":"10.18653\/v1\/2021.findings-acl.370"},{"key":"e_1_3_1_203_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_1_204_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547728"},{"key":"e_1_3_1_205_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/138"},{"key":"e_1_3_1_206_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2022.103207"},{"key":"e_1_3_1_207_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2023.106729"},{"key":"e_1_3_1_208_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548769"},{"key":"e_1_3_1_209_2","doi-asserted-by":"crossref","unstructured":"Xu Yang Jiawei Peng Zihua Wang Haiyang Xu Qinghao Ye Chenliang Li Ming Yan Fei Huang Zhangzikang Li and Yu Zhang. 2023. Transforming Visual Scene Graphs to Image Captions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto 12427\u201312440.","DOI":"10.18653\/v1\/2023.acl-long.694"},{"key":"e_1_3_1_210_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_1_211_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_1_212_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01075"},{"key":"e_1_3_1_213_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12293"},{"key":"e_1_3_1_214_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58520-4_14"},{"key":"e_1_3_1_215_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00218"},{"key":"e_1_3_1_216_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i11.21483"},{"key":"e_1_3_1_217_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2019.2947482"},{"key":"e_1_3_1_218_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2018.08.017"},{"key":"e_1_3_1_219_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-29551-6_3"},{"key":"e_1_3_1_220_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"e_1_3_1_221_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2817340"},{"key":"e_1_3_1_222_2","unstructured":"Yu Yuan Jiaqi Wu Zhongliang Jing Henry Leung and Han Pan. 2022. Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention. arXiv:2210.09847. Retrieved from https:\/\/arxiv.org\/abs\/2210.09847"},{"key":"e_1_3_1_223_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33019176"},{"key":"e_1_3_1_224_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3002667"},{"key":"e_1_3_1_225_2","doi-asserted-by":"crossref","unstructured":"Amir Zadeh Minghai Chen Soujanya Poria Erik Cambria and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Copenhagen Denmark 1103\u20131114.","DOI":"10.18653\/v1\/D17-1115"},{"key":"e_1_3_1_226_2","doi-asserted-by":"crossref","unstructured":"Yawen Zeng Yiru Wang Dongliang Liao Gongfu Li Jin Xu Xiangmin Xu Bo Liu and Hong Man. 2024. Contrastive topic-enhanced network for video captioning. Expert Systems with Applications 237 (2024) 121601.","DOI":"10.1016\/j.eswa.2023.121601"},{"key":"e_1_3_1_227_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3551600"},{"key":"e_1_3_1_228_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2021.06.008"},{"key":"e_1_3_1_229_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3549201"},{"key":"e_1_3_1_230_2","doi-asserted-by":"publisher","DOI":"10.23919\/FUSION43075.2019.9011282"},{"key":"e_1_3_1_231_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBIOM.2020.2973001"},{"key":"e_1_3_1_232_2","unstructured":"Shan Zhang Pranay Sharma Baocheng Geng and Pramod K. Varshney. 2022. Distributed estimation in large scale wireless sensor networks via a two step group-based approach. arXiv:2203.09567. Retrieved from https:\/\/arxiv.org\/abs\/2203.09567"},{"key":"e_1_3_1_233_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00101"},{"key":"e_1_3_1_234_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_18"},{"key":"e_1_3_1_235_2","first-page":"336","volume-title":"Proceedings of the VISIGRAPP (5: VISAPP)","author":"Zhang Yifei","year":"2019","unstructured":"Yifei Zhang, Olivier Morel, Marc Blanchon, Ralph Seulin, Mojdeh Rastgoo, and D\u00e9sir\u00e9 Sidib\u00e9. 2019. Exploration of deep learning-based multimodal fusion for semantic road scene segmentation. In Proceedings of the VISIGRAPP (5: VISAPP). 336\u2013343."},{"key":"e_1_3_1_236_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2020.104042"},{"key":"e_1_3_1_237_2","article-title":"Spatial-information guided adaptive context-aware network for efficient rgb-d semantic segmentation","author":"Zhang Yang","year":"2023","unstructured":"Yang Zhang, Chenyun Xiong, Junjie Liu, Xuhui Ye, and Guodong Sun. 2023. Spatial-information guided adaptive context-aware network for efficient rgb-d semantic segmentation. IEEE Sensors Journal 23, 19 (2023), 23512\u201323521.","journal-title":"IEEE Sensors Journal"},{"key":"e_1_3_1_238_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2018.11.004"},{"key":"e_1_3_1_239_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.126389"},{"key":"e_1_3_1_240_2","doi-asserted-by":"publisher","DOI":"10.3390\/rs12111887"},{"key":"e_1_3_1_241_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICNSC48988.2020.9238079"},{"key":"e_1_3_1_242_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.aiopen.2021.01.001"},{"key":"e_1_3_1_243_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12342"},{"key":"e_1_3_1_244_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.array.2019.100004"},{"key":"e_1_3_1_245_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00877"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3649447","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3649447","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:21Z","timestamp":1750291401000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3649447"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,24]]},"references-count":244,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3649447"],"URL":"https:\/\/doi.org\/10.1145\/3649447","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,24]]},"assertion":[{"value":"2022-11-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-31","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}