{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T20:18:28Z","timestamp":1776111508160,"version":"3.50.1"},"reference-count":419,"publisher":"Association for Computing Machinery (ACM)","issue":"10","license":[{"start":{"date-parts":[[2024,6,22]],"date-time":"2024-06-22T00:00:00Z","timestamp":1719014400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"<jats:p>\n            Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this article is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality\n            <jats:italic>heterogeneity<\/jats:italic>\n            ,\n            <jats:italic>connections<\/jats:italic>\n            , and\n            <jats:italic>interactions<\/jats:italic>\n            that have driven subsequent innovations, and propose a taxonomy of six core technical challenges:\n            <jats:italic>representation<\/jats:italic>\n            ,\n            <jats:italic>alignment<\/jats:italic>\n            ,\n            <jats:italic>reasoning<\/jats:italic>\n            ,\n            <jats:italic>generation<\/jats:italic>\n            ,\n            <jats:italic>transference<\/jats:italic>\n            , and\n            <jats:italic>quantification<\/jats:italic>\n            covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.\n          <\/jats:p>","DOI":"10.1145\/3656580","type":"journal-article","created":{"date-parts":[[2024,4,9]],"date-time":"2024-04-09T11:52:50Z","timestamp":1712663570000},"page":"1-42","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":143,"title":["Foundations &amp; Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions"],"prefix":"10.1145","volume":"56","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7768-3610","authenticated-orcid":false,"given":"Paul Pu","family":"Liang","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5297-3571","authenticated-orcid":false,"given":"Amir","family":"Zadeh","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6376-7696","authenticated-orcid":false,"given":"Louis-Philippe","family":"Morency","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, United States"}]}],"member":"320","published-online":{"date-parts":[[2024,6,22]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2018.2875385"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3461702.3462624"},{"issue":"1","key":"e_1_3_1_4_2","first-page":"1","article-title":"Multi-modal haptic feedback for grip force reduction in robotic surgery","volume":"9","year":"2019","unstructured":"Ahmad Abiri, Jake Pensa, Anna Tao, Ji Ma, Yen-Yi Juo, and Syed J. Askari. 2019. Multi-modal haptic feedback for grip force reduction in robotic surgery. Scientific Reports 9, 1 (2019), 1\u201310.","journal-title":"Scientific Reports"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41591-022-01981-2"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.02072"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00971"},{"key":"e_1_3_1_8_2","unstructured":"Andrea Agostinelli Timo I. Denk Zal\u00e1n Borsos Jesse Engel Mauro Verzetti and Antoine Caillon. 2023. MusicLM: Generating music from text. arXiv:2301.11325. Retrieved from https:\/\/arxiv.org\/abs\/2301.11325"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1203"},{"key":"e_1_3_1_10_2","doi-asserted-by":"crossref","unstructured":"Aishwarya Agrawal Jiasen Lu Stanislaw Antol Margaret Mitchell C. Lawrence Zitnick Devi Parikh and Dhruv Batra. 2017. VQA: Visual question answering: www.visualqa.org. International Journal of Computer Vision 123 1 (2017) 4\u201331.","DOI":"10.1007\/s11263-016-0966-6"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_15"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2019.00084"},{"key":"e_1_3_1_13_2","unstructured":"Hana Ajakan Pascal Germain Hugo Larochelle Fran\u00e7ois Laviolette and Mario Marchand. 2014. Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446 (2014)."},{"key":"e_1_3_1_14_2","unstructured":"Hassan Akbari Liangzhe Yuan and Rui Qian. 2021. VATT: Transformers for multimodal self-supervised learning from raw video audio and text. arXiv preprint arXiv:2104.11178 (2021)."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.08.019"},{"key":"e_1_3_1_16_2","unstructured":"Jean-Baptiste Alayrac J\u0229 Donahue Pauline Luc Antoine Miech Iain Barr and Yana Hasson. 2022. Flamingo: A visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1080\/0163853X.2020.1768500"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.4103\/jfmpc.jfmpc_440_19"},{"key":"e_1_3_1_19_2","first-page":"279","volume-title":"ICML","author":"Amizadeh Saeed","year":"2020","unstructured":"Saeed Amizadeh, Hamid Palangi, Alex Polozov, Yichen Huang, and Kazuhito Koishida. 2020. Neuro-symbolic visual reasoning: Disentangling visual from reasoning. In ICML. PMLR, 279\u2013290."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.12"},{"key":"e_1_3_1_22_2","volume-title":"ICML","author":"Andrew Galen","year":"2013","unstructured":"Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML."},{"key":"e_1_3_1_23_2","volume-title":"INTERSPEECH","author":"Anguera Xavier","year":"2014","unstructured":"Xavier Anguera, Jordi Luque, and Ciro Gracia. 2014. Audio-to-text alignment for speech recognition with very limited resources. In INTERSPEECH."},{"key":"e_1_3_1_24_2","unstructured":"John Arevalo Thamar Solorio Manuel Montes-y G\u00f3mez and Fabio A. Gonz\u00e1lez. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017)."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-010-0182-0"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-76298-0_52"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14400-4_20"},{"key":"e_1_3_1_28_2","unstructured":"Anas Awadalla Irena Gao Josh Gardner Jack Hessel and Yusuf Hanafy. 2023. OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)."},{"key":"e_1_3_1_29_2","article-title":"Co-training and expansion: Towards bridging theory and practice","author":"Balcan Maria-Florina","year":"2004","unstructured":"Maria-Florina Balcan, Avrim Blum, and Ke Yang. 2004. Co-training and expansion: Towards bridging theory and practice. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_1_31_2","volume-title":"NeurIPS 2020 Workshop SVRHM","author":"Barnum George","year":"2020","unstructured":"George Barnum, Sabera J. Talukder, and Yisong Yue. 2020. On the benefits of early fusion in multimodal representation learning. In NeurIPS 2020 Workshop SVRHM."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1037\/0022-3514.51.6.1173"},{"key":"e_1_3_1_33_2","volume-title":"Image-Music-Text","author":"Barthes Roland","year":"1977","unstructured":"Roland Barthes. 1977. Image-Music-Text. Macmillan."},{"key":"e_1_3_1_34_2","article-title":"Analysis of representations for domain adaptation","author":"Ben-David Shai","year":"2006","unstructured":"Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2006. Analysis of representations for domain adaptation. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.285"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442188.3445922"},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","unstructured":"Yoshua Bengio Aaron Courville and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 8 (2013) 1798\u20131828.","DOI":"10.1109\/TPAMI.2013.50"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1089\/1092642041255441"},{"key":"e_1_3_1_39_2","unstructured":"Abeba Birhane Vinay Uday Prabhu and Emmanuel Kahembwe. 2021. Multimodal datasets: Misogyny pornography and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021)."},{"key":"e_1_3_1_40_2","first-page":"8718","volume-title":"EMNLP","year":"2020","unstructured":"Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, and Aleksandr Nisnevich. 2020. Experience grounds language. In EMNLP. 8718\u20138735."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/279943.279962"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376746"},{"key":"e_1_3_1_43_2","first-page":"4349","volume-title":"NeurIPS","author":"Bolukbasi Tolga","year":"2016","unstructured":"Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In NeurIPS. 4349\u20134357."},{"key":"e_1_3_1_44_2","volume-title":"NIPS 2017\u2019s Visually-Grounded Interaction and Language Workshop","year":"2017","unstructured":"Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, and Luca Celotti. 2017. HoME: A household multimodal environment. In NIPS 2017\u2019s Visually-Grounded Interaction and Language Workshop."},{"key":"e_1_3_1_45_2","unstructured":"Anthony Brohan Noah Brown Justice Carbajal Yevgen Chebotar Xi Chen Krzysztof Choromanski Tianli Ding Danny Driess Avinava Dubey and Chelsea Finn. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)."},{"key":"e_1_3_1_46_2","unstructured":"Michael M. Bronstein Joan Bruna Taco Cohen and Petar Veli\u010dkovi\u0107. 2021. Geometric deep learning: Grids groups graphs geodesics and gauges. arXiv preprint arXiv:2104.13478 (2021)."},{"key":"e_1_3_1_47_2","unstructured":"Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT. PMLR 77\u201391."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2941419"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33275-3_42"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58539-6_34"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10487"},{"key":"e_1_3_1_52_2","first-page":"2633","volume-title":"USENIX Security","year":"2021","unstructured":"Nicholas Carlini, Florian Tramer, EricWallace, Matthew Jagielski, and Ariel Herbert-Voss. 2021. Extracting training data from large language models. In USENIX Security. 2633\u20132650."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1455"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1128"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.375"},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","unstructured":"Wilson Chango Juan A. Lara Rebeca Cerezo and Cristobal Romero. 2022. A review on data fusion in multimodal learning analytics and educational data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12 4 (2022) e1458.","DOI":"10.1002\/widm.1458"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11832"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3026276"},{"key":"e_1_3_1_59_2","first-page":"8012","volume-title":"ICCV","year":"2021","unstructured":"Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, and Angie Boggust. 2021. Multimodal clustering networks for self-supervised learning from unlabeled videos. In ICCV. 8012\u20138021."},{"key":"e_1_3_1_60_2","doi-asserted-by":"crossref","unstructured":"Jun Chen Han Guo Kai Yi Boyang Li and Mohamed Elhoseiny. 2021. VisualGPT: Data-efficient adaptation of pretrained language models for image captioning. arXiv preprint arXiv:2102.10407 (2021).","DOI":"10.1109\/CVPR52688.2022.01750"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.46"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1438"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/SKG.2018.00033"},{"key":"e_1_3_1_64_2","unstructured":"Lele Chen Guofeng Cui Ziyi Kou Haitian Zheng and Chenliang Xu. 2020. What comprises a good talking-head video generation?: A survey and benchmark. arXiv preprint arXiv:2005.03201 (2020)."},{"key":"e_1_3_1_65_2","first-page":"1542","volume-title":"ICML","author":"Chen Liqun","year":"2020","unstructured":"Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. 2020. Graph optimal transport for cross-domain alignment. In ICML. PMLR, 1542\u20131553."},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3136755.3136801"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2675998"},{"key":"e_1_3_1_68_2","first-page":"3345","volume-title":"IJCAI","year":"2016","unstructured":"Yanhua Cheng, Xin Zhao, Rui Cai, Zhiwei Li, Kaiqi Huang, and Yong Rui. 2016. Semi-supervised multimodal deep learning for RGB-D object recognition. In IJCAI. 3345\u20133351."},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"Jaemin Cho Abhay Zala and Mohit Bansal. 2022. DALL-Eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arXiv preprint arXiv:2202.04053 (2022).","DOI":"10.1109\/ICCV51070.2023.00283"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12343"},{"key":"e_1_3_1_71_2","volume-title":"ACL","author":"Cirik Volkan","year":"2020","unstructured":"Volkan Cirik, Taylor Berg-Kirkpatrick, and L.-P. Morency. 2020. Refer360: A referring expression recognition dataset in 360 images. In ACL."},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2123"},{"key":"e_1_3_1_73_2","unstructured":"Jade Copet Felix Kreuk Itai Gat Tal Remez David Kant Gabriel Synnaeve Yossi Adi and Alexandre D\u00e9fossez. 2023. Simple and controllable music generation. arXiv preprint arXiv:2306.05284 (2023)."},{"key":"e_1_3_1_74_2","unstructured":"Wenliang Dai Junnan Li Dongxu Li Anthony Meng Huat Tiong Junqi Zhao and Weisheng Wang. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500. Retrieved from https:\/\/arxiv.org\/abs\/2305.06500"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIC.2015.72"},{"key":"e_1_3_1_76_2","first-page":"14961","article-title":"See, hear, explore: Curiosity via audio-visual association","author":"Dean Victoria","year":"2020","unstructured":"Victoria Dean, Shubham Tulsiani, and Abhinav Gupta. 2020. See, hear, explore: Curiosity via audio-visual association. In NeurIPS. 14961\u201314972.","journal-title":"NeurIPS"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/1878116.1878131"},{"key":"e_1_3_1_78_2","unstructured":"Joseph DelPreto Chao Liu and Yiyue Luo. 2022. ActionSense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment. InNeurIPS."},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.599"},{"key":"e_1_3_1_80_2","first-page":"1174","volume-title":"ICML","author":"Denton Emily","year":"2018","unstructured":"Emily Denton and Rob Fergus. 2018. Stochastic video generation with a learned prior. In ICML. PMLR, 1174\u20131183."},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2005.03.007"},{"key":"e_1_3_1_82_2","volume-title":"NAACL-HLT (1)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1)."},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2021.3058873"},{"key":"e_1_3_1_84_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn and Xiaohua Zhai. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR."},{"key":"e_1_3_1_85_2","unstructured":"Danny Driess Fei Xia Mehdi S. M. Sajjadi Corey Lynch Aakanksha Chowdhery Brian Ichter and Ayzaan Wahid. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)."},{"key":"e_1_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9052990"},{"key":"e_1_3_1_87_2","unstructured":"Yifan Du Zikang Liu Junyi Li and Wayne Xin Zhao. 2022. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936 (2022)."},{"key":"e_1_3_1_88_2","doi-asserted-by":"crossref","unstructured":"Jared A. Dunnmon Alexander J. Ratner Khaled Saab Nishith Khandwala Matthew Markert Hersh Sagreiya Roger Goldman Christopher Lee-Messer Matthew P. Lungren and Daniel L. Rubin. 2020. Cross-modal data programming enables rapid medical machine learning. Patterns 1 2 (2020).","DOI":"10.1016\/j.patter.2020.100019"},{"key":"e_1_3_1_89_2","unstructured":"Chris Dyer. 2014. Notes on noise contrastive estimation and negative sampling. arXiv preprint arXiv:1410.8251 (2014)."},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/2522848.2532595"},{"key":"e_1_3_1_91_2","doi-asserted-by":"crossref","unstructured":"Georgios Evangelopoulos Athanasia Zlatintsi Alexandros Potamianos Petros Maragos Konstantinos Rapantzikos Georgios Skoumas and Yannis Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural visual and textual attention. IEEE Transactions on Multimedia 15 7 (2013) 1553\u20131568.","DOI":"10.1109\/TMM.2013.2267205"},{"key":"e_1_3_1_92_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00118"},{"key":"e_1_3_1_93_2","doi-asserted-by":"publisher","DOI":"10.5555\/1888089.1888092"},{"key":"e_1_3_1_94_2","doi-asserted-by":"publisher","DOI":"10.1145\/3461615.3486570"},{"key":"e_1_3_1_95_2","doi-asserted-by":"publisher","DOI":"10.1214\/07-AOAS148"},{"key":"e_1_3_1_96_2","first-page":"2121","volume-title":"NeurIPS","author":"Frome Andrea","year":"2013","unstructured":"Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc\u2019Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NeurIPS. 2121\u20132129."},{"key":"e_1_3_1_97_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_1_98_2","unstructured":"Hiroki Furuta Ofir Nachum Kuang-Huei Lee Yutaka Matsuo Shixiang Shane Gu and Izzeddin Gur. 2023. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854 (2023)."},{"key":"e_1_3_1_99_2","first-page":"1","volume-title":"FUSION","author":"Gadzicki Konrad","year":"2020","unstructured":"Konrad Gadzicki, Razieh Khamsehashari, and Christoph Zetzsche. 2020. Early vs late fusion in multimodal convolutional neural networks. In FUSION. IEEE, 1\u20136."},{"key":"e_1_3_1_100_2","doi-asserted-by":"crossref","unstructured":"Zhe Gan Linjie Li Chunyuan Li Lijuan Wang Zicheng Liu and Jianfeng Gao. 2022. Vision-language pre-training: Basics recent advances and future trends. Foundations and TrendsR in Computer Graphics and Vision 14 3\u20134 (2022) 163\u2013352.","DOI":"10.1561\/0600000105"},{"key":"e_1_3_1_101_2","unstructured":"Peng Gao Jiaming Han Renrui Zhang Ziyi Lin Shijie Geng Aojun Zhou and Wei Zhang. 2023. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)."},{"key":"e_1_3_1_102_2","first-page":"324","volume-title":"CVPR","author":"Gao Ruohan","year":"2019","unstructured":"Ruohan Gao and Kristen Grauman. 2019. 2.5 d visual sound. In CVPR. 324\u2013333."},{"key":"e_1_3_1_103_2","doi-asserted-by":"crossref","unstructured":"Enrique Garcia-Ceja Michael Riegler Tine Nordgreen Petter Jakobsen Ketil J. Oedegaard and Jim T\u00f8rresen. 2018. Mental health monitoring with multimodal sensing and machine learning: A survey. Pervasive and Mobile Computing 51 (2018) 1\u201326.","DOI":"10.1016\/j.pmcj.2018.09.003"},{"key":"e_1_3_1_104_2","article-title":"Perceptual score: What data modalities does your model perceive? In","author":"Gat Itai","year":"2021","unstructured":"Itai Gat, Idan Schwartz, and Alex Schwing. 2021. Perceptual score: What data modalities does your model perceive? In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_105_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.301"},{"key":"e_1_3_1_106_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1107"},{"key":"e_1_3_1_107_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33013681"},{"key":"e_1_3_1_108_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58589-1_23"},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_1_110_2","unstructured":"Yash Goyal Akrit Mohapatra Devi Parikh and Dhruv Batra. 2016. Towards transparent AI systems: Interpreting visual question answering models. arXiv preprint arXiv:1608.08974 (2016)."},{"key":"e_1_3_1_111_2","first-page":"1880","volume-title":"AISTATS","author":"Grave Edouard","year":"2019","unstructured":"Edouard Grave, Armand Joulin, and Quentin Berthet. 2019. Unsupervised alignment of embeddings with Wasserstein Procrustes. In AISTATS. PMLR, 1880\u20131890."},{"key":"e_1_3_1_112_2","unstructured":"Liangke Gui Borui Wang Qiuyuan Huang Alex Hauptmann Yonatan Bisk and Jianfeng Gao. 2021. KAT: A knowledge augmented transformer for vision-and-language. arXiv preprint arXiv:2112.08614 (2021)."},{"key":"e_1_3_1_113_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2010.5540120"},{"key":"e_1_3_1_114_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00380"},{"key":"e_1_3_1_115_2","doi-asserted-by":"crossref","unstructured":"Jeffrey T. Hancock and Jeremy N. Bailenson. 2021. The Social Impact of Deepfakes. 149\u2013152 pages.","DOI":"10.1089\/cyber.2021.29208.jth"},{"key":"e_1_3_1_116_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00550"},{"key":"e_1_3_1_117_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i14.17534"},{"key":"e_1_3_1_118_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1211"},{"key":"e_1_3_1_119_2","unstructured":"Xuehai He Yichen Zhang Luntian Mou Eric Xing and Pengtao Xie. 2020. PathVQA: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)."},{"key":"e_1_3_1_120_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_47"},{"key":"e_1_3_1_121_2","doi-asserted-by":"crossref","unstructured":"Lisa Anne Hendricks John Mellor Rosalia Schneider Jean-Baptiste Alayrac and Aida Nematzadeh. 2021. Decoupling the role of data attention and losses in multimodal transformers. arXiv preprint arXiv:2102.00529 (2021).","DOI":"10.1162\/tacl_a_00385"},{"key":"e_1_3_1_122_2","volume-title":"EMNLP","author":"Hessel Jack","year":"2020","unstructured":"Jack Hessel and Lillian Lee. 2020. Does my multimodal model learn cross-modal interactions? It\u2019s harder to tell than you might think!. In EMNLP."},{"key":"e_1_3_1_123_2","doi-asserted-by":"crossref","unstructured":"Jack Hessel Ana Marasovi\u0107 Jena D. Hwang Lillian Lee Jeff Da Rowan Zellers Robert Mankoff and Yejin Choi. 2022. Do androids laugh at electric sheep? Humor \u201cUnderstanding\u201d benchmarks from the New Yorker Caption Contest. arXiv preprint arXiv:2209.06293 (2022).","DOI":"10.18653\/v1\/2023.acl-long.41"},{"key":"e_1_3_1_124_2","unstructured":"Irina Higgins Loic Matthey Arka Pal Christopher Burgess Xavier Glorot Matthew Botvinick Shakir Mohamed and Alexander Lerchner. 2016. beta-VAE: Learning basic visual concepts with a constrained variational framework. (2016)."},{"key":"e_1_3_1_125_2","unstructured":"Ryota Hinami Junwei Liang Shin\u2019ichi Satoh and Alexander Hauptmann. 2018. Multimodal co-training for selecting good examples from webly labeled video. arXiv preprint arXiv:1804.06057 (2018)."},{"key":"e_1_3_1_126_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2020.3016820"},{"key":"e_1_3_1_127_2","doi-asserted-by":"crossref","unstructured":"Richang Hong Daqing Liu Xiaoyu Mo Xiangnan He and Hanwang Zhang. 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 2 (2019) 684\u2013696.","DOI":"10.1109\/TPAMI.2019.2911066"},{"key":"e_1_3_1_128_2","first-page":"12136","article-title":"Deep multimodal multilinear fusion with high-order polynomial pooling","author":"Hou Ming","year":"2019","unstructured":"Ming Hou, Jiajia Tang, Jianhai Zhang, Wanzeng Kong, and Qibin Zhao. 2019. Deep multimodal multilinear fusion with high-order polynomial pooling. In NeurIPS. 12136\u201312145.","journal-title":"NeurIPS"},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICBK.2019.00020"},{"key":"e_1_3_1_130_2","unstructured":"Tzu-Ming Harry Hsu Wei-Hung Weng Willie Boag Matthew McDermott and Peter Szolovits. 2018. Unsupervised multimodal representation learning across medical images and reports. arXiv preprint arXiv:1811.08615 (2018)."},{"key":"e_1_3_1_131_2","unstructured":"Wei-Ning Hsu and James Glass. 2018. Disentangling by partitioning: A representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264 (2018)."},{"key":"e_1_3_1_132_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00947"},{"key":"e_1_3_1_133_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2019.05.017"},{"key":"e_1_3_1_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.93"},{"key":"e_1_3_1_135_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.493"},{"key":"e_1_3_1_136_2","unstructured":"Wenlong Huang Pieter Abbeel Deepak Pathak and Igor Mordatch. 2022. Language models as zero-shot planners: extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207 (2022)."},{"key":"e_1_3_1_137_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2017\/263"},{"key":"e_1_3_1_138_2","article-title":"What makes multi-modal learning better than single (provably)","author":"Huang Yu","year":"2021","unstructured":"Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. 2021. What makes multi-modal learning better than single (provably). In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_139_2","unstructured":"Yu Huang Junyang Lin Chang Zhou Hongxia Yang and Longbo Huang. 2022. Modality competition: what makes joint training of multi-modal network fail in deep learning?(Provably). arXiv preprint arXiv:2203.12221 (2022)."},{"key":"e_1_3_1_140_2","doi-asserted-by":"publisher","DOI":"10.1109\/JIOT.2019.2940709"},{"key":"e_1_3_1_141_2","article-title":"Learning by abstraction: The neural state machine","author":"Hudson Drew","year":"2019","unstructured":"Drew Hudson and Christopher D. Manning. 2019. Learning by abstraction: The neural state machine. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_142_2","unstructured":"Drew A. Hudson and Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)."},{"key":"e_1_3_1_143_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00686"},{"key":"e_1_3_1_144_2","unstructured":"Masha Itkina B. Ivanovic Ransalu Senanayake Mykel J. Kochenderfer and Marco Pavone. 2020. Evidential sparsification of multimodal latent spaces in conditional variational autoencoders. arXiv:2010.09164. Retrieved from https:\/\/arxiv.org\/abs\/2010.09164"},{"key":"e_1_3_1_145_2","unstructured":"Andrew Jaegle Felix Gimeno Andrew Brock Andrew Zisserman Oriol Vinyals and Joao Carreira. 2021. Perceiver: General perception with iterative attention. arXiv preprint arXiv:2103.03206 (2021)."},{"key":"e_1_3_1_146_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2006.10.019"},{"key":"e_1_3_1_147_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01150-y"},{"key":"e_1_3_1_148_2","doi-asserted-by":"crossref","unstructured":"Anubhav Jangra Adam Jatowt Mohammad Hasanuzzaman and Sriparna Saha. 2020. Text-image-video summary generation using joint integer linear programming. In ECIR. Springer.","DOI":"10.1007\/978-3-030-45442-5_24"},{"key":"e_1_3_1_149_2","volume-title":"ICLR","author":"Jayakumar Siddhant M.","year":"2020","unstructured":"Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. 2020. Multiplicative interactions and where to find them. In ICLR."},{"key":"e_1_3_1_150_2","first-page":"4904","volume-title":"ICML","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, and Hieu Pham. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 4904\u20134916."},{"issue":"1","key":"e_1_3_1_151_2","first-page":"1","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","year":"2016","unstructured":"Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, H. Lehman Li-Wei, Mengling Feng, and Mohammad Ghassemi. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 1\u20139.","journal-title":"Scientific Data"},{"key":"e_1_3_1_152_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_1_153_2","article-title":"One model to learn them all","author":"Kaiser Lukasz","year":"2017","unstructured":"Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. One model to learn them all. arXiv:1706.05137. Retrieved from https:\/\/arxiv.org\/abs\/1706.05137","journal-title":"arXiv:1706.05137"},{"key":"e_1_3_1_154_2","article-title":"Deep fragment embeddings for bidirectional image sentence mapping","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_155_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00813"},{"key":"e_1_3_1_156_2","doi-asserted-by":"publisher","DOI":"10.1162\/NECO_a_00074"},{"key":"e_1_3_1_157_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2945574"},{"key":"e_1_3_1_158_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2021.3054789"},{"key":"e_1_3_1_159_2","first-page":"2611","article-title":"The hateful memes challenge: Detecting hate speech in multimodal memes","author":"Kiela Douwe","year":"2020","unstructured":"Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS. 2611\u20132624.","journal-title":"NeurIPS"},{"key":"e_1_3_1_160_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2015.03.003"},{"key":"e_1_3_1_161_2","doi-asserted-by":"publisher","unstructured":"Elsa A. Kirchner Stephen H. Fairclough and Frank Kirchner. 2019. Embedded multimodal interfaces in robotics: applications future trends and societal implications. Association for Computing Machinery and Morgan & Claypool 523\u2013576. 10.1145\/3233795.3233810","DOI":"10.1145\/3233795.3233810"},{"key":"e_1_3_1_162_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00028"},{"key":"e_1_3_1_163_2","first-page":"5338","volume-title":"ICML","author":"Koh Pang Wei","year":"2020","unstructured":"Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. In ICML. PMLR, 5338\u20135348."},{"key":"e_1_3_1_164_2","doi-asserted-by":"publisher","DOI":"10.1093\/acprof:oso\/9780199546251.003.0001"},{"key":"e_1_3_1_165_2","doi-asserted-by":"crossref","unstructured":"Satwik Kottur Jos\u00e9 MF Moura Devi Parikh Dhruv Batra and Marcus Rohrbach. 2018. Visual coreference resolution in visual dialog using neural module networks. In ECCV. 153\u2013169.","DOI":"10.1007\/978-3-030-01267-0_10"},{"key":"e_1_3_1_166_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_1_167_2","doi-asserted-by":"crossref","unstructured":"Satyapriya Krishna Tessa Han Alex Gu Javin Pombra Shahin Jabbari Steven Wu and Himabindu Lakkaraju. 2022. The disagreement problem in explainable machine learning: A practitioner\u2019s perspective. arXiv preprint arXiv:2202.01602 (2022).","DOI":"10.21203\/rs.3.rs-2963888\/v1"},{"key":"e_1_3_1_168_2","doi-asserted-by":"publisher","DOI":"10.1137\/1025045"},{"key":"e_1_3_1_169_2","doi-asserted-by":"publisher","DOI":"10.1142\/S012906570000034X"},{"key":"e_1_3_1_170_2","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2018.251"},{"key":"e_1_3_1_171_2","first-page":"2085","volume-title":"ICML","author":"Lebret R\u00e9mi","year":"2015","unstructured":"R\u00e9mi Lebret, Pedro Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. In ICML. PMLR, 2085\u20132094."},{"key":"e_1_3_1_172_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01838"},{"key":"e_1_3_1_173_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8793485"},{"key":"e_1_3_1_174_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01435"},{"key":"e_1_3_1_175_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1167"},{"key":"e_1_3_1_176_2","doi-asserted-by":"publisher","DOI":"10.1145\/3406324.3410710"},{"key":"e_1_3_1_177_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.243"},{"key":"e_1_3_1_178_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1114"},{"key":"e_1_3_1_179_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jag.2022.102926"},{"key":"e_1_3_1_180_2","unstructured":"Junnan Li Dongxu Li Silvio Savarese and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)."},{"key":"e_1_3_1_181_2","unstructured":"Liunian Harold Li Mark Yatskar Da Yin Cho-Jui Hsieh and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)."},{"key":"e_1_3_1_182_2","unstructured":"Mingzhe Li Xiuying Chen Shen Gao Zhangming Chan Dongyan Zhao and Rui Yan. 2020. VMSMO: Learning to generate multimodal summary for video-based news articles. arXiv preprint arXiv:2010.05406 (2020)."},{"key":"e_1_3_1_183_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01593"},{"key":"e_1_3_1_184_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1210"},{"key":"e_1_3_1_185_2","unstructured":"Qing Li Boqing Gong Yin Cui Dan Kondratyuk and Xianzhi Du. 2021. Towards a unified foundation model: Jointly pre-training transformers on unpaired images and text. arXiv preprint arXiv:2112.07074 (2021)."},{"key":"e_1_3_1_186_2","unstructured":"Shuang Li Xavier Puig Yilun Du Clinton Wang Ekin Akyurek Antonio Torralba Jacob Andreas and Igor Mordatch. 2022. Pre-trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771 (2022)."},{"key":"e_1_3_1_187_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i10.17026"},{"key":"e_1_3_1_188_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.353"},{"key":"e_1_3_1_189_2","unstructured":"Paul Pu Liang. 2022. Brainish: Formalizing a multimodal language for intelligence and consciousness. arXiv preprint arXiv:2205.00001 (2022)."},{"key":"e_1_3_1_190_2","volume-title":"NeurIPS","year":"2023","unstructured":"Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard J. Chen, and Zihao Deng. 2023. Quantifying & modeling multimodal interactions: An information decomposition framework. In NeurIPS."},{"key":"e_1_3_1_191_2","volume-title":"NeurIPS","author":"Liang Paul Pu","year":"2023","unstructured":"Paul Pu Liang, Zihao Deng, Martin Ma, James Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2023. Factorized contrastive learning: Going beyond multi-view redundancy. In NeurIPS."},{"key":"e_1_3_1_192_2","volume-title":"ACL\/IJCNLP (1)","year":"2021","unstructured":"Paul Pu Liang, Terrance Liu, Anna Cai, Michal Muszynski, Ryo Ishii, Nicholas Allen, and Randy Auerbach. 2021. Learning language and multimodal privacy-preserving markers of mood from mobile data. In ACL\/IJCNLP (1)."},{"key":"e_1_3_1_193_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1152"},{"key":"e_1_3_1_194_2","volume-title":"ICLR","year":"2023","unstructured":"Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, and Zihao Deng. 2023. MultiViz: Towards visualizing and understanding multimodal models. In ICLR."},{"key":"e_1_3_1_195_2","unstructured":"Paul Pu Liang Yiwei Lyu Xiang Fan Jeffrey Tsaw Yudong Liu Shentong Mo Dani Yogatama Louis-Philippe Morency and Russ Salakhutdinov. 2023. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research (2023)."},{"key":"e_1_3_1_196_2","volume-title":"NeurIPS Datasets and Benchmarks Track","year":"2021","unstructured":"Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, and Leslie Yufan Chen. 2021. MultiBench: Multiscale benchmarks for multimodal representation learning. In NeurIPS Datasets and Benchmarks Track."},{"key":"e_1_3_1_197_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475247"},{"key":"e_1_3_1_198_2","first-page":"17612","article-title":"Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning","author":"Liang Victor Weixin","year":"2022","unstructured":"Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y. Zou. 2022. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS. 17612\u201317625.","journal-title":"NeurIPS"},{"key":"e_1_3_1_199_2","unstructured":"Valerii Likhosherstov Mostafa Dehghani Anurag Arnab Krzysztof Marcin Choromanski Mario Lucic Yi Tay and Adrian Weller. 2022. PolyViT: Co-training Vision Transformers on Images Videos and Audio."},{"key":"e_1_3_1_200_2","doi-asserted-by":"crossref","unstructured":"Bryan Lim Sercan \u00d6 Ar\u0131k Nicolas Loeff and Tomas Pfister. 2021. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37 4 (2021) 1748\u20131764.","DOI":"10.1016\/j.ijforecast.2021.03.012"},{"key":"e_1_3_1_201_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1282"},{"key":"e_1_3_1_202_2","doi-asserted-by":"crossref","unstructured":"Jana Lipkova Richard J. Chen Bowen Chen Ming Y. Lu Matteo Barbieri and Daniel Shao. 2022. Artificial intelligence for multimodal data integration in oncology. Cancer Cell (2022).","DOI":"10.1016\/j.ccell.2022.09.012"},{"key":"e_1_3_1_203_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.215"},{"key":"e_1_3_1_204_2","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)."},{"key":"e_1_3_1_205_2","doi-asserted-by":"crossref","unstructured":"Ye Liu Hui Li Alberto Garcia-Duran Mathias Niepert Daniel Onoro-Rubio and David S. Rosenblum. 2019. MMKG: Multi-modal knowledge graphs. In ESWC. Springer 459\u2013474.","DOI":"10.1007\/978-3-030-21348-0_30"},{"key":"e_1_3_1_206_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1209"},{"key":"e_1_3_1_207_2","doi-asserted-by":"publisher","DOI":"10.1109\/WD.2008.4812899"},{"key":"e_1_3_1_208_2","first-page":"13","volume-title":"NeurIPS","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS. 13\u201323."},{"key":"e_1_3_1_209_2","unstructured":"Kevin Lu Aditya Grover Pieter Abbeel and Igor Mordatch. 2021. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247 (2021)."},{"key":"e_1_3_1_210_2","volume-title":"NeurIPS","author":"Lu Pan","year":"2022","unstructured":"Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS."},{"key":"e_1_3_1_211_2","unstructured":"Yadong Lu Chunyuan Li Haotian Liu Jianwei Yang Jianfeng Gao and Yelong Shen. 2023. An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958 (2023)."},{"key":"e_1_3_1_212_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/880"},{"key":"e_1_3_1_213_2","doi-asserted-by":"crossref","unstructured":"Yiwei Lyu Paul Pu Liang Zihao Deng Ruslan Salakhutdinov and Louis-Philippe Morency. 2022. DIME: Fine-grained interpretations of multimodal models via disentangled local explanations. arXiv preprint arXiv:2203.02013 (2022).","DOI":"10.1145\/3514094.3534148"},{"key":"e_1_3_1_214_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01764"},{"key":"e_1_3_1_215_2","unstructured":"Mengmeng Ma Jian Ren Long Zhao Sergey Tulyakov Cathy Wu and Xi Peng. 2021. Smil: Multimodal learning with severely missing modality. arXiv preprint arXiv:2103.05677 (2021)."},{"key":"e_1_3_1_216_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.tins.2005.03.008"},{"key":"e_1_3_1_217_2","unstructured":"T. Soni Madhulatha. 2012. An overview on clustering methods. arXiv preprint arXiv:1205.1117 (2012)."},{"key":"e_1_3_1_218_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.778"},{"key":"e_1_3_1_219_2","first-page":"143","volume-title":"NAACL-HLT","author":"Malmaud Jonathan","year":"2015","unstructured":"Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nicholas Johnston, Andrew Rabinovich, and Kevin Murphy. 2015. What\u2019s Cookin\u2019? Interpreting cooking videos using text, speech and vision. In NAACL-HLT. 143\u2013152."},{"key":"e_1_3_1_220_2","volume-title":"ICLR","author":"Mao Jiayuan","year":"2018","unstructured":"Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. 2018. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In ICLR."},{"key":"e_1_3_1_221_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.9"},{"key":"e_1_3_1_222_2","doi-asserted-by":"crossref","unstructured":"Matthew Marge Carol Espy-Wilson Nigel G. Ward Abeer Alwan Yoav Artzi Mohit Bansal Gil Blankenship Joyce Chai Hal Daum\u00e9 III Debadeepta Dey et\u00a0al. 2022. Spoken language interaction with robots: Recommendations for future research. Computer Speech & Language 71 (2022) 101255.","DOI":"10.1016\/j.csl.2021.101255"},{"key":"e_1_3_1_223_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.10"},{"key":"e_1_3_1_224_2","doi-asserted-by":"publisher","DOI":"10.1108\/00220410310506303"},{"key":"e_1_3_1_225_2","volume-title":"AISTATS","author":"Mazzetto Alessio","year":"2021","unstructured":"Alessio Mazzetto, Dylan Sam, Andrew Park, Eli Upfal, and Stephen Bach. 2021. Semi-supervised aggregation of dependent weak supervision sources with performance guarantees. In AISTATS."},{"key":"e_1_3_1_226_2","volume-title":"SIGIR","author":"Mekhaldi Dalila","year":"2007","unstructured":"Dalila Mekhaldi. 2007. Multimodal document alignment: Towards a fully-indexed multimedia archive. In SIGIR."},{"key":"e_1_3_1_227_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1084"},{"key":"e_1_3_1_228_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-57321-8_2"},{"key":"e_1_3_1_229_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_24"},{"key":"e_1_3_1_230_2","doi-asserted-by":"publisher","DOI":"10.1145\/219717.219748"},{"key":"e_1_3_1_231_2","first-page":"71","volume-title":"Advances in Computers","author":"Mitrovi\u0107 Dalibor","year":"2010","unstructured":"Dalibor Mitrovi\u0107, Matthias Zeppelzauer, and Christian Breiteneder. 2010. Features for content-based audio retrieval. In Advances in Computers. Vol. 78. Elsevier, 71\u2013150."},{"key":"e_1_3_1_232_2","unstructured":"Shentong Mo Paul Pu Liang Russ Salakhutdinov and Louis-Philippe Morency. 2023. MultiIoT: Towards large-scale multisensory learning for the Internet of Things. arXiv:2311.06217. Retrieved from https:\/\/arxiv.org\/abs\/2311.06217"},{"key":"e_1_3_1_233_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2015.7351387"},{"key":"e_1_3_1_234_2","doi-asserted-by":"crossref","unstructured":"Ghulam Muhammad Fatima Alshehri Fakhri Karray and Abdulmotaleb El Saddik. 2021. A comprehensive survey on multimodal medical signals fusion for smart healthcare systems. Information Fusion 76 1 (2021) 355\u2013375.","DOI":"10.1016\/j.inffus.2021.06.007"},{"key":"e_1_3_1_235_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00020"},{"key":"e_1_3_1_236_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.232"},{"key":"e_1_3_1_237_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cortex.2017.07.006"},{"key":"e_1_3_1_238_2","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2006.63"},{"key":"e_1_3_1_239_2","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.11263"},{"key":"e_1_3_1_240_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2955637"},{"key":"e_1_3_1_241_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01975"},{"key":"e_1_3_1_242_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1007677"},{"key":"e_1_3_1_243_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2015.2483592"},{"key":"e_1_3_1_244_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01251"},{"key":"e_1_3_1_245_2","doi-asserted-by":"crossref","unstructured":"Zeljko Obrenovic and Dusan Starcevic. 2004. Modeling multimodal human-computer interaction. Computer 37 9 (2004) 65\u201372.","DOI":"10.1109\/MC.2004.139"},{"key":"e_1_3_1_246_2","first-page":"3918","volume-title":"ICML","year":"2018","unstructured":"Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, and Oriol Vinyals. 2018. Parallel wavenet: Fast high-fidelity speech synthesis. In ICML. PMLR, 3918\u20133926."},{"key":"e_1_3_1_247_2","unstructured":"R. OpenAI. 2023. GPT-4 technical report. arXiv:2303\u201308774. Retrieved from https:\/\/arxiv.org\/abs\/2303-08774"},{"key":"e_1_3_1_248_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-019-00187-6"},{"key":"e_1_3_1_249_2","doi-asserted-by":"publisher","DOI":"10.1145\/319382.319398"},{"key":"e_1_3_1_250_2","doi-asserted-by":"crossref","unstructured":"Dinesh K. Pai. 2005. Multisensory interaction: Real and Virtual. In Robotics Research. The Eleventh International Symposium Paolo Dario and Raja Chatila (Eds.). Springer Berlin Heidelberg Berlin Heidelberg 489\u2013498.","DOI":"10.1007\/11008941_52"},{"key":"e_1_3_1_251_2","doi-asserted-by":"crossref","unstructured":"Shruti Palaskar Jindrich Libovick\u1ef3 Spandana Gella and Florian Metze. 2019. Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901 (2019).","DOI":"10.18653\/v1\/P19-1659"},{"key":"e_1_3_1_252_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2003.817122"},{"key":"e_1_3_1_253_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00915"},{"key":"e_1_3_1_254_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58558-7_30"},{"key":"e_1_3_1_255_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.283.5406.1272"},{"key":"e_1_3_1_256_2","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511803161"},{"key":"e_1_3_1_257_2","doi-asserted-by":"publisher","DOI":"10.1098\/rstb.2009.0186"},{"key":"e_1_3_1_258_2","doi-asserted-by":"publisher","unstructured":"Catherine Pelachaud Carlos Busso and Dirk Heylen. 2021. Multimodal behavior modeling for socially interactive agents (1ed.). Association for Computing Machinery New York NY USA 259\u2013310. 10.1145\/3477322.3477331","DOI":"10.1145\/3477322.3477331"},{"key":"e_1_3_1_259_2","doi-asserted-by":"publisher","DOI":"10.1145\/3382507.3421165"},{"key":"e_1_3_1_260_2","unstructured":"Zhiliang Peng Wenhui Wang Li Dong Yaru Hao Shaohan Huang Shuming Ma and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)."},{"key":"e_1_3_1_261_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00713"},{"key":"e_1_3_1_262_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1359"},{"key":"e_1_3_1_263_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016892"},{"key":"e_1_3_1_264_2","doi-asserted-by":"publisher","DOI":"10.5555\/265013"},{"key":"e_1_3_1_265_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_1_266_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2017.02.003"},{"key":"e_1_3_1_267_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00062"},{"key":"e_1_3_1_268_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00481"},{"key":"e_1_3_1_269_2","first-page":"8748","volume-title":"ICML","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, and Sandhini Agarwal. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748\u20138763."},{"key":"e_1_3_1_270_2","unstructured":"Alec Radford Jeff Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019)."},{"key":"e_1_3_1_271_2","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_272_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.214"},{"key":"e_1_3_1_273_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46478-7_21"},{"key":"e_1_3_1_274_2","first-page":"8821","volume-title":"ICML","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In ICML. PMLR, 8821\u20138831."},{"key":"e_1_3_1_275_2","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873987"},{"key":"e_1_3_1_276_2","unstructured":"Scott Reed Konrad Zolna Emilio Parisotto Sergio G\u00f3mez Colmenarejo Alexander Novikov Gabriel Barth-Maron Mai Gim\u00e9nez Yury Sulsky et\u00a0al. 2022. One Model to Learn Them All. Deepmind Technical Report."},{"key":"e_1_3_1_277_2","article-title":"Fastspeech: Fast, robust and controllable text to speech","author":"Ren Yi","year":"2019","unstructured":"Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. InNeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_278_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_3_1_279_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_1_280_2","doi-asserted-by":"crossref","unstructured":"Candace Ross Boris Katz and Andrei Barbu. 2020. Measuring social biases in grounded vision and language embeddings. arXiv preprint arXiv:2002.08911 (2020).","DOI":"10.18653\/v1\/2021.naacl-main.78"},{"key":"e_1_3_1_281_2","doi-asserted-by":"publisher","DOI":"10.1006\/jvci.1999.0413"},{"key":"e_1_3_1_282_2","doi-asserted-by":"crossref","unstructured":"Natalie Ruiz Ronnie Taib and Fang Chen. 2006. Examining the redundancy of multimodal input. In Proceedings of the 18th Australia conference on Computer-Human Interaction: Design: Activities Artefacts and Environments. 389\u2013392.","DOI":"10.1145\/1228175.1228254"},{"key":"e_1_3_1_283_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2017.115"},{"key":"e_1_3_1_284_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6399"},{"key":"e_1_3_1_285_2","first-page":"3070","article-title":"Multimodal graph networks for compositional generalization in visual question answering","author":"Saqur Raeid","year":"2020","unstructured":"Raeid Saqur and Karthik Narasimhan. 2020. Multimodal graph networks for compositional generalization in visual question answering. InNeurIPS. 3070\u20133081.","journal-title":"NeurIPS"},{"key":"e_1_3_1_286_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2007.906583"},{"key":"e_1_3_1_287_2","first-page":"9339","volume-title":"ICCV","year":"2019","unstructured":"Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, and Jitendra Malik. 2019. Habitat: A platform for embodied ai research. In ICCV. 9339\u20139347."},{"key":"e_1_3_1_288_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2008.2005605"},{"key":"e_1_3_1_289_2","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2809933"},{"key":"e_1_3_1_290_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242969.3242985"},{"key":"e_1_3_1_291_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.74"},{"key":"e_1_3_1_292_2","doi-asserted-by":"publisher","DOI":"10.3389\/fnbot.2019.00053"},{"key":"e_1_3_1_293_2","unstructured":"Luciano Serafini and Artur d\u2019Avila Garcez. 2016. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422 (2016)."},{"key":"e_1_3_1_294_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-61807-4"},{"key":"e_1_3_1_295_2","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1948.tb01338.x"},{"key":"e_1_3_1_296_2","volume-title":"SemEval","year":"2020","unstructured":"Chhavi Sharma, William Paka, Scott, Deepesh Bhageria, Amitava Das, and Soujanya Poria. 2020. Task report: Memotion analysis 1.0 @SemEval 2020: The visuo-lingual metaphor!. In SemEval."},{"key":"e_1_3_1_297_2","doi-asserted-by":"crossref","unstructured":"Rajeev Sharma Vladimir I. Pavlovic and Thomas S. Huang. 1998. Toward multimodal human-computer interface. Proc. IEEE 86 5 (1998) 853\u2013869.","DOI":"10.1109\/5.664275"},{"key":"e_1_3_1_298_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1339"},{"key":"e_1_3_1_299_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2598561"},{"key":"e_1_3_1_300_2","article-title":"Variational mixture-of-experts autoencoders for multimodal deep generative models","author":"Shi Yuge","year":"2019","unstructured":"Yuge Shi, Brooks Paige, and Philip Torr. 2019. Variational mixture-of-experts autoencoders for multimodal deep generative models. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_301_2","unstructured":"Karen Simonyan Andrea Vedaldi and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)."},{"key":"e_1_3_1_302_2","first-page":"74","volume-title":"Proceedings of ICML Workshop on Learning with Multiple Views.","volume":"2005","author":"Sindhwani Vikas","year":"2005","unstructured":"Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of ICML Workshop on Learning with Multiple Views. Vol. 2005, Citeseer, 74\u201379."},{"key":"e_1_3_1_303_2","unstructured":"Uriel Singer Adam Polyak Thomas Hayes Xi Yin Jie An Songyang Zhang and Qiyuan Hu. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)."},{"key":"e_1_3_1_304_2","doi-asserted-by":"crossref","unstructured":"Amanpreet Singh Ronghang Hu Vedanuj Goswami and Guillaume Couairon. 2021. FLAVA: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482 (2021).","DOI":"10.1109\/CVPR52688.2022.01519"},{"key":"e_1_3_1_305_2","article-title":"Zero-shot learning through cross-modal transfer","author":"Socher Richard","year":"2013","unstructured":"Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_306_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2017.08.003"},{"key":"e_1_3_1_307_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.107408"},{"key":"e_1_3_1_308_2","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390282"},{"key":"e_1_3_1_309_2","unstructured":"Karthik Sridharan and Sham M. Kakade. 2008. An information theoretic framework for multi-view learning. (2008)."},{"key":"e_1_3_1_310_2","doi-asserted-by":"crossref","unstructured":"Tejas Srinivasan and Yonatan Bisk. 2021. Worst of both worlds: Biases compound in pre-trained vision-and-language models. arXiv preprint arXiv:2104.08666 (2021).","DOI":"10.18653\/v1\/2022.gebnlp-1.10"},{"key":"e_1_3_1_311_2","unstructured":"Bing Su Dazhao Du Zhao Yang Yujie Zhou Jiangmeng Li and Anyi Rao. 2022. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481 (2022)."},{"key":"e_1_3_1_312_2","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242667"},{"key":"e_1_3_1_313_2","unstructured":"Alane Suhr and Yoav Artzi. 2019. NLVR2 visual bias analysis. arXiv preprint arXiv:1909.10411 (2019)."},{"key":"e_1_3_1_314_2","article-title":"Multimodal engagement analysis from facial videos in the classroom","author":"S\u00fcmer \u00d6mer","year":"2021","unstructured":"\u00d6mer S\u00fcmer, Patricia Goldberg, Sidney D\u2019Mello, Peter Gerjets, Ulrich Trautwein, and Enkelejda Kasneci. 2021. Multimodal engagement analysis from facial videos in the classroom. IEEE Trans. on Affective Computing 14, 2 (2021), 1012\u20131027.","journal-title":"IEEE Trans. on Affective Computing"},{"key":"e_1_3_1_315_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_1_316_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-013-1362-6"},{"key":"e_1_3_1_317_2","volume-title":"Reinforcement Learning: An Introduction","author":"Sutton Richard S.","year":"2018","unstructured":"Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press."},{"key":"e_1_3_1_318_2","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_3_1_319_2","doi-asserted-by":"crossref","unstructured":"Riko Suzuki Hitomi Yanaka Masashi Yoshikawa Koji Mineshima and Daisuke Bekki. 2019. Multimodal logical inference system for visual-textual entailment. arXiv preprint arXiv:1906.03952 (2019).","DOI":"10.18653\/v1\/P19-2054"},{"key":"e_1_3_1_320_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00687"},{"key":"e_1_3_1_321_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_1_322_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.162"},{"key":"e_1_3_1_323_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2020.102277"},{"key":"e_1_3_1_324_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298792"},{"key":"e_1_3_1_325_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.501"},{"key":"e_1_3_1_326_2","first-page":"3477","volume-title":"IJCAI","author":"Thomason Jesse","year":"2016","unstructured":"Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Peter Stone, and Raymond J. Mooney. 2016. Learning multi-modal grounded linguistic semantics by playing \u201cI Spy\u201d. In IJCAI. 3477\u20133483."},{"key":"e_1_3_1_327_2","unstructured":"Bruce Thompson. 2000. Canonical correlation analysis. (2000)."},{"key":"e_1_3_1_328_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00517"},{"key":"e_1_3_1_329_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58621-8_45"},{"key":"e_1_3_1_330_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58580-8_26"},{"key":"e_1_3_1_331_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_16"},{"key":"e_1_3_1_332_2","first-page":"6827","article-title":"What makes for good views for contrastive learning? In","author":"Tian Yonglong","year":"2020","unstructured":"Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? In NeurIPS. 6827\u20136839.","journal-title":"NeurIPS"},{"key":"e_1_3_1_333_2","volume-title":"ALT","author":"Tosh Christopher","year":"2021","unstructured":"Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. 2021. Contrastive learning, multi-view redundancy, and linear models. In ALT."},{"key":"e_1_3_1_334_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.528"},{"key":"e_1_3_1_335_2","doi-asserted-by":"publisher","DOI":"10.21105\/joss.03249"},{"key":"e_1_3_1_336_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2710047"},{"key":"e_1_3_1_337_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1656"},{"key":"e_1_3_1_338_2","article-title":"Learning factorized multimodal representations","author":"Tsai Yao-Hung Hubert","year":"2019","unstructured":"Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning factorized multimodal representations. In ICLR (2019).","journal-title":"ICLR"},{"key":"e_1_3_1_339_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.143"},{"key":"e_1_3_1_340_2","volume-title":"ICLR","author":"Tsai Yao-Hung Hubert","year":"2020","unstructured":"Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Self-supervised learning from a multi-view perspective. In ICLR."},{"key":"e_1_3_1_341_2","volume-title":"ICLR","author":"Tsang Michael","year":"2018","unstructured":"Michael Tsang, Dehua Cheng, and Yan Liu. 2018. Detecting statistical interactions from neural network weights. In ICLR."},{"key":"e_1_3_1_342_2","article-title":"Multimodal few-shot learning with frozen language models","author":"Tsimpoukelli Maria","year":"2021","unstructured":"Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_343_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2013.07.003"},{"key":"e_1_3_1_344_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-005-0913-1"},{"key":"e_1_3_1_345_2","unstructured":"Len Unsworth and Chris Cl\u00e9irigh. 2014. Multimodality and Reading: The Construction of Meaning through Image-Text Interaction. Routledge."},{"key":"e_1_3_1_346_2","doi-asserted-by":"crossref","unstructured":"Shagun Uppal Sarthak Bhagat Devamanyu Hazarika and Navonil Majumder. 2022. Multimodal research in vision and language: A review of current and emerging trends. Information Fusion 77 1 (2022) 149\u2013171.","DOI":"10.1016\/j.inffus.2021.07.009"},{"key":"e_1_3_1_347_2","doi-asserted-by":"publisher","DOI":"10.1145\/1943403.1943412"},{"key":"e_1_3_1_348_2","unstructured":"Aaron Van Den Oord and Oriol Vinyals. 2017. Neural discrete representation learning. NeurIPS 30 (2017)."},{"key":"e_1_3_1_349_2","article-title":"Analyzing differentiable fuzzy logic operators","author":"Krieken Emile van","year":"2022","unstructured":"Emile van Krieken, Erman Acar, and Frank van Harmelen. 2022. Analyzing differentiable fuzzy logic operators. Artificial Intelligence (2022).","journal-title":"Artificial Intelligence"},{"key":"e_1_3_1_350_2","first-page":"5998","volume-title":"NeurIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998\u20136008."},{"key":"e_1_3_1_351_2","first-page":"6428","volume-title":"ICML","author":"Vedantam Ramakrishna","year":"2019","unstructured":"Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2019. Probabilistic neural symbolic models for interpretable visual question answering. In ICML. PMLR, 6428\u20136437."},{"key":"e_1_3_1_352_2","volume-title":"ICLR","author":"Veli\u010dkovi\u0107 Petar","year":"2018","unstructured":"Petar Veli\u010dkovi\u0107, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio. 2018. Graph attention networks. In ICLR."},{"key":"e_1_3_1_353_2","unstructured":"Ivan Vendrov Ryan Kiros Sanja Fidler and Raquel Urtasun. 2015. Order-embeddings of images and language. arXivpreprint arXiv:1511.06361 (2015)."},{"key":"e_1_3_1_354_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2010.939739"},{"key":"e_1_3_1_355_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-71050-9"},{"key":"e_1_3_1_356_2","doi-asserted-by":"crossref","unstructured":"Oriol Vinyals Alexander Toshev Samy Bengio and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 4 (2016) 652\u2013663.","DOI":"10.1109\/TPAMI.2016.2587640"},{"key":"e_1_3_1_357_2","volume-title":"ICLR","author":"Wan Alvin","year":"2020","unstructured":"Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Suzanne Petryk, Sarah Adel Bargal, and Joseph E. Gonzalez. 2020. NBDT: Neural-backed decision tree. In ICLR."},{"key":"e_1_3_1_358_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472749.3474765"},{"key":"e_1_3_1_359_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2012.2207397"},{"key":"e_1_3_1_360_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403234"},{"key":"e_1_3_1_361_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00197"},{"key":"e_1_3_1_362_2","first-page":"1083","volume-title":"ICML","author":"Wang Weiran","year":"2015","unstructured":"Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multi-view representation learning. In ICML. PMLR, 1083\u20131092."},{"key":"e_1_3_1_363_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01271"},{"key":"e_1_3_1_364_2","doi-asserted-by":"crossref","unstructured":"Xingbo Wang Jianben He Zhihua Jin Muqiao Yang Yong Wang and Huamin Qu. 2021. M2Lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics 28 1 (2021) 802\u2013812.","DOI":"10.1109\/TVCG.2021.3114794"},{"key":"e_1_3_1_365_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00679"},{"key":"e_1_3_1_366_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2018.00015"},{"key":"e_1_3_1_367_2","doi-asserted-by":"crossref","unstructured":"Alex Wilf Qianli M. Ma Paul Pu Liang Amir Zadeh and Louis-Philippe Morency. 2022. Face-to-face contrastive learning for social intelligence question-answering. arXiv preprint arXiv:2208.01036 (2022).","DOI":"10.1109\/FG57933.2023.10042612"},{"key":"e_1_3_1_368_2","unstructured":"Paul L. Williams and Randall D. Beer. 2010. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515 (2010)."},{"key":"e_1_3_1_369_2","volume-title":"ICML","author":"Wong Eric","year":"2021","unstructured":"Eric Wong, Shibani Santurkar, and Aleksander Madry. 2021. Leveraging sparse linear layers for debuggable deep networks. In ICML."},{"key":"e_1_3_1_370_2","first-page":"7623","volume-title":"ICCV","year":"2023","unstructured":"Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, and Wynne Hsu. 2023. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV. 7623\u20137633."},{"key":"e_1_3_1_371_2","article-title":"Multimodal generative models for scalable weakly-supervised learning","author":"Wu Mike","year":"2018","unstructured":"Mike Wu and Noah Goodman. 2018. Multimodal generative models for scalable weakly-supervised learning. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_372_2","unstructured":"Nan Wu Stanis\u0142aw Jastrz\u0119bski Kyunghyun Cho and Krzysztof J. Geras. 2022. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. arXiv preprint arXiv:2202.05306 (2022)."},{"key":"e_1_3_1_373_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.500"},{"key":"e_1_3_1_374_2","unstructured":"Xindi Wu Zhiwei Deng and Olga Russakovsky. 2023. Multimodal dataset distillation for image-text retrieval. arXivpreprint arXiv:2308.07545 (2023)."},{"key":"e_1_3_1_375_2","doi-asserted-by":"crossref","unstructured":"Yi Xiao Felipe Codevilla Akhil Gurram Onay Urfalioglu and Antonio M. L\u00f3pez. 2020. Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems 23 1 (2020) 537\u2013547.","DOI":"10.1109\/TITS.2020.3013234"},{"key":"e_1_3_1_376_2","volume-title":"NeurIPS","author":"Xing Chen","year":"2019","unstructured":"Chen Xing, Negar Rostamzadeh, Boris Oreshkin, and Pedro O O. Pinheiro. 2019. Adaptive cross-modal few-shot learning. In NeurIPS."},{"key":"e_1_3_1_377_2","volume-title":"ICML","author":"Xiong Caiming","year":"2016","unstructured":"Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In ICML."},{"key":"e_1_3_1_378_2","doi-asserted-by":"crossref","unstructured":"Chang Xu Dacheng Tao and Chao Xu. 2015. Multi-view intact space learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 12 (2015) 2531\u20132544.","DOI":"10.1109\/TPAMI.2015.2417578"},{"key":"e_1_3_1_379_2","unstructured":"Fangli Xu Lingfei Wu K. P. Thai Carol Hsu Wei Wang and Richard Tong. 2019. MUTLA: A large-scale dataset for multimodal teaching and learning analytics. arXiv preprint arXiv:1910.06078 (2019)."},{"key":"e_1_3_1_380_2","first-page":"2048","volume-title":"ICML","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. 2048\u20132057."},{"key":"e_1_3_1_381_2","doi-asserted-by":"crossref","unstructured":"Peng Xu Xiatian Zhu and David A. Clifton. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 10 (2023) 12113\u201312132.","DOI":"10.1109\/TPAMI.2023.3275156"},{"key":"e_1_3_1_382_2","unstructured":"Zhen Xu David R. So and Andrew M. Dai. 2021. MUFASA: Multimodal fusion architecture search for electronic health records. arXiv preprint arXiv:2102.02340 (2021)."},{"key":"e_1_3_1_383_2","unstructured":"Zihui Xue Zhengqi Gao Sucheng Ren and Hang Zhao. 2022. The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation. arXiv preprint arXiv:2206.06487 (2022)."},{"key":"e_1_3_1_384_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00089"},{"key":"e_1_3_1_385_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.03.090"},{"key":"e_1_3_1_386_2","volume-title":"NAACL-HLT","year":"2021","unstructured":"Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, and Amir Zadeh. 2021. MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences. In NAACL-HLT."},{"key":"e_1_3_1_387_2","first-page":"270","volume-title":"GIS","author":"Yang Yi","year":"2010","unstructured":"Yi Yang and Shawn Newsam. 2010. Bag-of-visual-words and spatial extensions for land-use classification. In GIS. 270\u2013279."},{"key":"e_1_3_1_388_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/568"},{"key":"e_1_3_1_389_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/416"},{"key":"e_1_3_1_390_2","first-page":"20744","article-title":"Webshop: Towards scalable real-world web interaction with grounded language agents","author":"Yao Shunyu","year":"2022","unstructured":"Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. In NeurIPS. 20744\u201320757.","journal-title":"NeurIPS"},{"key":"e_1_3_1_391_2","unstructured":"Kexin Yi Chuang Gan Yunzhu Li Pushmeet Kohli Jiajun Wu Antonio Torralba and Joshua B. Tenenbaum. 2019. CLEVRER: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)."},{"key":"e_1_3_1_392_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.273"},{"key":"e_1_3_1_393_2","doi-asserted-by":"crossref","unstructured":"M. H. Peter Young Alice Lai and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 1 (2014) 67\u201368.","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_1_394_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.326"},{"key":"e_1_3_1_395_2","volume-title":"NeurIPS","author":"Yu Weijiang","year":"2019","unstructured":"Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, and Nong Xiao. 2019. Heterogeneous graph learning for visual commonsense reasoning. In NeurIPS."},{"key":"e_1_3_1_396_2","doi-asserted-by":"crossref","unstructured":"Jiahong Yuan Mark Liberman et\u00a0al. 2008. Speaker identi?cation on the SCOTUS corpus. Journal of the Acoustical Society of America 123 5 (2008) 3878.","DOI":"10.1121\/1.2935783"},{"key":"e_1_3_1_397_2","doi-asserted-by":"crossref","unstructured":"Amir Zadeh Minghai Chen Soujanya Poria Erik Cambria and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).","DOI":"10.18653\/v1\/D17-1115"},{"key":"e_1_3_1_398_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12021"},{"key":"e_1_3_1_399_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2020.06.001"},{"key":"e_1_3_1_400_2","unstructured":"Amir Zadeh Rowan Zellers Eli Pincus and Louis-Philippe Morency. 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016)."},{"key":"e_1_3_1_401_2","volume-title":"ACL","author":"Zadeh AmirAli Bagher","year":"2018","unstructured":"AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In ACL."},{"key":"e_1_3_1_402_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00391"},{"key":"e_1_3_1_403_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00688"},{"key":"e_1_3_1_404_2","article-title":"Merlot: Multimodal neural script knowledge models","author":"Zellers Rowan","year":"2021","unstructured":"Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. Merlot: Multimodal neural script knowledge models. In NeurIPS.","journal-title":"NeurIPS"},{"key":"e_1_3_1_405_2","unstructured":"Andy Zeng Adrian Wong Stefan Welker Krzysztof Choromanski and Federico Tombari. 2022. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598 (2022)."},{"key":"e_1_3_1_406_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11238"},{"issue":"16","key":"e_1_3_1_407_2","first-page":"1","article-title":"Multimodal deep representation learning for protein interaction identification and protein family classification","volume":"20","author":"Zhang Da","year":"2019","unstructured":"Da Zhang and Mansur Kabuka. 2019. Multimodal deep representation learning for protein interaction identification and protein family classification. BMC Bioinformatics 20, 16 (2019), 1\u201314.","journal-title":"BMC Bioinformatics"},{"key":"e_1_3_1_408_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1169"},{"key":"e_1_3_1_409_2","doi-asserted-by":"publisher","DOI":"10.1145\/3617680"},{"key":"e_1_3_1_410_2","doi-asserted-by":"publisher","DOI":"10.1109\/89.917689"},{"key":"e_1_3_1_411_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2019.08.009"},{"key":"e_1_3_1_412_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01064"},{"key":"e_1_3_1_413_2","unstructured":"Shuyan Zhou Frank F. Xu Hao Zhu Xuhui Zhou Robert Lo Abishek Sridhar and Xianyi Cheng. 2023. WebArena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023)."},{"key":"e_1_3_1_414_2","unstructured":"Deyao Zhu Jun Chen Xiaoqian Shen Xiang Li and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)."},{"key":"e_1_3_1_415_2","first-page":"2362","volume-title":"IJCAI","author":"Zhu Hao","year":"2021","unstructured":"Hao Zhu, Huaibo Huang, Yi Li, Aihua Zheng, and Ran He. 2021. Arbitrary talking face generation via attentional audio-visual coherence learning. In IJCAI. 2362\u20132368."},{"key":"e_1_3_1_416_2","unstructured":"Xiangru Zhu Zhixu Li Xiaodan Wang Xueyao Jiang Penglei Sun and Xuwu Wang. 2022. Multi-modal knowledge graph construction and application: A survey. arXiv preprint arXiv:2202.05786 (2022)."},{"key":"e_1_3_1_417_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.11"},{"key":"e_1_3_1_418_2","unstructured":"Yuke Zhu Ce Zhang Christopher R\u00e9 and Li Fei-Fei. 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670 (2015)."},{"key":"e_1_3_1_419_2","unstructured":"Zachary M. Ziegler Luke Melas-Kyriazi Sebastian Gehrmann and Alexander M. Rush. 2019. Encoder-agnostic adaptation for conditional language generation. arXiv preprint arXiv:1908.06938 (2019)."},{"key":"e_1_3_1_420_2","volume-title":"Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology","author":"Zipf George Kingsley","year":"2016","unstructured":"George Kingsley Zipf. 2016. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio books."}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656580","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3656580","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:57:31Z","timestamp":1750291051000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656580"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,22]]},"references-count":419,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3656580"],"URL":"https:\/\/doi.org\/10.1145\/3656580","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,22]]},"assertion":[{"value":"2023-02-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-02","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}