{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T20:05:31Z","timestamp":1776888331287,"version":"3.51.2"},"reference-count":199,"publisher":"Association for Computing Machinery (ACM)","issue":"12","funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["2022.11905.BD."],"award-info":[{"award-number":["2022.11905.BD."]}]},{"name":"JSPS Scientific Research","award":["19K11987"],"award-info":[{"award-number":["19K11987"]}]},{"name":"European Union\u2019s Horizon Europe Research and Innovation Program"},{"name":"Telecommunications and Computer Vision Convergence Tools for Research Infrastructures","award":["101094831"],"award-info":[{"award-number":["101094831"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>Audio-visual correlation learning aims at capturing and understanding natural phenomena between audio and visual data. The rapid growth of dl propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of ()avcl and discuss the future research directions.<\/jats:p>","DOI":"10.1145\/3696445","type":"journal-article","created":{"date-parts":[[2025,5,14]],"date-time":"2025-05-14T07:28:21Z","timestamp":1747207701000},"page":"1-46","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning"],"prefix":"10.1145","volume":"57","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3640-7019","authenticated-orcid":false,"given":"Lu\u00eds","family":"Vila\u00e7a","sequence":"first","affiliation":[{"name":"INESC TEC","place":["Porto, Portugal"]},{"name":"National Institute of Informatics","place":["Porto, Portugal"]},{"name":"ISEP, Polytechnic of Porto","place":["Porto, Portugal"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0294-6620","authenticated-orcid":false,"given":"Yi","family":"Yu","sequence":"additional","affiliation":[{"name":"Hiroshima University Graduate School of Advanced Science and Engineering","place":["Hiroshima, Japan"]},{"name":"National Institute of Informatics","place":["Hiroshima, Japan"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8447-2360","authenticated-orcid":false,"given":"Paula","family":"Viana","sequence":"additional","affiliation":[{"name":"ISEP, Polytechnic of Porto","place":["Porto, Portugal"]},{"name":"INESC TEC","place":["Porto, Portugal"]}]}],"member":"320","published-online":{"date-parts":[[2025,7,11]]},"reference":[{"key":"e_1_3_4_2_2","unstructured":"Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arxiv:1609.08675. Retrieved from https:\/\/arxiv.org\/abs\/1609.08675"},{"key":"e_1_3_4_3_2","doi-asserted-by":"publisher","unstructured":"Triantafyllos Afouras Yuki M. Asano Francois Fagan Andrea Vedaldi and Florian Metze. 2021. Self-supervised object detection from audio-visual correspondence. DOI:10.48550\/ARXIV.2104.06401","DOI":"10.48550\/ARXIV.2104.06401"},{"key":"e_1_3_4_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_13"},{"key":"e_1_3_4_5_2","unstructured":"Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih-Fu Chang Yin Cui and Boqing Gong. 2021. VATT: transformers for multimodal self-supervised learning from raw video audio and text. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS\u201921). Curran Associates Inc. Red Hook NY USA Article 1853 16 pages."},{"key":"e_1_3_4_6_2","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds Roman Ring Eliza Rutherford Serkan Cabi Tengda Han Zhitao Gong Sina Samangooei Marianne Monteiro Jacob L. Menick Sebastian Borgeaud Andy Brock Aida Nematzadeh Sahand Sharifzadeh Miko\u0142 aj Bi\u0144kowski Ricardo Barreira Oriol Vinyals Andrew Zisserman and Kar\u00e9n Simonyan. 2022. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems S. Koyejo S. Mohamed A. Agarwal D. Belgrave K. Cho and A. Oh (Eds.). Vol. 35. Curran Associates Inc. 23716\u201323736. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf"},{"key":"e_1_3_4_7_2","first-page":"25","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","volume":"33","author":"Alayrac Jean-Baptiste","year":"2020","unstructured":"Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi\u0107, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. In Proceedings of the Advances in Neural Information Processing Systems.H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, Curran Associates, Inc., 25\u201337. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/0060ef47b12160b9198302ebdb144dcf-Paper.pdf"},{"key":"e_1_3_4_8_2","unstructured":"Humam Alwassel Dhruv Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver BC Canada) (NIPS\u201920). Curran Associates Inc. Red Hook NY USA Article 818 13 pages."},{"key":"e_1_3_4_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3592097"},{"key":"e_1_3_4_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.73"},{"key":"e_1_3_4_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_3_4_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.09.106"},{"key":"e_1_3_4_13_2","doi-asserted-by":"publisher","unstructured":"Tadas Baltru\u0161aitis Chaitanya Ahuja and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 2 (2019) 423\u2013443. 10.1109\/TPAMI.2018.2798607","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_4_14_2","unstructured":"Hangbo Bao Li Dong Songhao Piao and Furu Wei. 2022. BEiT: BERT Pre-Training of Image Transformers. arxiv:2106.08254. Retrieved from https:\/\/arxiv.org\/abs\/2106.08254"},{"key":"e_1_3_4_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2911996.2912000"},{"key":"e_1_3_4_16_2","unstructured":"Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Piotr Bojanowski and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver BC Canada) (NIPS\u201920). Curran Associates Inc. Red Hook NY USA Article 831 13 pages."},{"key":"e_1_3_4_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_4_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01659"},{"key":"e_1_3_4_19_2","doi-asserted-by":"publisher","unstructured":"Ke Chen Xingjian Du Bilei Zhu Zejun Ma Taylor Berg-Kirkpatrick and Shlomo Dubnov. 2022. HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 646\u2013650. 10.1109\/ICASSP43922.2022.9746312","DOI":"10.1109\/ICASSP43922.2022.9746312"},{"key":"e_1_3_4_20_2","doi-asserted-by":"publisher","unstructured":"Jing Liu Sihan Chen Xingjian He Longteng Guo Xinxin Zhu Weining Wang and Jinhui Tang. 2025. VALOR: Vision-audio-language omni-perception pretraining model and dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence 47 2 (2025) 708\u2013724. 10.1109\/TPAMI.2024.3479776","DOI":"10.1109\/TPAMI.2024.3479776"},{"key":"e_1_3_4_21_2","unstructured":"Sanyuan Chen Yu Wu Chengyi Wang Shujie Liu Daniel Tompkins Zhuo Chen Wanxiang Che Xiangzhan Yu and Furu Wei. 2023. BEATs: Audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning (Honolulu Hawaii USA) (ICML\u201923). JMLR.org Article 203 16 pages."},{"key":"e_1_3_4_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01065"},{"key":"e_1_3_4_23_2","unstructured":"Ting Chen Simon Kornblith Mohammad Norouzi and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML\u201920). JMLR.org Article 149 11 pages."},{"key":"e_1_3_4_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413869"},{"key":"e_1_3_4_25_2","first-page":"3915","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Chiu Chung-Cheng","year":"2022","unstructured":"Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. 2022. Self-supervised learning with random-projection quantizer for speech recognition. In Proceedings of the International Conference on Machine Learning. PMLR, 3915\u20133924."},{"key":"e_1_3_4_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00718"},{"key":"e_1_3_4_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00807"},{"key":"e_1_3_4_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2022.103406"},{"key":"e_1_3_4_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00138-023-01444-9"},{"key":"e_1_3_4_30_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_4_31_2","unstructured":"Prafulla Dhariwal Heewoo Jun Christine Payne Jong Wook Kim Alec Radford and Ilya Sutskever. 2020. Jukebox: A Generative Model for Music. arxiv:2005.00341. Retrieved from https:\/\/arxiv.org\/abs\/2005.00341"},{"key":"e_1_3_4_32_2","first-page":"8780","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","volume":"34","author":"Dhariwal Prafulla","year":"2021","unstructured":"Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat GANs on image synthesis. In Proceedings of the Advances in Neural Information Processing Systems.M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34, Curran Associates, Inc., 8780\u20138794. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/file\/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf"},{"key":"e_1_3_4_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58558-7_35"},{"key":"e_1_3_4_34_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_4_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00406"},{"key":"e_1_3_4_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58555-6_40"},{"key":"e_1_3_4_37_2","unstructured":"Leonardo A. Fanzeres and Climent Nadeu. 2022. Sound-to-Imagination: An Exploratory Study on Unsupervised Crossmodal Translation Using Diverse Audiovisual Data. arxiv:2106.01266. Retrieved from https:\/\/arxiv.org\/abs\/2106.01266"},{"key":"e_1_3_4_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00331"},{"key":"e_1_3_4_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_3_4_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58621-8_44"},{"key":"e_1_3_4_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01049"},{"key":"e_1_3_4_42_2","unstructured":"Ruohan Gao Rogerio Feris and Kristen Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. arxiv:1804.01665. Retrieved from https:\/\/arxiv.org\/abs\/1804.01665"},{"key":"e_1_3_4_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00398"},{"key":"e_1_3_4_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"e_1_3_4_45_2","unstructured":"Mariana-Iuliana Georgescu Eduardo Fonseca Radu Tudor Ionescu Mario Lucic Cordelia Schmid and Anurag Arnab. 2024. Audiovisual Masked Autoencoders. arxiv:2212.05922. Retrieved from https:\/\/arxiv.org\/abs\/2212.05922"},{"key":"e_1_3_4_46_2","doi-asserted-by":"crossref","unstructured":"Yuan Gong Yu-An Chung and James Glass. 2021. AST: Audio Spectrogram Transformer. arxiv:2104.01778. Retrieved from https:\/\/arxiv.org\/abs\/2104.01778","DOI":"10.21437\/Interspeech.2021-698"},{"key":"e_1_3_4_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/lsp.2022.3224688"},{"key":"e_1_3_4_48_2","unstructured":"Yuan Gong Andrew Rouditchenko Alexander H. Liu David Harwath Leonid Karlinsky Hilde Kuehne and James Glass. 2023. Contrastive Audio-Visual Masked Autoencoder. arxiv:2210.07839. Retrieved from https:\/\/arxiv.org\/abs\/2210.07839"},{"key":"e_1_3_4_49_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16235"},{"key":"e_1_3_4_50_2","unstructured":"Jean-Bastien Grill Florian Strub Florent Altch\u00e9 Corentin Tallec Pierre H. Richemond Elena Buchatskaya Carl Doersch Bernardo Avila Pires Zhaohan Daniel Guo Mohammad Gheshlaghi Azar Bilal Piot Koray Kavukcuoglu R\u00e9mi Munos and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. arxiv:2006.07733. Retrieved from https:\/\/arxiv.org\/abs\/2006.07733"},{"key":"e_1_3_4_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01043"},{"key":"e_1_3_4_52_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00306"},{"key":"e_1_3_4_53_2","unstructured":"Kaiming He Xinlei Chen Saining Xie Yanghao Li Piotr Doll\u00e1r and Ross Girshick. 2021. Masked Autoencoders Are Scalable Vision Learners. arxiv:2111.06377. Retrieved from https:\/\/arxiv.org\/abs\/2111.06377"},{"key":"e_1_3_4_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_4_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350974"},{"key":"e_1_3_4_56_2","doi-asserted-by":"publisher","DOI":"10.3390\/s19061382"},{"key":"e_1_3_4_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240601"},{"key":"e_1_3_4_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/taslp.2021.3122291"},{"key":"e_1_3_4_59_2","unstructured":"Di Hu Zheng Wang Haoyi Xiong Dong Wang Feiping Nie and Dejing Dou. 2020. Curriculum Audiovisual Learning. arxiv:2001.09414. Retrieved from https:\/\/arxiv.org\/abs\/2001.09414"},{"key":"e_1_3_4_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2020.10.003"},{"key":"e_1_3_4_61_2","unstructured":"Xixi Hu Ziyang Chen and Andrew Owens. 2022. Mix and Localize: Localizing Sound Sources in Mixtures. arxiv:2211.15058. Retrieved from https:\/\/arxiv.org\/abs\/2211.15058"},{"key":"e_1_3_4_62_2","unstructured":"Jingjia Huang Yinan Li Jiashi Feng Xinglong Wu Xiaoshuai Sun and Rongrong Ji. 2022. Clover: Towards A Unified Video-Language Alignment and Fusion Model. arxiv:2207.07885. Retrieved from https:\/\/arxiv.org\/abs\/2207.07885"},{"key":"e_1_3_4_63_2","unstructured":"Po-Yao Huang Vasu Sharma Hu Xu Chaitanya Ryali Haoqi Fan Yanghao Li Shang-Wen Li Gargi Ghosh Jitendra Malik and Christoph Feichtenhofer. 2023. MAViL: Masked Audio-Video Learners. arxiv:2212.08071. Retrieved from https:\/\/arxiv.org\/abs\/2212.08071"},{"key":"e_1_3_4_64_2","unstructured":"Gabriel Ilharco Yuan Zhang and Jason Baldridge. 2019. Large-scale representation learning from visually grounded untranscribed speech. arxiv:1909.08782. Retrieved from https:\/\/arxiv.org\/abs\/1909.08782"},{"key":"e_1_3_4_65_2","unstructured":"Andrew Jaegle Felix Gimeno Andrew Brock Andrew Zisserman Oriol Vinyals and Joao Carreira. 2021. Perceiver: General Perception with Iterative Attention. arxiv:2103.03206. Retrieved from https:\/\/arxiv.org\/abs\/2103.03206"},{"key":"e_1_3_4_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054137"},{"key":"e_1_3_4_67_2","unstructured":"Aren Jansen Manoj Plakal Richard Channing Moore Shawn Hershey Ratheet Pandya Ryan Rifkin Jiayang Liu and Daniel Ellis. 2020. Unsupervised Learning of Semantic Audio Representations."},{"key":"e_1_3_4_68_2","unstructured":"Kumara Kahatapitiya Anurag Arnab Arsha Nagrani and Michael S. Ryoo. 2024. VicTR: Video-conditioned Text Representations for Activity Recognition. arxiv:2304.02560. Retrieved from https:\/\/arxiv.org\/abs\/2304.02560"},{"key":"e_1_3_4_69_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-68238-5_48"},{"key":"e_1_3_4_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.3030497"},{"key":"e_1_3_4_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00633"},{"key":"e_1_3_4_72_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-227"},{"key":"e_1_3_4_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10094745"},{"key":"e_1_3_4_74_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Lee Jun-Tae","year":"2021","unstructured":"Jun-Tae Lee, Mihir Jain, Hyoungwoo Park, and Sungrack Yun. 2021. Cross-attentional audio-visual fusion for weakly-supervised action localization. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=hWr3e3r-oH5"},{"key":"e_1_3_4_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2015.7404793"},{"key":"e_1_3_4_76_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01315"},{"key":"e_1_3_4_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00087"},{"key":"e_1_3_4_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/tkde.2018.2872063"},{"key":"e_1_3_4_79_2","unstructured":"Zhaohui Li Haitao Wang and Xinghua Jiang. 2023. AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes. arxiv:2308.07221. Retrieved from https:\/\/arxiv.org\/abs\/2308.07221"},{"key":"e_1_3_4_80_2","unstructured":"Paul Pu Liang Amir Zadeh and Louis-Philippe Morency. 2023. Foundations and Trends in Multimodal Machine Learning: Principles Challenges and Open Questions. arxiv:2209.03430. Retrieved from https:\/\/arxiv.org\/abs\/2209.03430"},{"key":"e_1_3_4_81_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2022.103740"},{"key":"e_1_3_4_82_2","doi-asserted-by":"crossref","unstructured":"Wei Lin Leonid Karlinsky Nina Shvetsova Horst Possegger Mateusz Kozinski Rameswar Panda Rogerio Feris Hilde Kuehne and Horst Bischof. 2023. MAtch eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge. arxiv:2303.08914. Retrieved from https:\/\/arxiv.org\/abs\/2303.08914","DOI":"10.1109\/ICCV51070.2023.00267"},{"key":"e_1_3_4_83_2","volume-title":"Proceedings of the Proceedings of the Asian Conference on Computer Vision","author":"Lin Yan-Bo","year":"2020","unstructured":"Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Proceedings of the Asian Conference on Computer Vision."},{"key":"e_1_3_4_84_2","unstructured":"Yan-Bo Lin and Yu-Chiang Frank Wang. 2021. Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation. arxiv:2105.00708. Retrieved from https:\/\/arxiv.org\/abs\/2105.00708"},{"key":"e_1_3_4_85_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_4_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/tip.2022.3142526"},{"key":"e_1_3_4_87_2","unstructured":"Dezhao Luo Chang Liu Yu Zhou Dongbao Yang Can Ma Qixiang Ye and Weiping Wang. 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. arxiv:2001.00294. Retrieved from https:\/\/arxiv.org\/abs\/2001.00294"},{"key":"e_1_3_4_88_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612132"},{"key":"e_1_3_4_89_2","unstructured":"Shuang Ma Zhaoyang Zeng Daniel McDuff and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. arxiv:2009.09805. Retrieved from https:\/\/arxiv.org\/abs\/2009.09805"},{"key":"e_1_3_4_90_2","unstructured":"Shuang Ma Zhaoyang Zeng Daniel McDuff and Yale Song. 2021. Contrastive Self-Supervised Learning of Global-Local Audio-Visual Representations. Retrieved from https:\/\/openreview.net\/forum?id=Py4VjN6V2JX"},{"key":"e_1_3_4_91_2","unstructured":"Danny Merkx Stefan L. Frank and Mirjam Ernestus. 2019. Language learning using speech to image retrieval. arxiv:1909.03795. Retrieved from https:\/\/arxiv.org\/abs\/1909.03795"},{"key":"e_1_3_4_92_2","unstructured":"Shaobo Min Qi Dai Hongtao Xie Chuang Gan Yongdong Zhang and Jingdong Wang. 2021. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. arxiv:2106.06939. Retrieved from https:\/\/arxiv.org\/abs\/2106.06939"},{"key":"e_1_3_4_93_2","doi-asserted-by":"publisher","unstructured":"Rodrigo Mira Konstantinos Vougioukas Pingchuan Ma Stavros Petridis Bj\u00f6rn W. Schuller and Maja Pantic. 2022. End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks. DOI:10.1109\/TCYB.2022.3162495arxiv:2104.13332 [cs.LG]","DOI":"10.1109\/TCYB.2022.3162495"},{"key":"e_1_3_4_94_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01463"},{"key":"e_1_3_4_95_2","unstructured":"Pedro Morgado Yi Li and Nuno Vasconcelos. 2020. Learning Representations from Audio-Visual Spatial Alignment. arxiv:2011.01819. Retrieved from https:\/\/arxiv.org\/abs\/2011.01819"},{"key":"e_1_3_4_96_2","unstructured":"Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid and Chen Sun. 2022. Attention Bottlenecks for Multimodal Fusion. arxiv:2107.00135. Retrieved from https:\/\/arxiv.org\/abs\/2107.00135"},{"key":"e_1_3_4_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.232"},{"key":"e_1_3_4_98_2","first-page":"689","volume-title":"Proceedings of the 28th International Conference on Machine Learning","author":"Ngiam Jiquan","year":"2011","unstructured":"Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning. 689\u2013696."},{"key":"e_1_3_4_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/icip40778.2020.9190769"},{"key":"e_1_3_4_100_2","unstructured":"Mandela Patrick Yuki Asano Polina Kuznetsova Ruth Fong Joao F. Henriques Geoffrey Zweig and Andrea Vedaldi. 2021. Multi-modal Self-Supervision from Generalized Data Transformations. Retrieved from https:\/\/openreview.net\/forum?id=mgVbI13p96"},{"key":"e_1_3_4_101_2","doi-asserted-by":"publisher","DOI":"10.1145\/3284750"},{"key":"e_1_3_4_102_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016892"},{"key":"e_1_3_4_103_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01018"},{"key":"e_1_3_4_104_2","unstructured":"Laure Pretet Gael Richard and Geoffroy Peeters. 2021. Cross-Modal Music-Video Recommendation: A Study of Design Choices. arxiv:2104.14799. Retrieved from https:\/\/arxiv.org\/abs\/2104.14799"},{"key":"e_1_3_4_105_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2019.2908700"},{"key":"e_1_3_4_106_2","unstructured":"Rui Qian Yeqing Li Zheng Xu Ming-Hsuan Yang Serge Belongie and Yin Cui. 2022. Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. arxiv:2207.07646. Retrieved from https:\/\/arxiv.org\/abs\/2207.07646"},{"key":"e_1_3_4_107_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00689"},{"key":"e_1_3_4_108_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01233"},{"key":"e_1_3_4_109_2","unstructured":"Alec Radford Jong Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020. Retrieved from https:\/\/arxiv.org\/abs\/2103.00020"},{"key":"e_1_3_4_110_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413456"},{"key":"e_1_3_4_111_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093616"},{"key":"e_1_3_4_112_2","first-page":"8821","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning. PMLR, 8821\u20138831."},{"key":"e_1_3_4_113_2","doi-asserted-by":"publisher","DOI":"10.1109\/icassp39728.2021.9414053"},{"key":"e_1_3_4_114_2","doi-asserted-by":"crossref","unstructured":"Adri\u00e0 Recasens Pauline Luc Jean-Baptiste Alayrac Luyu Wang Ross Hemsley Florian Strub Corentin Tallec Mateusz Malinowski Viorica Patraucean Florent Altch\u00e9 Michal Valko Jean-Bastien Grill A\u00e4ron van den Oord and Andrew Zisserman. 2021. Broaden Your Views for Self-Supervised Video Learning. arxiv:2103.16559. Retrieved from https:\/\/arxiv.org\/abs\/2103.16559","DOI":"10.1109\/ICCV48922.2021.00129"},{"key":"e_1_3_4_115_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00112"},{"key":"e_1_3_4_116_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_4_117_2","doi-asserted-by":"crossref","unstructured":"Andrew Rouditchenko Angie Boggust David Harwath Brian Chen Dhiraj Joshi Samuel Thomas Kartik Audhkhasi Hilde Kuehne Rameswar Panda Rogerio Feris Brian Kingsbury Michael Picheny Antonio Torralba and James Glass. 2021. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. arxiv:2006.09199. Retrieved from https:\/\/arxiv.org\/abs\/2006.09199","DOI":"10.21437\/Interspeech.2021-1312"},{"key":"e_1_3_4_118_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00985"},{"key":"e_1_3_4_119_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.3000593"},{"key":"e_1_3_4_120_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413528"},{"key":"e_1_3_4_121_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-84"},{"key":"e_1_3_4_122_2","unstructured":"Pritam Sarkar and Ali Etemad. 2022. Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity. arxiv:2111.05329. Retrieved from https:\/\/arxiv.org\/abs\/2111.05329"},{"key":"e_1_3_4_123_2","unstructured":"Pritam Sarkar and Ali Etemad. 2023. XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning. arxiv:2211.13929. Retrieved from https:\/\/arxiv.org\/abs\/2211.13929"},{"key":"e_1_3_4_124_2","unstructured":"Florian Schmid Khaled Koutini and Gerhard Widmer. 2023. Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation. arxiv:2211.04772. Retrieved from https:\/\/arxiv.org\/abs\/2211.04772"},{"key":"e_1_3_4_125_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3006563"},{"key":"e_1_3_4_126_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654919"},{"key":"e_1_3_4_127_2","unstructured":"Bowen Shi Wei-Ning Hsu Kushal Lakhotia and Abdelrahman Mohamed. 2022. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. arxiv:2201.02184. Retrieved from https:\/\/arxiv.org\/abs\/2201.02184"},{"key":"e_1_3_4_128_2","unstructured":"Zhaofeng Shi. 2021. A Survey on Audio Synthesis and Audio-Visual Multimodal Processing. arxiv:2108.00443. Retrieved from https:\/\/arxiv.org\/abs\/2108.00443"},{"key":"e_1_3_4_129_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. arxiv:1406.2199. Retrieved from https:\/\/arxiv.org\/abs\/1406.2199"},{"key":"e_1_3_4_130_2","first-page":"2256","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Sohl-Dickstein Jascha","year":"2015","unstructured":"Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning. PMLR, 2256\u20132265."},{"key":"e_1_3_4_131_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093274"},{"key":"e_1_3_4_132_2","unstructured":"Jabeen Summaira Xi Li Amin Muhammad Shoib Songyuan Li and Jabbar Abdul. 2021. Recent Advances and Trends in Multimodal Deep Learning: A Review. arxiv:2105.11087. Retrieved from https:\/\/arxiv.org\/abs\/2105.11087"},{"key":"e_1_3_4_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00222"},{"key":"e_1_3_4_134_2","doi-asserted-by":"publisher","unstructured":"Li Tao Xueting Wang and Toshihiko Yamasaki. 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. DOI:10.1145\/3394171.3413694arxiv:2008.02531 [cs.CV]","DOI":"10.1145\/3394171.3413694"},{"key":"e_1_3_4_135_2","unstructured":"Yapeng Tian Jing Shi Bochen Li Zhiyao Duan and Chenliang Xu. 2018. Audio-Visual Event Localization in Unconstrained Videos. arxiv:1803.08842. Retrieved from https:\/\/arxiv.org\/abs\/1803.08842"},{"key":"e_1_3_4_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00092"},{"key":"e_1_3_4_137_2","unstructured":"Zhan Tong Yibing Song Jue Wang and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arxiv:2203.12602. Retrieved from https:\/\/arxiv.org\/abs\/2203.12602"},{"key":"e_1_3_4_138_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_4_139_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_4_140_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2005.1415167"},{"key":"e_1_3_4_141_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","volume":"30","author":"Oord Aaron van den","year":"2017","unstructured":"Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. 2017. Neural discrete representation learning. In Proceedings of the Advances in Neural Information Processing Systems.I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf"},{"key":"e_1_3_4_142_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2023. Attention Is All You Need. arxiv:1706.03762. Retrieved from https:\/\/arxiv.org\/abs\/1706.03762"},{"key":"e_1_3_4_143_2","doi-asserted-by":"publisher","unstructured":"Sergey Verbitskiy Vladimir Berikov and Viacheslav Vyshegorodtsev. 2022. ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition. DOI:10.1016\/j.patrec.2022.07.012arxiv:2106.01621 [cs.SD]","DOI":"10.1016\/j.patrec.2022.07.012"},{"key":"e_1_3_4_144_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123326"},{"key":"e_1_3_4_145_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01398"},{"key":"e_1_3_4_146_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-1891"},{"key":"e_1_3_4_147_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475572"},{"key":"e_1_3_4_148_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00879"},{"key":"e_1_3_4_149_2","unstructured":"Luyu Wang Pauline Luc Adria Recasens Jean-Baptiste Alayrac and Aaron van den Oord. 2021. Multimodal Self-Supervised Learning of General Audio Representations. arxiv:2104.12807. Retrieved from https:\/\/arxiv.org\/abs\/2104.12807"},{"key":"e_1_3_4_150_2","unstructured":"Luyu Wang and Aaron van den Oord. 2021. Multi-Format Contrastive Learning of Audio Representations. arxiv:2103.06508. Retrieved from https:\/\/arxiv.org\/abs\/2103.06508"},{"key":"e_1_3_4_151_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2868668"},{"key":"e_1_3_4_152_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01432"},{"key":"e_1_3_4_153_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00611"},{"key":"e_1_3_4_154_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1011"},{"key":"e_1_3_4_155_2","unstructured":"Wenhui Wang Hangbo Bao Li Dong Johan Bjorck Zhiliang Peng Qiang Liu Kriti Aggarwal Owais Khan Mohammed Saksham Singhal Subhojit Som and Furu Wei. 2022. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. arxiv:2208.10442. Retrieved from https:\/\/arxiv.org\/abs\/2208.10442"},{"key":"e_1_3_4_156_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.291"},{"key":"e_1_3_4_157_2","doi-asserted-by":"publisher","DOI":"10.1145\/3569584"},{"key":"e_1_3_4_158_2","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2020.3015894"},{"key":"e_1_3_4_159_2","unstructured":"Chen Wei Haoqi Fan Saining Xie Chao-Yuan Wu Alan Yuille and Christoph Feichtenhofer. 2023. Masked Feature Prediction for Self-Supervised Visual Pre-Training. arxiv:2112.09133. Retrieved from https:\/\/arxiv.org\/abs\/2112.09133"},{"key":"e_1_3_4_160_2","unstructured":"Yake Wei Di Hu Yapeng Tian and Xuelong Li. 2022. Learning in Audio-visual Context: A Review Analysis and New Perspective. arxiv:2208.09579. Retrieved from https:\/\/arxiv.org\/abs\/2208.09579"},{"key":"e_1_3_4_161_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00677"},{"key":"e_1_3_4_162_2","unstructured":"Wenhao Wu Zhun Sun and Wanli Ouyang. 2023. Revisiting Classifier: Transferring Vision-Language Models for Video Recognition. arxiv:2207.01297. Retrieved from https:\/\/arxiv.org\/abs\/2207.01297"},{"key":"e_1_3_4_163_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00138"},{"key":"e_1_3_4_164_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00639"},{"key":"e_1_3_4_165_2","doi-asserted-by":"publisher","DOI":"10.1145\/3122865.3122867"},{"key":"e_1_3_4_166_2","doi-asserted-by":"crossref","unstructured":"Hu Xu Gargi Ghosh Po-Yao Huang Dmytro Okhonko Armen Aghajanyan Florian Metze Luke Zettlemoyer and Christoph Feichtenhofer. 2021. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. arxiv:2109.14084. Retrieved from https:\/\/arxiv.org\/abs\/2109.14084","DOI":"10.18653\/v1\/2021.emnlp-main.544"},{"key":"e_1_3_4_167_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413581"},{"key":"e_1_3_4_168_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3275156"},{"key":"e_1_3_4_169_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11280-018-0541-x"},{"key":"e_1_3_4_170_2","article-title":"Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited","author":"Xu Xing","year":"2020","unstructured":"Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2020. Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_4_171_2","doi-asserted-by":"publisher","DOI":"10.1109\/tip.2021.3106814"},{"key":"e_1_3_4_172_2","doi-asserted-by":"publisher","DOI":"10.1631\/fitee.2100463"},{"key":"e_1_3_4_173_2","unstructured":"Qinghao Ye Guohai Xu Ming Yan Haiyang Xu Qi Qian Ji Zhang and Fei Huang. 2022. HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training. arxiv:2212.14546. Retrieved from https:\/\/arxiv.org\/abs\/2212.14546"},{"key":"e_1_3_4_174_2","doi-asserted-by":"publisher","DOI":"10.1145\/2393347.2396493"},{"key":"e_1_3_4_175_2","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Ng Joe Yue-Hei","year":"2015","unstructured":"Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_4_176_2","first-page":"12310","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Zbontar Jure","year":"2021","unstructured":"Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St\u00e9phane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning. PMLR, 12310\u201312320."},{"key":"e_1_3_4_177_2","unstructured":"Donghuo Zeng Jianming Wu Gen Hattori Yi Yu and Rong Xu. 2021. Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval. arxiv:2110.13556. Retrieved from https:\/\/arxiv.org\/abs\/2110.13556"},{"key":"e_1_3_4_178_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISM.2018.00-21"},{"key":"e_1_3_4_179_2","doi-asserted-by":"publisher","DOI":"10.1145\/3387164"},{"key":"e_1_3_4_180_2","first-page":"(2015), 60.1\u201360","article-title":"Exploiting image-trained CNN architectures for unconstrained video classification","author":"Zha S.","year":"2015","unstructured":"S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. 26th British Machine Vision Conference(2015), 60.1\u201360.13.","journal-title":"26th British Machine Vision Conference"},{"key":"e_1_3_4_181_2","unstructured":"Jingran Zhang Fumin Shen Xing Xu and Heng Tao Shen. 2019. Cooperative Cross-Stream Network for Discriminative Action Representation. arxiv:1908.10136. Retrieved from https:\/\/arxiv.org\/abs\/1908.10136"},{"key":"e_1_3_4_182_2","unstructured":"Jiwei Zhang Yi Yu Suhua Tang Jianming Wu and Wei Li. 2021. Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval. arxiv:2112.02601. Retrieved from https:\/\/arxiv.org\/abs\/2112.02601"},{"key":"e_1_3_4_183_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01332"},{"key":"e_1_3_4_184_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2019.8851942"},{"key":"e_1_3_4_185_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00182"},{"key":"e_1_3_4_186_2","unstructured":"Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh McDermott and Antonio Torralba. 2018. The Sound of Pixels. arxiv:1804.03160. Retrieved from https:\/\/arxiv.org\/abs\/1804.03160"},{"key":"e_1_3_4_187_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01064"},{"key":"e_1_3_4_188_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3050089"},{"key":"e_1_3_4_189_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449801"},{"key":"e_1_3_4_190_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00154"},{"key":"e_1_3_4_191_2","unstructured":"Jinghao Zhou Chen Wei Huiyu Wang Wei Shen Cihang Xie Alan Yuille and Tao Kong. 2022. iBOT: Image BERT Pre-Training with Online Tokenizer. arxiv:2111.07832. Retrieved from https:\/\/arxiv.org\/abs\/2111.07832"},{"key":"e_1_3_4_192_2","unstructured":"Bin Zhu Bin Lin Munan Ning Yang Yan Jiaxi Cui HongFa Wang Yatian Pang Wenhao Jiang Junwu Zhang Zongwei Li Wancai Zhang Zhifeng Li Wei Liu and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arxiv:2310.01852. Retrieved from https:\/\/arxiv.org\/abs\/2310.01852"},{"key":"e_1_3_4_193_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11633-021-1293-0"},{"key":"e_1_3_4_194_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01016"},{"key":"e_1_3_4_195_2","volume-title":"Proceedings of the Asian Conference on Computer Vision","author":"Zhu Lingyu","year":"2020","unstructured":"Lingyu Zhu and Esa Rahtu. 2020. Visually guided sound source separation using cascaded opponent filter network. In Proceedings of the Asian Conference on Computer Vision."},{"key":"e_1_3_4_196_2","doi-asserted-by":"publisher","DOI":"10.1109\/EUVIP50544.2021.9484036"},{"key":"e_1_3_4_197_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.219"},{"key":"e_1_3_4_198_2","doi-asserted-by":"crossref","unstructured":"Ye Zhu Kyle Olszewski Yu Wu Panos Achlioptas Menglei Chai Yan Yan and Sergey Tulyakov. 2022. Quantized GAN for Complex Music Generation from Dance Videos. arxiv:2204.00604. Retrieved from https:\/\/arxiv.org\/abs\/2204.00604","DOI":"10.1007\/978-3-031-19836-6_11"},{"key":"e_1_3_4_199_2","unstructured":"Ye Zhu Yu Wu Hugo Latapie Yi Yang and Yan Yan. 2021. Learning Audio-Visual Correlations from Variational Cross-Modal Generation. arxiv:2102.03424. Retrieved from https:\/\/arxiv.org\/abs\/2102.03424"},{"key":"e_1_3_4_200_2","unstructured":"Ye Zhu Yu Wu Kyle Olszewski Jian Ren Sergey Tulyakov and Yan Yan. 2023. Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation. arxiv:2206.07771. Retrieved from https:\/\/arxiv.org\/abs\/2206.07771"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3696445","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T13:25:47Z","timestamp":1752672347000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696445"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,11]]},"references-count":199,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3696445"],"URL":"https:\/\/doi.org\/10.1145\/3696445","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,11]]},"assertion":[{"value":"2022-07-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-22","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}