{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T13:07:03Z","timestamp":1780319223494,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":78,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,10,28]],"date-time":"2024-10-28T00:00:00Z","timestamp":1730073600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,10,28]]},"DOI":"10.1145\/3664647.3680795","type":"proceedings-article","created":{"date-parts":[[2024,10,26]],"date-time":"2024-10-26T06:59:41Z","timestamp":1729925981000},"page":"7414-7423","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":45,"title":["AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7978-0860","authenticated-orcid":false,"given":"Zhixi","family":"Cai","sequence":"first","affiliation":[{"name":"Monash University, Melbourne, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2639-8374","authenticated-orcid":false,"given":"Shreya","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Curtin University, Perth, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9585-404X","authenticated-orcid":false,"given":"Aman Pankaj","family":"Adatia","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology Ropar, Ropar, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2706-5985","authenticated-orcid":false,"given":"Munawar","family":"Hayat","sequence":"additional","affiliation":[{"name":"Qualcomm, San Diego, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2230-1440","authenticated-orcid":false,"given":"Abhinav","family":"Dhall","sequence":"additional","affiliation":[{"name":"Flinders University, Adelaide, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8356-4909","authenticated-orcid":false,"given":"Tom","family":"Gedeon","sequence":"additional","affiliation":[{"name":"Curtin University of Technology, Perth, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0861-8660","authenticated-orcid":false,"given":"Kalin","family":"Stefanov","sequence":"additional","affiliation":[{"name":"Monash University, Melbourne, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,10,28]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2018.8630761"},{"key":"e_1_3_2_2_2_1","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 5178--5187","author":"Agarwal Madhav","unstructured":"Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C. V. Jawahar. 2023. Audio-Visual Face Reenactment. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 5178--5187."},{"key":"e_1_3_2_2_3_1","volume-title":"Advances in Neural Information Processing Systems","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877--1901."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2023.103818"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00150"},{"key":"e_1_3_2_2_6_1","volume-title":"Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","author":"Cai Zhixi","year":"2022","unstructured":"Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. 2022. Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). Sydney, Australia, 1--10."},{"key":"e_1_3_2_2_7_1","volume-title":"Frederico Santos De Oliveira, Arnaldo Candido Jr., Anderson Da Silva Soares, Sandra Maria Aluisio, and Moacir Antonelli Ponti.","author":"Casanova Edresson","year":"2021","unstructured":"Edresson Casanova, Christopher Shulby, Eren G\u00f6lge, Nicolas Michael M\u00fcller, Frederico Santos De Oliveira, Arnaldo Candido Jr., Anderson Da Silva Soares, Sandra Maria Aluisio, and Moacir Antonelli Ponti. 2021. SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. In Interspeech 2021. ISCA, 3645--3649."},{"key":"e_1_3_2_2_8_1","volume-title":"Proceedings of the 39th International Conference on Machine Learning. PMLR, 2709--2720","author":"Casanova Edresson","unstructured":"Edresson Casanova, Julian Weber, Christopher D. Shulby, Arnaldo Candido Junior, Eren G\u00f6lge, and Moacir A. Ponti. 2022. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 2709--2720. ISSN: 2640--3498."},{"key":"e_1_3_2_2_9_1","unstructured":"Harrison Chase. 2022. LangChain."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00802"},{"key":"e_1_3_2_2_11_1","volume-title":"ISCA","author":"Choi Seungwoo","year":"2020","unstructured":"Seungwoo Choi, Seungju Han, Dongyoung Kim, and Sungjoo Ha. 2020. Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. In Interspeech 2020. ISCA, 2007--2011."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.195"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413700"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"e_1_3_2_2_15_1","volume-title":"Image Analysis and Processing -- ICIAP 2022 (Lecture Notes in Computer Science), Stan Sclaroff, Cosimo Distante, Marco Leo, Giovanni M","author":"Coccomini Davide Alessandro","unstructured":"Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. 2022. Combining EfficientNet and\u00a0Vision Transformers for\u00a0Video Deepfake Detection. In Image Analysis and Processing -- ICIAP 2022 (Lecture Notes in Computer Science), Stan Sclaroff, Cosimo Distante, Marco Leo, Giovanni M. Farinella, and Federico Tombari (Eds.). Springer International Publishing, Cham, 219--229."},{"key":"e_1_3_2_2_16_1","volume-title":"The DeepFake Detection Challenge (DFDC) Dataset. arXiv","author":"Dolhansky Brian","year":"2006","unstructured":"Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset. arXiv: 2006.07397 [cs]."},{"key":"e_1_3_2_2_17_1","volume-title":"Interspeech","author":"D\u00e9fossez Alexandre","year":"2020","unstructured":"Alexandre D\u00e9fossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real Time Speech Enhancement in the Waveform Domain. In Interspeech 2020. Shanghai, China, 3291--3295."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01011"},{"key":"e_1_3_2_2_19_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Shai Avidan, Gabriel Brostow, Moustapha Ciss\u00e9","author":"Ge Songwei","unstructured":"Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. 2022. Long Video Generation with\u00a0Time-Agnostic VQGAN and\u00a0Time-Sensitive Transformer. In Proceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Shai Avidan, Gabriel Brostow, Moustapha Ciss\u00e9, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 102--118."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00573"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00500"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00434"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-022-03867-9"},{"key":"e_1_3_2_2_24_1","volume-title":"Advances in Neural Information Processing Systems","volume":"30","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc."},{"key":"e_1_3_2_2_25_1","volume-title":"Applied Soft Computing","volume":"136","author":"Ilyas Hafsa","year":"2023","unstructured":"Hafsa Ilyas, Ali Javed, and Khalid Mahmood Malik. 2023. AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio--visual deepfakes detection. Applied Soft Computing, Vol. 136 (March 2023), 110124."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528233.3530745"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327345.3327360"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00296"},{"key":"e_1_3_2_2_29_1","unstructured":"Ziyue Jiang Jinglin Liu Yi Ren Jinzheng He Chen Zhang Zhenhui Ye Pengfei Wei Chunfeng Wang Xiang Yin Zejun Ma and Zhou Zhao. 2023. Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts. arXiv:2307.07218 [cs eess]."},{"key":"e_1_3_2_2_30_1","unstructured":"Ziyue Jiang Yi Ren Zhenhui Ye Jinglin Liu Chen Zhang Qian Yang Shengpeng Ji Rongjie Huang Chunfeng Wang Xiang Yin Zejun Ma and Zhou Zhao. 2023. Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias. arXiv:2306.03509 [cs eess]."},{"key":"e_1_3_2_2_31_1","volume-title":"Woo","author":"Khalid Hasam","year":"2021","unstructured":"Hasam Khalid, Shahroz Tariq, and Simon S. Woo. 2021. FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. arXiv: 2108.05080 [cs]."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Kevin Kilgour Mauricio Zuluaga Dominik Roblek and Matthew Sharifi. 2019. Fr\u00e9chet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. arXiv:1812.08466 [cs eess].","DOI":"10.21437\/Interspeech.2019-2219"},{"key":"e_1_3_2_2_33_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540","author":"Kim Jaehyeon","year":"2021","unstructured":"Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540. ISSN: 2640--3498."},{"key":"e_1_3_2_2_34_1","unstructured":"Pavel Korshunov and Sebastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. arXiv:1812.08685 [cs]."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01057"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00505"},{"key":"e_1_3_2_2_37_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops. 7.","author":"Li Yuezun","year":"2019","unstructured":"Yuezun Li and Siwei Lyu. 2019. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops. 7."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00327"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2023.3285283"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413570"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00939"},{"key":"e_1_3_2_2_42_1","unstructured":"Dufou Nick and Jigsaw Andrew. 2019. Contributing Data to Deepfake Detection Research."},{"key":"e_1_3_2_2_43_1","volume-title":"BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. In 2021 International Joint Conference on Neural Networks (IJCNN). 1--8. ISSN: 2161--4407","author":"Niizumi Daisuke","year":"2021","unstructured":"Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2021. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. In 2021 International Joint Conference on Neural Networks (IJCNN). 1--8. ISSN: 2161--4407."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02559"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-205"},{"key":"e_1_3_2_2_46_1","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia (MM '20)","author":"Prajwal K R","unstructured":"K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 484--492."},{"key":"e_1_3_2_2_47_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.)","author":"Qian Yuyang","unstructured":"Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues. In Proceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 86--103."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.5555\/3618408.3619590"},{"key":"e_1_3_2_2_49_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 993--1000","author":"Raza Muhammad Anas","year":"2023","unstructured":"Muhammad Anas Raza and Khalid Mahmood Malik. 2023. Multimodaltrace: Deepfake Detection Using Audiovisual Representation Learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 993--1000."},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00009"},{"key":"e_1_3_2_2_51_1","first-page":"2640","volume-title":"Lip Sync Matters: A Novel Multimodal Forgery Detector. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 1885--1892","author":"Shahzad Sahibzada Adil","year":"2022","unstructured":"Sahibzada Adil Shahzad, Ammarah Hashmi, Sarwar Khan, Yan-Tsung Peng, Yu Tsao, and Hsin-Min Wang. 2022. Lip Sync Matters: A Novel Multimodal Forgery Detector. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 1885--1892. ISSN: 2640-0103."},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3367749"},{"key":"e_1_3_2_2_53_1","unstructured":"Kai Shen Zeqian Ju Xu Tan Yanqing Liu Yichong Leng Lei He Tao Qin Sheng Zhao and Jiang Bian. 2023. NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv:2304.09116 [cs eess]."},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00197"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01808"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01816"},{"key":"e_1_3_2_2_57_1","unstructured":"Uriel Singer Adam Polyak Thomas Hayes Xi Yin Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni Devi Parikh Sonal Gupta and Yaniv Taigman. 2022. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792 [cs]."},{"key":"e_1_3_2_2_58_1","first-page":"146","article-title":"Converting video formats with FFmpeg","volume":"2006","author":"Tomar Suramya","year":"2006","unstructured":"Suramya Tomar. 2006. Converting video formats with FFmpeg. Linux Journal, Vol. 2006, 146 (June 2006), 10.","journal-title":"Linux Journal"},{"key":"e_1_3_2_2_59_1","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs]."},{"key":"e_1_3_2_2_60_1","volume-title":"Generalized End-to-End Loss for Speaker Verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4879--4883","author":"Wan Li","year":"2018","unstructured":"Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized End-to-End Loss for Speaker Verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4879--4883. ISSN: 2379--190X."},{"key":"e_1_3_2_2_61_1","volume-title":"Exploiting Modality-Specific Features for Multi-Modal Manipulation Detection and Grounding. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4935--4939","author":"Wang Jiazhen","year":"2024","unstructured":"Jiazhen Wang, Bin Liu, Changtao Miao, Zhiwei Zhao, Wanyi Zhuang, Qi Chu, and Nenghai Yu. 2024. Exploiting Modality-Specific Features for Multi-Modal Manipulation Detection and Grounding. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4935--4939. ISSN: 2379--190X."},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01408"},{"key":"e_1_3_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3512527.3531415"},{"key":"e_1_3_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01398"},{"key":"e_1_3_2_2_65_1","unstructured":"Yi Wang Kunchang Li Yizhuo Li Yinan He Bingkun Huang Zhiyu Zhao Hongjie Zhang Jilan Xu Yi Liu Zun Wang Sen Xing Guo Chen Junting Pan Jiashuo Yu Yali Wang Limin Wang and Yu Qiao. 2022. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv:2212.03191 [cs]."},{"key":"e_1_3_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2003.819861"},{"key":"e_1_3_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00701"},{"key":"e_1_3_2_2_68_1","volume-title":"Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation","author":"Wu Yusong","unstructured":"Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. ISSN: 2379--190X."},{"key":"e_1_3_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2023.3262148"},{"key":"e_1_3_2_2_70_1","volume-title":"Exposing Deep Fakes Using Inconsistent Head Poses. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265","author":"Yang Xin","year":"2019","unstructured":"Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing Deep Fakes Using Inconsistent Head Poses. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265. ISSN: 2379--190X."},{"key":"e_1_3_2_2_71_1","volume-title":"ADD 2022: the First Audio Deep Synthesis Detection Challenge. arXiv:2202","author":"Yi Jiangyan","year":"2022","unstructured":"Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, and Bin Liu. 2022. ADD 2022: the First Audio Deep Synthesis Detection Challenge. arXiv:2202.08433 [cs, eess]."},{"key":"e_1_3_2_2_72_1","volume-title":"Kot","author":"Yu Yang","year":"2023","unstructured":"Yang Yu, Xiaolong Liu, Rongrong Ni, Siyuan Yang, Yao Zhao, and Alex C. Kot. 2023. PVASS-MDD: Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1--1."},{"key":"e_1_3_2_2_73_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Shai Avidan, Gabriel Brostow, Moustapha Ciss\u00e9","author":"Zhang Chen-Lin","unstructured":"Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. ActionFormer: Localizing Moments of\u00a0Actions with\u00a0Transformers. In Proceedings of the European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Shai Avidan, Gabriel Brostow, Moustapha Ciss\u00e9, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 492--510."},{"key":"e_1_3_2_2_74_1","doi-asserted-by":"crossref","unstructured":"Hang Zhang Xin Li and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv:2306.02858 [cs eess].","DOI":"10.18653\/v1\/2023.emnlp-demo.49"},{"key":"e_1_3_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3613767"},{"key":"e_1_3_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.317"},{"key":"e_1_3_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00572"},{"key":"e_1_3_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413769"}],"event":{"name":"MM '24: The 32nd ACM International Conference on Multimedia","location":"Melbourne VIC Australia","acronym":"MM '24","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 32nd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664647.3680795","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3664647.3680795","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:18:07Z","timestamp":1750295887000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664647.3680795"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,28]]},"references-count":78,"alternative-id":["10.1145\/3664647.3680795","10.1145\/3664647"],"URL":"https:\/\/doi.org\/10.1145\/3664647.3680795","relation":{},"subject":[],"published":{"date-parts":[[2024,10,28]]},"assertion":[{"value":"2024-10-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}