{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T20:10:21Z","timestamp":1765311021659,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":38,"publisher":"ACM","funder":[{"name":"National Nature Science Foundation of China","award":["Grant 62441616, Grant 62336004, Grant 62406172, Grant 62125603."],"award-info":[{"award-number":["Grant 62441616, Grant 62336004, Grant 62406172, Grant 62125603."]}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"publisher","award":["No. 2023M741964"],"award-info":[{"award-number":["No. 2023M741964"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Postdoctoral Fellowship Program of CPSF","award":["No. GZC20240841"],"award-info":[{"award-number":["No. GZC20240841"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755594","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T07:30:51Z","timestamp":1761377451000},"page":"11610-11618","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["WhiADD: Semantic-Acoustic Fusion for Robust Audio Deepfake Detection"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6016-5574","authenticated-orcid":false,"given":"Jianqiao","family":"Cui","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China and Yangtze Delta Region Institute of Tsinghua University, Zhejiang, Jiaxing, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9550-6554","authenticated-orcid":false,"given":"Bingyao","family":"Yu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0612-7669","authenticated-orcid":false,"given":"Qihao","family":"Wang","sequence":"additional","affiliation":[{"name":"GuangXi University, Nanning, GuangXi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2553-9470","authenticated-orcid":false,"given":"Fei","family":"Meng","sequence":"additional","affiliation":[{"name":"Yangtze Delta Region Institute of Tsinghua University, Zhejiang, Jiaxing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6121-5529","authenticated-orcid":false,"given":"Jiwen","family":"Lu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111","author":"Wang Chengyi","year":"2023","unstructured":"Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023."},{"key":"e_1_3_2_1_2_1","volume-title":"Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926","author":"Zhang Ziqiang","year":"2023","unstructured":"Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023."},{"key":"e_1_3_2_1_3_1","volume-title":"Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107","author":"Wang Tianrui","year":"2023","unstructured":"Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, and Furu Wei. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023."},{"key":"e_1_3_2_1_4_1","volume-title":"et al. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704","author":"Yang Dongchao","year":"2023","unstructured":"Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, et al. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023."},{"key":"e_1_3_2_1_5_1","volume-title":"et al. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673","author":"Du Zhihao","year":"2023","unstructured":"Zhihao Du, Jiaming Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, et al. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673, 2023."},{"key":"e_1_3_2_1_6_1","volume-title":"Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, and Takuya Yoshioka. Speechx: Neural codec language model as a versatile speech transformer","author":"Wang Xiaofei","year":"2024","unstructured":"Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, and Takuya Yoshioka. Speechx: Neural codec language model as a versatile speech transformer. IEEE\/ACM Transactions on Audio, Speech, and Language Processing, 2024."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.21437\/ASVSPOOF.2021-8"},{"key":"e_1_3_2_1_8_1","volume-title":"Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH)","author":"Todisco Massimiliano","year":"2019","unstructured":"Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah, H\u00e9ctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas W. D. Evans, Tomi H. Kinnunen, and Kong Aik Lee. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746939"},{"key":"e_1_3_2_1_10_1","first-page":"4213","volume-title":"Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH)","author":"Das Rohan Kumar","year":"2020","unstructured":"Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, and Haizhou Li. The attacker's perspective on automatic speaker verification: An overview. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 4213-4217, Shanghai, China, 2020. ISCA."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2024-2093"},{"key":"e_1_3_2_1_12_1","volume-title":"Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems. arXiv preprint arXiv:2406.07237","author":"Wu Haibin","year":"2024","unstructured":"Haibin Wu, Yuan Tseng, and Hung-yi Lee. Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems. arXiv preprint arXiv:2406.07237, 2024."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLPRO.2025.3525966"},{"key":"e_1_3_2_1_14_1","volume-title":"Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. arxiv","author":"Wang X","year":"2024","unstructured":"X Wang, H Delgado, H Tak, Jw Jung, Hj Shim, M Todisco, I Kukanov, X Liu, M Sahidullah, T Kinnunen, et al. Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. arxiv 2024. arXiv preprint arXiv:2408.08739, 2024."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.09.135"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-1537"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-61382-1_8"},{"key":"e_1_3_2_1_18_1","first-page":"28492","volume-title":"International conference on machine learning","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492-28518. PMLR, 2023."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3658644.3670285"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.286"},{"issue":"1","key":"e_1_3_2_1_21_1","first-page":"201","article-title":"Spoof speech detection based on speaker features","volume":"50","author":"Yuxiang ZHANG","year":"2025","unstructured":"ZHANG Yuxiang, LI Zhuo, LU Jingze, SHANG Zengqiang, CHEN Shuli, WANG Wenchao, and ZHANG Pengyuan. Spoof speech detection based on speaker features. ACTA ACUSTICA, 50(1):201-210, 2025.","journal-title":"ACTA ACUSTICA"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0218001423500155"},{"key":"e_1_3_2_1_23_1","volume-title":"Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92)","author":"Yamagishi Junichi","year":"2019","unstructured":"Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald, et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-2193"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-1505"},{"key":"e_1_3_2_1_26_1","volume-title":"Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670","author":"Ardila Rosana","year":"2019","unstructured":"Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"e_1_3_2_1_28_1","volume-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449-12460","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449-12460, 2020."},{"key":"e_1_3_2_1_29_1","first-page":"3790","volume-title":"Interspeech","volume":"2021","author":"Zhu Youxiang","year":"2021","unstructured":"Youxiang Zhu, Abdelrahman Obyat, Xiaohui Liang, John A Batsis, and Robert M Roth. Wavbert: Exploiting semantic and non-semantic speech using wav2vec and bert for dementia detection. In Interspeech, volume 2021, page 3790, 2021."},{"key":"e_1_3_2_1_30_1","volume-title":"Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716-23736","author":"Alayrac Jean-Baptiste","year":"2022","unstructured":"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716-23736, 2022."},{"key":"e_1_3_2_1_31_1","volume-title":"Hubert: Self-supervised speech representation learning by masked prediction of hidden units","author":"Hsu Wei-Ning","year":"2021","unstructured":"Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM transactions on audio, speech, and language processing, 29:3451-3460, 2021."},{"issue":"2","key":"e_1_3_2_1_32_1","first-page":"3","article-title":"Lora: Low-rank adaptation of large language models","volume":"1","author":"Hu Edward J","year":"2022","unstructured":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022.","journal-title":"ICLR"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9052942"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-1768"},{"key":"e_1_3_2_1_35_1","first-page":"6367","volume-title":"ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Heo Hee-Soo","year":"2022","unstructured":"Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6367-6371. IEEE, 2022."},{"key":"e_1_3_2_1_36_1","volume-title":"et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430","author":"Anastassiou Philip","year":"2024","unstructured":"Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024."},{"key":"e_1_3_2_1_37_1","first-page":"6691","volume-title":"Proceedings of the 13th Language Resources and Evaluation Conference (LREC)","author":"Jia Ye","year":"2022","unstructured":"Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, and Heiga Zen. Cvss corpus and massively multilingual speech-to-speech translation. In Proceedings of the 13th Language Resources and Evaluation Conference (LREC), pages 6691-6703, Marseille, France, 2022. European Language Resources Association (ELRA)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10096278"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755594","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T20:06:05Z","timestamp":1765310765000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755594"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":38,"alternative-id":["10.1145\/3746027.3755594","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755594","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}