{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T16:18:05Z","timestamp":1776442685611,"version":"3.51.2"},"reference-count":57,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2022,5,4]],"date-time":"2022-05-04T00:00:00Z","timestamp":1651622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>A number of AI-generated tools are used today to clone human voices, leading to a new technology known as Audio Deepfakes (ADs). Despite being introduced to enhance human lives as audiobooks, ADs have been used to disrupt public safety. ADs have thus recently come to the attention of researchers, with Machine Learning (ML) and Deep Learning (DL) methods being developed to detect them. In this article, a review of existing AD detection methods was conducted, along with a comparative description of the available faked audio datasets. The article introduces types of AD attacks and then outlines and analyzes the detection methods and datasets for imitation- and synthetic-based Deepfakes. To the best of the authors\u2019 knowledge, this is the first review targeting imitated and synthetically generated audio detection methods. The similarities and differences of AD detection methods are summarized by providing a quantitative comparison that finds that the method type affects the performance more than the audio features themselves, in which a substantial tradeoff between the accuracy and scalability exists. Moreover, at the end of this article, the potential research directions and challenges of Deepfake detection methods are discussed to discover that, even though AD detection is an active area of research, further research is still needed to address the existing gaps. This article can be a starting point for researchers to understand the current state of the AD literature and investigate more robust detection models that can detect fakeness even if the target audio contains accented voices or real-world noises.<\/jats:p>","DOI":"10.3390\/a15050155","type":"journal-article","created":{"date-parts":[[2022,5,4]],"date-time":"2022-05-04T08:21:25Z","timestamp":1651652485000},"page":"155","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":147,"title":["A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8242-9246","authenticated-orcid":false,"given":"Zaynab","family":"Almutairi","sequence":"first","affiliation":[{"name":"Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh P.O. Box 145111, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3764-6169","authenticated-orcid":false,"given":"Hebah","family":"Elgibreen","sequence":"additional","affiliation":[{"name":"Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh P.O. Box 145111, Saudi Arabia"},{"name":"Artificial Intelligence Center of Advanced Studies (Thakaa), King Saud University, Riyadh P.O. Box 145111, Saudi Arabia"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Lyu, S. (2020). Deepfake detection: Current challenges and next steps. IEEE Comput. Soc., 1\u20136.","DOI":"10.1109\/ICMEW46912.2020.9105991"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"2072","DOI":"10.1177\/1461444820925811","article-title":"Anticipating and addressing the ethical implications of deepfakes in the context of elections","volume":"23","author":"Diakopoulos","year":"2021","journal-title":"New Media Soc."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Florez, H., and Misra, S. (2020). A machine learning model to detect fake voice. Applied Informatics, Springer International Publishing.","DOI":"10.1007\/978-3-030-61702-8"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Chen, T., Kumar, A., Nagarsheth, P., Sivaraman, G., and Khoury, E. (2020, January 1\u20135). Generalization of audio deepfake detection. Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan.","DOI":"10.21437\/Odyssey.2020-19"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"115465","DOI":"10.1016\/j.eswa.2021.115465","article-title":"Deep4SNet: Deep learning for fake speech classification","volume":"184","author":"Ballesteros","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073640","article-title":"Synthesizing obama: Learning lip sync from audio","volume":"36","author":"Suwajanakorn","year":"2017","journal-title":"ACM Trans. Graph. ToG"},{"key":"ref_7","unstructured":"(2022, January 29). Catherine Stupp Fraudsters Used AI to Mimic CEO\u2019s Voice in Unusual Cybercrime Case. Available online: https:\/\/www.wsj.com\/articles\/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Singh, P.K., Wierzcho\u0144, S.T., Tanwar, S., Ganzha, M., and Rodrigues, J.J.P.C. (2021). Deepfake: An overview. Proceedings of Second International Conference on Computing, Communications, and Cyber-Security, Springer.","DOI":"10.1007\/978-981-16-0733-2"},{"key":"ref_9","unstructured":"Tan, X., Qin, T., Soong, F., and Liu, T.-Y. (2021). A survey on neural speech synthesis. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Ning, Y., He, S., Wu, Z., Xing, C., and Zhang, L.-J. (2019). A Review of Deep Learning Based Speech Synthesis. Appl. Sci., 9.","DOI":"10.3390\/app9194050"},{"key":"ref_11","unstructured":"Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. (2020). Fastspeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018). Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions, IEEE.","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"ref_13","unstructured":"Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Raiman, J., and Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv."},{"key":"ref_14","unstructured":"Khanjani, Z., Watson, G., and Janeja, V.P. (2021). How deep are the fakes? Focusing on audio deepfake: A survey. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3351258","article-title":"Combating replay attacks against voice assistants","volume":"3","author":"Pradhan","year":"2019","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"105331","DOI":"10.1016\/j.dib.2020.105331","article-title":"A dataset of histograms of original and fake voice recordings (H-voice)","volume":"29","author":"Ballesteros","year":"2020","journal-title":"Data Brief"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Singh, A.K., and Singh, P. (2021, January 8\u201310). Detection of ai-synthesized speech using cepstral & bispectral statistics. Proceedings of the 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), Tokyo, Japan.","DOI":"10.1109\/MIPR51284.2021.00076"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1186\/s13635-021-00116-3","article-title":"Synthetic speech detection through short-term and long-term prediction traces","volume":"2021","author":"Borrelli","year":"2021","journal-title":"EURASIP J. Inf. Secur."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., Yamagishi, J., Evans, N., Kinnunen, T., and Lee, K.A. (2019). ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv.","DOI":"10.21437\/Interspeech.2019-2249"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Liu, T., Yan, D., Wang, R., Yan, N., and Chen, G. (2021). Identification of fake stereo audio using SVM and CNN. Information, 12.","DOI":"10.3390\/info12070263"},{"key":"ref_21","unstructured":"Subramani, N., and Rao, D. (2020, January 7\u201312). Learning efficient representations for fake speech detection. Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA."},{"key":"ref_22","unstructured":"Bartusiak, E.R., and Delp, E.J. (2021, January 11\u201315). Frequency domain-based detection of generated audio. Proceedings of the Electronic Imaging; Society for Imaging Science and Technology, New York, NY, USA."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"162","DOI":"10.1016\/j.neucom.2020.07.099","article-title":"Arabic audio clips: Identification and discrimination of authentic cantillations from imitations","volume":"418","author":"Lataifeh","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"106503","DOI":"10.1016\/j.dib.2020.106503","article-title":"Ar-DAD: Arabic diversified audio dataset","volume":"33","author":"Lataifeh","year":"2020","journal-title":"Data Brief"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lei, Z., Yang, Y., Liu, C., and Ye, J. (2020, January 25\u201329). Siamese convolutional neural network using gaussian probability feature for spoofing speech detection. Proceedings of the INTERSPEECH, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2723"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Hofbauer, H., and Uhl, A. (2016, January 13). Calculating a boundary for the significance from the equal-error rate. Proceedings of the 2016 International Conference on Biometrics (ICB), Halmstad, Sweden.","DOI":"10.1109\/ICB.2016.7550053"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Figueroa-Garc\u00eda, J.C., D\u00edaz-Gutierrez, Y., Gaona-Garc\u00eda, E.E., and Orjuela-Ca\u00f1\u00f3n, A.D. (2021). Fake speech recognition using deep learning. Applied Computer Sciences in Engineering, Springer International Publishing.","DOI":"10.1007\/978-3-030-86702-7"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Reimao, R., and Tzerpos, V. (2019, January 10). For: A dataset for synthetic speech detection. Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Timisoara, Romania.","DOI":"10.1109\/SPED.2019.8906599"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"4633","DOI":"10.1109\/TNNLS.2017.2771947","article-title":"Guo spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features","volume":"29","author":"Yu","year":"2018","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanil\u00e7i, C., Sahidullah, M., and Sizov, A. (2015, January 6\u201310). ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. Proceedings of the Interspeech 2015, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-462"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, R., Juefei-Xu, F., Huang, Y., Guo, Q., Xie, X., Ma, L., and Liu, Y. (2020, January 12\u201316). Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. Proceedings of the the 28th ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413716"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Wijethunga, R.L.M.A.P.C., Matheesha, D.M.K., Al Noman, A., De Silva, K.H.V.T.A., Tissera, M., and Rupasinghe, L. (2020, January 10\u201311). Rupasinghe deepfake audio detection: A deep learning based solution for group conversations. Proceedings of the 2020 2nd International Conference on Advancements in Computing (ICAC), Malabe, Sri Lanka.","DOI":"10.1109\/ICAC51239.2020.9357161"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1024","DOI":"10.1109\/JSTSP.2020.2999185","article-title":"Ptucha recurrent convolutional structures for audio spoof and video deepfake detection","volume":"14","author":"Chintha","year":"2020","journal-title":"IEEE J. Sel. Top. Signal. Process."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Kinnunen, T., Lee, K.A., Delgado, H., Evans, N., Todisco, M., Sahidullah, M., Yamagishi, J., and Reynolds, D.A. (2018). T-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv.","DOI":"10.21437\/Odyssey.2018-44"},{"key":"ref_35","unstructured":"Shan, M., and Tsai, T. (2020). A cross-verification approach for protecting world leaders from fake and tampered audio. arXiv."},{"key":"ref_36","unstructured":"Aravind, P.R., Nechiyil, U., and Paramparambath, N. (2020). Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"3447","DOI":"10.1007\/s13369-021-06297-w","article-title":"A deep learning framework for audio deepfake detection","volume":"47","author":"Khochare","year":"2021","journal-title":"Arab. J. Sci. Eng."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Khalid, H., Kim, M., Tariq, S., and Woo, S.S. (2021, January 20). Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. Proceedings of the 1st Workshop on Synthetic Multimedia, ACM Association for Computing Machinery, New York, NY, USA.","DOI":"10.1145\/3476099.3484315"},{"key":"ref_39","unstructured":"Khalid, H., Tariq, S., Kim, M., and Woo, S.S. (2021, January 6\u201314). FakeAVCeleb: A novel audio-video multimodal deepfake dataset. Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Virtual."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Alzantot, M., Wang, Z., and Srivastava, M.B. (2019). Deep residual neural networks for audio spoofing detection. arXiv CoRR.","DOI":"10.21437\/Interspeech.2019-3174"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"162857","DOI":"10.1109\/ACCESS.2021.3133134","article-title":"Voice spoofing countermeasure for logical access attacks detection","volume":"9","author":"Arif","year":"2021","journal-title":"IEEE Access"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lai, C.-I., Chen, N., Villalba, J., and Dehak, N. (2019). ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv.","DOI":"10.21437\/Interspeech.2019-1794"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Jiang, Z., Zhu, H., Peng, L., Ding, W., and Ren, Y. (2020, January 25\u201329). Self-supervised spoofing audio detection scheme. Proceedings of the INTERSPEECH 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1760"},{"key":"ref_44","unstructured":"(2022, March 10). Imdat Solak The M-AILABS Speech Dataset. Available online: https:\/\/www.caito.de\/2019\/01\/the-m-ailabs-speech-dataset\/."},{"key":"ref_45","unstructured":"Arik, S.O., Chen, J., Peng, K., Ping, W., and Zhou, Y. (2018, January 2\u20138). Neural voice cloning with a few samples. Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, QC, Canada."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Yi, J., Fu, R., Tao, J., Nie, S., Ma, H., Wang, C., Wang, T., Tian, Z., Bai, Y., and Fan, C. (2022, January 23\u201327). Add 2022: The first audio deep synthesis detection challenge. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746939"},{"key":"ref_47","unstructured":"Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., and Lee, K.A. (2021, November 05). The 2nd Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2017) Database, Version 2. Available online: https:\/\/datashare.ed.ac.uk\/handle\/10283\/3055."},{"key":"ref_48","unstructured":"Nations, U. (2022, March 05). Official Languages. Available online: https:\/\/www.un.org\/en\/our-work\/official-languages."},{"key":"ref_49","unstructured":"Almeman, K., and Lee, M. (2013, January 16\u201319). A comparison of arabic speech recognition for multi-dialect vs. specific dialects. Proceedings of the Seventh International Conference on Speech Technology and Human-Computer Dialogue (SpeD 2013), Cluj-Napoca, Romania."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"88405","DOI":"10.1109\/ACCESS.2021.3089924","article-title":"An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus","volume":"9","author":"Elgibreen","year":"2021","journal-title":"IEEE Access"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Asif, A., Mukhtar, H., Alqadheeb, F., Ahmad, H.F., and Alhumam, A. (2022). An approach for pronunciation classification of classical arabic phonemes using deep learning. Appl. Sci., 12.","DOI":"10.3390\/app12010238"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"200395","DOI":"10.1109\/ACCESS.2020.3034762","article-title":"Optimizing Arabic Speech Distinctive Phonetic Features and Phoneme Recognition Using Genetic Algorithm","volume":"8","author":"Ibrahim","year":"2020","journal-title":"IEEE Access"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"102","DOI":"10.22452\/mjcs.vol33no2.2","article-title":"Trends and patterns of text classification techniques: A systematic mapping study","volume":"33","author":"Maw","year":"2020","journal-title":"Malays. J. Comput. Sci."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Rizwan, M., Odelowo, B.O., and Anderson, D.V. (2016, January 24). Word based dialect classification using extreme learning machines. Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada.","DOI":"10.1109\/IJCNN.2016.7727528"},{"key":"ref_55","first-page":"1","article-title":"Modeling accents for automatic speech recognition","volume":"Volume 1568","author":"Najafian","year":"2013","journal-title":"Proceedings of the 23rd European Signal Proceedings (EUSIPCO)"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., and Tang, J. (2021). Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng.","DOI":"10.1109\/TKDE.2021.3090866"},{"key":"ref_57","first-page":"241","article-title":"Review paper on noise cancellation using adaptive filters","volume":"11","author":"Jain","year":"2022","journal-title":"Int. J. Eng. Res. Technol."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/15\/5\/155\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:05:55Z","timestamp":1760137555000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/15\/5\/155"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,4]]},"references-count":57,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2022,5]]}},"alternative-id":["a15050155"],"URL":"https:\/\/doi.org\/10.3390\/a15050155","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,4]]}}}