{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T06:09:22Z","timestamp":1781244562206,"version":"3.54.1"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T00:00:00Z","timestamp":1741651200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T00:00:00Z","timestamp":1741651200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001638","name":"Dublin City University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001638","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["SIViP"],"published-print":{"date-parts":[[2025,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Deepfake detection has become a critical challenge nowadays with the rise of sophisticated generative techniques that manipulate audio-visual data. Existing methods primarily focus on lip movement synchronization using audio and visual features, often relying on local feature extraction with Convolutional Neural Networks (CNNs). In this work, we propose an enhanced multimodal framework that integrates with local and global features for advanced deepfake detection. Our approach extends traditional pipelines by introducing additional visual features such as eye movement and facial regions, combined with audio features to model cross-modal dependencies. While CNNs capture local features, Vision Transformers (ViTs) extract global contextual relationships from both visual and audio modalities. The diffusion models are incorporated as pre-processors to refine noisy data and generate realistic augmentations, ensuring high-quality feature representation. The proposed framework achieves state-of-the-art performance, with accuracy scores of 0.9987, 0.9825, 0.9915, and 0.9812 on the FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, respectively. These results demonstrate significant improvements over existing methods, highlighting the framework\u2019s superior generalization and robustness in detecting subtle inconsistencies across manipulated audio-visual data.\n<\/jats:p>","DOI":"10.1007\/s11760-025-03970-7","type":"journal-article","created":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T09:19:09Z","timestamp":1741684749000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":35,"title":["Enhancing multimodal deepfake detection with local\u2013global feature integration and diffusion models"],"prefix":"10.1007","volume":"19","author":[{"given":"Muhammad","family":"Javed","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhaohui","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fida Hussain","family":"Dahri","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Teerath","family":"Kumar","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,3,11]]},"reference":[{"key":"3970_CR1","doi-asserted-by":"crossref","unstructured":"Rossler, A., et\u00a0al.: Faceforensics++: learning to detect manipulated facial images. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, vol. 2019, pp. 1\u201311, (2019)","DOI":"10.1109\/ICCV.2019.00009"},{"key":"3970_CR2","doi-asserted-by":"crossref","unstructured":"Yang, X., et\u00a0al.: Exposing deep fakes using inconsistent head poses. In: ICASSP 2022\u20132022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2019)","DOI":"10.1109\/ICASSP.2019.8683164"},{"issue":"2","key":"3970_CR3","doi-asserted-by":"publisher","first-page":"1520","DOI":"10.1002\/widm.1520","volume":"14","author":"A Heidari","year":"2024","unstructured":"Heidari, A., et al.: Deepfake detection using deep learning methods: a systematic and comprehensive review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 14(2), 1520 (2024)","journal-title":"Wiley Interdiscip. Rev. Data Min. Knowl. Discov."},{"key":"3970_CR4","doi-asserted-by":"crossref","unstructured":"Haliassos, A., et\u00a0al.: Lips don\u2019t lie: a generalisable and robust approach to face forgery detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 5037\u20135047, (2021)","DOI":"10.1109\/CVPR46437.2021.00500"},{"key":"3970_CR5","doi-asserted-by":"crossref","unstructured":"Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., Knoll, A.: Selective frequency network for image restoration. In: The Eleventh International Conference on Learning Representations, (2023)","DOI":"10.1109\/ICCV51070.2023.01195"},{"key":"3970_CR6","doi-asserted-by":"crossref","unstructured":"Cui, Y., Ren, W., Knoll, A.: Omni-kernel network for image restoration. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1426\u20131434, (2024)","DOI":"10.1609\/aaai.v38i2.27907"},{"key":"3970_CR7","doi-asserted-by":"publisher","first-page":"5735","DOI":"10.1109\/LRA.2023.3300254","volume":"8","author":"Y Cui","year":"2023","unstructured":"Cui, Y., Knoll, A.: Psnet: towards efficient image restoration with self-attention. IEEE Robot. Autom. Lett. 8, 5735\u20135742 (2023)","journal-title":"IEEE Robot. Autom. Lett."},{"key":"3970_CR8","doi-asserted-by":"crossref","unstructured":"Cai, Z., et\u00a0al.: Do you really mean that? Content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In: 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), (2022)","DOI":"10.1109\/DICTA56598.2022.10034605"},{"key":"3970_CR9","doi-asserted-by":"publisher","first-page":"83144","DOI":"10.1109\/ACCESS.2020.2988660","volume":"8","author":"T Jung","year":"2020","unstructured":"Jung, T., et al.: Deepvision: deepfakes detection using human eye blinking pattern. IEEE Access 8, 83144\u201383154 (2020)","journal-title":"IEEE Access"},{"key":"3970_CR10","unstructured":"Tian, M., et\u00a0al.: Unsupervised multimodal deepfake detection using intra- and cross-modal inconsistencies. arXiv [Online]. Available: arXiv:2311.17088, (2023)"},{"key":"3970_CR11","doi-asserted-by":"crossref","unstructured":"Cai, Z., et\u00a0al.: AV-Deepfake1M: a large-scale LLM-driven audio-visual deepfake dataset. In: MM \u201924: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 7414\u20137423, (2024)","DOI":"10.1145\/3664647.3680795"},{"key":"3970_CR12","unstructured":"Khalid, H., et\u00a0al.: Fakeavceleb: a novel audio-video multimodal deepfake dataset, (2021). Available: arXiv:2108.05080"},{"key":"3970_CR13","doi-asserted-by":"crossref","unstructured":"Zhang, R., et\u00a0al.: Ummaformer: a universal multimodal-adaptive transformer framework for temporal forgery localization. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 8749\u20138759, (2023)","DOI":"10.1145\/3581783.3613767"},{"key":"3970_CR14","doi-asserted-by":"publisher","first-page":"110124","DOI":"10.1016\/j.asoc.2023.110124","volume":"136","author":"H Ilyas","year":"2023","unstructured":"Ilyas, H., et al.: AVFakenet: a unified end-to-end dense swin transformer deep learning model for audio-visual deepfakes detection. Appl. Soft Comput. 136, 110124 (2023)","journal-title":"Appl. Soft Comput."},{"key":"3970_CR15","doi-asserted-by":"crossref","unstructured":"Jia, S., Li, X., Lyu, S.: Model attribution of face-swap deepfake videos. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 2356\u20132360, (2022)","DOI":"10.1109\/ICIP46576.2022.9897972"},{"key":"3970_CR16","unstructured":"Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. Available: arXiv:1508.01991 (2015)"},{"issue":"2","key":"3970_CR17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.9734\/ajarr\/2024\/v18i2601","volume":"18","author":"TO Oladoyinbo","year":"2024","unstructured":"Oladoyinbo, T.O., Olabanji, S.O., Olaniyi, O.O., Adebiyi, O.O., Okunleye, O.J., Alao, A.I.: Exploring the challenges of artificial intelligence in data integrity and its influence on social dynamics. Asian J. Adv. Res. Rep. 18(2), 1\u201323 (2024)","journal-title":"Asian J. Adv. Res. Rep."},{"issue":"4","key":"3970_CR18","doi-asserted-by":"publisher","first-page":"352","DOI":"10.1080\/23268743.2020.1765851","volume":"7","author":"K Kikerpill","year":"2020","unstructured":"Kikerpill, K.: Choose your stars and studs: the rise of deepfake designer porn. Porn Stud. 7(4), 352\u2013356 (2020)","journal-title":"Porn Stud."},{"issue":"42","key":"3970_CR19","first-page":"1","volume":"12","author":"EJ Aloke","year":"2023","unstructured":"Aloke, E.J., Abah, J.: Enhancing the fight against social media misinformation: an ensemble deep learning framework for detecting deepfakes. Int. J. Appl. Inf. Syst. 12(42), 1\u201314 (2023)","journal-title":"Int. J. Appl. Inf. Syst."},{"key":"3970_CR20","doi-asserted-by":"publisher","first-page":"59204","DOI":"10.1109\/ACCESS.2023.3285826","volume":"11","author":"MO Alassafi","year":"2023","unstructured":"Alassafi, M.O., et al.: A novel deep learning architecture with image diffusion for robust face presentation attack detection. IEEE Access 11, 59204\u201359216 (2023)","journal-title":"IEEE Access"},{"key":"3970_CR21","doi-asserted-by":"crossref","unstructured":"Thakur, R.: Introduction to artificial intelligence and its importance in modern business management. pp. 133\u2013165, (2023)","DOI":"10.4018\/979-8-3693-1902-4.ch009"},{"key":"3970_CR22","doi-asserted-by":"crossref","unstructured":"Katamneni, V.S., Rattani, A.: MIS-AVoiDD: modality invariant and specific representation for audio-visual deepfake detection. In: 2023 International Conference on Machine Learning and Applications (ICMLA), pp. 1371\u20131378, (2023)","DOI":"10.1109\/ICMLA58977.2023.00207"},{"issue":"5","key":"3970_CR23","doi-asserted-by":"publisher","first-page":"4925","DOI":"10.1007\/s12652-020-01932-0","volume":"12","author":"M Arun Anoop","year":"2021","unstructured":"Arun Anoop, M., Poonkuntran, S.: LPG: a novel approach for medical forgery detection in image transmission. J. Ambient. Intell. Humaniz. Comput. 12(5), 4925\u20134941 (2021)","journal-title":"J. Ambient. Intell. Humaniz. Comput."},{"issue":"19","key":"3970_CR24","doi-asserted-by":"publisher","first-page":"8842","DOI":"10.3390\/app11198842","volume":"11","author":"A Chandio","year":"2021","unstructured":"Chandio, A., et al.: AUDD: audio Urdu digits dataset for automatic audio Urdu digit recognition. Appl. Sci. 11(19), 8842 (2021)","journal-title":"Appl. Sci."},{"key":"3970_CR25","doi-asserted-by":"crossref","unstructured":"Turab, M., et\u00a0al.: Investigating multi-feature selection and ensembling for audio classification, (2022). arXiv [Online]. Available: arXiv:2206.07511","DOI":"10.5121\/ijaia.2022.13306"},{"issue":"5","key":"3970_CR26","doi-asserted-by":"publisher","first-page":"3557","DOI":"10.1109\/TPAMI.2024.3350004","volume":"46","author":"A Melnik","year":"2024","unstructured":"Melnik, A., et al.: Face generation and editing with stylegan: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3557\u20133576 (2024)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"3970_CR27","doi-asserted-by":"crossref","unstructured":"Rana, M.S., Sung, A.H.: Deepfakestack: a deep ensemble-based learning technique for deepfake detection. In: 2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud), pp. 70\u201375, (2020)","DOI":"10.1109\/CSCloud-EdgeCom49738.2020.00021"},{"key":"3970_CR28","doi-asserted-by":"crossref","unstructured":"Liang et\u00a0al, T.: Sdhf: spotting deepfakes with hierarchical features. In: 2020 IEEE 32nd international conference on tools with artificial intelligence (ICTAI), pp. 675\u2013680, (2020)","DOI":"10.1109\/ICTAI50040.2020.00108"},{"key":"3970_CR29","doi-asserted-by":"crossref","unstructured":"Hashmi et\u00a0al, A.: Multimodal forgery detection using ensemble learning. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2022, pp. 1524\u20131532, (2022)","DOI":"10.23919\/APSIPAASC55919.2022.9980255"},{"key":"3970_CR30","doi-asserted-by":"publisher","first-page":"103860","DOI":"10.1016\/j.cose.2024.103860","volume":"142","author":"K Jayashre","year":"2024","unstructured":"Jayashre, K., Amsaprabhaa, M.: Safeguarding media integrity: a hybrid optimized deep feature fusion based deepfake detection in videos. Comput. Secur. 142, 103860 (2024)","journal-title":"Comput. Secur."},{"issue":"2","key":"3970_CR31","doi-asserted-by":"publisher","first-page":"672","DOI":"10.1109\/TCDS.2021.3064679","volume":"14","author":"W Liu","year":"2022","unstructured":"Liu, W., et al.: Data-fusion-based two-stage cascade framework for multimodality face anti-spoofing. IEEE Trans. Cogn. Dev. Syst. 14(2), 672\u2013683 (2022)","journal-title":"IEEE Trans. Cogn. Dev. Syst."},{"key":"3970_CR32","doi-asserted-by":"crossref","unstructured":"Khalid et\u00a0al, H.: Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection Co-located with ACM MM 2021, pp. 7\u201315, (2021)","DOI":"10.1145\/3476099.3484315"},{"issue":"5","key":"3970_CR33","doi-asserted-by":"publisher","first-page":"1231","DOI":"10.1109\/TSP.2003.810293","volume":"51","author":"Lutfiye Durak","year":"2003","unstructured":"Durak, Lutfiye, Arikan, Orhan: Short-time Fourier transform: two fundamental properties and an optimal implementation. IEEE Trans. Signal Process. 51(5), 1231\u20131242 (2003)","journal-title":"IEEE Trans. Signal Process."},{"issue":"4","key":"3970_CR34","doi-asserted-by":"publisher","first-page":"101","DOI":"10.5121\/sipij.2013.4408","volume":"4","author":"Shikha Gupta","year":"2013","unstructured":"Gupta, Shikha, Jaafar, Jafreezal, Wan Ahmad, W.F., Bansal, Arpit: Feature extraction using MFCC. Signal Image Process. Int. J. 4(4), 101\u2013108 (2013)","journal-title":"Signal Image Process. Int. J."},{"issue":"3","key":"3970_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3625231","volume":"20","author":"H Cheng","year":"2023","unstructured":"Cheng, H., et al.: Voice-face homogeneity tells deepfake. ACM Trans. Multimed. Comput. Commun. Appl. 20(3), 1\u201322 (2023)","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"3970_CR36","doi-asserted-by":"crossref","unstructured":"Anas Raza, M., Mahmood Malik, K.: Multimodaltrace: deepfake detection using audiovisual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work, vol. 2023-June, pp. 993\u20131000, (2023)","DOI":"10.1109\/CVPRW59228.2023.00106"},{"issue":"1","key":"3970_CR37","first-page":"407","volume":"14","author":"M Elpeltagy","year":"2023","unstructured":"Elpeltagy, M., et al.: A novel smart deepfake video detection system. Int. J. Adv. Comput. Sci. Appl. 14(1), 407\u2013419 (2023)","journal-title":"Int. J. Adv. Comput. Sci. Appl."}],"container-title":["Signal, Image and Video Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11760-025-03970-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11760-025-03970-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11760-025-03970-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,8]],"date-time":"2025-04-08T20:11:06Z","timestamp":1744143066000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11760-025-03970-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,11]]},"references-count":37,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,5]]}},"alternative-id":["3970"],"URL":"https:\/\/doi.org\/10.1007\/s11760-025-03970-7","relation":{},"ISSN":["1863-1703","1863-1711"],"issn-type":[{"value":"1863-1703","type":"print"},{"value":"1863-1711","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,11]]},"assertion":[{"value":"31 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 February 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 February 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 March 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"400"}}