{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,18]],"date-time":"2026-06-18T16:16:33Z","timestamp":1781799393863,"version":"3.54.5"},"reference-count":82,"publisher":"Association for Computing Machinery (ACM)","issue":"11","license":[{"start":{"date-parts":[[2024,9,12]],"date-time":"2024-09-12T00:00:00Z","timestamp":1726099200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,11,30]]},"abstract":"<jats:p>Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen\/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this article proposes a framework that combines Convolutional Neural Network (CNN) with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named<jats:italic>HCiT<\/jats:italic>, exploits the advantages of CNNs to extract meaningful local features, as well as the ViT\u2019s self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation. The source code is available at:<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/KADDAR-Bachir\/HCiT\">https:\/\/github.com\/KADDAR-Bachir\/HCiT<\/jats:ext-link><\/jats:p>","DOI":"10.1145\/3643030","type":"journal-article","created":{"date-parts":[[2024,1,23]],"date-time":"2024-01-23T12:29:46Z","timestamp":1706012986000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":43,"title":["Deepfake Detection Using Spatiotemporal Transformer"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4385-8683","authenticated-orcid":false,"given":"Bachir","family":"Kaddar","sequence":"first","affiliation":[{"name":"University of Ibn Khaldoun-Tiaret, Tiaret, Algeria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5026-5416","authenticated-orcid":false,"given":"Sid Ahmed","family":"Fezza","sequence":"additional","affiliation":[{"name":"National Higher School of Telecommunications and ICT, Oran, Algeria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0143-1756","authenticated-orcid":false,"given":"Zahid","family":"Akhtar","sequence":"additional","affiliation":[{"name":"State University of New York Polytechnic Institute, Utica NY, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6453-8588","authenticated-orcid":false,"given":"Wassim","family":"Hamidouche","sequence":"additional","affiliation":[{"name":"University of Rennes, INSA Rennes, CNRS, Rennes, France"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9092-735X","authenticated-orcid":false,"given":"Abdenour","family":"Hadid","sequence":"additional","affiliation":[{"name":"Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi, Abu Dhabi, UAE"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4729-9292","authenticated-orcid":false,"given":"Joan","family":"Serra-Sagrist\u00e1","sequence":"additional","affiliation":[{"name":"Universitat Aut\u00f2noma de Barcelona, Bellaterra, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,9,12]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2018.8630761"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/BTAS.2017.8272754"},{"key":"e_1_3_1_4_2","first-page":"38","volume-title":"Proceedings of the CVPR Workshops","volume":"1","author":"Agarwal Shruti","year":"2019","unstructured":"Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting world leaders against deep fakes. In Proceedings of the CVPR Workshops. Vol. 1, 38."},{"key":"e_1_3_1_5_2","unstructured":"Henry Ajder Giorgio Patrini Francesco Cavalli and Laurence Cullen. 2019. The State of Deepfakes: Landscape Threats and Impact. Deeptrace Amsterdam."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/HST47167.2019.9033005"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00152"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWBF.2018.8401564"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/2909827.2930786"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2018.2825953"},{"key":"e_1_3_1_11_2","unstructured":"Miko\u0142aj Bi\u0144kowski Dougal J. Sutherland Michael Arbel and Arthur Gretton. 2018. Demystifying MMD GANS. International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=r1lUOzWCW"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/1399504.1360638"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00408"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.2139\/ssrn.3213954"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.195"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2015.7368565"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2019.2916364"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Jiankang Deng Jia Guo Evangelos Ververas Irene Kotsia and Stefanos Zafeiriou. 2020. RetinaFace: Single-shot multi-level face localisation in the wild. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"e_1_3_1_20_2","unstructured":"Brian Dolhansky Russ Howes Ben Pflaum Nicole Baram and Cristian Canton Ferrer. 2019. The deepfake detection challenge (DFDC) preview dataset. arXiv:1910.08854. Retrieved from https:\/\/arxiv.org\/abs\/1910.08854"},{"key":"e_1_3_1_21_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1088\/1742-5468\/ac9830"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/10451.001.0001"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2012.2190402"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1117\/12.2078399"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/AVSS.2018.8639163"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.373"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2015.2389824"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_31_2","unstructured":"Young-Jin Heo Young-Ju Choi Young-Woon Lee and Byung-Gyu Kim. 2021. Deepfake detection scheme based on vision transformer and distillation. arXiv:2104.01353. Retrieved from https:\/\/arxiv.org\/abs\/2104.01353"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_7"},{"issue":"5","key":"e_1_3_1_33_2","first-page":"053033","article-title":"On the effectiveness of handcrafted features for deepfake video detection","volume":"32","author":"Kaddar Bachir","year":"2023","unstructured":"Bachir Kaddar, Sid Ahmed Fezza, Wassim Hamidouche, Zahid Akhtar, and Abdenour Hadid. 2023. On the effectiveness of handcrafted features for deepfake video detection. Journal of Electronic Imaging 32, 5 (2023), 053033\u2013053033.","journal-title":"Journal of Electronic Imaging"},{"key":"e_1_3_1_34_2","unstructured":"Tero Karras Timo Aila Samuli Laine and Jaak-ko Lehtinen. 2018. Progressive growing of GANs for improved quality stability and variation. International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Hk99zCeAb"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.5555\/1577069.1755843"},{"key":"e_1_3_1_36_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations. http:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_1_37_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Korshunov Pavel","year":"2019","unstructured":"Pavel Korshunov, Michael Halstead, Diego Castan, Martin Graciarena, Mitchell McLaren, Brian Burns, Aaron Lawson, and Sebastien Marcel. 2019. Tampered speaker inconsistency detection with phonetically aware audio-visual features. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.23919\/EUSIPCO.2018.8553270"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414258"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00505"},{"key":"e_1_3_1_41_2","doi-asserted-by":"crossref","unstructured":"Yuezun Li Ming-Ching Chang and Siwei Lyu. 2018. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. IEEE International Workshop on Information Forensics and Security (WIFS\u201918) IEEE 1\u20137.","DOI":"10.1109\/WIFS.2018.8630787"},{"key":"e_1_3_1_42_2","unstructured":"Yuezun Li and Siwei Lyu. 2019. Exposing deepfake videos by detecting face warping artifacts. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00327"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2023.3293951"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461904"},{"key":"e_1_3_1_46_2","article-title":"The best (and scariest) examples of AI-enabled deepfakes","author":"Marr Bernard","year":"2019","unstructured":"Bernard Marr. 2019. The best (and scariest) examples of AI-enabled deepfakes. Forbes. Retrieved from https:\/\/cutt.ly\/vK0OcsP. Accessed 01-March-2022.","journal-title":"Forbes"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIPR.2018.00084"},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","unstructured":"Momina Masood Mariam Nawaz Khalid Mahmood Malik Ali Javed Aun Irtaza and Hafiz Malik. 2023. Deepfakes generation and detection: State-of-the-art open challenges countermeasures and way forward. Applied Intelligence 53 4 (2023) 3974\u20134026.","DOI":"10.1007\/s10489-022-03766-z"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACVW.2019.00020"},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Scott McCloskey and Michael Albright. 2018. Detecting GAN-generated imagery using color cues. arXiv:1812.08247. Retrieved from https:\/\/arxiv.org\/abs\/1812.08247","DOI":"10.1109\/ICIP.2019.8803661"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3425780"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3206004.3206009"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2020.3007250"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/BTAS46853.2019.9185974"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682602"},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","unstructured":"Huy H. Nguyen Junichi Yamagishi and Isao Echizen. 2019. Use of a capsule network to detect fake images and videos. arXiv:1910.12467. Retrieved from https:\/\/arxiv.org\/abs\/1910.12467","DOI":"10.1109\/ICASSP.2019.8682602"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCPhot.2012.6215223"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2017.8267647"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-010-0620-1"},{"key":"e_1_3_1_60_2","unstructured":"Andreas R\u00f6ssler Davide Cozzolino Luisa Verdoliva Christian Riess Justus Thies and Matthias Nie\u00dfner. 2018. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv:1803.09179. Retrieved from https:\/\/arxiv.org\/abs\/1803.09179"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00009"},{"key":"e_1_3_1_62_2","first-page":"213","volume-title":"Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks","author":"Roy Ritaban","year":"2022","unstructured":"Ritaban Roy, Indu Joshi, Abhijit Das, and Antitza Dantcheva. 2022. 3D CNN architectures and attention mechanisms for deepfake detection. In Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks, Christian Rathgeb, Ruben Tolosana, Ruben Vera-Rodriguez, and Christoph Busch (Eds.). Springer International Publishing, Cham, 213\u2013234."},{"issue":"1","key":"e_1_3_1_63_2","first-page":"80","article-title":"Recurrent convolutional strategies for face manipulation detection in videos","volume":"3","author":"Sabir Ekraam","year":"2019","unstructured":"Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. 2019. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 3, 1 (2019), 80\u201387.","journal-title":"Interfaces (GUI)"},{"key":"e_1_3_1_64_2","unstructured":"Rulin Shao Zhouxing Shi Jinfeng Yi Pin-Yu Chen and Cho-Jui Hsieh. 2022. On the Adversarial Robustness of Vision Transformers. https:\/\/openreview.net\/forum?id=O0g6uPDLW7"},{"key":"e_1_3_1_65_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1007\/11941439_114"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i2.20130"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"e_1_3_1_69_2","first-page":"6105","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6105\u20136114."},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201350"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2020.06.014"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390294"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01468"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/3512527.3531415"},{"key":"e_1_3_1_76_2","unstructured":"Deressa Wodajo and Solomon Atnafu. 2021. Deepfake video detection using convolutional vision transformer. arXiv:2102.11126. Retrieved from https:\/\/arxiv.org\/abs\/2102.11126"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683164"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00765"},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00062"},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2603342"},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/SIPROCESS.2017.8124497"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2017.229"},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2017.8296389"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643030","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643030","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T16:31:21Z","timestamp":1750264281000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643030"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,12]]},"references-count":82,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,11,30]]}},"alternative-id":["10.1145\/3643030"],"URL":"https:\/\/doi.org\/10.1145\/3643030","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,12]]},"assertion":[{"value":"2023-04-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-08","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}