{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T02:28:39Z","timestamp":1772850519947,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"name":"Key R&D Program Guidance Projects of Heilongjiang Province","award":["GZ20210065"],"award-info":[{"award-number":["GZ20210065"]}]},{"DOI":"10.13039\/501100005046","name":"National Science Foundation of Heilongjiang Province","doi-asserted-by":"crossref","award":["LH2019F024"],"award-info":[{"award-number":["LH2019F024"]}],"id":[{"id":"10.13039\/501100005046","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Priv. Secur."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>With the continuous evolution of deep learning, forgery techniques have undergone constant innovation, providing convenience to individuals and resulting in significant negative consequences. Notably, these forged videos have become remarkably realistic, nearly indistinguishable to the human eye, posing a formidable challenge in forgery detection. However, many current Deepfake detection models focus on improving evaluation metrics and model architecture design, often lacking the necessary generality and practicality. We propose a Deepfake detection method based on a hybrid network in response to these challenges. Our approach utilizes an improved EfficientNetV2S as the backbone, replacing the original Fused-Conv module with a Tok-MLP module and integrating an attention mechanism at the end of the backbone. Subsequently, the backbone's output is fed into a Vision Transformer (VIT) for classification. Extensive work in data preprocessing includes training our model on three datasets: DFDC, Celeb-DF v2, and FaceForensics++. The achieved results are exceptionally competitive. Additionally, visual analysis of DFDC dataset videos validates the practicality of our approach, yielding outstanding results. In conclusion, the relentless evolution of Deepfake technology poses challenges and opportunities. Our novel Deepfake detection method, grounded in a hybrid network, enhances the capabilities of existing models, ensuring practicality and effectiveness in real-world scenarios.<\/jats:p>","DOI":"10.1145\/3777412","type":"journal-article","created":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T10:56:47Z","timestamp":1765537007000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Deepfake Video Detection Based on Improved EfficientNetV2S and Transformer Network"],"prefix":"10.1145","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3996-5569","authenticated-orcid":false,"given":"Liwei","family":"Deng","sequence":"first","affiliation":[{"name":"School of automation, Harbin University of Science and Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0883-535X","authenticated-orcid":false,"given":"Yunlong","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of automation, Harbin University of Science and Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-2401-2159","authenticated-orcid":false,"given":"Fei","family":"Chen","sequence":"additional","affiliation":[{"name":"School of automation, Harbin University of Science and Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6886-8297","authenticated-orcid":false,"given":"Liangchao","family":"Gao","sequence":"additional","affiliation":[{"name":"China State Shipbuilding Corporation Limited 703rd Research Institute","place":["Harbin, China"]}]}],"member":"320","published-online":{"date-parts":[[2026,2,5]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1812.08685"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","unstructured":"D. Wodajo and S. Atnafu. 2021. Deepfake video detection using convolutional vision transformer. DOI:10.48550\/arXiv.2102.11126","DOI":"10.48550\/arXiv.2102.11126"},{"key":"e_1_3_1_4_2","doi-asserted-by":"crossref","first-page":"4318","DOI":"10.1145\/3394171.3413707","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Qi H.","year":"2020","unstructured":"H. Qi, Q. Guo, F. Juefei-Xu, et al. 2020. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. In Proceedings of the 28th ACM International Conference on Multimedia. 4318\u20134327."},{"key":"e_1_3_1_5_2","first-page":"10096","article-title":"Efficientnetv2: Smaller models and faster training","author":"Tan M.","year":"2021","unstructured":"M. Tan and Q. Le. 2021. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning. 10096\u201310106.","journal-title":"International Conference on Machine Learning"},{"key":"e_1_3_1_6_2","first-page":"23","article-title":"Unext: Mlp-Based rapid medical image segmentation network","author":"Valanarasu J. M. J.","year":"2022","unstructured":"J. M. J. Valanarasu and V. M. Patel. 2022. Unext: Mlp-Based rapid medical image segmentation network.In International Conference on Medical Image Computing and Computer-Assisted Intervention. 23\u201333.","journal-title":"International Conference on Medical Image Computing and Computer-Assisted Intervention"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00069"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","unstructured":"A. Dosovitskiy L. Beyer A Kolesnikov et al. 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. DOI:10.48550\/arXiv.2010.11929","DOI":"10.48550\/arXiv.2010.11929"},{"key":"e_1_3_1_9_2","first-page":"27","article-title":"Generative adversarial nets","author":"Goodfellow I.","year":"2014","unstructured":"I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems. 27.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"4","key":"e_1_3_1_10_2","doi-asserted-by":"crossref","first-page":"3313","DOI":"10.1109\/TKDE.2021.3130191","article-title":"A review on generative adversarial networks: Algorithms, theory, and applications","volume":"35","author":"Gui J.","year":"2021","unstructured":"J. Gui, Z. Sun, Y. Wen, et al. 2021. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Transactions on Knowledge and Data Engineering 35, 4 (2021), 3313\u20133332.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"12","key":"e_1_3_1_11_2","doi-asserted-by":"crossref","first-page":"17521","DOI":"10.1007\/s11042-022-13797-w","article-title":"Image forgery detection: A survey of recent deep-learning approaches","volume":"82","author":"Zanardelli M.","year":"2023","unstructured":"M. Zanardelli, F. Guerrini, R. Leonardi, et al. 2023. Image forgery detection: A survey of recent deep-learning approaches. Multimedia Tools and Applications. 82, 12 (2023), 17521\u201317566.","journal-title":"Multimedia Tools and Applications"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.397"},{"key":"e_1_3_1_13_2","first-page":"7184","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Nirkin Y.","year":"2019","unstructured":"Y. Nirkin, Y. Keller, and T. Hassner. 2019. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 7184\u20137193."},{"key":"e_1_3_1_14_2","first-page":"3404","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Gao G.","year":"2021","unstructured":"G. Gao, H. Huang, C. Fu, et al. 2021. Information bottleneck disentanglement for identity swapping. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3404\u20133413."},{"key":"e_1_3_1_15_2","first-page":"16890","article-title":"Learning personalized high quality volumetric head avatars from monocular rgb videos","author":"Bai Z.","year":"2023","unstructured":"Z. Bai, F. Tan, Z. Huang, et al. 2023. Learning personalized high quality volumetric head avatars from monocular rgb videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 16890\u201316900.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"e_1_3_1_16_2","first-page":"2387","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Thies J.","year":"2016","unstructured":"J. Thies, M. Zollhofer, M. Stamminger, et al. 2016. Face2face: Real-Time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2387\u20132395."},{"issue":"4","key":"e_1_3_1_17_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073640","article-title":"Synthesizing Obama: Learning lip sync from audio","volume":"36","author":"Suwajanakorn S.","year":"2017","unstructured":"S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1\u201313.","journal-title":"ACM Transactions on Graphics (ToG)"},{"issue":"4","key":"e_1_3_1_18_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3306346.3323035","article-title":"Deferred neural rendering: image synthesis using neural textures","volume":"38","author":"Thies J.","year":"2019","unstructured":"J. Thies, M. Zollh\u00f6fer, and M. Nie\u00dfner. 2019. Deferred neural rendering: image synthesis using neural textures. Acm Transactions on Graphics (TOG) 38, 4 (2019), 1\u201312.","journal-title":"Acm Transactions on Graphics (TOG)"},{"key":"e_1_3_1_19_2","first-page":"1","volume-title":"2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)","author":"Lyu S.","year":"2020","unstructured":"S. Lyu. 2020. Deepfake detection: current challenges and next steps. In 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1\u20136."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2018.8630787"},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1145\/3382507.3418857","volume-title":"Proceedings of the 2020 International Conference on Multimodal Interaction","author":"Gupta P.","year":"2020","unstructured":"P. Gupta, K. Chugh, A. Dhall, et al. 2020. The eyes know It: Fakeet-an Eye-tracking database to understand deepfake perception. In Proceedings of the 2020 International Conference on Multimodal Interaction. 519\u2013527."},{"key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1145\/3267357.3267367","volume-title":"Proceedings of the 2nd International Workshop on Multimedia Privacy and Security","author":"Tariq S.","year":"2018","unstructured":"S. Tariq, S. Lee, H. Kim, et al. 2018. Detecting both machine and human created fake face images in the wild. In Proceedings of the 2nd International Workshop on Multimedia Privacy and Security. 81\u201387. DOI:10.1145\/3267357.3267367"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2018.8630761"},{"key":"e_1_3_1_24_2","first-page":"2185","article-title":"Multi-Attentional deepfake detection","author":"Zhao H.","year":"2021","unstructured":"H. Zhao, W. Zhou, D. Chen, et al. 2021. Multi-Attentional deepfake detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2185\u20132194.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"e_1_3_1_25_2","first-page":"1","volume-title":"2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","author":"G\u00fcera D.","year":"2018","unstructured":"D. G\u00fcera and E. J. Delp. 2018. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1\u20136. DOI:10.1109\/AVSS.2018.8639163"},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1109\/I4Tech48345.2020.9102668","volume-title":"2020 International Conference on Industry 4.0 Technology (I4Tech)","author":"Dhere S.","year":"2020","unstructured":"S. Dhere, S. B. Rathod S. Aarankalle, et al. 2020. A review on face reenactment techniques. In 2020 International Conference on Industry 4.0 Technology (I4Tech). IEEE, 191\u2013194."},{"issue":"10","key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"6111","DOI":"10.1109\/TPAMI.2021.3093446","article-title":"Deepfake detection based on discrepancies between faces and their context","volume":"44","author":"Nirkin Y.","year":"2021","unstructured":"Y. Nirkin, L. Wolf, Y. Keller, et al. 2021. Deepfake detection based on discrepancies between faces and their context. In IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6111\u20136121.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"7","key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"4854","DOI":"10.1109\/TCSVT.2021.3133859","article-title":"Msta-Net: Forgery detection by generating manipulation trace based on multi-scale self-texture attention","volume":"32","author":"Yang J.","year":"2021","unstructured":"J. Yang, S. Xiao, A. Li, et al. 2021. Msta-Net: Forgery detection by generating manipulation trace based on multi-scale self-texture attention. IEEE Transactions on Circuits and Systems for Video Technology 32, 7 (2021), 4854\u20134866.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_29_2","first-page":"3441549","article-title":"Deepfake video detection based on efficientnet-V2 Network","volume":"1","author":"Deng L.","year":"2022","unstructured":"L. Deng, H. Suo, and D. Li. 2022. Deepfake video detection based on efficientnet-V2 Network. Computational Intelligence and Neuroscience 1 (2022), 3441549.","journal-title":"Computational Intelligence and Neuroscience"},{"issue":"5","key":"e_1_3_1_30_2","first-page":"1671","article-title":"Cascaded-Hop for deepfake videos detection","volume":"16","author":"Zhang D.","year":"2022","unstructured":"D. Zhang, P. Wu, F. Li, et al. 2022. Cascaded-Hop for deepfake videos detection. KSII Transactions on Internet and Information Systems (TIIS) 16, 5 (2022), 1671\u20131686.","journal-title":"KSII Transactions on Internet and Information Systems (TIIS)"},{"key":"e_1_3_1_31_2","first-page":"24699","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Y.","year":"2023","unstructured":"Y. Li, Y. Li, X. Dai, et al. 2023. Physical-world optical adversarial attacks on 3d face recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 24699\u201324708."},{"key":"e_1_3_1_32_2","first-page":"1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Rossler A.","year":"2019","unstructured":"A. Rossler, D. Cozzolino, L. Verdoliva, et al. 2019. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1\u201311."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00327"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","unstructured":"B. Dolhansky R. Howes B. Pflaum et al. 2019. The deepfake detection challenge (Dfdc) preview dataset. DOI:10.48550\/arXiv.1910.08854","DOI":"10.48550\/arXiv.1910.08854"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2020.06.014"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11063-023-11249-6"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3623639"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.1313"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","unstructured":"O. De Lima S. Franklin S. Basu et al. 2020. Deepfake detection using spatiotemporal convolutional networks. DOI:10.48550\/arXiv.2006.14749","DOI":"10.48550\/arXiv.2006.14749"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.3390\/app11167678"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-022-03867-9"}],"container-title":["ACM Transactions on Privacy and Security"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3777412","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T10:38:51Z","timestamp":1770287931000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3777412"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,5]]},"references-count":40,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3777412"],"URL":"https:\/\/doi.org\/10.1145\/3777412","relation":{},"ISSN":["2471-2566","2471-2574"],"issn-type":[{"value":"2471-2566","type":"print"},{"value":"2471-2574","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,5]]},"assertion":[{"value":"2024-03-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-05","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}