{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,11]],"date-time":"2026-02-11T21:02:33Z","timestamp":1770843753094,"version":"3.50.1"},"reference-count":99,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T00:00:00Z","timestamp":1751328000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Recent advancements in deep learning have driven the rapid proliferation of deepfake generation techniques, raising substantial concerns over digital security and trustworthiness. Most current detection methods primarily focus on spatial or frequency domain features but show limited effectiveness when dealing with compressed videos and cross-dataset scenarios. Observing that mainstream generation methods use frame-by-frame synthesis without adequate temporal consistency constraints, we introduce the Spatiotemporal Attention 3D Network (STA-3D), a novel framework that combines a lightweight spatiotemporal attention module with a 3D convolutional architecture to improve detection robustness. The proposed attention module adopts a symmetric multi-branch architecture, where each branch follows a nearly identical processing pipeline to separately model temporal-channel, temporal-spatial, and intra-spatial correlations. Our framework additionally implements Spatial Pyramid Pooling (SPP) layers along the temporal axis, enabling adaptive modeling regardless of input video length. Furthermore, we mitigate the inherent asymmetry in the quantity of authentic and forged samples by replacing standard cross entropy with focal loss for training. This integration facilitates the simultaneous exploitation of inter-frame temporal discontinuities and intra-frame spatial artifacts, achieving competitive performance across various benchmark datasets under different compression conditions: for the intra-dataset setting on FF++, it improves the average accuracy by 1.09 percentage points compared to existing SOTA, with a more significant gain of 1.63 percentage points under the most challenging C40 compression level (particularly for NeuralTextures, achieving an improvement of 4.05 percentage points); while for the intra-dataset setting, AUC is enhanced by 0.24 percentage points on the DFDC-P dataset.<\/jats:p>","DOI":"10.3390\/sym17071037","type":"journal-article","created":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T04:11:04Z","timestamp":1751429464000},"page":"1037","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-7872-8340","authenticated-orcid":false,"given":"Jingbo","family":"Wang","sequence":"first","affiliation":[{"name":"Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China"}]},{"given":"Jun","family":"Lei","sequence":"additional","affiliation":[{"name":"Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China"}]},{"given":"Shuohao","family":"Li","sequence":"additional","affiliation":[{"name":"Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1804-9198","authenticated-orcid":false,"given":"Jun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,1]]},"reference":[{"key":"ref_1","unstructured":"Kingma, D.P., and Welling, M. (2013). Auto-encoding variational bayes. arXiv."},{"key":"ref_2","unstructured":"Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014, January 8\u201313). Generative adversarial nets. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Afchar, D., Nozick, V., Yamagishi, J., and Echizen, I. (2018, January 11\u201313). Mesonet: A compact facial video forgery detection network. Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China.","DOI":"10.1109\/WIFS.2018.8630761"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Kumar, P., Vatsa, M., and Singh, R. (2020, January 1\u20135). Detecting face2face facial reenactment in videos. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093628"},{"key":"ref_5","unstructured":"Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., and Nie\u00dfner, M. (November, January 27). Faceforensics++: Learning to detect manipulated facial images. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_6","unstructured":"Li, Y., and Lyu, S. (2018). Exposing deepfake videos by detecting face warping artifacts. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Qian, Y., Yin, G., Sheng, L., Chen, Z., and Shao, J. (2020, January 13\u201328). Thinking in frequency: Face forgery detection by mining frequency-aware clues. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58610-2_6"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., and Guo, B. (2020, January 13\u201319). Face X-ray for more general face forgery detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00505"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Nguyen, D., Mejri, N., Singh, I.P., Kuleshova, P., Astrid, M., Kacem, A., Ghorbel, E., and Aouada, D. (2024, January 17\u201322). LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01647"},{"key":"ref_10","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-stream convolutional networks for action recognition in videos. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"G\u00fcera, D., and Delp, E.J. (2018, January 27\u201330). Deepfake video detection using recurrent neural networks. Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand.","DOI":"10.1109\/AVSS.2018.8639163"},{"key":"ref_12","first-page":"80","article-title":"Recurrent convolutional strategies for face manipulation detection in videos","volume":"3","author":"Sabir","year":"2019","journal-title":"Interfaces (GUI)"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1016\/j.gltp.2022.04.017","article-title":"Deepfake detection in digital media forensics","volume":"3","author":"Vamsi","year":"2022","journal-title":"Glob. Transit. Proc."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Amerini, I., Galteri, L., Caldelli, R., and Del Bimbo, A. (2019, January 27\u201328). Deepfake video detection through optical flow based cnn. Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea.","DOI":"10.1109\/ICCVW.2019.00152"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Saikia, P., Dholaria, D., Yadav, P., Patel, V., and Roy, M. (2022, January 18\u201323). A hybrid CNN-LSTM model for video deepfake detection by leveraging optical flow features. Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy.","DOI":"10.1109\/IJCNN55064.2022.9892905"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Haliassos, A., Vougioukas, K., Petridis, S., and Pantic, M. (2021, January 20\u201325). Lips don\u2019t lie: A generalisable and robust approach to face forgery detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00500"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1083","DOI":"10.1007\/s00371-023-02833-x","article-title":"Local attention and long-distance interaction of rPPG for deepfake detection","volume":"40","author":"Wu","year":"2024","journal-title":"Vis. Comput."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"21434","DOI":"10.1364\/OE.16.021434","article-title":"Remote plethysmographic imaging using ambient light","volume":"16","author":"Verkruysse","year":"2008","journal-title":"Opt. Express"},{"key":"ref_19","unstructured":"Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"4507","DOI":"10.1109\/TIFS.2024.3381823","article-title":"Where Deepfakes Gaze at? Spatial-Temporal Gaze Inconsistency Analysis for Video Face Forgery Detection","volume":"19","author":"Peng","year":"2024","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zheng, Y., Bao, J., Chen, D., Zeng, M., and Wen, F. (2021, January 20\u201325). Exploring temporal coherence for more general video face forgery detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/ICCV48922.2021.01477"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Xu, Y., Liang, J., Jia, G., Yang, Z., Zhang, Y., and He, R. (2023, January 1\u20136). Tall: Thumbnail layout for deepfake video detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.02071"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1335","DOI":"10.1109\/TIFS.2023.3239223","article-title":"ISTVT: Interpretable spatial-temporal video transformer for deepfake detection","volume":"18","author":"Zhao","year":"2023","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_24","unstructured":"Tran, D., Ray, J., Shou, Z., Chang, S.F., and Paluri, M. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv."},{"key":"ref_25","unstructured":"Turrisi, R., Verri, A., and Barla, A. (2023). The effect of data augmentation and 3D-CNN depth on Alzheimer\u2019s Disease detection. arXiv."},{"key":"ref_26","unstructured":"De Lima, O., Franklin, S., Basu, S., Karwoski, B., and George, A. (2020). Deepfake detection using spatiotemporal convolutional networks. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"4990","DOI":"10.1002\/int.22499","article-title":"A lightweight 3D convolutional neural network for deepfake detection","volume":"36","author":"Liu","year":"2021","journal-title":"Int. J. Intell. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Lu, C., Liu, B., Zhou, W., Chu, Q., and Yu, N. (2021, January 19\u201322). Deepfake video detection using 3D-attentional inception convolutional neural network. Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA.","DOI":"10.1109\/ICIP42928.2021.9506381"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Ma, Z., Mei, X., and Shen, J. (2023, January 17\u201319). 3D Attention Network for Face Forgery Detection. Proceedings of the 2023 4th Information Communication Technologies Conference (ICTC), Nanjing, China.","DOI":"10.1109\/ICTC57116.2023.10154671"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, Z., Bao, J., Zhou, W., Wang, W., and Li, H. (2023, January 17\u201324). Altfreezing for more general video face forgery detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00402"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. (2018, January 8\u201314). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Misra, D., Nalamada, T., Arasanipalai, A.U., and Hou, Q. (2021, January 3\u20138). Rotate to attend: Convolutional triplet attention module. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV48630.2021.00318"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_35","unstructured":"MarekKowalski (2025, January 15). 3D Face Swapping Implemented in Python. Available online: https:\/\/github.com\/MarekKowalski\/FaceSwap."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Korshunova, I., Shi, W., Dambre, J., and Theis, L. (2017, January 22\u201329). Fast face-swap using convolutional neural networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.397"},{"key":"ref_37","unstructured":"(2025, January 15). DeepFakes. Available online: https:\/\/www.theverge.com\/2018\/2\/7\/16982046\/reddit-deepfakes-ai-celebrity-face-swap-porn-community-ban."},{"key":"ref_38","unstructured":"Shaoanlu (2025, January 15). Faceswap-GAN. Available online: https:\/\/github.com\/shaoanlu\/faceswap-GAN."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zhu, J.Y., Park, T., Isola, P., and Efros, A.A. (2017, January 22\u201329). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.244"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Bao, J., Chen, D., Wen, F., Li, H., and Hua, G. (2018, January 18\u201323). Towards open-set identity preserving face synthesis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00702"},{"key":"ref_41","unstructured":"Li, L., Bao, J., Yang, H., Chen, D., and Wen, F. (2019). Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Li, Q., Wang, J., Xu, C.Z., and Sun, Z. (2021, January 20\u201325). One shot face swapping on megapixels. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00480"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. (2020, January 13\u201319). Analyzing and improving the image quality of stylegan. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00813"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Kim, J., Lee, J., and Zhang, B.T. (2022, January 18\u201324). Smooth-swap: A simple enhancement for face-swapping with smoothness. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01051"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Rosberg, F., Aksoy, E.E., Alonso-Fernandez, F., and Englund, C. (2023, January 2\u20137). Facedancer: Pose-and occlusion-aware high fidelity face swapping. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV56688.2023.00345"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., and Nie\u00dfner, M. (2016, January 27\u201330). Face2face: Real-time face capture and reenactment of rgb videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.262"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3306346.3323035","article-title":"Deferred neural rendering: Image synthesis using neural textures","volume":"38","author":"Thies","year":"2019","journal-title":"Acm Trans. Graph. (TOG)"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Tripathy, S., Kannala, J., and Rahtu, E. (2020, January 1\u20135). Icface: Interpretable and controllable face reenactment using gans. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093474"},{"key":"ref_49","unstructured":"Wang, Y., Yang, D., Bremond, F., and Dantcheva, A. (2022). Latent image animator: Learning to animate images via latent space navigation. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Pang, Y., Zhang, Y., Quan, W., Fan, Y., Cun, X., Shan, Y., and Yan, D.M. (2023, January 17\u201324). Dpe: Disentanglement of pose and expression for general video portrait editing. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00049"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"331","DOI":"10.1007\/s41095-022-0271-y","article-title":"Attention mechanisms in computer vision: A survey","volume":"8","author":"Guo","year":"2022","journal-title":"Comput. Vis. Media"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201323). Non-local neural networkspark2018bam. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_54","unstructured":"Park, J., Woo, S., Lee, J.Y., and Kweon, I.S. (2018). Bam: Bottleneck attention module. arXiv."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Hou, Q., Zhang, L., Cheng, M.M., and Feng, J. (2020, January 13\u201319). Strip pooling: Rethinking spatial pooling for scene parsing. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00406"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. (2019, January 27\u201328). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea.","DOI":"10.1109\/ICCVW.2019.00246"},{"key":"ref_57","unstructured":"Shen, Z., Zhang, M., Zhao, H., Yi, S., and Li, H. (2021, January 3\u20138). Efficient attention: Attention with linear complexities. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Ouyang, D., He, S., Zhang, G., Luo, M., Guo, H., Zhan, J., and Huang, Z. (2023, January 4\u201310). Efficient multi-scale attention module with cross-spatial learning. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10096516"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Hou, Q., Zhou, D., and Feng, J. (2021, January 20\u201325). Coordinate attention for efficient mobile network design. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01350"},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"129866","DOI":"10.1016\/j.neucom.2025.129866","article-title":"SCSA: Exploring the synergistic effects between spatial and channel attention","volume":"634","author":"Si","year":"2025","journal-title":"Neurocomputing"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"1510","DOI":"10.1109\/TPAMI.2017.2712608","article-title":"Long-term temporal convolutions for action recognition","volume":"40","author":"Varol","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., and Mei, T. (2017, January 22\u201329). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.590"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18\u201323). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_68","unstructured":"Zhang, S., Guo, S., Huang, W., Scott, M.R., and Wang, L. (2020). V4d: 4d convolutional neural networks for video-level representation learning. arXiv."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Wu, W., Zhao, Y., Xu, Y., Tan, X., He, D., Zou, Z., Ye, J., Li, Y., Yao, M., and Dong, Z. (2021, January 20\u201324). Dsanet: Dynamic segment aggregation network for video-level representation learning. Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China.","DOI":"10.1145\/3474085.3475344"},{"key":"ref_70","doi-asserted-by":"crossref","first-page":"3345","DOI":"10.1007\/s13042-024-02454-3","article-title":"MCANet: A lightweight action recognition network with multidimensional convolution and attention","volume":"16","author":"Tian","year":"2025","journal-title":"Int. J. Mach. Learn. Cybern."},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Hara, K., Kataoka, H., and Satoh, Y. (2018, January 18\u201323). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00685"},{"key":"ref_72","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., and Natsev, P. (2017). The kinetics human action video dataset. arXiv."},{"key":"ref_73","doi-asserted-by":"crossref","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","article-title":"Spatial pyramid pooling in deep convolutional networks for visual recognition","volume":"37","author":"He","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016, January 27\u201330). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., and Zelnik-Manor, L. (2021, January 10\u201317). Asymmetric loss for multi-label classification. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00015"},{"key":"ref_76","unstructured":"Dolhansky, B., Howes, R., Pflaum, B., Baram, N., and Ferrer, C.C. (2019). The deepfake detection challenge (dfdc) preview dataset. arXiv."},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Li, Y., Yang, X., Sun, P., Qi, H., and Lyu, S. (2020, January 13\u201319). Celeb-df: A large-scale challenging dataset for deepfake forensics. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00327"},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"861","DOI":"10.1016\/j.patrec.2005.10.010","article-title":"An introduction to ROC analysis","volume":"27","author":"Fawcett","year":"2006","journal-title":"Pattern Recognit. Lett."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015, January 7\u201313). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.123"},{"key":"ref_80","unstructured":"Kingma, D.P. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_81","doi-asserted-by":"crossref","first-page":"868","DOI":"10.1109\/TIFS.2012.2190402","article-title":"Rich models for steganalysis of digital images","volume":"7","author":"Fridrich","year":"2012","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_82","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_83","doi-asserted-by":"crossref","unstructured":"Chollet, F. (2017, January 21\u201326). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.195"},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., and AbdAlmageed, W. (2020, January 23\u201328). Two-branch recurrent network for isolating deepfakes in videos. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part VII 16.","DOI":"10.1007\/978-3-030-58571-6_39"},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"18461","DOI":"10.1007\/s11042-020-10420-8","article-title":"Detecting deepfake, faceswap and face2face facial forgeries using frequency cnn","volume":"80","author":"Kohli","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., and Yu, N. (2021, January 20\u201325). Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00083"},{"key":"ref_87","unstructured":"Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., and Paluri, M. (2014). C3D: Generic Features for Video Analysis. arXiv."},{"key":"ref_88","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019\u20132, January 27). Slowfast networks for video recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00630"},{"key":"ref_89","doi-asserted-by":"crossref","unstructured":"Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., and Lu, T. (2020, January 7\u201312). Teinet: Towards an efficient architecture for video recognition. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6836"},{"key":"ref_90","doi-asserted-by":"crossref","unstructured":"Dang, H., Liu, F., Stehouwer, J., Liu, X., and Jain, A.K. (2020, January 13\u201319). On the detection of digital face manipulation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00582"},{"key":"ref_91","doi-asserted-by":"crossref","first-page":"3663","DOI":"10.1109\/TCSVT.2023.3239607","article-title":"MRE-Net: Multi-rate excitation network for deepfake video detection","volume":"33","author":"Pang","year":"2023","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_92","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Hinton","year":"2008","journal-title":"J. Mach. Learn. Res."},{"key":"ref_93","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_94","doi-asserted-by":"crossref","unstructured":"Gu, Z., Yao, T., Chen, Y., Yi, R., Ding, S., and Ma, L. (2022, January 23\u201329). Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection. Proceedings of the IJCAI, Vienna, Austria.","DOI":"10.24963\/ijcai.2022\/129"},{"key":"ref_95","doi-asserted-by":"crossref","unstructured":"Yang, X., Li, Y., and Lyu, S. (2019, January 12\u201317). Exposing deep fakes using inconsistent head poses. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683164"},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"Matern, F., Riess, C., and Stamminger, M. (2019, January 7\u201311). Exploiting visual artifacts to expose deepfakes and face manipulations. Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA.","DOI":"10.1109\/WACVW.2019.00020"},{"key":"ref_97","doi-asserted-by":"crossref","unstructured":"Nguyen, H.H., Yamagishi, J., and Echizen, I. (2019, January 12\u201317). Capsule-forensics: Using capsule networks to detect forged images and videos. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682602"},{"key":"ref_98","doi-asserted-by":"crossref","unstructured":"Nguyen, H.H., Fang, F., Yamagishi, J., and Echizen, I. (2019, January 23\u201326). Multi-task learning for detecting and segmenting manipulated facial images and videos. Proceedings of the 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA.","DOI":"10.1109\/BTAS46853.2019.9185974"},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Sun, Z., Han, Y., Hua, Z., Ruan, N., and Jia, W. (2021, January 20\u201325). Improving the efficiency and robustness of deepfakes detection through precise geometric features. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00361"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/17\/7\/1037\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:02:43Z","timestamp":1760032963000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/17\/7\/1037"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":99,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2025,7]]}},"alternative-id":["sym17071037"],"URL":"https:\/\/doi.org\/10.3390\/sym17071037","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,1]]}}}