{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T17:50:36Z","timestamp":1768240236347,"version":"3.49.0"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"name":"Australian Research Council, Australia","award":["DP230100246, LP220200808, and DP250100463"],"award-info":[{"award-number":["DP230100246, LP220200808, and DP250100463"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>Deepfake techniques can now generate multimodal content comprising video and audio tracks. Compared with unimodal Deepfake images, videos or audio, multimodal Deepfake content is more deceptive and easily leads to the dissemination of hate speech, incitement to violence, and disinformation. Therefore, the detection of multimodal Deepfake has attracted much research attention recently. While cross-attention shows the promising capacity for modelling the complicated dependencies between audio and video in multimodal Deepfake detection, it fails to learn accurate cross-modal patterns if audio and video are misaligned in the temporal dimension. Besides, most current multimodal Deepfake detectors only provide a binary classification label, lacking fine-grained localization to identify significant forgery in multiple dimensions (e.g., modal, time, and spatial dimension). In this study, we propose a novel multimodal Deepfake detection framework named ForgeFinder, which goes beyond binary label prediction and achieves multi-grained forgery localization in modal and spatiotemporal dimensions. ForgeFinder incorporates both intra-modal and cross-modal inconsistencies to classify multimodal input. In detail, we adopt Serial Spatiotemporal Self-Attention (SSTSA) in the Intra-Modal Inconsistency Explorer (Intra-MIE), which allows the temporal self-attention to run in the original dimension without bringing unacceptable computational complexity. In the Cross-Modal Inconsistency Explorer (Cross-MIE), we propose the Offset-Shifted Cross-Attention (OSCA) by introducing a time offset term to the conventional cross-attention to mitigate the inaccuracy of cross-modal dependencies modelling brought by the temporal misalignment. By adopting the outputs of Intra-MIE for unimodal tasks, we identify the likelihood of modals being manipulated and localize tampered modals. At the same time, the attention weights of SSTSA can be visualized to pinpoint the temporal and spatial distribution of Deepfake manipulation. Therefore, for a single audio\u2013video input sample, ForgeFinder not only tells the authenticity of the overall input but also localizes the modal, temporal sequence, and spatial coordinates of significant forgery, significantly contributing to more comprehensive forensics analysis. The results of extensive experiments indicate that ForgeFinder achieves state-of-the-art detection performance as well as accurate forgery localization in modal and spatiotemporal dimensions. Furthermore, experiments on content generated by Diffusion Models (DMs) show that our model also effectively recognizes DM-generated content.<\/jats:p>","DOI":"10.1145\/3778030","type":"journal-article","created":{"date-parts":[[2025,11,24]],"date-time":"2025-11-24T15:37:04Z","timestamp":1763998624000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["ForgeFinder: Perceptive Multimodal Deepfake Detection via Multi-grained Forgery Localization"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-4276-5522","authenticated-orcid":false,"given":"Baoping","family":"Liu","sequence":"first","affiliation":[{"name":"School of Computer Science, University of Technology Sydney, Ultimo, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3603-6617","authenticated-orcid":false,"given":"Bo","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Australian Artificial Intelligence Institute (AAII), University of Technology Sydney, Ultimo, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3690-0321","authenticated-orcid":false,"given":"Ming","family":"Ding","sequence":"additional","affiliation":[{"name":"Data61, CSIRO, Everleigh, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0702-7102","authenticated-orcid":false,"given":"Tianqing","family":"Zhu","sequence":"additional","affiliation":[{"name":"Faculty of Data Science, City University of Macau, Taipa, China"}]}],"member":"320","published-online":{"date-parts":[[2026,1,12]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/WIFS.2018.8630761"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00338"},{"key":"e_1_3_1_4_2","unstructured":"Rosana Ardila Megan Branson Kelly Davis Michael Henretty Michael Kohler Josh Meyer Reuben Morais Lindsay Saunders Francis M. Tyers and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv:1912.06670. Retrieved from https:\/\/arxiv.org\/abs\/1912.06670"},{"key":"e_1_3_1_5_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3612928","article-title":"Head pose estimation patterns as deepfake detectors","volume":"20","author":"Becattini Federico","year":"2023","unstructured":"Federico Becattini, Carmen Bisogni, Vincenzo Loia, Chiara Pero, and Fei Hao. 2023.Head pose estimation patterns as deepfake detectors. ACM Transactions on Multimedia Computing, Communications and Applications 20 (2023), 1\u201324.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/DICTA56598.2022.10034605"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2018.00097"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3209336"},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","first-page":"2003","DOI":"10.1145\/3394171.3413630","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Chen Renwang","year":"2020","unstructured":"Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. 2020. SimSwap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, 2003\u20132011."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"e_1_3_1_11_2","unstructured":"Harry Cheng Yangyang Guo Tianyi Wang Qi Li Xiaojun Chang and Liqiang Nie. 2022. Voice-face homogeneity tells deepfake. arXiv:2203.02195. Retrieved from https:\/\/arxiv.org\/abs\/2203.02195"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413700"},{"key":"e_1_3_1_13_2","unstructured":"Junyoung Chung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555. Retrieved from https:\/\/arxiv.org\/abs\/1412.3555"},{"key":"e_1_3_1_14_2","unstructured":"Jean-Baptiste Cordonnier Andreas Loukas and Martin Jaggi. 2019. On the relationship between self-attention and convolutional layers. arXiv:1911.03584. Retrieved from https:\/\/arxiv.org\/abs\/1911.03584"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW59228.2023.00101"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG52635.2021.9667026"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_18_2","unstructured":"Brian Dolhansky Russ Howes Ben Pflaum Nicole Baram and Cristian Canton Ferrer. 2019. The deepfake detection challenge (DFDC) preview dataset. arXiv:1910.08854. Retrieved from https:\/\/arxiv.org\/abs\/1910.08854"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00925"},{"key":"e_1_3_1_20_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3536426"},{"key":"e_1_3_1_22_2","first-page":"36 744","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Gu Zhihao","year":"2022","unstructured":"Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, and Lizhuang Ma. 2022. Delving into the local: Dynamic inconsistency learning for deepfake video detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 744\u2013752."},{"key":"e_1_3_1_23_2","unstructured":"Yuwei Guo Ceyuan Yang Anyi Rao Yaohui Wang Yu Qiao Dahua Lin and Bo Dai. 2023. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725. Retrieved from https:\/\/arxiv.org\/abs\/2307.04725"},{"key":"e_1_3_1_24_2","unstructured":"Wu Haiwei Zhou Jiantao Zhang Shile and Tian Jinyu. 2022. Exploring spatial-temporal features for deepfake detection and localization. arXiv:2210.15872. Retrieved from https:\/\/arxiv.org\/abs\/2210.15872"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00500"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3231480"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00434"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2022.3141262"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3592615","article-title":"Data augmentation-based novel deep learning method for deepfaked images detection","volume":"20","author":"Iqbal Farkhund","year":"2023","unstructured":"Farkhund Iqbal, Ahmed Abbasi, Abdul Rehman Javed, Ahmad Almadhor, Zunera Jalil, Sajid Anwar, and Imad Rida. 2023. Data augmentation-based novel deep learning method for deepfaked images detection. ACM Transactions on Multimedia Computing, Communications and Applications 20 (2023), 1\u201315.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_1_31_2","unstructured":"Jee-Weon Jung Hee-Soo Heo Hemlata Tak Hye-Jin Shim Joon Son Chung Bong-Jin Lee Ha-Jin Yu and Nicholas Evans. 2021. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. arXiv:2110.01200. Retrieved from https:\/\/arxiv.org\/abs\/2110.01200"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3643030"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-1537"},{"key":"e_1_3_1_34_2","unstructured":"Hasam Khalid Shahroz Tariq and Simon S. Woo. 2021. FakeAVCeleb: A novel audio-video multimodal deepfake dataset. arXiv:2108.05080. Retrieved from https:\/\/arxiv.org\/abs\/2108.05080"},{"key":"e_1_3_1_35_2","first-page":"1564","article-title":"Bilinear attention networks","volume":"31","author":"Kim Jin-Hwa","year":"2018","unstructured":"Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems, Vol. 31, 1564\u20131574.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_36_2","unstructured":"Pavel Korshunov and S\u00e9bastien Marcel. 2018. Deepfakes: A new threat to face recognition? Assessment and detection. arXiv:1812.08685. Retrieved from https:\/\/arxiv.org\/abs\/1812.08685"},{"key":"e_1_3_1_37_2","first-page":"2673","volume-title":"Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922)","author":"Li Changjiang","year":"2022","unstructured":"Changjiang Li, Li Wang, Shouling Ji, Xuhong Zhang, Zhaohan Xi, Shanqing Guo, and Ting Wang. 2022. Seeing is living? Rethinking the security of facial liveness verification in the deepfake era. In Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922), 2673\u20132690."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3613842"},{"key":"e_1_3_1_39_2","volume-title":"Proceedings of the Asian Conference on Computer Vision","author":"Lin Yan-Bo","year":"2020","unstructured":"Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision."},{"key":"e_1_3_1_40_2","first-page":"4691","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Liu Baoping","year":"2023","unstructured":"Baoping Liu, Bo Liu, Ming Ding, Tianqing Zhu, and Xin Yu. 2023. TI2Net: Temporal identity inconsistency network for deepfake detection. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, 4691\u20134700."},{"key":"e_1_3_1_41_2","unstructured":"Haohe Liu Zehua Chen Yi Yuan Xinhao Mei Xubo Liu Danilo Mandic Wenwu Wang and Mark D. Plumbley. 2023. AudioLDM: Text-to-audio generation with latent diffusion models. arXiv:2301.12503. Retrieved from https:\/\/arxiv.org\/abs\/2301.12503"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2023.3285283"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3558004"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00374"},{"key":"e_1_3_1_45_2","first-page":"10209","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Luo Zhengxiong","year":"2023","unstructured":"Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. 2023. VideoFusion: Decomposed diffusion models for High-Quality video generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 10209\u201310218."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00283"},{"key":"e_1_3_1_47_2","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/978-3-030-29513-4_15","volume-title":"Proceedings of the Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys)","volume":"2","author":"Meng Hsien-Yu","year":"2020","unstructured":"Hsien-Yu Meng and Jiangtao Wen. 2020. LSTM-based facial performance capture using embedding between expressions. In Proceedings of the Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys), Vol. 2. Springer, 211\u2013226."},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","first-page":"2823","DOI":"10.1145\/3394171.3413570","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Mittal Trisha","year":"2020","unstructured":"Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions don\u2019t lie: An audio-visual deepfake detection method using affective cues. In Proceedings of the 28th ACM International Conference on Multimedia, 2823\u20132832."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74048-3_4"},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Arsha Nagrani Joon Son Chung and Andrew Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. arXiv:1706.08612. Retrieved from https:\/\/arxiv.org\/abs\/1706.08612","DOI":"10.21437\/Interspeech.2017-950"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00728"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-023-09196-3"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02559"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_3_1_55_2","unstructured":"Ivan Perov Daiheng Gao Nikolay Chervoniy Kunlin Liu Sugasa Marangonda Chris Um\u00e9 Carl Shift Facenheim R. P. Luis Jian Jiang Sheng Zhang et al. 2020. DeepFaceLab: Integrated flexible and extensible face-swapping framework. arXiv:2005.05535. Retrieved from https:\/\/arxiv.org\/abs\/2005.05535"},{"key":"e_1_3_1_56_2","first-page":"1","volume-title":"Proceedings of the 2022 IEEE International Workshop on Information Forensics and Security (WIFS)","author":"Pianese Alessandro","year":"2022","unstructured":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2022. Deepfake audio detection by speaker verification. In Proceedings of the 2022 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1\u20136."},{"key":"e_1_3_1_57_2","first-page":"993","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Raza Muhammad Anas","year":"2023","unstructured":"Muhammad Anas Raza and Khalid Mahmood Malik. 2023. Multimodaltrace: Deepfake detection using audiovisual representation learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 993\u20131000."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00009"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00985"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.74"},{"key":"e_1_3_1_61_2","unstructured":"Jiaming Song Chenlin Meng and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv:2010.02502. Retrieved from https:\/\/arxiv.org\/abs\/2010.02502"},{"key":"e_1_3_1_62_2","first-page":"6369","volume-title":"Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Tak Hemlata","year":"2021","unstructured":"Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-end anti-spoofing with RawNet2. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6369\u20136373."},{"key":"e_1_3_1_63_2","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"3","key":"e_1_3_1_64_2","doi-asserted-by":"crossref","first-page":"1466","DOI":"10.1109\/TAFFC.2020.3007531","article-title":"Phase space reconstruction driven spatio-temporal feature learning for dynamic facial expression recognition","volume":"13","author":"Wang Shanmin","year":"2020","unstructured":"Shanmin Wang, Hui Shuai, and Qingshan Liu. 2020. Phase space reconstruction driven spatio-temporal feature learning for dynamic facial expression recognition. IEEE Transactions on Affective Computing 13, 3 (2020), 1466\u20131476.","journal-title":"IEEE Transactions on Affective Computing"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2024.02.019"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588574"},{"key":"e_1_3_1_67_2","unstructured":"Xiang Wang Hangjie Yuan Shiwei Zhang Dayou Chen Jiuniu Wang Yingya Zhang Yujun Shen Deli Zhao and Jingren Zhou. 2023. VideoComposer: Compositional video synthesis with motion controllability. arXiv:2306.02018. Retrieved from https:\/\/arxiv.org\/abs\/2306.02018"},{"key":"e_1_3_1_68_2","first-page":"1161","article-title":"Spatio-temporal self-attention network for video saliency prediction","author":"Wang Ziqiang","year":"2021","unstructured":"Ziqiang Wang, Zhi Liu, Gongyang Li, Yang Wang, Tianhong Zhang, Lihua Xu, and Jijun Wang. 2021. Spatio-temporal self-attention network for video saliency prediction. IEEE Transactions on Multimedia 25 (2021), 1161\u20131174.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2023.3262148"},{"key":"e_1_3_1_71_2","unstructured":"YouTube. 2018. Deepfake Video of Barack Obama. Retrieved from https:\/\/www.youtube.com\/watch?v=AmUC4m6w1wo"},{"key":"e_1_3_1_72_2","unstructured":"YouTube. 2018. Deepfake Video of Donald Trump. Retrieved from https:\/\/www.youtube.com\/watch?v=Ws5O9WASoHg"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3625100"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2023.3239223"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00222"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-023-08271-z"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01453"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3778030","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T14:29:31Z","timestamp":1768228171000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3778030"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,12]]},"references-count":76,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3778030"],"URL":"https:\/\/doi.org\/10.1145\/3778030","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,12]]},"assertion":[{"value":"2024-04-09","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-15","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}