{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T20:30:49Z","timestamp":1774557049644,"version":"3.50.1"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"Science and Technology Program of Xuzhou","award":["KC25108"],"award-info":[{"award-number":["KC25108"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62472424, 62172417, 62402306, and T2525004"],"award-info":[{"award-number":["62472424, 62172417, 62402306, and T2525004"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Opening Fund of the State Key Laboratory of Industrial Control Technology, China","award":["ICT2024B72"],"award-info":[{"award-number":["ICT2024B72"]}]},{"name":"Natural Science Foundation of Shanghai","award":["24ZR1422400"],"award-info":[{"award-number":["24ZR1422400"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>Multi-modal deception detection is a challenging yet important task, having pivotal applications in many fields such as business credibility assessment and multimedia anti-frauds. Previous methods either rely solely on spatial features or overemphasize only temporal information within or across modalities, which may overlook potential critical clues. Motivated by these observations, we propose a Spatio-Temporal Representation Disentanglement (STRD) framework for multi-modal deception detection, which uses a dual-encoder structure to learn spatial and temporal representations for each modality. Specifically, we introduce a pre-trained foundation model to act as the spatial encoder and design a lightweight network as the temporal encoder, extracting spatial semantics and capturing dynamic temporal patterns. Then, we propose a Constrained Self-Attention Block (CSAB), in which self-attention distribution of each head is regarded as spatial distribution and is constrained to attend a certain facial local region. Furthermore, we present a Cross-Modal Correlation Fusion Block (CCFB) to achieve temporal synchronization across modalities by measuring the correlations between visual and audio features. Extensive experiments show that our STRD outperforms the state-of-the-art methods on challenging DOLOS, BOL, BgOL, and RLtrial benchmarks. Particularly, STRD improves by 2.12% and 1.88% over the previous best results in terms of ACC on the DOLOS and BOL datasets, respectively. Additionally, STRD outperforms previous methods in cross-dataset testing, highlighting its superior generalization ability.<\/jats:p>","DOI":"10.1145\/3783994","type":"journal-article","created":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T02:59:55Z","timestamp":1765508395000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Spatio-Temporal Disentanglement and Constrained Self-Attention for Multi-Modal Deception Detection"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9383-8384","authenticated-orcid":false,"given":"Zhiwen","family":"Shao","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China, and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0945-7528","authenticated-orcid":false,"given":"Hang","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China, and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5418-9879","authenticated-orcid":false,"given":"Hancheng","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China, and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2734-915X","authenticated-orcid":false,"given":"Rui","family":"Yao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China, and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6755-871X","authenticated-orcid":false,"given":"Lixin","family":"Zou","sequence":"additional","affiliation":[{"name":"Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7281-2812","authenticated-orcid":false,"given":"Mengtian","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Film and Television Engineering, Shanghai University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8678-2784","authenticated-orcid":false,"given":"Bin","family":"Sheng","sequence":"additional","affiliation":[{"name":"School of Computer Science, Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2026,2,9]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Hammad Ud Din Ahmed Usama Ijaz Bajwa Fan Zhang and Muhammad Waqas Anwar. 2021. Deception detection in videos using the facial action coding system. arXiv:2105.13659. Retrieved from https:\/\/arxiv.org\/abs\/2105.13659"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0129065720500689"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3349801.3349806"},{"key":"e_1_3_1_5_2","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, 12449\u201312460.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_6_2","unstructured":"Shaojie Bai J. Zico Kolter and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271. Retrieved from https:\/\/arxiv.org\/abs\/1803.01271"},{"key":"e_1_3_1_7_2","first-page":"16664","article-title":"Adaptformer: Adapting vision transformers for scalable visual recognition","author":"Chen Shoufa","year":"2022","unstructured":"Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. In Advances in Neural Information Processing Systems, 16664\u201316678.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00799"},{"key":"e_1_3_1_9_2","first-page":"1","volume-title":"International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 1\u201312."},{"key":"e_1_3_1_10_2","volume-title":"What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS)","author":"Ekman Paul","year":"1997","unstructured":"Paul Ekman and Erika L. Rosenberg. 1997. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICEENG64546.2025.11031381"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10506-013-9140-4"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-023-01891-x"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00779"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/SSCI.2017.8285382"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2005.06.042"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.02023"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2019.00016"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-21741-8_4"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2018.8621909"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCDS.2021.3086011"},{"key":"e_1_3_1_23_2","first-page":"1755","article-title":"Dlib-ml: A machine learning toolkit","volume":"10","author":"King Davis E.","year":"2009","unstructured":"Davis E. King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10 (2009), 1755\u20131758.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_24_2","first-page":"87","volume-title":"International Conference on Computational Linguistics and Intelligent Text Processing","author":"Krishnamurthy Gangeshwar","year":"2018","unstructured":"Gangeshwar Krishnamurthy, Navonil Majumder, Soujanya Poria, and Erik Cambria. 2018. A deep learning approach for multimodal deception detection. In International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 87\u201396."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729694"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-1519"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i3.16285"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-022-3783-3"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3587251"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2791608"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CIVEMSA58715.2024.10586610"},{"key":"e_1_3_1_32_2","first-page":"1","volume-title":"IEEE International Joint Conference on Biometrics","author":"Li Zhaoxu","year":"2024","unstructured":"Zhaoxu Li, Zitong Yu, Xun Lin, Nithish Muthuchamy Selvaraj, Xiaobao Guo, Bingquan Shen, Adams Wai, Kin Kong, and Alex Kot. 2024. Flexible-modal deception detection with audio-visual adapter. In IEEE International Joint Conference on Biometrics. IEEE, 1\u201310."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3278310"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3382507.3418864"},{"key":"e_1_3_1_35_2","first-page":"26462","article-title":"St-adapter: Parameter-efficient image-to-video transfer learning","author":"Pan Junting","year":"2022","unstructured":"Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. 2022. St-adapter: Parameter-efficient image-to-video transfer learning. In Advances in Neural Information Processing Systems, 26462\u201326477.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2818346.2820758"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1281"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2020.3015684"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-020-01378-z"},{"key":"e_1_3_1_40_2","first-page":"568","article-title":"Two-stream convolutional networks for action recognition in videos","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, 568\u2013576.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1175"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588574"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i12.26701"},{"issue":"3","key":"e_1_3_1_44_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3699710","article-title":"Deepfake detection: A comprehensive survey from the reliability perspective","volume":"57","author":"Wang Tianyi","year":"2024","unstructured":"Tianyi Wang, Xin Liao, Kam Pui Chow, Xiaodong Lin, and Yinglong Wang. 2024. Deepfake detection: A comprehensive survey from the reliability perspective. ACM Computing Surveys 57, 3 (2024), 1\u201335.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_1_45_2","first-page":"1695","article-title":"Deception detection in videos","author":"Wu Zhe","year":"2018","unstructured":"Zhe Wu, Bharat Singh, Larry Davis, and V. Subrahmanian. 2018. Deception detection in videos. In AAAI Conference on Artificial Intelligence, 1695\u20131702.","journal-title":"AAAI Conference on Artificial Intelligence"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2603342"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3783994","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T14:58:11Z","timestamp":1770649091000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3783994"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,9]]},"references-count":45,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3783994"],"URL":"https:\/\/doi.org\/10.1145\/3783994","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,9]]},"assertion":[{"value":"2025-06-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}