{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T03:02:38Z","timestamp":1773370958445,"version":"3.50.1"},"reference-count":31,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T00:00:00Z","timestamp":1773273600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan","award":["BR24992875"],"award-info":[{"award-number":["BR24992875"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Multimodal human\u2013AI systems generally consider facial expressions and body motions as separate input streams, leading to disjointed interpretations and diminished emotional coherence. To overcome this issue, we offer the Engagement-Safe Expressive Alignment (ESEA) paradigm and the Unified Visual Synchrony (UVS) framework as its computational implementation. UVS models the coherence between facial expressions and gestures, offering an interpretable visual synchrony signal that can function as adaptive feedback in human\u2013AI interactions. The framework\u2019s key component is the Consistency Index for Affective Synchrony (CIAS), which correlates brief visual segments with scalar synchrony scores through a common latent representation. Facial and gestural signals are processed by modality-specific projection networks into a unified latent space, and CIAS is derived from the similarity and short-term temporal consistency of these latent trajectories. The synchrony index is regarded as an estimation of affective visual coherence within the ESEA paradigm. We formalize the UVS\/CIAS framework and conduct a comparative experimental evaluation utilizing matched and mismatched face\u2013gesture segments derived from rendered dialog footage. Utilizing ROC analysis, score distribution comparisons, temporal visualizations, and negative control tests, we illustrate that CIAS effectively captures structured face\u2013gesture alignment that surpasses similarity-based baselines, while also delivering a persistent, time-resolved synchronization signal. These findings establish CIAS as a principled and interpretable feedback signal for future affect-aware, engagement-focused multimodal agents.<\/jats:p>","DOI":"10.3390\/bdcc10030088","type":"journal-article","created":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T14:46:31Z","timestamp":1773326791000},"page":"88","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Unified Visual Synchrony: A Framework for Face\u2013Gesture Coherence in Multimodal Human\u2013AI Interaction"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3807-9530","authenticated-orcid":false,"given":"Saule","family":"Kudubayeva","sequence":"first","affiliation":[{"name":"Institute of Information and Computational Technologies CS MSHE RK, Almaty 050000, Kazakhstan"},{"name":"Faculty of Digital Sciences and Artificial Intelligence, L. N. Gumilyov Eurasian National University, Astana 010008, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2863-1715","authenticated-orcid":false,"given":"Yernar","family":"Seksenbayev","sequence":"additional","affiliation":[{"name":"Faculty of Digital Sciences and Artificial Intelligence, L. N. Gumilyov Eurasian National University, Astana 010008, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2013-1513","authenticated-orcid":false,"given":"Aigerim","family":"Yerimbetova","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies CS MSHE RK, Almaty 050000, Kazakhstan"},{"name":"School of Engineering and Information Technology, META University, Almaty 050012, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4255-5456","authenticated-orcid":false,"given":"Elmira","family":"Daiyrbayeva","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies CS MSHE RK, Almaty 050000, Kazakhstan"},{"name":"Department of Software Engineering, Satbayev University, Almaty 050010, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9849-6176","authenticated-orcid":false,"given":"Bakzhan","family":"Sakenov","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies CS MSHE RK, Almaty 050000, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1387-5351","authenticated-orcid":false,"given":"Duman","family":"Telman","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies CS MSHE RK, Almaty 050000, Kazakhstan"},{"name":"School of Information Technologies and Applied Mathematics, SDU University, Kaskelen 040901, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1470-3706","authenticated-orcid":false,"given":"Mussa","family":"Turdalyuly","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies CS MSHE RK, Almaty 050000, Kazakhstan"},{"name":"School of Engineering and Information Technology, META University, Almaty 050012, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,3,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","article-title":"Multimodal Machine Learning: A Survey and Taxonomy","volume":"41","author":"Ahuja","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1007\/s00530-010-0182-0","article-title":"Multimodal fusion for multimedia analysis: A survey","volume":"16","author":"Atrey","year":"2010","journal-title":"Multimed. Syst."},{"key":"ref_3","unstructured":"Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., and Morency, L.-P. (2018). Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion. Proceedings of ACL, Association for Computational Linguistics."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"487","DOI":"10.1111\/cgf.13946","article-title":"Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows","volume":"39","author":"Alexanderson","year":"2020","journal-title":"Comput. Graph. Forum"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., and Black, M.J. (2024). EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling. IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE.","DOI":"10.1109\/CVPR52733.2024.00115"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Tsai, Y.-H.H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. (2019). Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of ACL, Association for Computational Linguistics.","DOI":"10.18653\/v1\/P19-1656"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Praveen, R.G., and Alam, J. (2024). Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition. arXiv.","DOI":"10.1109\/CVPRW63382.2024.00483"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Waligora, P., Aslam, H., Zeeshan, O., Koerich, A., Pedersoli, M., and Granger, E. (2024). Joint Multimodal Transformer for Emotion Recognition in the Wild. arXiv.","DOI":"10.1109\/CVPRW63382.2024.00465"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1177\/053901882021004003","article-title":"A Psychoevolutionary Theory of Emotions","volume":"21","author":"Plutchik","year":"1982","journal-title":"Soc. Sci. Inf."},{"key":"ref_11","unstructured":"Yernar, S. (2025, November 15). Multimodal Emotion and Gesture: UVS\/CIAS Implementation and Evaluation Scripts. GitHub Repository. Available online: https:\/\/github.com\/ernar65019920\/Multimodal_emoution_and_gesture."},{"key":"ref_12","unstructured":"McNeill, D. (1992). Hand and Mind: What Gestures Reveal About Thought, University of Chicago Press."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wang, Y., and Neff, M. (2013). The Influence of Prosody on the Requirements for Gesture\u2013Text Alignment. Intelligent Virtual Agents (IVA 2013), Springer.","DOI":"10.1007\/978-3-642-40415-3_16"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Kendon, A. (2004). Gesture: Visible Action as Utterance, Cambridge University Press.","DOI":"10.1017\/CBO9780511807572"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Grewe, C.M., Fehrenbach, J., and Hornecker, E. (2021). Statistical Learning of Facial Expressions Improves the Perceived Realism of Virtual Characters. Front. Virtual Real., 2.","DOI":"10.3389\/frvir.2021.619811"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1016\/0022-3956(67)90004-0","article-title":"A segmentation of behavior","volume":"5","author":"Condon","year":"1967","journal-title":"J. Psychiatr. Res."},{"key":"ref_17","unstructured":"Fu, D., Liu, Y., Delaherche, E., and Chetouani, M. (2021). Interpersonal Physiological Synchrony for Detecting Social Interaction Quality. Front. Psychol., 12."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wu, Y., Zhang, L., Chen, H., and Li, Y. (2025). A Comprehensive Review of Multimodal Emotion Recognition. Biomimetics, 10.","DOI":"10.3390\/biomimetics10070418"},{"key":"ref_19","unstructured":"Mehrabian, A. (1971). Silent Messages, Wadsworth."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"242","DOI":"10.1038\/nrn1872","article-title":"Towards the neurobiology of emotional body language","volume":"7","year":"2006","journal-title":"Nat. Rev. Neurosci."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"349","DOI":"10.1037\/amp0000488","article-title":"What the Face Displays: Mapping 28 Emotions Conveyed by Naturalistic Expression","volume":"75","author":"Cowen","year":"2020","journal-title":"Am. Psychol."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1225","DOI":"10.1126\/science.1224313","article-title":"Body cues, not facial expressions, discriminate between intense positive and negative emotions","volume":"338","author":"Aviezer","year":"2012","journal-title":"Science"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"724","DOI":"10.1111\/j.1467-9280.2008.02148.x","article-title":"Angry, disgusted, or afraid? Studies on the malleability of emotion perception","volume":"19","author":"Aviezer","year":"2008","journal-title":"Psychol. Sci."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1007\/s00221-010-2220-8","article-title":"Social context influences recognition of bodily expressions","volume":"203","author":"Kret","year":"2010","journal-title":"Exp. Brain Res."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"981","DOI":"10.1093\/cercor\/bhr156","article-title":"Biological motion processing as a hallmark of social cognition","volume":"22","author":"Pavlova","year":"2012","journal-title":"Cereb. Cortex"},{"key":"ref_26","unstructured":"Argyle, M. (1988). Bodily Communication, Methuen. [2nd ed.]."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Harrigan, J.A., Rosenthal, R., and Scherer, K.R. (2005). The New Handbook of Methods in Nonverbal Behavior Research, Oxford University Press.","DOI":"10.1093\/oso\/9780198529613.001.0001"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1307","DOI":"10.1080\/02699930902928969","article-title":"The dynamic architecture of emotion: Evidence for the component process model","volume":"23","author":"Scherer","year":"2009","journal-title":"Cogn. Emot."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"5449","DOI":"10.1109\/TPAMI.2024.3366769","article-title":"From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing","volume":"46","author":"Liu","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"5861","DOI":"10.1109\/TDSC.2025.3576223","article-title":"Light-Field Image Multiple Reversible Robust Watermarking Against Geometric Attacks","volume":"22","author":"Wang","year":"2025","journal-title":"IEEE Trans. Dependable Secur. Comput."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ekman, P., and Friesen, W. (1978). Facial Action Coding System (FACS), Consulting Psychologists Press.","DOI":"10.1037\/t27734-000"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/10\/3\/88\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T16:00:28Z","timestamp":1773331228000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/10\/3\/88"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,12]]},"references-count":31,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["bdcc10030088"],"URL":"https:\/\/doi.org\/10.3390\/bdcc10030088","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,12]]}}}