{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T07:41:30Z","timestamp":1775893290728,"version":"3.50.1"},"reference-count":60,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2023,3,2]],"date-time":"2023-03-02T00:00:00Z","timestamp":1677715200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Comput. Sci."],"abstract":"<jats:p>Recently, engagement has emerged as a key variable explaining the success of conversation. In the perspective of human-machine interaction, an automatic assessment of engagement becomes crucial to better understand the dynamics of an interaction and to design socially-aware robots. This paper presents a predictive model of the level of engagement in conversations. It shows in particular the interest of using a rich multimodal set of features, outperforming the existing models in this domain. In terms of methodology, study is based on two audio-visual corpora of naturalistic face-to-face interactions. These resources have been enriched with various annotations of verbal and nonverbal behaviors, such as smiles, head nods, and feedbacks. In addition, we manually annotated gestures intensity. Based on a review of previous works in psychology and human-machine interaction, we propose a new definition of the notion of engagement, adequate for the description of this phenomenon both in natural and mediated environments. This definition have been implemented in our annotation scheme. In our work, engagement is studied at the turn level, known to be crucial for the organization of the conversation. Even though there is still a lack of consensus around their precise definition, we have developed a turn detection tool. A multimodal characterization of engagement is performed using a multi-level classification of turns. We claim a set of multimodal cues, involving prosodic, mimo-gestural and morpho-syntactic information, is relevant to characterize the level of engagement of speakers in conversation. Our results significantly outperform the baseline and reach state-of-the-art level (0.76 weighted F-score). The most contributing modalities are identified by testing the performance of a two-layer perceptron when trained on unimodal feature sets and on combinations of two to four modalities. These results support our claim about multimodality: combining features related to the speech fundamental frequency and energy with mimo-gestural features leads to the best performance.<\/jats:p>","DOI":"10.3389\/fcomp.2023.1062342","type":"journal-article","created":{"date-parts":[[2023,3,2]],"date-time":"2023-03-02T05:05:00Z","timestamp":1677733500000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["A multimodal approach for modeling engagement in conversation"],"prefix":"10.3389","volume":"5","author":[{"given":"Arthur","family":"Pellet-Rostaing","sequence":"first","affiliation":[]},{"given":"Roxane","family":"Bertrand","sequence":"additional","affiliation":[]},{"given":"Auriane","family":"Boudin","sequence":"additional","affiliation":[]},{"given":"St\u00e9phane","family":"Rauzy","sequence":"additional","affiliation":[]},{"given":"Philippe","family":"Blache","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2023,3,2]]},"reference":[{"key":"B1","article-title":"\u201cA study of gestural feedback expressions,\u201d","volume-title":"First Nordic Symposium on Multimodal Communication","author":"Allwood","year":"2003"},{"key":"B2","article-title":"\u201cSmiling for negotiating topic transitions in French conversation,\u201d","author":"Amoyal","year":"2019","journal-title":"GESPIN-Gesture and Speech in Interaction"},{"key":"B3","article-title":"\u201cPaco: A corpus to analyze the impact of common ground in spontaneous face-to-face interaction,\u201d","author":"Amoyal","year":"2020","journal-title":"Language Resources and Evaluation Conference"},{"key":"B4","doi-asserted-by":"publisher","DOI":"10.1007\/s12369-015-0298-7","article-title":"Evaluating the engagement with social robots","author":"Anzalone","year":"2015","journal-title":"Int. J. Soc. Robot"},{"key":"B5","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-08786-3_25","article-title":"\u201cExtending log-based affect detection to a multi-user virtual environment for science,\u201d","author":"Baker","year":"2014","journal-title":"International Conference on User Modeling, Adaptation, and Personalization"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.1145\/2401836.2401846","article-title":"\u201cConversational engagement in multiparty video conversation: an annotation scheme and classification of high and low levels of engagement,\u201d","author":"Bednarik","year":"2012","journal-title":"Workshop on Eye Gaze in Intelligent Human Machine Interaction"},{"key":"B7","doi-asserted-by":"publisher","first-page":"815","DOI":"10.1007\/s12369-019-00591-2","article-title":"On-the-fly detection of user engagement decrease in spontaneous human-robot interaction using recurrent and deep neural networks","volume":"11","author":"Ben-Youssef","year":"2019","journal-title":"Int. J Soc. Robot."},{"key":"B8","doi-asserted-by":"publisher","first-page":"648","DOI":"10.1080\/08839514.2010.492259","article-title":"Engagement in long-term interventions with relational agents","volume":"24","author":"Bickmore","year":"2010","journal-title":"Appl. Artif. Intell"},{"key":"B9","first-page":"1748","article-title":"\u201cSppas: a tool for the phonetic segmentations of speech,\u201d","author":"Bigi","year":"2012","journal-title":"The eighth international conference on Language Resources and Evaluation"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.431","article-title":"\u201cTwo-level classification for dialogue act recognition in task-oriented dialogues,\u201d","author":"Blache","year":"2020","journal-title":"Proceedings of COLING-2020"},{"key":"B11","author":"Boersma","year":"1996","journal-title":"Praat, a System for Doing Phonetics by Computer, Version 3.4"},{"key":"B12","doi-asserted-by":"publisher","DOI":"10.3115\/1708376.1708411","article-title":"\u201cTo predict engagement with a spoken dialog system in open-world settings,\u201d","author":"Bohus","year":"2009","journal-title":"Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)"},{"key":"B13","doi-asserted-by":"publisher","DOI":"10.1109\/SocialCom-PASSAT.2012.110","article-title":"\u201cHow do we react to context? Annotation of individual and group engagement in a video corpus,\u201d","author":"Bonin","year":"2012","journal-title":"Privacy, Security, Risk and Trust (PASSAT), International Conference on Social Computing (SocialCom)"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-83527-9_46","article-title":"\u201cA multimodal model for predicting conversational feedbacks,\u201d","author":"Boudin","year":"","journal-title":"International Conference on Text, Speech, and Dialogue"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.1145\/1647314.1647336","article-title":"\u201cDetecting user engagement with a robot companion using task and social interaction-based features,\u201d","author":"Castellano","year":"2009","journal-title":"Proceedings of the International Conference on Multimodal Interfaces"},{"key":"B16","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511620539","volume-title":"Using Language","author":"Clark","year":"1996"},{"key":"B17","article-title":"\u201cAnalysis to modeling of engagement as sequences of multimodal behaviors,\u201d","author":"Dermouche","year":"2018","journal-title":"Language, Resources and Evaluation Conference (LREC)"},{"key":"B18","doi-asserted-by":"publisher","DOI":"10.1145\/3340555.3353765","article-title":"\u201cEngagement modeling in dyadic interaction,\u201d","author":"Dermouche","year":"2019","journal-title":"International Conference on Multimodal Interaction (ICMI '19)"},{"key":"B19","doi-asserted-by":"publisher","DOI":"10.1109\/ACII.2017.8273571","article-title":"\u201cAutomated mood-aware engagement prediction,\u201d","author":"Dhamija","year":"2017","journal-title":"Seventh International Conference on Affective Computing and Intelligent Interaction"},{"key":"B20","doi-asserted-by":"publisher","first-page":"2394","DOI":"10.1587\/transinf.E92.D.2394","article-title":"Humans with humor : a dialogue system that users want to interact with","author":"Dybala","year":"2009","journal-title":"IEICE Trans. Inf. Syst."},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.1145\/3279810.3279842","article-title":"\u201cMultimodal approach to engagement and disengagement detection with highly imbalanced in-the-wild data,\u201d","author":"Fedotov","year":"2018","journal-title":"Workshop on Modeling Cognitive Processes from Multimodal Data"},{"key":"B22","article-title":"\u201cIntrinsic and extrinsic evaluation of an automatic user disengagement detector for an uncertainty-adaptive spoken dialogue system,\u201d","author":"Forbes-Riley","year":"2012","journal-title":"Conference of the North American Chapter of the Association for Computational Linguistics"},{"key":"B23","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1007\/s12369-017-0414-y","article-title":"Automatically classifying user engagement for dynamic multi-party human-robot interaction","volume":"9","author":"Foster","year":"2017","journal-title":"Int. J. Social Robot."},{"key":"B24","doi-asserted-by":"publisher","first-page":"944","DOI":"10.1109\/ACII.2015.7344688","article-title":"\u201cDefinitions of engagement in human-agent interaction,\u201d","author":"Glas","year":"","journal-title":"2015 International Conference on Affective Computing and Intelligent Interaction (ACII)"},{"key":"B25","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W15-4725","article-title":"\u201cTopic transition strategies for an information-giving agent,\u201d","author":"Glas","year":"","journal-title":"European Workshop on Natural Language Generation"},{"key":"B26","doi-asserted-by":"publisher","first-page":"601","DOI":"10.1016\/j.csl.2010.10.003","article-title":"Turn-taking cues in task-oriented dialogue","volume":"25","author":"Gravano","year":"2011","journal-title":"Comput. Speech Lang."},{"key":"B27","article-title":"\u201cRecognizing continuous social engagement level in dyadic conversation by using turn-taking and speech emotion patterns,\u201d","author":"Hsiao","year":"2012","journal-title":"Workshop on Activity Context Representation - Techniques and Languages (ACR12)"},{"key":"B28","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-846","article-title":"\u201cEngagement recognition using auditory and visual cues,\u201d","author":"Huang","year":"2016","journal-title":"Interspeech 2016."},{"key":"B29","doi-asserted-by":"publisher","first-page":"249980","DOI":"10.1145\/2499474.2499480","article-title":"Gaze awareness in conversational agents: estimating a user's conversational engagement from eye gaze","volume":"3","author":"Ishii","year":"2013","journal-title":"ACM Trans. Interact. Intell. Syst."},{"key":"B30","doi-asserted-by":"publisher","DOI":"10.1145\/3472306.3478360","article-title":"\u201cMultimodal and multitask approach to listener's backchannel prediction: Can prediction of turn-changing and turn-management willingness improve backchannel modeling?\u201d","author":"Ishii","year":"2021","journal-title":"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents"},{"key":"B31","doi-asserted-by":"publisher","first-page":"e2810","DOI":"10.1609\/aimag.v39i3.2810","article-title":"Alexa prize \u2013 state of the art in conversational ai","volume":"39","author":"Khatri","year":"2018","journal-title":"AI Mag."},{"key":"B32","doi-asserted-by":"publisher","first-page":"291","DOI":"10.1007\/s12369-013-0178-y","article-title":"Social robots for long-term interaction: a survey","volume":"5","author":"Leite","year":"2013","journal-title":"Int. J. Soc. Robot."},{"key":"B33","doi-asserted-by":"publisher","DOI":"10.1145\/2696454.2696466","article-title":"\u201cComparing models of disengagement in individual and group interactions,\u201d","author":"Leite","year":"2015","journal-title":"International Conference on Human-Robot Interaction (HRI)."},{"key":"B34","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2015.00731","article-title":"Timing in turn-taking and its implications for processing models of language","author":"Levinson","year":"2015","journal-title":"Front. Psychol."},{"key":"B35","article-title":"\u201cEngagement breakdown in hri using thin-slices of facial expressions,\u201d","author":"Liu","year":"2018","journal-title":"Thirty-Second AAAI Conference on Artificial Intelligence"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.1109\/AMC.2006.1631755","article-title":"\u201cA spatial model of engagement for a social robot\u201d","author":"Michalowski","year":"2006","journal-title":"9th IEEE International Workshop on Advanced Motion Control, 2006"},{"key":"B37","doi-asserted-by":"publisher","DOI":"10.1109\/ROMAN.2007.4415249","article-title":"\u201cInvestigating implicit cues for user state estimation in human-robot interaction using physiological measurements,\u201d","author":"Mower","year":"2007","journal-title":"International Symposium on Robot and Human Interactive Communication (RO-MAN)"},{"key":"B38","doi-asserted-by":"publisher","DOI":"10.1145\/1719970.1719990","article-title":"\u201cEstimating user's engagement from eye-gaze behaviors in human-agent conversations,\u201d","author":"Nakano","year":"2010","journal-title":"Conference on Intelligent User Interfaces (IUI)"},{"key":"B39","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1007\/s12193-009-0026-4","article-title":"Hmm modeling of user engagement in advice-giving dialogues","volume":"3","author":"Novielli","year":"2009","journal-title":"J. Multimodal User Interface"},{"key":"B40","doi-asserted-by":"publisher","first-page":"2385","DOI":"10.1016\/j.pragma.2009.12.016","article-title":"User attitude towards an embodied conversational agent: Effects of the interaction mode","volume":"42","author":"Novielli","year":"2010","journal-title":"J. Pragm."},{"key":"B41","doi-asserted-by":"publisher","first-page":"92","DOI":"10.3389\/frobt.2020.00092","article-title":"Engagement in human-agent interaction: an overview","volume":"7","author":"Oertel","year":"2020","journal-title":"Front. Robot. AI"},{"key":"B42","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-25775-9_16","article-title":"\u201cTowards the automatic detection of involvement in conversation,\u201d","author":"Oertel","year":"2011","journal-title":"Analysis of Verbal and Nonverbal Communication and Enactment. The Processing Issues"},{"key":"B43","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23974-8_29","article-title":"\u201cEstimating a user's conversational engagement based on head pose information,\u201d","author":"Ooko","year":"2011","journal-title":"10th International Conference on Intelligent Virtual Agents, IVA'11"},{"key":"B44","doi-asserted-by":"publisher","DOI":"10.1145\/1655260.1655269","article-title":"\u201cAn exploration of user engagement in HCI,\u201d","author":"Peters","year":"2009","journal-title":"International Workshop on Affective-Aware Virtual Agents and Social Robots"},{"key":"B45","doi-asserted-by":"publisher","DOI":"10.1007\/11550617_20","article-title":"\u201cA model of attention and interest using gaze behavior,\u201d","author":"Peters","year":"","journal-title":"Conference on Intelligent Virtual Agents (IVA)"},{"key":"B46","article-title":"\u201cEngagement capabilities for ECAS,\u201d","author":"Peters","year":"","journal-title":"AAMAS Workshop Creating Bonds with ACAs"},{"key":"B47","volume-title":"Mind, hands, face and body: a goal and belief view of multimodal communication","author":"Poggi","year":"2007"},{"key":"B48","article-title":"\u201cCheese!: a corpus of face-to-face french interactions. a case study for analyzing smiling and conversational humor,\u201d","author":"Priego-Valverde","year":"2020","journal-title":"Language, Resources and Evaluation (LREC)"},{"key":"B49","article-title":"\u201cSmad: a tool for automatically annotating the smile intensity along a video record,\u201d","author":"Rauzy","year":"2020"},{"key":"B50","article-title":"\u201cMarsatag, a tagger for french written texts and speech transcriptions,\u201d","author":"Rauzy","year":"2014"},{"key":"B51","doi-asserted-by":"publisher","first-page":"696","DOI":"10.1353\/lan.1974.0010","article-title":"A simplest systematics for the organization of turn-taking for conversation","volume":"50","author":"Sacks","year":"1974","journal-title":"Language"},{"key":"B52","author":"Scheffer","year":"1999","journal-title":"Error estimation and model selection"},{"key":"B53","doi-asserted-by":"publisher","DOI":"10.1109\/ICMI.2002.1166980","article-title":"\u201cHuman-robot interaction: Engagement between humans and robots for hosting activities,\u201d","author":"Sidner","year":"2002","journal-title":"International Conference on Multimodal Interfaces"},{"key":"B54","doi-asserted-by":"publisher","DOI":"10.1145\/964442.964458","article-title":"\u201cWhere to look: a study of human-robot engagement,\u201d","author":"Sidner","year":"2004","journal-title":"International Conference on Intelligent User Interfaces"},{"key":"B55","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1016\/j.artint.2005.03.005","article-title":"Explorations in engagement for humans and robots","volume":"166","author":"Sidner","year":"2005","journal-title":"Artif. Intell."},{"key":"B56","doi-asserted-by":"publisher","first-page":"3134301","DOI":"10.1145\/3134301","article-title":"A survey of presence and related concepts","volume":"50","author":"Skarbez","year":"2017","journal-title":"ACM Comput. Surv."},{"key":"B57","doi-asserted-by":"publisher","first-page":"285","DOI":"10.1207\/s15327965pli0104_1","article-title":"The nature of rapport and its nonverbal correlates","volume":"1","author":"Tickle-Degnen","year":"1990","journal-title":"Psychol. Inquiry"},{"key":"B58","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1801.03625","article-title":"On evaluating and comparing open domain dialog systems","author":"Venkatesh","year":"2018","journal-title":"arXiv: Comput. Lang"},{"key":"B59","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1162\/105474698565686","article-title":"Measuring presence in virtual environments: A presence questionnaire","volume":"7","author":"Witmer","year":"1998","journal-title":"Presence Teleoper. Virtual Environ."},{"key":"B60","doi-asserted-by":"publisher","author":"Yu","year":"2004","DOI":"10.21437\/Interspeech.2004-327"}],"container-title":["Frontiers in Computer Science"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fcomp.2023.1062342\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,2]],"date-time":"2023-03-02T05:05:11Z","timestamp":1677733511000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fcomp.2023.1062342\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,2]]},"references-count":60,"alternative-id":["10.3389\/fcomp.2023.1062342"],"URL":"https:\/\/doi.org\/10.3389\/fcomp.2023.1062342","relation":{},"ISSN":["2624-9898"],"issn-type":[{"value":"2624-9898","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,2]]},"article-number":"1062342"}}