{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T21:05:43Z","timestamp":1761599143420,"version":"3.41.0"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"1s","license":[{"start":{"date-parts":[[2015,10,21]],"date-time":"2015-10-21T00:00:00Z","timestamp":1445385600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2015,10,21]]},"abstract":"<jats:p>The need for human-centered, affective multimedia interfaces has motivated research in automatic emotion recognition. In this article, we focus on facial emotion recognition. Specifically, we target a domain in which speakers produce emotional facial expressions while speaking. The main challenge of this domain is the presence of modulations due to both emotion and speech. For example, an individual's mouth movement may be similar when he smiles and when he pronounces the phoneme \/IY\/, as in \u201ccheese\u201d. The result of this confusion is a decrease in performance of facial emotion recognition systems. In our previous work, we investigated the joint effects of emotion and speech on facial movement. We found that it is critical to employ proper temporal segmentation and to leverage knowledge of spoken content to improve classification performance. In the current work, we investigate the temporal characteristics of specific regions of the face, such as the forehead, eyebrow, cheek, and mouth. We present methodology that uses the temporal patterns of specific regions of the face in the context of a facial emotion recognition system. We test our proposed approaches on two emotion datasets, the IEMOCAP and SAVEE datasets. Our results demonstrate that the combination of emotion recognition systems based on different facial regions improves overall accuracy compared to systems that do not leverage different characteristics of individual regions.<\/jats:p>","DOI":"10.1145\/2808204","type":"journal-article","created":{"date-parts":[[2015,10,24]],"date-time":"2015-10-24T18:27:12Z","timestamp":1445711232000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face"],"prefix":"10.1145","volume":"12","author":[{"given":"Yelin","family":"Kim","sequence":"first","affiliation":[{"name":"University of Michigan, Ann Arbor, MI"}]},{"given":"Emily Mower","family":"Provost","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, MI"}]}],"member":"320","published-online":{"date-parts":[[2015,10,21]]},"reference":[{"volume-title":"Proceedings of the International Conference on Spoken Language Processing. 1931--1934","year":"1994","author":"Arons Barry","key":"e_1_2_2_1_1"},{"key":"e_1_2_2_2_1","unstructured":"Douglas Bates Martin Maechler and Ben Bolker. 2007. lme4: Linear mixed-effects models using S4 classes (R package version 0.9975-11).  Douglas Bates Martin Maechler and Ben Bolker. 2007. lme4: Linear mixed-effects models using S4 classes (R package version 0.9975-11)."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1002\/cav.v15:3\/4"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2508119"},{"volume-title":"Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI'08)","author":"Bigot Benjamin","key":"e_1_2_2_5_1"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007977618277"},{"key":"e_1_2_2_7_1","doi-asserted-by":"crossref","unstructured":"Marisa Boston John Hale Reinhold Kliegl Umesh Patil and Shravan Vasishth. 2008. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta) 1.  Marisa Boston John Hale Reinhold Kliegl Umesh Patil and Shravan Vasishth. 2008. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta) 1.","DOI":"10.16910\/jemr.2.1.1"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-008-9076-6"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2007.905145"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/T-AFFC.2010.1"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2013.30"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1000436"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0031-3203(02)00128-0"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654935"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2011.5771366"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1162\/089976698300017197"},{"volume-title":"Friesen","year":"1977","author":"Ekman Paul","key":"e_1_2_2_17_1"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2010.09.020"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/NCC.2013.6487987"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-011-0643-1"},{"volume-title":"Speech and Audio Signal Processing: Processing and Perception of Speech and Music","author":"Gold Ben","key":"e_1_2_2_21_1","doi-asserted-by":"crossref","DOI":"10.1002\/9781118142882"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2011.5771357"},{"volume-title":"Jackson","year":"2010","author":"Haq Sanaul","key":"e_1_2_2_23_1"},{"volume-title":"Calvo","year":"2014","author":"Hussain M. Sazzad","key":"e_1_2_2_24_1"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2656408"},{"key":"e_1_2_2_26_1","first-page":"1","article-title":"Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression","volume":"1","author":"K\u00e4chele Markus","year":"2014","journal-title":"Depression"},{"volume-title":"Proceedings of INTERSPEECH.","year":"2012","author":"Kalinli Ozlem","key":"e_1_2_2_27_1"},{"key":"e_1_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Joseph Keshet Shai Shalev-Shwartz and Yoram Singer. 2005. Phoneme alignment based on discriminative learning. http:\/\/u.cs.biu.ac.il\/&sim;jkeshet\/papers\/KeshetShSiCh05.pdf.  Joseph Keshet Shai Shalev-Shwartz and Yoram Singer. 2005. Phoneme alignment based on discriminative learning. http:\/\/u.cs.biu.ac.il\/&sim;jkeshet\/papers\/KeshetShSiCh05.pdf.","DOI":"10.21437\/Interspeech.2005-129"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654934"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACII.2009.5349544"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/T-AFFC.2012.16"},{"volume-title":"Proceedings of the International Conference on Devices and Communications. IEEE, 1--5.","author":"Koolagudi Shashidhar G.","key":"e_1_2_2_32_1"},{"volume-title":"Proceedings of INTERSPEECH. 320--323","year":"2009","author":"Lee Chi-Chun","key":"e_1_2_2_33_1"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2011.06.004"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.2004.838534"},{"volume-title":"Proceedings of INTERSPEECH. 205--211","year":"2004","author":"Lee Chul Min","key":"e_1_2_2_36_1"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247546"},{"volume-title":"Proceedings of the Australian International Conference on Speech Science & Technology. 265--270","year":"2004","author":"Lucey Patrick","key":"e_1_2_2_38_1"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2013.6553752"},{"volume-title":"Affective Computing and Intelligent Interaction","author":"Meng Hongying","key":"e_1_2_2_40_1"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2010.5494893"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2012.08.018"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/T-AFFC.2011.40"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2009.2021722"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2076804"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2011.5946960"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638345"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2012.2236291"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2388676.2388783"},{"volume-title":"Face Recognition","author":"Pantic Maja","key":"e_1_2_2_50_1"},{"volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3989--3992","year":"2008","author":"Qiao Yu","key":"e_1_2_2_51_1"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/354384.354443"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2512530.2512534"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2011.5771434"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2388676.2388781"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2011.01.011"},{"volume-title":"Proceedings of INTERSPEECH. 1818--1821","year":"2006","author":"Schuller Bj\u00f6rn","key":"e_1_2_2_57_1"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2012.02.005"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACII.2013.15"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2008.08.005"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.2003.813579"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2012.11.003"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2013.11.025"},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2010.08.013"},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.1110"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2808204","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2808204","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:12:40Z","timestamp":1750227160000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2808204"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,10,21]]},"references-count":65,"journal-issue":{"issue":"1s","published-print":{"date-parts":[[2015,10,21]]}},"alternative-id":["10.1145\/2808204"],"URL":"https:\/\/doi.org\/10.1145\/2808204","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2015,10,21]]},"assertion":[{"value":"2015-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-10-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}