{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T04:19:38Z","timestamp":1777522778200,"version":"3.51.4"},"reference-count":61,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2016,10,1]],"date-time":"2016-10-01T00:00:00Z","timestamp":1475280000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Adaptive Behavior"],"published-print":{"date-parts":[[2016,10]]},"abstract":"<jats:p>A robot capable of understanding emotion expressions can increase its own capability of solving problems by using emotion expressions as part of its own decision-making, in a similar way to humans. Evidence shows that the perception of human interaction starts with an innate perception mechanism, where the interaction between different entities is perceived and categorized into two very clear directions: positive or negative. While the person is developing during childhood, the perception evolves and is shaped based on the observation of human interaction, creating the capability to learn different categories of expressions. In the context of human\u2013robot interaction, we propose a model that simulates the innate perception of audio\u2013visual emotion expressions with deep neural networks, that learns new expressions by categorizing them into emotional clusters with a self-organizing layer. The proposed model is evaluated with three different corpora: The Surrey Audio\u2013Visual Expressed Emotion (SAVEE) database, the visual Bi-modal Face and Body benchmark (FABO) database, and the multimodal corpus of the Emotion Recognition in the Wild (EmotiW) challenge. We use these corpora to evaluate the performance of the model to recognize emotional expressions, and compare it to state-of-the-art research.<\/jats:p>","DOI":"10.1177\/1059712316664017","type":"journal-article","created":{"date-parts":[[2016,10,11]],"date-time":"2016-10-11T12:10:05Z","timestamp":1476187805000},"page":"373-396","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":53,"title":["Developing crossmodal expression recognition based on a deep neural model"],"prefix":"10.1177","volume":"24","author":[{"given":"Pablo","family":"Barros","sequence":"first","affiliation":[{"name":"Department of Informatics, University of Hamburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stefan","family":"Wermter","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Hamburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2016,10,10]]},"reference":[{"key":"e_1_3_4_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2339736"},{"key":"e_1_3_4_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0959-4388(02)00301-X"},{"key":"e_1_3_4_4_1","doi-asserted-by":"crossref","unstructured":"Afzal S. Robinson P. (2009). Natural affect data\u2014Collection & annotation in a learning context. In 3rd international conference on affective computing and intelligent interaction (pp. 1\u20137). Piscataway NJ: IEEE Press. Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=5349537 (accessed 19 August 2016).","DOI":"10.1109\/ACII.2009.5349537"},{"key":"e_1_3_4_5_1","unstructured":"Banda N. Robinson P. (2011). Noise analysis in audio-visual emotion recognition. In 13th international conference on multimodal interaction (ICMI \u201811) (pp. 1\u20134). New York: ACM Press. Available at: http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.228.6522 (accessed 19 August 2016)."},{"key":"e_1_3_4_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2015.09.009"},{"key":"e_1_3_4_7_1","doi-asserted-by":"crossref","unstructured":"Barros P. Weber C. Wermter S. (2015b). Emotional expression recognition with a cross-channel convolutional neural network for human-robot interaction. In 15th IEEE-RAS international conference on humanoid robots (pp. 646\u2013651). Piscataway NJ: IEEE Press. Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=7363421 (accessed 19 August 2016).","DOI":"10.1109\/HUMANOIDS.2015.7363421"},{"key":"e_1_3_4_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0376-6357(02)00078-5"},{"key":"e_1_3_4_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-85099-1_8"},{"key":"e_1_3_4_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2012.06.014"},{"key":"e_1_3_4_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/79.911197"},{"key":"e_1_3_4_12_1","doi-asserted-by":"crossref","unstructured":"Dhall A. Goecke R. Joshi J. Sikka K. Gedeon T. (2014). Emotion recognition in the wild challenge 2014: Baseline data and protocol. 16th international conference on multimodal interaction (ICMI \u201814) (pp. 461\u2013466). New York: ACM Press. Available at: http:\/\/dl.acm.org\/citation.cfm?id=2666275 (accessed 19 August 2016).","DOI":"10.1145\/2663204.2666275"},{"key":"e_1_3_4_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2012.26"},{"key":"e_1_3_4_14_1","unstructured":"Ekman P. (2007). Emotions revealed: Recognizing faces and feelings to improve communication and emotional life. Macmillan. Available at: http:\/\/psycnet.apa.org\/psycinfo\/2003-88051-000 (accessed 19 August 2016)."},{"key":"e_1_3_4_15_1","doi-asserted-by":"publisher","DOI":"10.1037\/h0030377"},{"key":"e_1_3_4_16_1","first-page":"625","article-title":"Why does unsupervised pre-training help deep learning?","volume":"11","author":"Erhan D.","year":"2010","unstructured":"Erhan D., Bengio Y., Courville A., Manzagol P.-A., Vincent P., Bengio S. (2010). Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 625\u2013660.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_4_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/0896-6273(94)90455-3"},{"key":"e_1_3_4_18_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1467-9280.2009.02400.x"},{"issue":"4","key":"e_1_3_4_19_1","first-page":"441","article-title":"Shunting inhibition, a silent step in visual cortical computation","volume":"97","author":"Fregnac Y.","year":"2003","unstructured":"Fregnac Y., Monier C., Chavane F., Baudot P., Graham L. (2003). Shunting inhibition, a silent step in visual cortical computation. Journal of Physiology, 97(4), 441\u2013451.","journal-title":"Journal of Physiology"},{"key":"e_1_3_4_20_1","volume-title":"Facial action coding system: A technique for the measurement of facial movement","author":"Friesen E.","year":"1978","unstructured":"Friesen E., Ekman P. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA: Consulting Psychologists Press."},{"key":"e_1_3_4_21_1","doi-asserted-by":"publisher","DOI":"10.5430\/air.v4n2p61"},{"key":"e_1_3_4_22_1","unstructured":"Glorot X. Bordes A. Bengio Y. (2011). Deep sparse rectifier neural networks. In 14th international conference on artificial intelligence and statistics (AISTATS-11) (Vol. 15 pp. 315\u2013323). Available at: http:\/\/www.jmlr.org\/proceedings\/papers\/v15\/glorot11a.html (accessed 19 August 2016)."},{"key":"e_1_3_4_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/0166-2236(92)90344-8"},{"key":"e_1_3_4_24_1","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/4934.001.0001"},{"key":"e_1_3_4_25_1","doi-asserted-by":"crossref","unstructured":"Gunes H. Piccardi M. (2006). A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In 18th international conference on pattern recognition (ICPR) (Vol. 1 pp. 1148\u20131153). Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=1699093 (accessed 19 August 2016).","DOI":"10.1109\/ICPR.2006.39"},{"key":"e_1_3_4_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCB.2008.927269"},{"key":"e_1_3_4_27_1","doi-asserted-by":"publisher","DOI":"10.1177\/0963721412470687"},{"key":"e_1_3_4_28_1","doi-asserted-by":"publisher","DOI":"10.4018\/978-1-61520-919-4.ch017"},{"key":"e_1_3_4_29_1","unstructured":"Haq S. Jackson P. J. Edge J. (2009). Speaker-dependent audio-visual emotion recognition. In 2009 international conference on audio-visual speech processing (AVSP) (pp. 53\u201358). Available at: https:\/\/scholar.google.de\/scholar?cluster=5579645476741846741&hl=de&as_sdt=0 5 (accessed 19 August 2016)."},{"key":"e_1_3_4_30_1","doi-asserted-by":"publisher","DOI":"10.1037\/0012-1649.23.3.388"},{"key":"e_1_3_4_31_1","unstructured":"Hau D. Chen K. (2011). Exploring hierarchical speech representations with a deep convolutional neural network. In 11th UK workshop on computational intelligence (UKCI\u201811) (p. 37). Available at: https:\/\/scholar.google.de\/scholar?cluster=18130383993448916657&hl=de&as_sdt=0 5 (accessed 19 August 2016)."},{"key":"e_1_3_4_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcomdis.2012.06.004"},{"key":"e_1_3_4_33_1","doi-asserted-by":"publisher","DOI":"10.1113\/jphysiol.1959.sp006308"},{"key":"e_1_3_4_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.59"},{"key":"e_1_3_4_35_1","doi-asserted-by":"crossref","unstructured":"Jin Q. Li C. Chen S. Wu H. (2015). Speech emotion recognition with acoustic and lexical features. In 2015 IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 4749\u20134753). Piscataway NJ: IEEE Press. Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=7178872 (accessed 19 August 2016).","DOI":"10.1109\/ICASSP.2015.7178872"},{"key":"e_1_3_4_36_1","doi-asserted-by":"crossref","unstructured":"Kahou S. E. Pal C. Bouthillier X. Froumenty P. G\u00fcl\u00e7ehre C. Memisevic R. \u2026 Wu Z. (2013). Combining modality specific deep neural networks for emotion recognition in video. In 15th international conference on multimodal interaction (ICMI \u201813) (pp. 543\u2013550). New York: ACM Press. Available at: http:\/\/dl.acm.org\/citation.cfm?id=2531745 (accessed 19 August 2016).","DOI":"10.1145\/2522848.2531745"},{"key":"e_1_3_4_37_1","doi-asserted-by":"crossref","unstructured":"Karnowski T. P. Arel I. Rose D. (2010). Deep spatiotemporal feature learning with application to image classification. In 9th international conference on machine learning and applications (ICMLA) (pp. 883\u2013888). Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=5708961 (accessed 19 August 2016).","DOI":"10.1109\/ICMLA.2010.138"},{"key":"e_1_3_4_38_1","doi-asserted-by":"crossref","unstructured":"Khalil-Hani M. Sung L. S. (2014). A convolutional neural network approach for face verification. In 2014 international conference on high performance computing simulation (HPCS) (pp. 707\u2013714). Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=6903759 (accessed 19 August 2016).","DOI":"10.1109\/HPCSim.2014.6903759"},{"key":"e_1_3_4_39_1","first-page":"62","volume-title":"NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text","author":"Kim S. M.","year":"2010","unstructured":"Kim S. M., Valitutti A., Calvo R. A. (2010). Evaluation of unsupervised emotion models to textual affect recognition. In NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text (pp. 62\u201370). Association for Computational Linguistics."},{"key":"e_1_3_4_40_1","doi-asserted-by":"crossref","unstructured":"Kohonen T. (1990). The self-organizing map. Proceedings of the IEEE 78 1464\u20131480. Available at: http:\/\/dl.acm.org\/citation.cfm?id=1860639 (accessed 19 August 2016).","DOI":"10.1109\/5.58325"},{"key":"e_1_3_4_41_1","doi-asserted-by":"crossref","unstructured":"Kret M. E. Roelofs K. Stekelenburg J. J. de Gelder B. (2013). Emotional signals from faces bodies and scenes influence observers\u2019 face expressions fixations and pupil-size. Frontiers in Human Neuroscience 7 810. Available at: http:\/\/journal.frontiersin.org\/article\/10.3389\/fnhum.2013.00810\/full (accessed 19 August 2016).","DOI":"10.3389\/fnhum.2013.00810"},{"key":"e_1_3_4_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/72.554195"},{"key":"#cr-split#-e_1_3_4_43_1.1","doi-asserted-by":"crossref","unstructured":"Lecun Y. Bottou L. Bengio Y. Haffner P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86","DOI":"10.1109\/5.726791"},{"key":"#cr-split#-e_1_3_4_43_1.2","unstructured":"(11) 2278-2324. Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=726791 (accessed 19 August 2016)."},{"key":"e_1_3_4_44_1","unstructured":"Lewis M. (2012). Children\u2019s emotions and moods: Developmental theory and measurement. Springer. Available at: https:\/\/scholar.google.de\/scholar?q=Children%E2%80%99s+emotions+and+moods%3A+Develop-+mental+theory+and+measurement&btnG=&hl=de&as_sdt=0%2C5 (accessed 19 August 2016)."},{"key":"e_1_3_4_45_1","unstructured":"Li T. L. Chan A. B. Chun A. H. (2010). Automatic musical pattern feature extraction using convolutional neural network. In International multiconference of engineers and computer scientists (IMECS 2010) (Vol. 1). Available at: http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.302.7795 (accessed 19 August 2016)."},{"key":"e_1_3_4_46_1","doi-asserted-by":"crossref","unstructured":"Liu M. Chen H. Li Y. Zhang F. (2015). Emotional tone-based audio continuous emotion recognition. In He X. Luo S. Tao D. Xu C. Yang J. Hasan M. (Eds.) Multimedia modeling (pp. 470\u2013480). Springer. Available at: http:\/\/link.springer.com\/chapter\/10.1007\/978-3-319-14442-9_52 (accessed 19 August 2016).","DOI":"10.1007\/978-3-319-14442-9_52"},{"key":"e_1_3_4_47_1","doi-asserted-by":"crossref","unstructured":"Liu M. Wang R. Li S. Shan S. Huang Z. Chen X. (2014). Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild. In 16th international conference on multimodal interaction (pp. 494\u2013501). New York: ACM Press. Available at: http:\/\/dl.acm.org\/citation.cfm?id=2666274 (accessed 19 August 2016).","DOI":"10.1145\/2663204.2666274"},{"key":"e_1_3_4_48_1","unstructured":"MacQueen J. (1967). Some methods for classification and analysis of multivariate observations. In 5th Berkeley symposium on mathematical statistics and probability (Vol. 1 pp. 281\u2013297). Oakland CA. Available at: https:\/\/scholar.google.de\/scholar?start=0&hl=de&as_sdt=0 5&cluster=14924728719521477429 (accessed 19 August 2016)."},{"key":"e_1_3_4_49_1","doi-asserted-by":"publisher","DOI":"10.1080\/17405620344000022"},{"key":"e_1_3_4_50_1","doi-asserted-by":"crossref","unstructured":"Ringeval F. Amiriparian S. Eyben F. Scherer K. Schuller B. (2014). Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion. In 16th international conference on multimodal interaction (ICMI \u201814) (pp. 473\u2013480). New York: ACM Press. Available at: http:\/\/dl.acm.org\/citation.cfm?id=2666271 (accessed 19 August 2016).","DOI":"10.1145\/2663204.2666271"},{"key":"e_1_3_4_51_1","doi-asserted-by":"publisher","DOI":"10.1037\/0033-295X.110.1.145"},{"key":"e_1_3_4_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2014.08.005"},{"key":"e_1_3_4_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2366127"},{"key":"e_1_3_4_54_1","doi-asserted-by":"crossref","unstructured":"Schluter J. Bock S. (2014). Improved musical onset detection with convolutional neural networks. In 2014 IEEE international conference on acoustics speech and signal processing (ICASSP 2014) (pp. 6979\u20136983). Piscataway NJ: IEEE Press. Available at: http:\/\/ieeexplore.ieee.org\/xpls\/abs_all.jsp?arnumber=6854953 (accessed 19 August 2016).","DOI":"10.1109\/ICASSP.2014.6854953"},{"key":"e_1_3_4_55_1","first-page":"177","article-title":"Beyond shallow models of emotion","volume":"2","author":"Sloman A.","year":"2001","unstructured":"Sloman A. (2001). Beyond shallow models of emotion. Cognitive Processing, 2, 177\u2013198.","journal-title":"Cognitive Processing"},{"key":"e_1_3_4_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.2002.800560"},{"key":"e_1_3_4_57_1","unstructured":"Ultsch A. (2003). U-Matrix: A tool to visualize clusters in high dimensional data. Report University of Marburg Germany December."},{"key":"e_1_3_4_58_1","doi-asserted-by":"publisher","DOI":"10.3233\/IDA-1999-3203"},{"key":"e_1_3_4_59_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000013087.49260.fb"},{"key":"e_1_3_4_60_1","doi-asserted-by":"crossref","unstructured":"Zeiler M. D. Fergus R. (2014). Visualizing and understanding convolutional networks. In 13th european conference of computer vision (ECCV 2014) (pp. 818\u2013833). Berlin Germany: Springer. Available at: http:\/\/link.springer.com\/chapter\/10.1007\/978-3-319-10590-1_53 (accessed 19 August 2016).","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"e_1_3_4_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2008.52"}],"container-title":["Adaptive Behavior"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1059712316664017","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1059712316664017","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1059712316664017","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T16:18:38Z","timestamp":1777393118000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1059712316664017"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,10]]},"references-count":61,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2016,10]]}},"alternative-id":["10.1177\/1059712316664017"],"URL":"https:\/\/doi.org\/10.1177\/1059712316664017","relation":{},"ISSN":["1059-7123","1741-2633"],"issn-type":[{"value":"1059-7123","type":"print"},{"value":"1741-2633","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,10]]}}}