{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T23:31:48Z","timestamp":1775086308453,"version":"3.50.1"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"24","license":[{"start":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T00:00:00Z","timestamp":1655856000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T00:00:00Z","timestamp":1655856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001215","name":"La Trobe University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001215","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Multimed Tools Appl"],"published-print":{"date-parts":[[2022,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The advancements of the Internet of Things (IoT) and voice-based multimedia applications have resulted in the generation of big data consisting of patterns, trends and associations capturing and representing many features of human behaviour. The latent representations of many aspects and the basis of human behaviour is naturally embedded within the expression of emotions found in human speech. This signifies the importance of mining audio data collected from human conversations for extracting human emotion. Ability to capture and represent human emotions will be an important feature in next-generation artificial intelligence, with the expectation of closer interaction with humans. Although the textual representations of human conversations have shown promising results for the extraction of emotions, the acoustic feature-based emotion detection from audio still lags behind in terms of accuracy. This paper proposes a novel approach for feature extraction consisting of Bag-of-Audio-Words (BoAW) based feature embeddings for conversational audio data. A Recurrent Neural Network (RNN) based state-of-the-art emotion detection model is proposed that captures the conversation-context and individual party states when making real-time categorical emotion predictions. The performance of the proposed approach and the model is evaluated using two benchmark datasets along with an empirical evaluation on real-time prediction capability. The proposed approach reported 60.87% weighted accuracy and 60.97% unweighted accuracy for six basic emotions for IEMOCAP dataset, significantly outperforming current state-of-the-art models.<\/jats:p>","DOI":"10.1007\/s11042-022-13363-4","type":"journal-article","created":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T03:26:31Z","timestamp":1655868391000},"page":"35173-35194","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":72,"title":["A voice-based real-time emotion detection technique using recurrent neural network empowered feature modelling"],"prefix":"10.1007","volume":"81","author":[{"given":"Sadil","family":"Chamishka","sequence":"first","affiliation":[]},{"given":"Ishara","family":"Madhavi","sequence":"additional","affiliation":[]},{"given":"Rashmika","family":"Nawaratne","sequence":"additional","affiliation":[]},{"given":"Damminda","family":"Alahakoon","sequence":"additional","affiliation":[]},{"given":"Daswin","family":"De Silva","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5396-8897","authenticated-orcid":false,"given":"Naveen","family":"Chilamkurti","sequence":"additional","affiliation":[]},{"given":"Vishaka","family":"Nanayakkara","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,6,22]]},"reference":[{"key":"13363_CR1","doi-asserted-by":"publisher","unstructured":"Abeysinghe S et al. (2018) Enhancing decision making capacity in tourism domain using social media analytics. 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), pp 369\u2013375. https:\/\/doi.org\/10.1109\/ICTER.2018.8615462","DOI":"10.1109\/ICTER.2018.8615462"},{"key":"13363_CR2","doi-asserted-by":"publisher","unstructured":"Adikari A, Alahakoon D (2021) Understanding citizens\u2019 emotional pulse in a smart city using artificial intelligence. IEEE Trans Ind Inf 17(4):2743\u20132751.\u00a0https:\/\/doi.org\/10.1109\/TII.2020.3009277","DOI":"10.1109\/TII.2020.3009277"},{"key":"13363_CR3","doi-asserted-by":"crossref","unstructured":"Adikari A, Burnett D, Sedera D, de Silva D, Alahakoon D (2021) Value co-creation for open innovation: An evidence-based study of the data driven paradigm of social media using machine learning. Int J Inf Manag Data Insights 1(2):100022","DOI":"10.1016\/j.jjimei.2021.100022"},{"issue":"4","key":"13363_CR4","doi-asserted-by":"publisher","first-page":"e27341","DOI":"10.2196\/27341","volume":"23","author":"A Adikari","year":"2021","unstructured":"Adikari A, Nawaratne R, De Silva D, Ranasinghe S, Alahakoon O, Alahakoon D (2021) Emotions of COVID-19: Content analysis of self-reported information using artificial intelligence. J Med Internet Res 23(4):e27341","journal-title":"J Med Internet Res"},{"key":"13363_CR5","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1016\/j.future.2020.10.028","volume":"116","author":"A Adikari","year":"2021","unstructured":"Adikari A, Gamage G, de Silva D, Mills N, Wong S, Alahakoon D (2021) A self structuring artificial intelligence framework for deep emotions modeling and analysis on the social web. Futur Gener Comput Syst 116:302\u2013315","journal-title":"Futur Gener Comput Syst"},{"key":"13363_CR6","doi-asserted-by":"publisher","DOI":"10.1007\/s10796-020-10056-x","author":"D Alahakoon","year":"2020","unstructured":"Alahakoon D, Nawaratne R, Xu Y, De Silva D, Sivarajah U, Gupta B (2020)Self-building artificial intelligence and machine learning to empower big data analytics in smart cities. Inform Syst Front. https:\/\/doi.org\/10.1007\/s10796-020-10056-x","journal-title":"Inform Syst Front"},{"key":"13363_CR7","doi-asserted-by":"crossref","unstructured":"Alvi S, Afzal B, Shah G, Atzori L, Mahmood W (2015) Internet of multimedia things: Vision and challenges. Ad Hoc Networks 33:87\u2013111","DOI":"10.1016\/j.adhoc.2015.04.006"},{"key":"13363_CR8","unstructured":"Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proc. of the 18th annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027\u20131035"},{"key":"13363_CR9","unstructured":"Baevski A, Zhou H, Mohamed A, Auli M (2021) wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv.org"},{"issue":"4","key":"13363_CR10","doi-asserted-by":"publisher","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","volume":"42","author":"C Busso","year":"2008","unstructured":"Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335","journal-title":"Lang Resour Eval"},{"issue":"10","key":"13363_CR11","doi-asserted-by":"publisher","first-page":"1440","DOI":"10.1109\/LSP.2018.2860246","volume":"25","author":"M Chen","year":"2018","unstructured":"Chen M, He X, Yang J, Zhang H (2018)3-D Convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440\u20131444. https:\/\/doi.org\/10.1109\/LSP.2018.2860246","journal-title":"IEEE Signal Process Lett"},{"key":"13363_CR12","doi-asserted-by":"publisher","unstructured":"Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. https:\/\/doi.org\/10.48550\/arXiv.1412.3555","DOI":"10.48550\/arXiv.1412.3555"},{"key":"13363_CR13","unstructured":"Converting Video (2020) Formats with FFmpeg | Linux Journal. Linuxjournal.com"},{"key":"13363_CR14","doi-asserted-by":"crossref","unstructured":"Devamanyu Hazarika S, Poria A, Zadeh E, Cambria L-P, Morency, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume1 (Long Papers), vol 1, pp 2122\u20132132","DOI":"10.18653\/v1\/N18-1193"},{"key":"13363_CR15","doi-asserted-by":"publisher","unstructured":"Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3\u20134):169\u2013200.\u00a0https:\/\/doi.org\/10.1080\/02699939208411068","DOI":"10.1080\/02699939208411068"},{"key":"13363_CR16","doi-asserted-by":"publisher","unstructured":"Florian Eyben F, Weninger F, Gross B (2013) Schuller: Recent Developments in open SMILE, the Munich Open-Source Multimedia Feature Extractor. In: Proc. ACM Multimedia (MM), Barcelona, Spain, ACM, ISBN 978-1-4503-2404-5, pp 835\u2013838.\u00a0https:\/\/doi.org\/10.1145\/2502081.2502224","DOI":"10.1145\/2502081.2502224"},{"key":"13363_CR17","doi-asserted-by":"crossref","unstructured":"Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540","DOI":"10.18653\/v1\/D19-1015"},{"key":"13363_CR18","unstructured":"Han K, Yu D, Tashev I (2020) Speech emotion recognition using deep neural network and extreme learning machine. Microsoft Research"},{"key":"13363_CR19","doi-asserted-by":"crossref","unstructured":"Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long Papers), pp 2122\u20132132","DOI":"10.18653\/v1\/N18-1193"},{"key":"13363_CR20","doi-asserted-by":"publisher","unstructured":"Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2020) ICoN: Interactive conversational memory network for multimodal emotion detection. Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2018, pp 2594\u20132604. https:\/\/doi.org\/10.18653\/v1\/d18-1280","DOI":"10.18653\/v1\/d18-1280"},{"key":"13363_CR21","unstructured":"De Barros PVA (2016) Modeling affection mechanisms using deep and self-organizing neural networks. Staats-und Universit\u00e4tsbibliothek Hamburg Carl von Ossietzky"},{"key":"13363_CR22","first-page":"1","volume-title":"Human emotions","author":"C Izard","year":"2013","unstructured":"Izard C (2013) Human emotions. Springer, New York, pp 1\u20134"},{"key":"13363_CR23","doi-asserted-by":"crossref","unstructured":"Jiao W, Lyu MR, King I (2019)Real-time emotion recognition via attention gated hierarchical memory network. arXiv preprint arXiv:1911.09075","DOI":"10.1609\/aaai.v34i05.6309"},{"key":"13363_CR24","doi-asserted-by":"publisher","unstructured":"Keren G, Schuller B (2016) Convolutional RNN: An enhanced model for extracting features from sequential data. Proc. Int. Jt. Conf. Neural Networks, vol. 2016-October, pp 3412\u20133419. https:\/\/doi.org\/10.1109\/IJCNN.2016.7727636","DOI":"10.1109\/IJCNN.2016.7727636"},{"issue":"9\u201310","key":"13363_CR25","doi-asserted-by":"publisher","first-page":"1162","DOI":"10.1016\/j.specom.2011.06.004","volume":"53","author":"C-C Lee","year":"2011","unstructured":"Lee C-C, Mower E, Busso C, Lee S, Narayanan S (2011) Emotion recognition using a hierarchical binary decision tree approach. Speech Commun 53(9\u201310):1162\u20131171","journal-title":"Speech Commun"},{"issue":"10","key":"13363_CR26","doi-asserted-by":"publisher","first-page":"1163","DOI":"10.3390\/electronics10101163","volume":"10","author":"E Lieskovsk\u00e1","year":"2021","unstructured":"Lieskovsk\u00e1 E, Jakubec M, Jarina R, Chmul\u00edk M (2021) A review on speech emotion recognition using deep learning and attention mechanism. Electronics 10(10):1163","journal-title":"Electronics"},{"key":"13363_CR27","doi-asserted-by":"publisher","unstructured":"Madhavi I, Chamishka S, Nawaratne R, Nanayakkara V, Alahakoon D, De Silva D (2020) A deep learning approach for work related stress detection from audio streams in cyber physical environments. 2020 25th IEEE International Conference on Emerging Technologies and Automation F (ETFA), pp 929\u2013936. https:\/\/doi.org\/10.1109\/ETFA46521.2020.9212098","DOI":"10.1109\/ETFA46521.2020.9212098"},{"key":"13363_CR28","doi-asserted-by":"publisher","unstructured":"Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 6818\u20136825. Available: https:\/\/doi.org\/10.1609\/aaai.v33i01.33016818","DOI":"10.1609\/aaai.v33i01.33016818"},{"key":"13363_CR29","doi-asserted-by":"publisher","unstructured":"Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention center for robust speech systems. The University of Texas at Dallas, Richardson, TX 75080, USA Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA. IEEE Int. Conf. Acoust. Speech, Signal Process, pp 2227\u20132231.\u00a0https:\/\/doi.org\/10.1109\/ICASSP.2017.7952552","DOI":"10.1109\/ICASSP.2017.7952552"},{"issue":"4","key":"13363_CR30","doi-asserted-by":"publisher","first-page":"344","DOI":"10.1511\/2001.4.344","volume":"89","author":"R Plutchik","year":"2001","unstructured":"Plutchik R (2001) The Nature of Emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. Am Sci 89(4):344\u2013350","journal-title":"Am Sci"},{"key":"13363_CR31","doi-asserted-by":"crossref","unstructured":"Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P(2017)Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol 1: Long Papers), pp 873\u2013 883","DOI":"10.18653\/v1\/P17-1081"},{"key":"13363_CR32","doi-asserted-by":"crossref","unstructured":"Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2019) MELD: A multimodal multi-party dataset for emotion recognition in conversations. ACL, pp 527\u2013536","DOI":"10.18653\/v1\/P19-1050"},{"key":"13363_CR33","unstructured":"Rathnayaka P, Abeysinghe S, Samarajeewa C, Manchanayake I, Walpola M, Nawaratne R, Bandaragoda T, Alahakoon D (2019) Gated recurrent neural network approach for multilabel emotion detection in microblogs. 2012:2012\u20132017. http:\/\/arxiv.org\/abs\/1907.07653"},{"issue":"1","key":"13363_CR34","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1109\/T-AFFC.2010.10","volume":"1","author":"WP Rosalind","year":"2010","unstructured":"Rosalind WP (2010) Affective computing: from laughter to IEEE. IEEE Trans Affect Comput 1(1):11\u201317","journal-title":"IEEE Trans Affect Comput"},{"key":"13363_CR35","doi-asserted-by":"crossref","unstructured":"Ruusuvuori J (2013) Emotion, affect and conversation. The handbook of conversation analysis, pp 330\u2013349","DOI":"10.1002\/9781118325001.ch16"},{"key":"13363_CR36","doi-asserted-by":"publisher","unstructured":"Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms,. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol 2017-August, pp 1089\u20131093. https:\/\/doi.org\/10.21437\/Interspeech.2017-200","DOI":"10.21437\/Interspeech.2017-200"},{"issue":"96","key":"13363_CR37","first-page":"1","volume":"18","author":"M Schmitt","year":"2017","unstructured":"Schmitt M, Schuller B (2017) openXBOW - Introducing the passau open-source crossmodal bag-of-words toolkit. J Mach Learn Res 18(96):1\u20135","journal-title":"J Mach Learn Res"},{"key":"13363_CR38","doi-asserted-by":"crossref","unstructured":"Schmitt F, Ringeval, Schuller B (2016) At the border of acous-tics and linguistics: Bag-of-audio-words for the recognition of emotions in speech. Proc of Interspeech, pp 495\u2013499","DOI":"10.21437\/Interspeech.2016-1124"},{"key":"13363_CR39","doi-asserted-by":"crossref","unstructured":"Schuller B, Steidl S, Batliner A, Epps J, Eyben F, Ringeval F, Marchi E, Zhang Y (2014) The INTERSPEECH 2014 Computational Paralinguistics Challenge: Cognitive & Physical Load. In: Proceedings INTERSPEECH 2014. 15th Annual Conference of the International Speech Communication Association, (Singapore, Singapore), ISCA, ISCA","DOI":"10.21437\/Interspeech.2014-104"},{"key":"13363_CR40","unstructured":"Tripathi S, Kumar A, Ramesh A, Singh C, Yenigalla P (2019) Deep learning based emotion recognition system using speech features and transcriptions, pp 1\u201312"},{"key":"13363_CR41","doi-asserted-by":"publisher","unstructured":"Yoon S, Byun S, Jung K (2019) Multimodal speech emotion recognition using audio and text. 2018 IEEE Spok. Lang. Technol. Work. SLT 2018 - Proc., no. December, pp 112\u2013118.\u00a0https:\/\/doi.org\/10.1109\/SLT.2018.8639583","DOI":"10.1109\/SLT.2018.8639583"},{"key":"13363_CR42","doi-asserted-by":"crossref","unstructured":"Yoon S, Byun S, Dey S, Jung K (2019) Speech emotion recognition using multi-hop attention mechanism. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2822\u20132826","DOI":"10.1109\/ICASSP.2019.8683483"}],"container-title":["Multimedia Tools and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-022-13363-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11042-022-13363-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-022-13363-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,9,22]],"date-time":"2022-09-22T10:55:21Z","timestamp":1663844121000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11042-022-13363-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,22]]},"references-count":42,"journal-issue":{"issue":"24","published-print":{"date-parts":[[2022,10]]}},"alternative-id":["13363"],"URL":"https:\/\/doi.org\/10.1007\/s11042-022-13363-4","relation":{},"ISSN":["1380-7501","1573-7721"],"issn-type":[{"value":"1380-7501","type":"print"},{"value":"1573-7721","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,22]]},"assertion":[{"value":"4 October 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2022","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 June 2022","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 June 2022","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The Authors declare that there is no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interests"}}]}}