{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T23:35:23Z","timestamp":1761176123395,"version":"build-2065373602"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686318","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T00:00:00Z","timestamp":1761004800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,10,21]]},"abstract":"<jats:p>Human speech possesses a great variety of conveying information. Identification of the transmitted emotions is crucial for effective communication, social, and human-computer interactions. Developing an efficient model is demanding issue due to subtle emotions differences, subjective assessment or sound characteristics for specific language. A wide range of deep learning methods are developed for this challenging task. We present Explainable Multimodal Hybrid Vision Transformers (EM-H-ViT), a unified framework that fuses four orthogonal feature spaces: Fuzzy-Transform energy maps, discrete Wavelet coefficients, complex Fourier spectrograms, and Mel spectrum coefficients, within a lightweight CNN\u2013ViT backbone. Each modality is first projected to an image-like tensor. Modality-specific convolutional branches capture local patterns, while a shared Vision Transformer aggregates long-range speech context. A cross-modal attention gate learns data-driven fusion weights and simultaneously produces pixel-level saliency maps, enabling post-hoc interpretation. We evaluate the EM-H-ViT on four benchmark corpora containing recordings of emotional speech in the following languages: Polish, English, German and Danish, using speaker-independent splits. The proposed model reaches 95.2%, 98.7%, 97.6%, and 95.1% accuracy, respectively. Ablation studies show that removing any single transform degrades performance by 3.4%-6.8%, confirming their complementarity. Obtained results demonstrate that the model can deliver, language independent, both superior accuracy and transparent reasoning.<\/jats:p>","DOI":"10.3233\/faia250823","type":"book-chapter","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:43:13Z","timestamp":1761126193000},"source":"Crossref","is-referenced-by-count":0,"title":["Explainable Multimodal Hybrid Vision Transformers for Emotional Speech Recognition"],"prefix":"10.3233","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5705-4785","authenticated-orcid":false,"given":"Pawel","family":"Powroznik","sequence":"first","affiliation":[{"name":"Department of Computer Science, Lublin University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0760-7126","authenticated-orcid":false,"given":"Maria","family":"Skublewska-Paszkowska","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Lublin University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3458-5464","authenticated-orcid":false,"given":"Krzysztof","family":"Dziedzic","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Lublin University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4909-9980","authenticated-orcid":false,"given":"Marcin","family":"Barszcz","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Lublin University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9061-4414","authenticated-orcid":false,"given":"Kinga","family":"Chwaleba","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Lublin University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6164-5862","authenticated-orcid":false,"given":"Weronika","family":"Wach","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Lublin University of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6911-8056","authenticated-orcid":false,"given":"Vimala","family":"Nunavath","sequence":"additional","affiliation":[{"name":"Department of Science and Industry systems, University of South-Eastern Norway"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","ECAI 2025"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA250823","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:43:13Z","timestamp":1761126193000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA250823"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,21]]},"ISBN":["9781643686318"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia250823","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,21]]}}}