{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:10:47Z","timestamp":1760148647597,"version":"build-2065373602"},"reference-count":43,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T00:00:00Z","timestamp":1685404800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NRDI Office of the Hungarian Ministry of Innovation and Technology","award":["TKP2021-NVA-09","RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["TKP2021-NVA-09","RRF-2.3.1-21-2022-00004"]}]},{"name":"Artificial Intelligence National Laboratory Program","award":["TKP2021-NVA-09","RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["TKP2021-NVA-09","RRF-2.3.1-21-2022-00004"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The field of computational paralinguistics emerged from automatic speech processing, and it covers a wide range of tasks involving different phenomena present in human speech. It focuses on the non-verbal content of human speech, including tasks such as spoken emotion recognition, conflict intensity estimation and sleepiness detection from speech, showing straightforward application possibilities for remote monitoring with acoustic sensors. The two main technical issues present in computational paralinguistics are (1) handling varying-length utterances with traditional classifiers and (2) training models on relatively small corpora. In this study, we present a method that combines automatic speech recognition and paralinguistic approaches, which is able to handle both of these technical issues. That is, we trained a HMM\/DNN hybrid acoustic model on a general ASR corpus, which was then used as a source of embeddings employed as features for several paralinguistic tasks. To convert the local embeddings into utterance-level features, we experimented with five different aggregation methods, namely mean, standard deviation, skewness, kurtosis and the ratio of non-zero activations. Our results show that the proposed feature extraction technique consistently outperforms the widely used x-vector method used as the baseline, independently of the actual paralinguistic task investigated. Furthermore, the aggregation techniques could be combined effectively as well, leading to further improvements depending on the task and the layer of the neural network serving as the source of the local embeddings. Overall, based on our experimental results, the proposed method can be considered as a competitive and resource-efficient approach for a wide range of computational paralinguistic tasks.<\/jats:p>","DOI":"10.3390\/s23115208","type":"journal-article","created":{"date-parts":[[2023,5,31]],"date-time":"2023-05-31T02:57:10Z","timestamp":1685501830000},"page":"5208","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Using Hybrid HMM\/DNN Embedding Extractor Models in Computational Paralinguistic Tasks"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3914-2036","authenticated-orcid":false,"given":"Mercedes","family":"Vetr\u00e1b","sequence":"first","affiliation":[{"name":"Institute of Informatics, University of Szeged, H-6720 Szeged, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2864-6466","authenticated-orcid":false,"given":"G\u00e1bor","family":"Gosztolya","sequence":"additional","affiliation":[{"name":"Institute of Informatics, University of Szeged, H-6720 Szeged, Hungary"},{"name":"ELKH-SZTE Research Group on Artificial Intelligence, H-6720 Szeged, Hungary"}]}],"member":"1968","published-online":{"date-parts":[[2023,5,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1590","DOI":"10.1109\/TASL.2008.2002085","article-title":"Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization","volume":"16","author":"Han","year":"2008","journal-title":"IEEE Trans. Audio, Speech, Lang. Process."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Lin, Y.C., Hsu, Y.T., Fu, S.W., Tsao, Y., and Kuo, T.W. (2019, January 15\u201319). IA-NET: Acceleration and Compression of Speech Enhancement Using Integer-Adder Deep Neural Network. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-1207"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Van Segbroeck, M., Travadi, R., Vaz, C., Kim, J., Black, M.P., Potamianos, A., and Narayanan, S.S. (2014, January 14\u201318). Classification of Cognitive Load from Speech Using an i-Vector Framework. Proceedings of the Fifteenth Annual Conference of the Interspeech 2014, Singapore.","DOI":"10.21437\/Interspeech.2014-114"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Gosztolya, G., Gr\u00f3sz, T., Busa-Fekete, R., and T\u00f3th, L. (2014, January 14\u201318). Detecting the intensity of cognitive and physical load using AdaBoost and Deep Rectifier Neural Networks. Proceedings of the Fifteenth Annual Conference of the Interspeech 2014, Singapore.","DOI":"10.21437\/Interspeech.2014-109"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"578369","DOI":"10.3389\/fninf.2021.578369","article-title":"X-Vectors: New Quantitative Biomarkers for Early Parkinson\u2019s Disease Detection From Speech","volume":"15","author":"Jeancolas","year":"2021","journal-title":"Front. Neuroinform."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"V\u00e1squez-Correa, J., Orozco-Arroyave, J.R., and N\u00f6th, E. (2017, January 20\u201324). Convolutional Neural Network to Model Articulation Impairments in Patients with Parkinson\u2019s Disease. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1078"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Kadiri, S., Kethireddy, R., and Alku, P. (2020, January 25\u201329). Parkinson\u2019s Disease Detection from Speech Using Single Frequency Filtering Cepstral Coefficients. Proceedings of the Interspeech 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-3197"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Pappagari, R., Cho, J., Joshi, S., Moro-Vel\u00e1zquez, L., \u017belasko, P., Villalba, J., and Dehak, N. (September, January 30). Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios. Proceedings of the Interspeech 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-1850"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Chen, J., Ye, J., Tang, F., and Zhou, J. (September, January 30). Automatic Detection of Alzheimer\u2019s Disease Using Spontaneous Speech Only. Proceedings of the Interspeech 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-2002"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"P\u00e9rez-Toro, P., Klumpp, P., Hernandez, A., Arias, T., Lillo, P., Slachevsky, A., Garc\u00eda, A., Schuster, M., Maier, A., and N\u00f6th, E. (2022, January 18\u201322). Alzheimer\u2019s Detection from English to Spanish Using Acoustic and Linguistic Embeddings. Proceedings of the Interspeech 2022, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-10883"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1759","DOI":"10.1007\/s00405-009-1003-y","article-title":"Vocal Symptoms and Acoustic Changes in Relation to the Expanded Disability Status Scale, Duration and Stage of Disease in Patients with Multiple Sclerosis","volume":"266","author":"Yamout","year":"2009","journal-title":"Eur. Arch. Otorhinolaryngol"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Egas-L\u00f3pez, J.V., Kiss, G., Sztah\u00f3, D., and Gosztolya, G. (2022, January 23\u201327). Automatic Assessment of the Degree of Clinical Depression from Speech Using X-Vectors. Proceedings of the ICASSP 2022\u20142022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746068"},{"key":"ref_13","first-page":"3","article-title":"GMM-Based Speaker Age and Gender Classification in Czech and Slovak","volume":"68","year":"2017","journal-title":"J. Electr. Eng."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"and Kwon, S. (2020). CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network. Mathematics, 8.","DOI":"10.3390\/math8122133"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhao, Z., Bao, Z., Zhang, Z., Cummins, N., Wang, H., and Schuller, B. (2019, January 15\u201319). Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-1649"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Gosztolya, G., Beke, A., and Neuberger, T. (2019, January 20\u201325). Differentiating laughter types via HMM\/DNN and probabilistic sampling. Proceedings of the Speech and Computer: 21st International Conference, SPECOM 2019, Istanbul, Turkey.","DOI":"10.1007\/978-3-030-26061-3_13"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Egas-L\u00f3pez, J.V., and Gosztolya, G. (2021, January 6\u201311). Deep Neural Network Embeddings for the Estimation of the Degree of Sleepiness. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413589"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Grezes, F., Richards, J., and Rosenberg, A. (2013, January 25\u201329). Let me finish: Automatic conflict detection using speaker overlap. Proceedings of the Interspeech 2013, Lyon, France.","DOI":"10.21437\/Interspeech.2013-67"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bone, D., Black, M.P., Li, M., Metallinou, A., Lee, S., and Narayanan, S. (2011, January 28\u201331). Intoxicated speech detection by fusion of speaker normalized hierarchical features and GMM supervectors. Proceedings of the Twelfth Annual Conference of Interspeech 2011, Lorence, Italy.","DOI":"10.21437\/Interspeech.2011-805"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Schuller, B., Steidl, S., and Batliner, A. (2009, January 6\u201310). The INTERSPEECH 2009 Emotion Challenge. Proceedings of the Computational Paralinguistics Challenge (ComParE), Interspeech 2009, Brighton, UK.","DOI":"10.21437\/Interspeech.2009-103"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Schuller, B., Steidl, S., Batliner, A., Hantke, S., H\u00f6nig, F., Orozco-Arroyave, J.R., N\u00f6th, E., Zhang, Y., and Weninger, F. (2015, January 6\u201310). The INTERSPEECH 2015 computational paralinguistics challenge: Nativeness, Parkinson\u2019s & eating condition. Proceedings of the Computational Paralinguistics Challenge (ComParE), Interspeech 2015, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-179"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Schuller, B.W., Batliner, A., Bergler, C., Mascolo, C., Han, J., Lefter, I., Kaya, H., Amiriparian, S., Baird, A., and Stappen, L. (September, January 30). The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation Primates. Proceedings of the Computational Paralinguistics Challenge (ComParE), Interspeech 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-19"},{"key":"ref_23","unstructured":"Rabiner, L., and Juang, B.H. (1993). Fundamentals of Speech Recognition, Pearson College Div."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1038\/323533a0","article-title":"Learning representations by back-propagating errors","volume":"323","author":"Rumelhart","year":"1986","journal-title":"Nature"},{"key":"ref_25","unstructured":"Cox, S. (1988). Hidden Markov Models for Automatic Speech Recognition: Theory and Application, Royal Signals & Radar Establishment."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Boser, B., Guyon, I., and Vapnik, V. (1992, January 27\u201329). A Training Algorithm for Optimal Margin Classifier. Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, Pittsburgh, PA, USA.","DOI":"10.1145\/130385.130401"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"101377","DOI":"10.1016\/j.csl.2022.101377","article-title":"Automatic Screening of Mild Cognitive Impairment and Alzheimer\u2019s Disease by Means of Posterior-Thresholding Hesitation Representation","volume":"75","author":"Balogh","year":"2022","journal-title":"Comput. Speech Lang."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"104943","DOI":"10.1016\/j.knosys.2019.104943","article-title":"Posterior-Thresholding Feature Extraction for Paralinguistic Speech Classification","volume":"186","author":"Gosztolya","year":"2019","journal-title":"Knowl.-Based Syst."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1109\/89.279278","article-title":"Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains","volume":"2","author":"Gauvain","year":"1994","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_31","unstructured":"Morgan, N., and Bourlard, H. (1990, January 3\u20136). Continuous speech recognition using multilayer perceptrons with hidden Markov models. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comp."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014, January 26\u201328). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Panzner, M., and Cimiano, P. (2016, January 26\u201329). Comparing Hidden Markov Models and Long Short Term Memory Neural Networks for Learning Action Representations. Proceedings of the Second International Workshop of Machine Learning, Optimization, and Big Data, MOD 2016, Volterra, Italy.","DOI":"10.1007\/978-3-319-51469-7_8"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Schmitt, M., Cummins, N., and Schuller, B. (2019, January 15\u201319). Continuous Emotion Recognition in Speech\u2014Do We Need Recurrence?. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2710"},{"key":"ref_36","unstructured":"Steidl, S. (2009). Automatic Classification of Emotion Related User States in Spontaneous Children\u2019s Speech, Logos."},{"key":"ref_37","unstructured":"Krajewski, J., Schieder, S., and Batliner, A. (2017, January 20\u201324). Description of the Upper Respiratory Tract Infection Corpus (URTIC). Proceedings of the Interspeech 2017, Stockholm, Sweden."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Hantke, S., Weninger, F., Kurle, R., Ringeval, F., Batliner, A., Mousa, A., and Schuller, B. (2016). I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance. PLoS ONE, 11.","DOI":"10.1371\/journal.pone.0154486"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Neuberger, T., Gyarmathy, D., Gr\u00e1czi, T.E., Horv\u00e1th, V., G\u00f3sy, M., and Beke, A. (2014, January 8\u201312). Development of a Large Spontaneous Speech Database of Agglutinative Hungarian Language. Proceedings of the 17th International Conference, TSD 2014, Brno, Czech Republic.","DOI":"10.1007\/978-3-319-10816-2_51"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Deng, L., Droppo, J., and Acero, A. (2002, January 13\u201317). A Bayesian approach to speech feature enhancement using the dynamic cepstral prior. Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA.","DOI":"10.1109\/ICASSP.2002.5743867"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1145\/1961189.1961199","article-title":"LIBSVM: A library for support vector machines","volume":"2","author":"Chang","year":"2011","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Metze, F., Batliner, A., Eyben, F., Polzehl, T., Schuller, B., and Steidl, S. (2010, January 26\u201330). Emotion Recognition using Imperfect Speech Recognition. Proceedings of the Interspeech 2010, Chiba, Japan.","DOI":"10.21437\/Interspeech.2010-202"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018, January 15\u201320). X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proceedings of the 2018 IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461375"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/11\/5208\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:45:25Z","timestamp":1760125525000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/11\/5208"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,30]]},"references-count":43,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2023,6]]}},"alternative-id":["s23115208"],"URL":"https:\/\/doi.org\/10.3390\/s23115208","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2023,5,30]]}}}