{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T10:11:30Z","timestamp":1775470290289,"version":"3.50.1"},"reference-count":37,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2022,4,8]],"date-time":"2022-04-08T00:00:00Z","timestamp":1649376000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Recent studies have reported that the performance of Automatic Speech Recognition (ASR) technologies designed for normal speech notably deteriorates when it is evaluated by whispered speech. Therefore, the detection of whispered speech is useful in order to attenuate the mismatch between training and testing situations. This paper proposes two new Glottal Flow (GF)-based features, namely, GF-based Mel-Frequency Cepstral Coefficient (GF-MFCC) as a magnitude-based feature and GF-based relative phase (GF-RP) as a phase-based feature for whispered speech detection. The main contribution of the proposed features is to extract magnitude and phase information obtained by the GF signal. In the GF-MFCC, Mel-frequency cepstral coefficient (MFCC) feature extraction is modified using the estimated GF signal derived from the iterative adaptive inverse filtering as the input to replace the raw speech signal. In a similar way, the GF-RP feature is the modification of the relative phase (RP) feature extraction by using the GF signal instead of the raw speech signal. The whispered speech production provides lower amplitude from the glottal source than normal speech production, thus, the whispered speech via Discrete Fourier Transformation (DFT) provides the lower magnitude and phase information, which make it different from a normal speech. Therefore, it is hypothesized that two types of our proposed features are useful for whispered speech detection. In addition, using the individual GF-MFCC\/GF-RP feature, the feature-level and score-level combination are also proposed to further improve the detection performance. The performance of the proposed features and combinations in this study is investigated using the CHAIN corpus. The proposed GF-MFCC outperforms MFCC, while GF-RP has a higher performance than the RP. Further improved results are obtained via the feature-level combination of MFCC and GF-MFCC (MFCC&amp;GF-MFCC)\/RP and GF-RP(RP&amp;GF-RP) compared with using either one alone. In addition, the combined score of MFCC&amp;GF-MFCC and RP&amp;GF-RP gives the best frame-level accuracy of 95.01% and the utterance-level accuracy of 100%.<\/jats:p>","DOI":"10.3390\/sym14040777","type":"journal-article","created":{"date-parts":[[2022,4,10]],"date-time":"2022-04-10T06:02:54Z","timestamp":1649570574000},"page":"777","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Whispered Speech Detection Using Glottal Flow-Based Features"],"prefix":"10.3390","volume":"14","author":[{"given":"Khomdet","family":"Phapatanaburi","sequence":"first","affiliation":[{"name":"Department of Telecommunication Engineering, Faculty of Engineering and Technology, Rajamangala University of Technology Isan (RMUTI), Nakhon Ratchasima 30000, Thailand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wongsathon","family":"Pathonsuwan","sequence":"additional","affiliation":[{"name":"School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Longbiao","family":"Wang","sequence":"additional","affiliation":[{"name":"Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Patikorn","family":"Anchuen","sequence":"additional","affiliation":[{"name":"Navaminda Kasatriyadhiraj Royal Air Force Academy, Bangkok 10220, Thailand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Talit","family":"Jumphoo","sequence":"additional","affiliation":[{"name":"School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Prawit","family":"Buayai","sequence":"additional","affiliation":[{"name":"Graduate Faculty of Interdisciplinary Research, University of Yamanashi, Kofu 400-8511, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Monthippa","family":"Uthansakul","sequence":"additional","affiliation":[{"name":"School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peerapong","family":"Uthansakul","sequence":"additional","affiliation":[{"name":"School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,4,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wang, D., Wang, X., and Lv, S. (2019). An overview of end-to-end automatic speech recognition. Symmetry, 11.","DOI":"10.3390\/sym11081018"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1109\/MSP.2017.2697179","article-title":"How biometric authentication poses new challenges to our security and privacy [in the spotlight]","volume":"34","author":"Memon","year":"2017","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Heigold, G., Moreno, I., Bengio, S., and Shazeer, N. (2016, January 20\u201325). End-to-end text-dependent speaker verification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472652"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2313","DOI":"10.1109\/TASLP.2017.2738559","article-title":"Whispered speech recognition using deep denoising autoencoder and inverse filtering","volume":"25","year":"2017","journal-title":"IEEE\/ACM Trans. Audio Speech Lang."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Jin, Q., Jou, S.S., and Schultz, T. (2007, January 2\u20135). Whispering speaker identification. Proceedings of the IEEE IEEE International Conference on Multimedia and Expo (ICME), Beijing, China.","DOI":"10.1109\/ICME.2007.4284828"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Yang, C., Brown, G., Lu, L., Yamagishi, J., and King, S. (2012, January 5\u20138). Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation. Proceedings of the 8th International Symposium on Chinese Spoken Language Processing, Hong Kong, China.","DOI":"10.1109\/ISCSLP.2012.6423522"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1016\/j.specom.2003.10.005","article-title":"Analysis and recognition of whispered speech","volume":"45","author":"Ito","year":"2005","journal-title":"Speech Commun."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1687-6180-2012-157","article-title":"Significance of parametric spectral ratio methods in detection and recognition of whispered speech","volume":"2012","author":"Mathur","year":"2012","journal-title":"EURASIP J. Adv. Signal Process."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Khoria, K., Kamble, M.R., and Patil, H.A. (2021, January 18\u201322). Teager energy cepstral coefficients for classification of normal vs. whisper speech. Proceedings of the 28th European Signal Processing Conference (EUSIPCO), Virtual.","DOI":"10.23919\/Eusipco47968.2020.9287634"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"373","DOI":"10.1109\/10.486257","article-title":"Direct speech feature estimation using an iterative EM algorithm for vocal fold pathology detection","volume":"43","author":"Hansen","year":"1996","journal-title":"IEEE Trans. Biomed. Eng."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"985","DOI":"10.1016\/S0030-6665(20)31062-8","article-title":"The spectrum of vocal dysfunction","volume":"24","author":"Koufman","year":"1991","journal-title":"Otolaryngol. Clin. N. Am."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1109\/10.661155","article-title":"A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment","volume":"45","author":"Hansen","year":"1998","journal-title":"IEEE Trans. Biomed. Eng."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Thakur, N., and Han, C. (2021). An ambient intelligence-based human behavior monitoring framework for ubiquitous environments. Information, 12.","DOI":"10.3390\/info12020081"},{"key":"ref_14","first-page":"142","article-title":"Whispered speech detection in noise using auditory-inspired modulation spectrum features","volume":"20","author":"Falk","year":"2013","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_15","unstructured":"Kinnunen, T., Lee, K.A., and Li, H. (2008, January 21\u201324). Dimension reduction of the modulation spectrogram for speaker verification. Proceedings of the Odyssey 2008: The Speaker and Language Recognition Workshop, Stellenbosch, South Africa."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1859","DOI":"10.1109\/LSP.2015.2439514","article-title":"Robust whisper activity detection using long-term log energy variation of sub-band signal","volume":"22","author":"Meenakshi","year":"2015","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"883","DOI":"10.1109\/TASL.2010.2066967","article-title":"Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing","volume":"19","author":"Zhang","year":"2010","journal-title":"IEEE Trans. Audio Speech Lang."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Raeesy, Z., Gillespie, K., Ma, C., Drugman, T., Gu, J., Maas, R., Rastrow, A., and Hoffmeister, B. (2018, January 18\u201321). LSTM-based whisper detection. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639614"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Shah, N.J., Shaik, M.A.B., Periyasamy, P., Patil, H.A., and Vij, V. (2021, January 23\u201327). Exploiting phase-based features for whisper vs. speech classification. Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Virtual.","DOI":"10.23919\/EUSIPCO54536.2021.9616337"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Wang, L., Phapatanaburi, K., Oo, Z., Nakagawa, S., Iwahashi, M., and Dang, J. (2017, January 10\u201314). Phase aware deep neural network for noise robust voice activity detection. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China.","DOI":"10.1109\/ICME.2017.8019414"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1109\/TASL.2010.2045239","article-title":"HMM-based speech synthesis utilizing glottal inverse filtering","volume":"19","author":"Raitio","year":"2010","journal-title":"IEEE Trans. Audio Speech Lang."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1016\/0167-6393(92)90005-R","article-title":"Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering","volume":"11","author":"Alku","year":"1992","journal-title":"Speech Commun."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1329","DOI":"10.1016\/S1388-2457(99)00088-7","article-title":"A method for generating natural-sounding speech stimuli for cognitive brain research","volume":"110","author":"Alku","year":"1999","journal-title":"Clin. Neurophysiol."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1109\/TASSP.1979.1163260","article-title":"Least squares glottal inverse filtering from the acoustic speech waveform","volume":"27","author":"Wong","year":"1979","journal-title":"IEEE Trans. Audio Speech Lang."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1016\/j.specom.2005.01.007","article-title":"Estimation of the vocal tract transfer function with application to glottal wave analysis","volume":"46","author":"Akande","year":"2005","journal-title":"Speech Commun."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"492","DOI":"10.1109\/TSA.2005.857807","article-title":"Robust glottal source estimation based on joint source-filter model optimization","volume":"14","author":"Fu","year":"2006","journal-title":"IEEE Trans. Audio Speech Lang."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1016\/j.specom.2021.11.005","article-title":"Learning affective representations based on magnitude and dynamic relative phase information for speech emotion recognition","volume":"136","author":"Guo","year":"2022","journal-title":"Speech Commun."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1085","DOI":"10.1109\/TASL.2011.2172422","article-title":"Speaker identification and verification by combining MFCC and phase information","volume":"20","author":"Nakagawa","year":"2011","journal-title":"IEEE\/ACM Trans. Audio Speech Lang."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2397","DOI":"10.1587\/transinf.E93.D.2397","article-title":"Speaker recognition by combining MFCC and phase information in noisy conditions","volume":"93","author":"Wang","year":"2010","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13636-019-0151-2","article-title":"Replay attack detection with auditory filter-based relative phase features","volume":"2019","author":"Oo","year":"2019","journal-title":"Eurasip J. Audio Speech Music Process."},{"key":"ref_31","unstructured":"Cummins, F., Grimaldi, M., Leonard, T., and Simko, J. (2006, January 25\u201329). The chains corpus: Characterizing individual speakers. Proceedings of the Sixteenth Annual Conference of the International Conference on Speech and Computer, Saint Petersburg, Russian."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Degottex, G., Kane, J., Drugman, T., Raitio, T., and Scherer, S. (2014, January 4\u20139). COVAREP\u2014A collaborative voice analysis repository for speech technologies. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6853739"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wang, L., Yoshida, Y., Kawakami, Y., and Nakagawa, S. (2015, January 6\u201310). Relative phase information for detecting human speech and spoofed speech. Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-473"},{"key":"ref_34","first-page":"5","article-title":"Deep learning: From speech recognition to language and multimodal processing","volume":"2016","author":"Deng","year":"2016","journal-title":"APSIPA Trans. Signal Inf. Process."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Vedaldi, A., and Fulkerson, B. (2010, January 25\u201329). VLFeat: An open and portable library of computer vision algorithms. Proceedings of the 18th ACM International Conference on Multimedia, New York, NY, USA.","DOI":"10.1145\/1873951.1874249"},{"key":"ref_36","first-page":"3029","article-title":"Brainwave classification for character-writing application using emd-based GMM and KELM approaches","volume":"66","author":"Phapatanaburi","year":"2021","journal-title":"CMC-Comput. Mater. Contin."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Naini, A.R., Satyapriya, M., and Ghosh, P.K. (2020, January 14\u201318). Whisper activity detection using CNN-LSTM based attention pooling network trained for a speaker identification Task. Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-3217"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/14\/4\/777\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:50:28Z","timestamp":1760136628000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/14\/4\/777"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,8]]},"references-count":37,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,4]]}},"alternative-id":["sym14040777"],"URL":"https:\/\/doi.org\/10.3390\/sym14040777","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,8]]}}}