{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T23:44:31Z","timestamp":1774050271982,"version":"3.50.1"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2020,9,24]],"date-time":"2020-09-24T00:00:00Z","timestamp":1600905600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,9,24]],"date-time":"2020-09-24T00:00:00Z","timestamp":1600905600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010665","name":"H2020 Marie Sklodowska-Curie Actions","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100010665","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Convocatoria Doctorado Nacional COLCIENCIAS","award":["785"],"award-info":[{"award-number":["785"]}]},{"DOI":"10.13039\/501100005278","name":"Universidad de Antioquia","doi-asserted-by":"publisher","award":["2018-23541"],"award-info":[{"award-number":["2018-23541"]}],"id":[{"id":"10.13039\/501100005278","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100005722","name":"Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005722","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Pattern Anal Applic"],"published-print":{"date-parts":[[2021,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Time\u2013frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time\u2013frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time\u2013frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers.<\/jats:p>","DOI":"10.1007\/s10044-020-00921-5","type":"journal-article","created":{"date-parts":[[2020,9,24]],"date-time":"2020-09-24T12:02:39Z","timestamp":1600948959000},"page":"423-431","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":79,"title":["Multi-channel spectrograms for speech processing applications using deep learning methods"],"prefix":"10.1007","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9405-4154","authenticated-orcid":false,"given":"T.","family":"Arias-Vergara","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"P.","family":"Klumpp","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"J. C.","family":"Vasquez-Correa","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"E.","family":"N\u00f6th","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"J. R.","family":"Orozco-Arroyave","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"M.","family":"Schuster","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,9,24]]},"reference":[{"issue":"2","key":"921_CR1","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1109\/JSTSP.2019.2908700","volume":"13","author":"H Purwins","year":"2019","unstructured":"Purwins H, Li B, Virtanen T, Schl\u00fcter J, Chang S, Sainath T (2019) Deep learning for audio signal processing. IEEE J Sel Top Signal Process 13(2):206\u2013219","journal-title":"IEEE J Sel Top Signal Process"},{"key":"921_CR2","doi-asserted-by":"crossref","unstructured":"V\u00e1squez-Correa JC, Orozco-Arroyave JR, N\u00f6th E (2017) Convolutional neural network to model articulation impairments in patients with Parkinson\u2019s disease. In: Proceedings of the eighteenth annual conference of the international speech communication association, pp 314\u2013318","DOI":"10.21437\/Interspeech.2017-1078"},{"key":"921_CR3","doi-asserted-by":"crossref","unstructured":"Wu H, Soraghan J, Lowit A, Di Caterina G (2018) A deep learning method for pathological voice detection using convolutional deep belief networks. In: Proceedings of the nineteenth annual conference of the international speech communication association, pp 446\u2013450","DOI":"10.21437\/Interspeech.2018-1351"},{"key":"921_CR4","doi-asserted-by":"publisher","first-page":"41034","DOI":"10.1109\/ACCESS.2018.2856238","volume":"6","author":"M Alhussein","year":"2018","unstructured":"Alhussein M, Muhammad G (2018) Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access 6:41034\u201341041","journal-title":"IEEE Access"},{"issue":"10","key":"921_CR5","doi-asserted-by":"publisher","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","volume":"22","author":"O Abdel-Hamid","year":"2014","unstructured":"Abdel-Hamid O, Mohamed A, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE\/ACM Trans Audio Speech Lang Process 22(10):1533\u20131545","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"921_CR6","doi-asserted-by":"crossref","unstructured":"Han K, He Y, Bagchi D, Fosler-Lussier E, Wang D (2015) Deep neural network based spectral feature mapping for robust speech recognition. In: Sixteenth annual conference of the international speech communication association, pp 2484\u20132488","DOI":"10.21437\/Interspeech.2015-536"},{"key":"921_CR7","doi-asserted-by":"crossref","unstructured":"Wei\u00dfkirchen N, Bock R, Wendemuth A (2017) Recognition of emotional speech with convolutional neural networks by means of spectral estimates. In: 2017 seventh international conference on affective computing and intelligent interaction workshops and demos (ACIIW), pp 50\u201355","DOI":"10.1109\/ACIIW.2017.8272585"},{"key":"921_CR8","doi-asserted-by":"crossref","unstructured":"Adavanne S, Politis A, Virtanen T (2018) Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features. In: 2018 international joint conference on neural networks (IJCNN), pp 1\u20137","DOI":"10.1109\/IJCNN.2018.8489542"},{"key":"921_CR9","doi-asserted-by":"crossref","unstructured":"Xu K, Feng D, Mi H, Zhu B, Wang D, Zhang L, Cai H, Liu S (2018) Mixup-based acoustic scene classification using multi-channel convolutional neural network. In: Pacific Rim conference on multimedia, pp 14\u201323","DOI":"10.1007\/978-3-030-00764-5_2"},{"key":"921_CR10","doi-asserted-by":"crossref","unstructured":"Ganapathy S, Peddinti V (2018) 3-D CNN models for far-field multi-channel speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5499\u20135503","DOI":"10.1109\/ICASSP.2018.8461580"},{"key":"921_CR11","doi-asserted-by":"crossref","unstructured":"Fu S, Hu T, Tsao Y, Lu X (2017) Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In: 2017 IEEE 27th international workshop on machine learning for signal processing (MLSP), pp 1\u20136","DOI":"10.1109\/MLSP.2017.8168119"},{"key":"921_CR12","doi-asserted-by":"crossref","unstructured":"Arias-Vergara T, Vasquez-Correa JC, Gollwitzer S, Orozco-Arroyave JR, Schuster M, N\u00f6th E (2019) Multi-channel convolutional neural networks for automatic detection of speech deficits in cochlear implant users. In: Iberoamerican congress on pattern recognition, pp 679\u2013687","DOI":"10.1007\/978-3-030-33904-3_64"},{"key":"921_CR13","first-page":"429","volume-title":"Complex sounds and auditory images","author":"RD Patterson","year":"1992","unstructured":"Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M (1992) Complex sounds and auditory images. Elsevier, Amsterdam, pp 429\u2013446"},{"key":"921_CR14","doi-asserted-by":"crossref","unstructured":"Virtanen T, Vincent E, Gannot S (2018) Time-frequency processing-spectral properties. In: Audio source separation and speech enhancement, pp 15\u201329","DOI":"10.1002\/9781119279860.ch2"},{"key":"921_CR15","unstructured":"Slaney M, et al (1993) An efficient implementation of the Patterson\u2013Holdsworth auditory filter bank. Apple Computer, Perception Group, Technical Report 35(8)"},{"key":"921_CR16","unstructured":"Latif S, Rana R, Khalifa S, Jurdak R, Qadir J, Schuller B (2020) Deep representation learning in speech processing: challenges, recent advances, and future trends. arXiv:2001.00378"},{"key":"921_CR17","doi-asserted-by":"crossref","unstructured":"Palaz D, Collobert RN, et al (2015) Analysis of CNN-based speech recognition system using raw speech as input. Technical Reports, Idiap","DOI":"10.21437\/Interspeech.2015-3"},{"key":"921_CR18","doi-asserted-by":"crossref","unstructured":"Graves A (2012) Supervised sequence labelling with recurrent neural networks, vol 385","DOI":"10.1007\/978-3-642-24797-2"},{"key":"921_CR19","unstructured":"Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch"},{"issue":"11","key":"921_CR20","doi-asserted-by":"publisher","first-page":"2278","DOI":"10.1109\/5.726791","volume":"86","author":"Y LeCun","year":"1998","unstructured":"LeCun Y, Bottou Ln, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278\u20132324","journal-title":"Proc IEEE"},{"key":"921_CR21","unstructured":"Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representation (ICLR)"},{"key":"921_CR22","doi-asserted-by":"crossref","unstructured":"Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673\u20132681","DOI":"10.1109\/78.650093"},{"key":"921_CR23","doi-asserted-by":"crossref","unstructured":"Cernak M, Tong S (2018) Nasal speech sounds detection using connectionist temporal classification. In: IEEE, pp 5574\u20135578","DOI":"10.1109\/ICASSP.2018.8462149"},{"key":"921_CR24","doi-asserted-by":"crossref","unstructured":"V\u00e1squez-Correa JC, Klumpp P, Orozco-Arroyave JR, N\u00f6th E (2019) Phonet: a tool based on gated recurrent neural networks to extract phonological posteriors from speech, pp 549\u2013553","DOI":"10.21437\/Interspeech.2019-1405"},{"key":"921_CR25","unstructured":"Hudgins CV, Numbers FC (1942) An investigation of the intelligibility of the speech of the deaf. In: Genetic psychology monographs"},{"issue":"S02","key":"921_CR26","first-page":"11435","volume":"98","author":"T Arias-Vergara","year":"2019","unstructured":"Arias-Vergara T, Gollwitzer S, Orozco-Arroyave JR, Vasquez-Correa JC, N\u00f6th E, H\u00f6gerle C, Schuster M (2019) Speech differences between CI users with pre-and postlingual onset of deafness detected by speech processing methods on voiceless to voice transitions. Laryngo-Rhino-Otologie 98(S02):11435","journal-title":"Laryngo-Rhino-Otologie"},{"key":"921_CR27","doi-asserted-by":"crossref","unstructured":"Arias-Vergara T, Orozco-Arroyave JR, Gollwitzer S, Schuster M, N\u00f6th E (2019) Consonant-to-vowel\/vowel-to-consonant transitions to analyze the speech of cochlear implant users. In: International conference on text, speech, and dialogue, pp 299\u2013306","DOI":"10.1007\/978-3-030-27947-9_25"},{"key":"921_CR28","volume-title":"Analysis of speech of people with Parkinson\u2019s disease","author":"JR Orozco-Arroyave","year":"2016","unstructured":"Orozco-Arroyave JR (2016) Analysis of speech of people with Parkinson\u2019s disease. Logos Verlag, Berlin"},{"key":"921_CR29","unstructured":"Huerta JM, Stern RM (1998) Speech recognition from GSM codec parameters. In: Fifth international conference on spoken language processing, pp 1\u20134"},{"key":"921_CR30","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"921_CR31","doi-asserted-by":"crossref","unstructured":"Arora V, Lahiri A, Reetz H (2017) Phonological feature based mispronunciation detection and diagnosis using multi-task DNNs and active learning","DOI":"10.21437\/Interspeech.2017-1350"},{"key":"921_CR32","doi-asserted-by":"crossref","unstructured":"Garcia-Ospina N, Arias-Vergara T, V\u00e1squez-Correa JC, Orozco-Arroyave JR, Cernak M, N\u00f6th E (2018) Phonological I-vectors to detect Parkinson\u2019s disease. In: International conference on text, speech, and dialogue, pp 462\u2013470","DOI":"10.1007\/978-3-030-00794-2_50"},{"key":"921_CR33","doi-asserted-by":"crossref","unstructured":"Arias-Vergara T, Orozco-Arroyave JR, Cernak M, Gollwitzer S, Schuster M, N\u00f6th E (2019) Phone-attribute posteriors to evaluate the speech of cochlear implant users. In: Proceedings of the 20th annual conference of the international speech communication association, pp 3108\u20133112","DOI":"10.21437\/Interspeech.2019-2144"},{"key":"921_CR34","volume-title":"Verbmobil: foundations of speech-to-speech translation","author":"W Wahlster","year":"2013","unstructured":"Wahlster W (2013) Verbmobil: foundations of speech-to-speech translation. Springer, Berlin"}],"container-title":["Pattern Analysis and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10044-020-00921-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10044-020-00921-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10044-020-00921-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,10,19]],"date-time":"2021-10-19T03:20:28Z","timestamp":1634613628000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10044-020-00921-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,24]]},"references-count":34,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,5]]}},"alternative-id":["921"],"URL":"https:\/\/doi.org\/10.1007\/s10044-020-00921-5","relation":{},"ISSN":["1433-7541","1433-755X"],"issn-type":[{"value":"1433-7541","type":"print"},{"value":"1433-755X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,24]]},"assertion":[{"value":"7 February 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 September 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 September 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Compliance with ethical standards"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}