{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:17:48Z","timestamp":1740143868805,"version":"3.37.3"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,5,17]],"date-time":"2021-05-17T00:00:00Z","timestamp":1621209600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,5,17]],"date-time":"2021-05-17T00:00:00Z","timestamp":1621209600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62071039"],"award-info":[{"award-number":["62071039"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Nature Science Foundation of China","doi-asserted-by":"crossref","award":["61620106002"],"award-info":[{"award-number":["61620106002"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.<\/jats:p>","DOI":"10.1186\/s13636-021-00209-4","type":"journal-article","created":{"date-parts":[[2021,5,17]],"date-time":"2021-05-17T11:02:56Z","timestamp":1621249376000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Neural network-based non-intrusive speech quality assessment using attention pooling function"],"prefix":"10.1186","volume":"2021","author":[{"given":"Miao","family":"Liu","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3653-9951","authenticated-orcid":false,"given":"Jing","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Weiming","family":"Yi","sequence":"additional","affiliation":[]},{"given":"Fang","family":"Liu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,5,17]]},"reference":[{"issue":"1","key":"209_CR1","doi-asserted-by":"publisher","first-page":"229","DOI":"10.1109\/TASL.2007.911054","volume":"16","author":"Y. Hu","year":"2008","unstructured":"Y. Hu, P. C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229\u2013238 (2008). https:\/\/doi.org\/10.1109\/TASL.2007.911054.","journal-title":"IEEE Trans. Audio Speech Lang. Process"},{"doi-asserted-by":"publisher","unstructured":"G. Mittag, S. M\u00f6ller, in Proc. Interspeech 2020. Deep learning based assessment of synthetic speech naturalness, (2020), pp. 1748\u20131752. https:\/\/doi.org\/10.21437\/Interspeech.2020-2382.","key":"209_CR2","DOI":"10.21437\/Interspeech.2020-2382"},{"unstructured":"I. T. Union, Single ended method for objective speech quality assessment in narrow-band telephony applications. ITU-T Recommendation P.563 (2004). Geneva.","key":"209_CR3"},{"doi-asserted-by":"publisher","unstructured":"H. Yang, K. Byun, H. Kang, Y. Kwak, in 2016 IEEE International Conference on Digital Signal Processing (DSP). Parametric-based non-intrusive speech quality assessment by deep neural network, (2016), pp. 99\u2013103. https:\/\/doi.org\/10.1109\/ICDSP.2016.7868524.","key":"209_CR4","DOI":"10.1109\/ICDSP.2016.7868524"},{"doi-asserted-by":"publisher","unstructured":"M. Hakami, W. B. Kleijn, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Machine learning based non-intrusive quality estimation with an augmented feature set, (2017), pp. 5105\u20135109. https:\/\/doi.org\/10.1109\/ICASSP.2017.7953129.","key":"209_CR5","DOI":"10.1109\/ICASSP.2017.7953129"},{"doi-asserted-by":"publisher","unstructured":"S. -w. Fu, Y. Tsao, H. -T. Hwang, H. -M. Wang, in Proc. Interspeech 2018. Quality-net: an end-to-end non-intrusive speech quality assessment model based on blstm, (2018), pp. 1873\u20131877. https:\/\/doi.org\/10.21437\/Interspeech.2018-1802.","key":"209_CR6","DOI":"10.21437\/Interspeech.2018-1802"},{"doi-asserted-by":"publisher","unstructured":"C. -C. Lo, S. -W. Fu, W. -C. Huang, X. Wang, J. Yamagishi, Y. Tsao, H. -M. Wang, in Proc. Interspeech 2019. MOSNet: deep learning-based objective assessment for voice conversion, (2019), pp. 1541\u20131545. https:\/\/doi.org\/10.21437\/Interspeech.2019-2003.","key":"209_CR7","DOI":"10.21437\/Interspeech.2019-2003"},{"key":"209_CR8","doi-asserted-by":"publisher","first-page":"316","DOI":"10.1109\/ICASSP.2018.8461392","volume-title":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Q. Kong","year":"2018","unstructured":"Q. Kong, Y. Xu, W. Wang, M. D. Plumbley, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Audio set classification with attention model: a probabilistic perspective (IEEECalgary, 2018), pp. 316\u2013320. https:\/\/doi.org\/10.1109\/ICASSP.2018.8461392."},{"unstructured":"I. T. Union, Methods for subjective determination of transmission quality. ITU-T Recommendation P.800 (1996). Geneva.","key":"209_CR9"},{"unstructured":"I. T. Union, Perceptual evaluation of speech quality (PESQ): an objective method for end-to end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation P.862 (2001). Geneva.","key":"209_CR10"},{"unstructured":"I. T. Union, Perceptual objective listening quality assessment (POLQA). ITU-T Recommendation P.863 (2011). Geneva.","key":"209_CR11"},{"issue":"5","key":"209_CR12","doi-asserted-by":"publisher","first-page":"821","DOI":"10.1109\/TSA.2005.851924","volume":"13","author":"D. -S. Kim","year":"2005","unstructured":"D. -S. Kim, Anique: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process.13(5), 821\u201331 (2005). https:\/\/doi.org\/10.1109\/TSA.2005.851924.","journal-title":"IEEE Trans. Speech Audio Process."},{"doi-asserted-by":"publisher","unstructured":"T. H. Falk, Q. Xu, W. -Y. Chan, in Proceedings. (ICASSP \u201905). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 vol. 1. Non-intrusive GMM-based speech quality measurement, (2005), pp. 125\u20131281. https:\/\/doi.org\/10.1109\/ICASSP.2005.1415066.","key":"209_CR13","DOI":"10.1109\/ICASSP.2005.1415066"},{"issue":"5","key":"209_CR14","doi-asserted-by":"publisher","first-page":"821","DOI":"10.1109\/TSA.2005.851924","volume":"13","author":"D. -S. Kim","year":"2005","unstructured":"D. -S. Kim, Anique: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process. 13(5), 821\u2013831 (2005). https:\/\/doi.org\/10.1109\/TSA.2005.851924.","journal-title":"IEEE Trans. Speech Audio Process"},{"doi-asserted-by":"publisher","unstructured":"M. H. Soni, H. A. Patil, in 2017 25th European Signal Processing Conference (EUSIPCO). Effectiveness of ideal ratio mask for non-intrusive quality assessment of noise suppressed speech, (2017), pp. 573\u2013577. https:\/\/doi.org\/10.23919\/EUSIPCO.2017.8081272.","key":"209_CR15","DOI":"10.23919\/EUSIPCO.2017.8081272"},{"key":"209_CR16","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1016\/j.specom.2019.04.002","volume":"110","author":"J. Wang","year":"2019","unstructured":"J. Wang, Y. Shan, X. Xie, J. Kuang, Output-based speech quality assessment using autoencoder and support vector regression. Speech Commun. 110:, 13\u201320 (2019).","journal-title":"Speech Commun"},{"issue":"3","key":"209_CR17","doi-asserted-by":"publisher","first-page":"199","DOI":"10.1023\/B:STCO.0000035301.49549.88","volume":"14","author":"A. J. Smola","year":"2004","unstructured":"A. J. Smola, B. Schlkopf, A tutorial on support vector regression. Stats Comput. 14(3), 199\u2013222 (2004).","journal-title":"Stats Comput"},{"key":"209_CR18","first-page":"1097","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1","author":"A. Krizhevsky","year":"2012","unstructured":"A. Krizhevsky, I. Sutskever, G. Hinton, in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. Imagenet classification with deep convolutional neural networks (Curran Associates Inc.Red Hook, 2012), pp. 1097\u20131105."},{"doi-asserted-by":"publisher","unstructured":"T. N. Sainath, O. Vinyals, A. Senior, H. Sak, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Convolutional, long short-term memory, fully connected deep neural networks, (2015), pp. 4580\u20134584. https:\/\/doi.org\/10.1109\/ICASSP.2015.7178838.","key":"209_CR19","DOI":"10.1109\/ICASSP.2015.7178838"},{"doi-asserted-by":"publisher","unstructured":"A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, J. Gehrke, in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Non-intrusive speech quality assessment using neural networks, (2019), pp. 631\u2013635. https:\/\/doi.org\/10.1109\/ICASSP.2019.8683175.","key":"209_CR20","DOI":"10.1109\/ICASSP.2019.8683175"},{"key":"209_CR21","volume-title":"Rectified linear units improve restricted Boltzmann machines","author":"V. Nair","year":"2010","unstructured":"V. Nair, G. E. Hinton, Rectified linear units improve restricted Boltzmann machines (Omnipress, Madison, 2010)."},{"unstructured":"S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR. abs\/1502.03167: (2015). http:\/\/arxiv.org\/abs\/1502.03167. Accessed: 20 Sept 2020.","key":"209_CR22"},{"key":"209_CR23","first-page":"1045","volume":"2","author":"T. Mikolov","year":"2010","unstructured":"T. Mikolov, M. Karafi\u00e1t, L. Burget, J. C\u011brnocky, S. Khudanpur, Recurrent neural network based language model. Proc. 11th Ann. Conf. Int. Speech Commun. Assoc. INTERSPEECH 2010. 2:, 1045\u20131048 (2010).","journal-title":"Proc. 11th Ann. Conf. Int. Speech Commun. Assoc. INTERSPEECH 2010"},{"issue":"8","key":"209_CR24","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S. Hochreiter","year":"1997","unstructured":"S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735\u20131780 (1997).","journal-title":"Neural Comput"},{"doi-asserted-by":"publisher","unstructured":"G. Mittag, S. M\u00f6ller, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Full-reference speech quality estimation with attentional siamese neural networks, (2020), pp. 346\u2013350. https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053951.","key":"209_CR25","DOI":"10.1109\/ICASSP40776.2020.9053951"},{"unstructured":"A. Shah, A. Kumar, A. G. Hauptmann, B. Raj, A closer look at weak label learning for audio events. CoRR. abs\/1804.09288: (2018). http:\/\/arxiv.org\/abs\/1804.09288. Accessed: 22 Sept 2020.","key":"209_CR26"},{"key":"209_CR27","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1109\/ICASSP.2019.8682847","volume-title":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Y. Wang","year":"2019","unstructured":"Y. Wang, J. Li, F. Metze, in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling (IEEEBrighton, 2019), pp. 31\u201335. https:\/\/doi.org\/10.1109\/ICASSP.2019.8682847."},{"doi-asserted-by":"publisher","unstructured":"Y. Xu, Q. Kong, W. Wang, M. D. Plumbley, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Large-scale weakly supervised audio classification using gated convolutional neural network, (2018), pp. 121\u2013125. https:\/\/doi.org\/10.1109\/ICASSP.2018.8461975.","key":"209_CR28","DOI":"10.1109\/ICASSP.2018.8461975"},{"doi-asserted-by":"crossref","unstructured":"S. Hong, Y. Zou, W. Wang, M. Cao, in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Weakly labelled audio tagging via convolutional networks with spatial and channel-wise attention, (2020), pp. 296\u2013300.","key":"209_CR29","DOI":"10.1109\/ICASSP40776.2020.9053427"},{"key":"209_CR30","doi-asserted-by":"publisher","first-page":"2880","DOI":"10.1109\/TASLP.2020.3030497","volume":"28","author":"Q. Kong","year":"2020","unstructured":"Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE\/ACM Trans Audio Speech Lang Process. 28:, 2880\u20132894 (2020). https:\/\/doi.org\/10.1109\/TASLP.2020.3030497.","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"doi-asserted-by":"publisher","unstructured":"Y. Shan, J. Wang, X. Xie, L. Meng, J. Kuang, in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). Non-intrusive speech quality assessment using deep belief network and backpropagation neural network, (2018), pp. 71\u201375. https:\/\/doi.org\/10.1109\/ISCSLP.2018.8706696.","key":"209_CR31","DOI":"10.1109\/ISCSLP.2018.8706696"},{"key":"209_CR32","doi-asserted-by":"publisher","first-page":"2880","DOI":"10.1109\/TASLP.2020.3030497","volume":"28","author":"Q. Kong","year":"2020","unstructured":"Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE\/ACM Trans. Audio Speech Lang. Process. 28:, 2880\u20132894 (2020). https:\/\/doi.org\/10.1109\/TASLP.2020.3030497.","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"doi-asserted-by":"publisher","unstructured":"D. S. Park, W. Chan, Y. Zhang, C. -C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, Specaugment: a simple data augmentation method for automatic speech recognition. Interspeech 2019 (2019). https:\/\/doi.org\/10.21437\/interspeech.2019-2680.","key":"209_CR33","DOI":"10.21437\/interspeech.2019-2680"},{"unstructured":"D. P. Kingma, J. Ba, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, ed. by Y. Bengio, Y. LeCun. Adam: a method for stochastic optimization, (2015). http:\/\/arxiv.org\/abs\/1412.6980. Accessed: 19 Sept 2020.","key":"209_CR34"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00209-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-021-00209-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00209-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,5,17]],"date-time":"2021-05-17T11:13:16Z","timestamp":1621249996000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-021-00209-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,17]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["209"],"URL":"https:\/\/doi.org\/10.1186\/s13636-021-00209-4","relation":{},"ISSN":["1687-4722"],"issn-type":[{"type":"electronic","value":"1687-4722"}],"subject":[],"published":{"date-parts":[[2021,5,17]]},"assertion":[{"value":"29 November 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 April 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 May 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"20"}}