{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T16:19:45Z","timestamp":1778948385260,"version":"3.51.4"},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,10,22]],"date-time":"2022-10-22T00:00:00Z","timestamp":1666396800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,10,22]],"date-time":"2022-10-22T00:00:00Z","timestamp":1666396800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Voice activity detection (VAD) based on deep neural networks (DNN) have demonstrated good performance in adverse acoustic environments. Current DNN-based VAD optimizes a surrogate function, e.g., minimum cross-entropy or minimum squared error, at a given decision threshold. However, VAD usually works on-the-fly with a dynamic decision threshold, and the receiver operating characteristic (ROC) curve is a global evaluation metric for VAD at all possible decision thresholds. In this paper, we propose to maximize the area under the ROC curve (MaxAUC) by DNN, which can maximize the performance of VAD in terms of the entire ROC curve. However, the objective of the AUC maximization is nondifferentiable. To overcome this difficulty, we relax the nondifferentiable loss function to two differentiable approximation functions\u2014sigmoid loss and hinge loss. To study the effectiveness of the proposed MaxAUC-DNN VAD, we take either a standard feedforward neural network or a bidirectional long short-term memory network as the DNN model with either the state-of-the-art multi-resolution cochleagram or short-term Fourier transform as the acoustic feature. We conducted noise-independent training to all comparison methods. Experimental results show that taking AUC as the optimization objective results in higher performance than the common objectives of the minimum squared error and minimum cross-entropy. The experimental conclusion is consistent across different DNN structures, acoustic features, noise scenarios, training sets, and languages.<\/jats:p>","DOI":"10.1186\/s13636-022-00260-9","type":"journal-article","created":{"date-parts":[[2022,10,22]],"date-time":"2022-10-22T10:03:44Z","timestamp":1666433024000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["AUC optimization for deep learning-based voice activity detection"],"prefix":"10.1186","volume":"2022","author":[{"given":"Xiao-Lei","family":"Zhang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Menglong","family":"Xu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,10,22]]},"reference":[{"issue":"4","key":"260_CR1","doi-asserted-by":"publisher","first-page":"377","DOI":"10.1049\/ip-i-2.1992.0052","volume":"139","author":"R Tucker","year":"1992","unstructured":"R. Tucker, Tucker, Voice activity detection using a periodicity measure. IEE Proc. I (Commun. Speech Vis.). 139(4), 377\u2013380 (1992)","journal-title":"IEE Proc. I (Commun. Speech Vis.)."},{"key":"260_CR2","doi-asserted-by":"crossref","unstructured":"J.-C. Junqua, H. Wakita, in Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference On. A comparative study of cepstral lifters and distance measures for all pole models of speech in noise (IEEE, 1989), pp. 476\u2013479","DOI":"10.1109\/ICASSP.1989.266467"},{"issue":"3","key":"260_CR3","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1109\/89.905996","volume":"9","author":"E Nemer","year":"2001","unstructured":"E. Nemer, R. Goubran, S. Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans. Speech Audio Process. 9(3), 217\u2013231 (2001)","journal-title":"IEEE Trans. Speech Audio Process."},{"issue":"1","key":"260_CR4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/97.736233","volume":"6","author":"J Sohn","year":"1999","unstructured":"J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1\u20133 (1999)","journal-title":"IEEE Signal Process. Lett."},{"issue":"10","key":"260_CR5","doi-asserted-by":"publisher","first-page":"689","DOI":"10.1109\/LSP.2005.855551","volume":"12","author":"J Ram\u00edrez","year":"2005","unstructured":"J. Ram\u00edrez, J.C. Segura, C. Ben\u00edtez, L. Garc\u00eda, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689\u2013692 (2005)","journal-title":"IEEE Signal Process. Lett."},{"issue":"7","key":"260_CR6","doi-asserted-by":"publisher","first-page":"632","DOI":"10.1049\/el:20030392","volume":"39","author":"J-H Chang","year":"2003","unstructured":"J.-H. Chang, N.S. Kim, Voice activity detection based on complex laplacian model. Electron. Lett. 39(7), 632\u2013634 (2003)","journal-title":"Electron. Lett."},{"issue":"3","key":"260_CR7","doi-asserted-by":"publisher","first-page":"258","DOI":"10.1109\/LSP.2004.840869","volume":"12","author":"JW Shin","year":"2005","unstructured":"J.W. Shin, J.-H. Chang, N.S. Kim, Statistical modeling of speech signals based on generalized gamma distribution. IEEE Signal Process. Lett. 12(3), 258\u2013261 (2005)","journal-title":"IEEE Signal Process. Lett."},{"issue":"6","key":"260_CR8","doi-asserted-by":"publisher","first-page":"1965","DOI":"10.1109\/TSP.2006.874403","volume":"54","author":"JH Chang","year":"2006","unstructured":"J.H. Chang, N.S. Kim, S.K. Mitra, Voice activity detection based on multiple statistical models. IEEE Trans. Signal Process. 54(6), 1965\u20131976 (2006)","journal-title":"IEEE Trans. Signal Process."},{"key":"260_CR9","doi-asserted-by":"crossref","unstructured":"J. Padrell, D. Macho, C. Nadeu, in Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP\u201905). IEEE International Conference On. Robust speech activity detection using lda applied to ff parameters, vol. 1 (IEEE, 2005), p. 557","DOI":"10.1109\/ICASSP.2005.1415174"},{"issue":"8","key":"260_CR10","doi-asserted-by":"publisher","first-page":"466","DOI":"10.1109\/LSP.2011.2159374","volume":"18","author":"J Wu","year":"2011","unstructured":"J. Wu, X.L. Zhang, Efficient multiple kernel support vector machine based voice activity detection. IEEE Signal Process. Lett. 18(8), 466\u2013499 (2011)","journal-title":"IEEE Signal Process. Lett."},{"issue":"6","key":"260_CR11","doi-asserted-by":"publisher","first-page":"1322","DOI":"10.1109\/TASLP.2017.2690568","volume":"25","author":"D Dov","year":"2017","unstructured":"D. Dov, R. Talmon, I. Cohen, Multimodal kernel method for activity detection of sound sources. IEEE\/ACM Trans. Audio Speech Lang. Process. 25(6), 1322\u20131334 (2017)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"issue":"5","key":"260_CR12","doi-asserted-by":"publisher","first-page":"475","DOI":"10.1109\/LSP.2013.2252615","volume":"20","author":"P Teng","year":"2013","unstructured":"P. Teng, Y. Jia, Voice activity detection via noise reducing using non-negative sparse coding. IEEE Signal Process. Lett. 20(5), 475\u2013478 (2013)","journal-title":"IEEE Signal Process. Lett."},{"issue":"4","key":"260_CR13","doi-asserted-by":"publisher","first-page":"1228","DOI":"10.1016\/j.dsp.2013.03.005","volume":"23","author":"S-W Deng","year":"2013","unstructured":"S.-W. Deng, J.-Q. Han, Statistical voice activity detection based on sparse representation over learned dictionary. Digit. Signal Process. 23(4), 1228\u20131232 (2013)","journal-title":"Digit. Signal Process."},{"issue":"4","key":"260_CR14","doi-asserted-by":"publisher","first-page":"697","DOI":"10.1109\/TASL.2012.2229986","volume":"21","author":"X-L Zhang","year":"2013","unstructured":"X.-L. Zhang, J. Wu, Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697\u2013710 (2013)","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"260_CR15","doi-asserted-by":"crossref","unstructured":"X.-L. Zhang, J. Wu, in the 38th IEEE International Conference on Acoustic, Speech, and Signal Processing. Denoising deep neural networks based voice activity detection (2013), pp. 853\u2013857","DOI":"10.1109\/ICASSP.2013.6637769"},{"key":"260_CR16","doi-asserted-by":"crossref","unstructured":"T. Hughes, K. Mierle, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Recurrent neural networks for voice activity detection (2013). pp. 7378\u20137382","DOI":"10.1109\/ICASSP.2013.6639096"},{"key":"260_CR17","doi-asserted-by":"crossref","unstructured":"F. Eyben, F. Weninger, S. Squartini, B. Schuller, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Real-life voice activity detection with lstm recurrent neural networks and an application to hollywood movies (IEEE, 2013) pp. 483\u2013487","DOI":"10.1109\/ICASSP.2013.6637694"},{"issue":"2","key":"260_CR18","doi-asserted-by":"publisher","first-page":"252","DOI":"10.1109\/TASLP.2015.2505415","volume":"24","author":"X-L Zhang","year":"2016","unstructured":"X.-L. Zhang, D. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE\/ACM Trans. Audio Speech Lang. Process. 24(2), 252\u2013264 (2016)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"260_CR19","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.csl.2015.11.003","volume":"38","author":"I Hwang","year":"2016","unstructured":"I. Hwang, H.-M. Park, J.-H. Chang, Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Comput. Speech Lang. 38, 1\u201312 (2016)","journal-title":"Comput. Speech Lang."},{"key":"260_CR20","doi-asserted-by":"crossref","unstructured":"Q. Wang, J. Du, X. Bao, Z.-R. Wang, L.-R. Dai, C.-H. Lee, In: Sixteenth Annual Conference of the International Speech Communication Association. A universal vad based on jointly trained deep neural networks (2015)","DOI":"10.21437\/Interspeech.2015-442"},{"key":"260_CR21","unstructured":"L. Wang, K. Phapatanaburi, Z. Go, S. Nakagawa, M. Iwahashi, J. Dang, in Proceedings of ICME. Limiting numerical precision of neural networks to achieve real-time voice activity detection (2018), pp. 1087\u20131092"},{"key":"260_CR22","unstructured":"Y. Tachioka, in Proceedings of ICASSP. Limiting numerical precision of neural networks to achieve real-time voice activity detection (2018), pp. 2236\u20132240"},{"key":"260_CR23","doi-asserted-by":"crossref","unstructured":"Y. Tachioka, in Proceedings of ICASSP. Dnn-based voice activity detection using auxiliary speech models in noisy environments (2018). pp. 5529\u20135533","DOI":"10.1109\/ICASSP.2018.8461551"},{"key":"260_CR24","doi-asserted-by":"crossref","unstructured":"W.A. Jassim, N. Harte, in Proceedings of ICASSP. Voice activity detection using neurograms (2018), pp. 5524\u20135528","DOI":"10.1109\/ICASSP.2018.8461952"},{"key":"260_CR25","doi-asserted-by":"crossref","unstructured":"Y. Jung, Y. Kim, Y. Choi, H. Kim, in Interspeech. Joint learning using denoising variational autoencoders for voice activity detection (2018), pp. 1210\u20131214","DOI":"10.21437\/Interspeech.2018-1151"},{"key":"260_CR26","doi-asserted-by":"crossref","unstructured":"T. Xu, H. Zhang, X. Zhang, in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Joint training rescnn-based voice activity detection with speech enhancement (IEEE, 2019), pp. 1157\u20131162","DOI":"10.1109\/APSIPAASC47483.2019.9023101"},{"issue":"9","key":"260_CR27","doi-asserted-by":"publisher","first-page":"3230","DOI":"10.3390\/app10093230","volume":"10","author":"GW Lee","year":"2020","unstructured":"G.W. Lee, H.K. Kim, Multi-task learning u-net for single-channel speech enhancement and mask-based voice activity detection. Appl. Sci. 10(9), 3230 (2020)","journal-title":"Appl. Sci."},{"key":"260_CR28","doi-asserted-by":"crossref","unstructured":"Y. Zhuang, S. Tong, M. Yin, Y. Qian, K. Yu, in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP). Multi-task joint-learning for robust voice activity detection (IEEE, 2016), pp. 1\u20135","DOI":"10.1109\/ISCSLP.2016.7918383"},{"key":"260_CR29","doi-asserted-by":"crossref","unstructured":"X. Tan, X.-L. Zhang, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speech enhancement aided end-to-end multi-task learning for voice activity detection (IEEE, 2021), pp. 6823\u20136827","DOI":"10.1109\/ICASSP39728.2021.9414445"},{"key":"260_CR30","unstructured":"Y. Chen, S. Wang, Y. Qian, K. Yu, End-to-end speaker-dependent voice activity detection. arXiv preprint arXiv:2009.09906 (2020)"},{"key":"260_CR31","doi-asserted-by":"crossref","unstructured":"Z.-C. Fan, Z. Bai, X.-L. Zhang, S. Rahardja, J. Chen, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Auc optimization for deep learning based voice activity detection (IEEE, 2019), pp. 6760\u20136764","DOI":"10.1109\/ICASSP.2019.8682803"},{"key":"260_CR32","doi-asserted-by":"crossref","unstructured":"H.B. Mann, D.R. Whitney, On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 50\u201360 (1947)","DOI":"10.1214\/aoms\/1177730491"},{"issue":"2","key":"260_CR33","doi-asserted-by":"publisher","first-page":"252","DOI":"10.1109\/TASLP.2015.2505415","volume":"24","author":"X-L Zhang","year":"2015","unstructured":"X.-L. Zhang, D. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE\/ACM Trans Audio Speech Lang. Process. 24(2), 252\u2013264 (2015)","journal-title":"IEEE\/ACM Trans Audio Speech Lang. Process."}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-022-00260-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-022-00260-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-022-00260-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,6]],"date-time":"2024-10-06T10:03:29Z","timestamp":1728209009000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-022-00260-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,22]]},"references-count":33,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["260"],"URL":"https:\/\/doi.org\/10.1186\/s13636-022-00260-9","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,10,22]]},"assertion":[{"value":"26 September 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 October 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 October 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"All authors agree to the publication in this journal.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"27"}}