{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:29:42Z","timestamp":1772166582679,"version":"3.50.1"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,7,23]],"date-time":"2021-07-23T00:00:00Z","timestamp":1626998400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,23]],"date-time":"2021-07-23T00:00:00Z","timestamp":1626998400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"e Natural Science Foundation of Guangdong Province","award":["2019A1515011940"],"award-info":[{"award-number":["2019A1515011940"]}]},{"name":"Science and Technology Program of Guangzhou","award":["2019050001, 202002030353"],"award-info":[{"award-number":["2019050001, 202002030353"]}]},{"name":"e Science and Technology Planning Project of Guangdong Province","award":["2017B030308009"],"award-info":[{"award-number":["2017B030308009"]}]},{"name":"Special Project for Youth Top-Notch Scholars of Guangdong Province","award":["2016TQ03X100"],"award-info":[{"award-number":["2016TQ03X100"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["EURASIP J. Adv. Signal Process."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Text-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of text-independent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate frame-level features from the ResNet into an utterance-level features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a margin-based loss function. However, a margin-based loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different margin-based losses for this problem, i.e., AM-Softmax and AAM-Softmax. The proposed method was applied to a large-scale VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.<\/jats:p>","DOI":"10.1186\/s13634-021-00762-2","type":"journal-article","created":{"date-parts":[[2021,7,23]],"date-time":"2021-07-23T07:03:02Z","timestamp":1627023782000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Text-independent speaker recognition based on adaptive course learning loss and deep residual network"],"prefix":"10.1186","volume":"2021","author":[{"given":"Qinghua","family":"Zhong","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7354-6608","authenticated-orcid":false,"given":"Ruining","family":"Dai","sequence":"additional","affiliation":[]},{"given":"Han","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Yongsheng","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"Guofu","family":"Zhou","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,7,23]]},"reference":[{"issue":"9","key":"762_CR1","doi-asserted-by":"publisher","first-page":"1437","DOI":"10.1109\/5.628714","volume":"85","author":"J. P. Campbell","year":"1997","unstructured":"J. P. Campbell, Speaker recognition: a tutorial. Proc. IEEE. 85(9), 1437\u20131462 (1997).","journal-title":"Proc. IEEE"},{"issue":"6","key":"762_CR2","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1109\/MSP.2015.2462851","volume":"32","author":"J. Hansen","year":"2015","unstructured":"J. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Proc. Mag.32(6), 74\u201399 (2015).","journal-title":"IEEE Signal Proc. Mag."},{"issue":"9","key":"762_CR3","doi-asserted-by":"publisher","first-page":"1633","DOI":"10.1109\/TASLP.2018.2831456","volume":"26","author":"Z. Chunlei","year":"2018","unstructured":"Z. Chunlei, K. Kazuhito, J. H. L. Hansen, Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE\/ACM Trans. Audio Speech Lang. Process.26(9), 1633\u20131644 (2018).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"issue":"2","key":"762_CR4","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1109\/MCAS.2011.941079","volume":"11","author":"R. Togneri","year":"2011","unstructured":"R. Togneri, D. Pullella, An overview of speaker identification: accuracy and robustness issues. IEEE Circ. Syst. Mag.11(2), 23\u201361 (2011).","journal-title":"IEEE Circ. Syst. Mag."},{"issue":"3","key":"762_CR5","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1016\/j.specom.2014.03.001","volume":"60","author":"A. Larcher","year":"2014","unstructured":"A. Larcher, K. A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and rsr2015. Speech Commun.60(3), 56\u201377 (2014).","journal-title":"Speech Commun."},{"key":"762_CR6","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1016\/j.csl.2019.06.002","volume":"59","author":"J. Rohdin","year":"2020","unstructured":"J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Mat\u011bjka, L. Burget, O. Glembek, End-to-end dnn based text-independent speaker recognition for long and short utterances. Comput. Speech Lang.59:, 22\u201335 (2020).","journal-title":"Comput. Speech Lang."},{"key":"762_CR7","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1016\/j.specom.2020.02.003","volume":"118","author":"Z. Bai","year":"2020","unstructured":"Z. Bai, X. -L. Zhang, J. Chen, Cosine metric learning based speaker verification. Speech Commun.118:, 10\u201320 (2020).","journal-title":"Speech Commun."},{"key":"762_CR8","doi-asserted-by":"crossref","unstructured":"C. Zhang, K. Koishida, in Interspeech 2017. End-to-end text-independent speaker verification with triplet loss on short utterances (ISCA, 2017), pp. 1487\u20131491.","DOI":"10.21437\/Interspeech.2017-1608"},{"key":"762_CR9","doi-asserted-by":"crossref","unstructured":"H. Bredin, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Tristounet: triplet loss for speaker turn embedding (IEEE, 2017), pp. 5430\u20135434. corrabs \/ 1609.04301.","DOI":"10.1109\/ICASSP.2017.7953194"},{"issue":"11","key":"762_CR10","doi-asserted-by":"publisher","first-page":"1686","DOI":"10.1109\/TASLP.2019.2928128","volume":"27","author":"S. Wang","year":"2019","unstructured":"S. Wang, Z. Huang, Y. Qian, K. Yu, Discriminative neural embedding learning for short-duration text-independent speaker verification. IEEE\/ACM Trans. Audio Speech Lang. Process.27(11), 1686\u20131696 (2019).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"762_CR11","doi-asserted-by":"publisher","first-page":"101027","DOI":"10.1016\/j.csl.2019.101027","volume":"60","author":"A. Nagrani","year":"2020","unstructured":"A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang.60:, 101027 (2020).","journal-title":"Comput. Speech Lang."},{"key":"762_CR12","doi-asserted-by":"crossref","unstructured":"W. Xie, A. Nagrani, J. S. Chung, A. Zisserman, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Utterance-level aggregation for speaker recognition in the wild (IEEE, 2019), pp. 5791\u20135795. corrabs \/ 1902.10107.","DOI":"10.1109\/ICASSP.2019.8683120"},{"key":"762_CR13","doi-asserted-by":"publisher","first-page":"751","DOI":"10.1016\/j.future.2019.05.057","volume":"100","author":"Z. Zhao","year":"2019","unstructured":"Z. Zhao, H. Duan, G. Min, Y. Wu, Z. Huang, X. Zhuang, H. Xi, M. Fu, A lighten cnn-lstm model for speaker verification on embedded devices. Futur. Gener. Comput. Syst.100:, 751\u2013758 (2019).","journal-title":"Futur. Gener. Comput. Syst."},{"key":"762_CR14","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1016\/j.neucom.2019.08.046","volume":"368","author":"T. Bian","year":"2019","unstructured":"T. Bian, F. Chen, L. Xu, Self-attention based speaker recognition using cluster-range loss. Neurocomputing. 368:, 59\u201368 (2019).","journal-title":"Neurocomputing"},{"issue":"10","key":"762_CR15","doi-asserted-by":"publisher","first-page":"1671","DOI":"10.1109\/LSP.2015.2420092","volume":"22","author":"F. Richardson","year":"2015","unstructured":"F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Sig. Process Lett.22(10), 1671\u20131675 (2015).","journal-title":"IEEE Sig. Process Lett."},{"key":"762_CR16","doi-asserted-by":"publisher","first-page":"85327","DOI":"10.1109\/ACCESS.2019.2917470","volume":"7","author":"N. N. An","year":"2019","unstructured":"N. N. An, N. Q. Thanh, Y. Liu, Deep CNNs with self-attention for speaker identification. IEEE Access. 7:, 85327\u201385337 (2019).","journal-title":"IEEE Access"},{"key":"762_CR17","doi-asserted-by":"publisher","first-page":"1293","DOI":"10.1109\/TASLP.2020.2986896","volume":"28","author":"H. Taherian","year":"2020","unstructured":"H. Taherian, Z. -Q. Wang, J. Chang, D. Wang, Robust speaker recognition based on single-channel and multi-channel speech enhancement. IEEE\/ACM Trans. Audio Speech Lang. Process.28:, 1293\u20131302 (2020).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"762_CR18","doi-asserted-by":"crossref","unstructured":"K. Okabe, T. Koshinaka, K. Shinoda, Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963, 2252\u20132256 (2018).","DOI":"10.21437\/Interspeech.2018-993"},{"key":"762_CR19","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1016\/j.sysarc.2018.05.010","volume":"88","author":"O. Boujelben","year":"2018","unstructured":"O. Boujelben, M. Bahoura, Efficient fpga-based architecture of an automatic wheeze detector using a combination of MFCC and SVM algorithms. J. Syst. Archit.88:, 54\u201364 (2018).","journal-title":"J. Syst. Archit."},{"key":"762_CR20","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1016\/j.procs.2018.10.395","volume":"143","author":"A. Sithara","year":"2018","unstructured":"A. Sithara, A. Thomas, D. Mathew, Study of MFCC and IHC feature extraction methods with probabilistic acoustic models for speaker biometric applications. Procedia Comput. Sci.143:, 267\u2013276 (2018).","journal-title":"Procedia Comput. Sci."},{"key":"762_CR21","doi-asserted-by":"crossref","unstructured":"K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Deep residual learning for image recognition (IEEE, 2016), pp. 770\u2013778. corrabs \/ 1512.03385.","DOI":"10.1109\/CVPR.2016.90"},{"issue":"99","key":"762_CR22","first-page":"1","volume":"PP","author":"R. Jahangir","year":"2020","unstructured":"R. Jahangir, W. T. Ying, N. A. Memon, G. Mujtaba, I. Ali, Text-independent speaker identification through feature fusion and deep neural network. IEEE Access. PP(99), 1 (2020).","journal-title":"IEEE Access"},{"key":"762_CR23","doi-asserted-by":"crossref","unstructured":"J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B. -J. Lee, I. Han, In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2977\u20132981 (2020).","DOI":"10.21437\/Interspeech.2020-1064"},{"issue":"7","key":"762_CR24","doi-asserted-by":"publisher","first-page":"926","DOI":"10.1109\/LSP.2018.2822810","volume":"25","author":"F. Wang","year":"2018","unstructured":"F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin softmax for face verification. IEEE Signal Proc. Lett.25(7), 926\u2013930 (2018).","journal-title":"IEEE Signal Proc. Lett."},{"key":"762_CR25","doi-asserted-by":"crossref","unstructured":"J. Deng, J. Guo, N. Xue, S. Zafeiriou, in Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. Arcface: additive angular margin loss for deep face recognition (IEEE, 2019), pp. 4690\u20134699. corr ABS \/ 1801.07698.","DOI":"10.1109\/CVPR.2019.00482"},{"key":"762_CR26","unstructured":"W. Liu, Y. Wen, Z. Yu, M. Yang, in ICML, 2. Large-margin softmax loss for convolutional neural networks (corrabs \/ 1612.02295, 2016), p. 7."},{"key":"762_CR27","doi-asserted-by":"crossref","unstructured":"Y. Huang, Y. Wang, Y. Tai, X. Liu, P. Shen, S. Li, J. Li, F. Huang, in Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. Curricularface: adaptive curriculum learning loss for deep face recognition (IEEE, 2020), pp. 5901\u20135910. corrabs \/ 2004.00288.","DOI":"10.1109\/CVPR42600.2020.00594"},{"key":"762_CR28","doi-asserted-by":"crossref","unstructured":"J. S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622, 1086\u20131090 (2018).","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"762_CR29","doi-asserted-by":"crossref","unstructured":"A. Nagrani, J. S. Chung, A. Zisserman, Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2616\u20132620 (2017).","DOI":"10.21437\/Interspeech.2017-950"},{"key":"762_CR30","doi-asserted-by":"publisher","first-page":"109","DOI":"10.1016\/j.neucom.2018.06.046","volume":"314","author":"F. Li","year":"2018","unstructured":"F. Li, J. M. Zurada, W. Wu, Smooth group l1\/2 regularization for input layer of feedforward neural networks. Neurocomputing. 314:, 109\u2013119 (2018). https:\/\/doi.org\/10.1016\/j.neucom.2018.06.046.","journal-title":"Neurocomputing"},{"key":"762_CR31","unstructured":"D. P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 1\u201315 (2014)."},{"key":"762_CR32","doi-asserted-by":"crossref","unstructured":"X. Xiang, S. Wang, H. Huang, Y. Qian, K. Yu, in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Margin matters: towards more discriminative deep neural network embeddings for speaker recognition (IEEE, 2019), pp. 1652\u20131656. corrabs \/ 1906.07317.","DOI":"10.1109\/APSIPAASC47483.2019.9023039"},{"key":"762_CR33","doi-asserted-by":"crossref","unstructured":"S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, H. Kim, Meta-learning for short utterance speaker recognition with imbalance length pairs. arXiv preprint arXiv:2004.02863, 1652\u20131656 (2020).","DOI":"10.21437\/Interspeech.2020-1283"},{"key":"762_CR34","doi-asserted-by":"publisher","first-page":"394","DOI":"10.1016\/j.neucom.2020.06.045","volume":"410","author":"J. Xu","year":"2020","unstructured":"J. Xu, X. Wang, B. Feng, W. Liu, Deep multi-metric learning for text-independent speaker verification. Neurocomputing. 410:, 394\u2013400 (2020).","journal-title":"Neurocomputing"},{"key":"762_CR35","unstructured":"Y. -Q. Yu, L. Fan, W. -J. Li, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ensemble additive margin softmax for speaker verification (IEEE, 2019), pp. 6046\u20136050."},{"key":"762_CR36","doi-asserted-by":"crossref","unstructured":"Y. Jung, Y. Kim, H. Lim, Y. Choi, H. Kim, Spatial pyramid encoding with convex length normalization for text-independent speaker verification. arXiv preprint arXiv:1906.08333, 2982\u20132986 (2019).","DOI":"10.21437\/Interspeech.2019-2177"}],"container-title":["EURASIP Journal on Advances in Signal Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13634-021-00762-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13634-021-00762-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13634-021-00762-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,23]],"date-time":"2021-07-23T07:14:52Z","timestamp":1627024492000},"score":1,"resource":{"primary":{"URL":"https:\/\/asp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13634-021-00762-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,23]]},"references-count":36,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["762"],"URL":"https:\/\/doi.org\/10.1186\/s13634-021-00762-2","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-206450\/v1","asserted-by":"object"}]},"ISSN":["1687-6180"],"issn-type":[{"value":"1687-6180","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,23]]},"assertion":[{"value":"4 February 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 July 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 July 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"45"}}