{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T17:28:18Z","timestamp":1762363698930,"version":"build-2065373602"},"reference-count":43,"publisher":"Institution of Engineering and Technology (IET)","issue":"1","license":[{"start":{"date-parts":[[2024,3,22]],"date-time":"2024-03-22T00:00:00Z","timestamp":1711065600000},"content-version":"vor","delay-in-days":81,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U21A20449","61971066"],"award-info":[{"award-number":["U21A20449","61971066"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Biometrics"],"published-print":{"date-parts":[[2024,1]]},"abstract":"<jats:p>Text\u2010independent speaker verification (TI\u2010SV) is a crucial task in speaker recognition, as it involves verifying an individual\u2019s claimed identity from speech of arbitrary content without any human intervention. The target for TI\u2010SV is to design a discriminative network to learn deep speaker embedding for speaker idiosyncrasy. In this paper, we propose a deep speaker embedding learning approach of a hybrid deep neural network (DNN) for TI\u2010SV in FM broadcasting. Not only acoustic features are utilized, but also phoneme features are introduced as prior knowledge to collectively learn deep speaker embedding. The hybrid DNN consists of a convolutional neural network architecture for generating acoustic features and a multilayer perceptron architecture for extracting phoneme features sequentially, which represent significant pronunciation attributes. The extracted acoustic and phoneme features are concatenated to form deep embedding descriptors for speaker identity. The hybrid DNN demonstrates not only the complementarity between acoustic and phoneme features but also the temporality of phoneme features in a sequence. Our experiments show that the hybrid DNN outperforms existing methods and delivers a remarkable performance in FM broadcasting TI\u2010SV.<\/jats:p>","DOI":"10.1049\/2024\/6694481","type":"journal-article","created":{"date-parts":[[2024,3,22]],"date-time":"2024-03-22T23:35:05Z","timestamp":1711150505000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting"],"prefix":"10.1049","volume":"2024","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-8580-4177","authenticated-orcid":false,"given":"Xiao","family":"Li","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2181-0084","authenticated-orcid":false,"given":"Xiao","family":"Chen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9919-0340","authenticated-orcid":false,"given":"Rui","family":"Fu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2128-3382","authenticated-orcid":false,"given":"Xiao","family":"Hu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6243-7349","authenticated-orcid":false,"given":"Mintong","family":"Chen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1877-5982","authenticated-orcid":false,"given":"Kun","family":"Niu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"265","published-online":{"date-parts":[[2024,3,22]]},"reference":[{"key":"e_1_2_10_1_2","doi-asserted-by":"crossref","unstructured":"LiX. ChenX. WangD. GuoZ. andNiuK. Deep speaker embedding with multi-part information aggregation infrequency-time domain for ASV 2022 IEEE 46th Annual Computers Software and Applications Conference (COMPSAC) June 2022 Los Alamitos CA USA IEEE 8\u201313 https:\/\/doi.org\/10.1109\/COMPSAC54236.2022.00011.","DOI":"10.1109\/COMPSAC54236.2022.00011"},{"key":"e_1_2_10_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2008.931100"},{"key":"e_1_2_10_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2021.3082303"},{"key":"e_1_2_10_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2017.2767587"},{"key":"e_1_2_10_5_2","doi-asserted-by":"publisher","DOI":"10.1006\/dspr.1999.0361"},{"key":"e_1_2_10_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2064307"},{"key":"e_1_2_10_7_2","doi-asserted-by":"crossref","unstructured":"Garcia-RomeroD.andEspy-WilsonC. Y. Analysis of i-vector length normalization in speaker recognition systems NTERSPEECH 2011 12th Annual Conference of the International Speech Communication Association 2011 Florence Italy ISCA 27\u201331 https:\/\/doi.org\/10.21437\/Interspeech.2011-53.","DOI":"10.21437\/Interspeech.2011-53"},{"key":"e_1_2_10_8_2","unstructured":"AbadiM. AgarwalA. BarhamP. BrevdoE. ChenZ. CitroC. CorradoG. S. DavisA. DeanJ. DevinM. GhemawatS. GoodfellowI. HarpA. IrvingG. IsardM. JiaY. JozefowiczR. KaiserL. KudlurM. LevenbergJ. ManeD. MongaR. MooreS. MurrayD. OlahC. SchusterM. ShlensJ. SteinerB. SutskeverI. TalwarK. TuckerP. VanhouckeV. VasudevanV. ViegasF. VinyalsO. WardenP. WattenbergM. WickeM. YuY. andZhengX. Tensorflow: large-scale machine learning on heterogeneous distributed systems 2016 https:\/\/arxiv.org\/pdf\/1603.04467.pdf."},{"key":"e_1_2_10_9_2","unstructured":"PaszkeA. GrossS. ChintalaS. ChananG. YangE. DeVitoZ. LinZ. DesmaisonA. AntigaL. andLererA. Automatic differentiation in PyTorch 31st Conference on Neural Information Processing Systems (NIPS 2017) 2017 Long Beach CA USA."},{"key":"e_1_2_10_10_2","doi-asserted-by":"crossref","unstructured":"VedaldiA.andLencK. MatConvNet: convolutional neural networks for MATLAB MM \u201915: Proceedings of the 23rd ACM International Conference on Multimedia October 2015 Brisbane Australia Association for Computing Machinery 689\u2013692 https:\/\/doi.org\/10.1145\/2733373.2807412 2-s2.0-84962815548.","DOI":"10.1145\/2733373.2807412"},{"key":"e_1_2_10_11_2","doi-asserted-by":"crossref","unstructured":"NagraniA. ChungJ. S. andZissermanA. VoxCeleb: a large-scale speaker identification dataset INTERSPEECH August 2017 Stockholm Sweden ISCA 20\u201324 https:\/\/doi.org\/10.21437\/Interspeech.2017-950 2-s2.0-85039159334.","DOI":"10.21437\/Interspeech.2017-950"},{"key":"e_1_2_10_12_2","doi-asserted-by":"crossref","unstructured":"BuH. DuJ. NaX. WuB. andZhengH. AISHELL-1: an open-source Mandarin speech corpus and a speech recognition baseline 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I\/O Systems and Assessment (O-COCOSDA) November 2017 Seoul Korea (South) IEEE 1\u20135 https:\/\/doi.org\/10.1109\/ICSDA.2017.8384449 2-s2.0-85040707435.","DOI":"10.1109\/ICSDA.2017.8384449"},{"key":"e_1_2_10_13_2","doi-asserted-by":"crossref","unstructured":"ChungJ. S. NagraniA. andZissermanA. VoxCeleb2: deep speaker recognition INTERSPEECH September 2018 Graz Austria ISCA 1086\u20131090 https:\/\/doi.org\/10.21437\/Interspeech.2018-1929 2-s2.0-85054964925.","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"e_1_2_10_14_2","doi-asserted-by":"crossref","unstructured":"VarianiE. LeiX. McDermottE. MorenoI. L. andGonzalez-DominguezJ. Deep neural networks for small footprint text-dependent speaker verification 2014 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) May 2014 Florence Italy IEEE 4052\u20134056 https:\/\/doi.org\/10.1109\/ICASSP.2014.6854363 2-s2.0-84905252894.","DOI":"10.1109\/ICASSP.2014.6854363"},{"key":"e_1_2_10_15_2","doi-asserted-by":"crossref","unstructured":"SnyderD. Garcia-RomeroD. SellG. PoveyD. andKhudanpurS. X-Vectors: robust DNN embeddings for speaker recognition 2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) ICASSP April 2018 Calgary AB Canada IEEE 5329\u20135333 https:\/\/doi.org\/10.1109\/ICASSP.2018.8461375 2-s2.0-85054205942.","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"e_1_2_10_16_2","unstructured":"ZeinaliH. WangS. SilnovaA. MatejkaP. andPlchotO. But system description to voxceleb speaker recognition challenge 2019 https:\/\/doi.org\/10.48550\/arXiv.1910.12592."},{"key":"e_1_2_10_17_2","doi-asserted-by":"crossref","unstructured":"TawaraN. OgawaA. IwataT. DelcroixM. andOgawaT. Frame-level phoneme-invariant speaker embedding for text-independent speaker recognition on extremely short utterances ICASSP 2020 - 2020 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) May 2020 Barcelona Spain IEEE 6799\u20136803 https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053871.","DOI":"10.1109\/ICASSP40776.2020.9053871"},{"key":"e_1_2_10_18_2","doi-asserted-by":"crossref","unstructured":"YunL. SchefferN. FerrerL. andMcLarenM. A novel scheme for speaker recognition using a phonetically-aware deep neural network 2014 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) May 2014 Florence Italy IEEE 1695\u20131699 https:\/\/doi.org\/10.1109\/ICASSP.2014.6853887 2-s2.0-84905252132.","DOI":"10.1109\/ICASSP.2014.6853887"},{"key":"e_1_2_10_19_2","doi-asserted-by":"crossref","unstructured":"WangS. RohdinJ. BurgetL. PlchotO. QianY. YuK. and\u010cernock\u00fdJ. On the usage of phonetic information for text-independent speaker embedding extraction INTERSPEECH September 2019 Graz Austria ISCA 1148\u20131152 https:\/\/doi.org\/10.21437\/Interspeech.2019-3036.","DOI":"10.21437\/Interspeech.2019-3036"},{"key":"e_1_2_10_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.2004.840940"},{"key":"e_1_2_10_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2019.2941773"},{"key":"e_1_2_10_22_2","doi-asserted-by":"crossref","unstructured":"LiW. QuanW. PapirA. andMorenoI. L. Generalized end-to-end loss for speaker verification 2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) April 2018 Calgary AB Canada IEEE Press 4879\u20134883 https:\/\/doi.org\/10.1109\/ICASSP.2018.8462665 2-s2.0-85054234395.","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"e_1_2_10_23_2","doi-asserted-by":"crossref","unstructured":"TangY. DingG. HuangJ. HeX. andZhouB. Deep speaker embedding learning with multi-level pooling for text-independent speaker verification ICASSP 2019\u20142019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) May 2019 Brighton UK IEEE 6116\u20136120 https:\/\/doi.org\/10.1109\/ICASSP.2019.8682712 2-s2.0-85067112298.","DOI":"10.1109\/ICASSP.2019.8682712"},{"key":"e_1_2_10_24_2","doi-asserted-by":"crossref","unstructured":"TorfiA. DawsonJ. andNasrabadiN. M. Text-independent speaker verification using 3d convolutional neural networks 2018 IEEE International Conference on Multimedia and Expo (ICME) 2018 Los Alamitos CA USA IEEE Computer Society 1\u20136 https:\/\/doi.org\/10.1109\/ICME.2018.8486441 2-s2.0-85061429993.","DOI":"10.1109\/ICME.2018.8486441"},{"key":"e_1_2_10_25_2","doi-asserted-by":"crossref","unstructured":"SnyderD. Garcia-RomeroD. PoveyD. andKhudanpurS. Deep neural network embeddings for text-independent speaker verification INTERSPEECH August 2017 Stockholm Sweden ISCA 999\u20131003 https:\/\/doi.org\/10.21437\/Interspeech.2017-620 2-s2.0-85039167983.","DOI":"10.21437\/Interspeech.2017-620"},{"key":"e_1_2_10_26_2","doi-asserted-by":"crossref","unstructured":"KimS.-H.andParkY.-H. Adaptive convolutional neural network for text-independent speaker recognition INTERSPEECH 2021 Brno Czechia ISCA 66\u201370 https:\/\/doi.org\/10.21437\/Interspeech.2021-65.","DOI":"10.21437\/Interspeech.2021-65"},{"key":"e_1_2_10_27_2","doi-asserted-by":"crossref","unstructured":"HanS. ByunJ. andShinJ. W. Time-domain speaker verification using temporal convolutional networks ICASSP 2021\u20142021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) June 2021 Toronto ON Canada IEEE 6688\u20136692 https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414765.","DOI":"10.1109\/ICASSP39728.2021.9414765"},{"key":"e_1_2_10_28_2","doi-asserted-by":"crossref","unstructured":"CaiW. ChenJ. andLiM. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system The Speaker and Language Recognition Workshop (Odyssey 2018) June 2018 Les Sables d\u2019Olonne France ISCA 74\u201381 https:\/\/doi.org\/10.21437\/Odyssey.2018-11.","DOI":"10.21437\/Odyssey.2018-11"},{"key":"e_1_2_10_29_2","doi-asserted-by":"crossref","unstructured":"WangZ. YaoK. LiX. andFangS. Multi-resolution multi-head attention in deep speaker embedding ICASSP 2020\u20142020 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) May 2020 Barcelona Spain IEEE 6464\u20136468 https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053217.","DOI":"10.1109\/ICASSP40776.2020.9053217"},{"key":"e_1_2_10_30_2","doi-asserted-by":"crossref","unstructured":"MaX. LiangT. ZhangS. HuangS. andHeL. Improved Light CNN with attention modules for ASV spoofing detection 2021 IEEE International Conference on Multimedia and Expo (ICME) July 2021 Shenzhen China IEEE 1\u20136 https:\/\/doi.org\/10.1109\/ICME51207.2021.9428313.","DOI":"10.1109\/ICME51207.2021.9428313"},{"key":"e_1_2_10_31_2","doi-asserted-by":"crossref","unstructured":"LiX.andWuX. Modeling speaker variability using long short-term memory networks for speech recognition INTERSPEECH September 2015 Dresden Germany ISCA 1086\u20131090 https:\/\/doi.org\/10.21437\/Interspeech.2015-287.","DOI":"10.21437\/Interspeech.2015-287"},{"key":"e_1_2_10_32_2","doi-asserted-by":"crossref","unstructured":"LiuY. HeL. LiuJ. andJohnsonM. T. Speaker embedding extraction with phonetic information INTERSPEECH September 2018 Hyderabad India ISCA 2247\u20132251 https:\/\/doi.org\/10.21437\/Interspeech.2018-1226 2-s2.0-85054953560.","DOI":"10.21437\/Interspeech.2018-1226"},{"key":"e_1_2_10_33_2","doi-asserted-by":"crossref","unstructured":"PironkovG. DupontS. andDutoitT. Speaker-aware long short-term memory multi-task learning for speech recognition 2016 24th European Signal Processing Conference (EUSIPCO) August 2016 Budapest Hungary IEEE 1911\u20131915 https:\/\/doi.org\/10.1109\/EUSIPCO.2016.7760581 2-s2.0-85006041106.","DOI":"10.1109\/EUSIPCO.2016.7760581"},{"key":"e_1_2_10_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2016.2639323"},{"key":"e_1_2_10_35_2","doi-asserted-by":"crossref","unstructured":"SaonG. SoltauH. NahamooD. andPichenyM. Speaker adaptation of neural network acoustic models using i-vectors 2013 IEEE Workshop on Automatic Speech Recognition and Understanding December 2014 Olomouc Czech Republic IEEE 55\u201359 https:\/\/doi.org\/10.1109\/ASRU.2013.6707705 2-s2.0-84893691530.","DOI":"10.1109\/ASRU.2013.6707705"},{"key":"e_1_2_10_36_2","doi-asserted-by":"crossref","unstructured":"KaranasouP. WangY. GalesM. J. F. andWoodlandP. C. Adaptation of deep neural network acoustic models using factorised i-vectors Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH) September 2014 Singapore ISCA 2184\u20132180 https:\/\/doi.org\/10.21437\/Interspeech.2014-488.","DOI":"10.21437\/Interspeech.2014-488"},{"key":"e_1_2_10_37_2","doi-asserted-by":"crossref","unstructured":"SilnovaA. MatejkaP. GlembekO. PlchotO. NovotnyO. GrezlF. SchwarzP. BurgetL. andCernockyJ. BUT\/phonexia bottleneck feature extractor The Speaker and Language Recognition Workshop (Odyssey 2018) June 2018 Les Sables d\u2019Olonne France ISCA 283\u2013287 https:\/\/doi.org\/10.21437\/Odyssey.2018-40.","DOI":"10.21437\/Odyssey.2018-40"},{"key":"e_1_2_10_38_2","doi-asserted-by":"crossref","unstructured":"SchneiderS. BaevskiA. CollobertR. andAuliM. wav2vec: unsupervised pre-training for speech recognition 2019 https:\/\/doi.org\/10.48550\/arXiv.1904.05862.","DOI":"10.21437\/Interspeech.2019-1873"},{"key":"e_1_2_10_39_2","doi-asserted-by":"crossref","unstructured":"LiX. HuX. ChenX. PanH. andNiuK. Deep speaker embedding using hybrid network of multi-feature aggregation and multi-loss fusion for TI-SV 2022 26th International Conference on Pattern Recognition (ICPR) ICPR August 2022 Montreal QC Canada IEEE 506\u2013512 https:\/\/doi.org\/10.1109\/ICPR56361.2022.9956552.","DOI":"10.1109\/ICPR56361.2022.9956552"},{"key":"e_1_2_10_40_2","doi-asserted-by":"crossref","unstructured":"XiangX. WangS. HuangH. QianY. andYuK. Margin matters: towards more discriminative deep neural network embeddings for speaker recognition 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) November 2019 Lanzhou China IEEE 1652\u20131656 https:\/\/doi.org\/10.1109\/APSIPAASC47483.2019.9023039.","DOI":"10.1109\/APSIPAASC47483.2019.9023039"},{"key":"e_1_2_10_41_2","unstructured":"LamelL. F. KasselR. H. andSeneffS. Speech database development: design and analysis of the acoustic-phonetic corpus Speech Input\/Output Assessment and Speech Databases September 1989 The Netherlands ISCA 161\u2013170."},{"key":"e_1_2_10_42_2","doi-asserted-by":"crossref","unstructured":"ZhuY.andMakB. Orthogonal training for text-independent speaker verification ICASSP 2020\u20142020 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) May 2020 Barcelona Spain IEEE 6584\u20136588 https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053198.","DOI":"10.1109\/ICASSP40776.2020.9053198"},{"key":"e_1_2_10_43_2","doi-asserted-by":"crossref","unstructured":"ZhangH. WangL. LeeK. A. LiuM. DangJ. andChenH. Meta-learning for cross-channel speaker verification ICASSP 2021\u20142021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) June 2021 Toronto ON Canada IEEE 5839\u20135843 https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9413978.","DOI":"10.1109\/ICASSP39728.2021.9413978"}],"container-title":["IET Biometrics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/downloads.hindawi.com\/journals\/ietbm\/2024\/6694481.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/ietbm\/2024\/6694481.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/2024\/6694481","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T17:24:27Z","timestamp":1762363467000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/2024\/6694481"}},"subtitle":[],"editor":[{"given":"Ahilan","family":"Kanagasundaram","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2024,1]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,1]]}},"alternative-id":["10.1049\/2024\/6694481"],"URL":"https:\/\/doi.org\/10.1049\/2024\/6694481","archive":["Portico"],"relation":{},"ISSN":["2047-4938","2047-4946"],"issn-type":[{"type":"print","value":"2047-4938"},{"type":"electronic","value":"2047-4946"}],"subject":[],"published":{"date-parts":[[2024,1]]},"assertion":[{"value":"2023-07-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"6694481"}}