{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T06:51:27Z","timestamp":1777704687335,"version":"3.51.4"},"reference-count":47,"publisher":"SAGE Publications","issue":"5","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IFS"],"published-print":{"date-parts":[[2023,11,4]]},"abstract":"<jats:p>The speaker diarization task pertains to the automated differentiation of speakers within an audio recording, while lacking any prior information regarding the speakers. The introduction of the self-attention mechanism in End-to-End Neural Speaker Diarization (EEND) has elegantly resolved the issue of overlapping speakers. The Transformer model equipped with self-attention mechanism has shown great potential in collecting global information, yielding remarkable outcomes in various tasks. However, the individual speaker characteristics are predominantly reflected in the contextual information, which conventional self-attention would not adequately address. In this study, we propose a hierarchical encoders model to augment the encoders\u2019 acquisition of speaker information in two distinct ways: (1) Constraining the perceptual field of the self-attentive mechanism with left-right windows or Gaussian weights to highlight contextual information; (2) Utilizing a pre-trained time-delay neural network based speaker embedding extractor to alleviate the shortcomings of speaker feature extraction ability. We evaluate the proposed methods on a simulated dataset of two speakers and a real conversation dataset. The model with the most favorable outcomes among the proposed enhancements achieves a diarization error rate of 7.74% on the simulated dataset and 21.92% on MagicData-RAMC after adaptation. These results compellingly demonstrate the efficacy of the proposed methods.<\/jats:p>","DOI":"10.3233\/jifs-230249","type":"journal-article","created":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T11:17:31Z","timestamp":1685445451000},"page":"9169-9180","source":"Crossref","is-referenced-by-count":1,"title":["Speaker diarization with variants of self-attention and joint speaker embedding extractor"],"prefix":"10.1177","volume":"45","author":[{"given":"Pengbin","family":"Fu","sequence":"first","affiliation":[{"name":"Faculty of Information Technology, Beijing university of technology, Xidawang Road, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuchen","family":"Ma","sequence":"additional","affiliation":[{"name":"Faculty of Information Technology, Beijing university of technology, Xidawang Road, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huirong","family":"Yang","sequence":"additional","affiliation":[{"name":"Faculty of Information Technology, Beijing university of technology, Xidawang Road, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"key":"10.3233\/JIFS-230249_ref1","doi-asserted-by":"crossref","first-page":"101317","DOI":"10.1016\/j.csl.2021.101317","article-title":"A review of speaker diarization: Recent advances with deep learning","volume":"72","author":"Park","year":"2022","journal-title":"Computer Speech & Language"},{"key":"10.3233\/JIFS-230249_ref2","doi-asserted-by":"crossref","unstructured":"Kanda N. , et al., Acoustic modeling for distant multi-talker speech recognition with single-and multi-channel branches, in ICASSP 2019\u20132019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, IEEE.","DOI":"10.1109\/ICASSP.2019.8682273"},{"key":"10.3233\/JIFS-230249_ref3","first-page":"2019","article-title":"Joint speech recognition and speaker diarization via sequence transduction[J]","author":"Shafey","journal-title":"arXiv preprint arXiv:1907.05337"},{"key":"10.3233\/JIFS-230249_ref4","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1109\/ASRU46091.2019.9004009","article-title":"Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models[C]","author":"Kanda","year":"2019","journal-title":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)"},{"key":"10.3233\/JIFS-230249_ref5","doi-asserted-by":"crossref","first-page":"101254","DOI":"10.1016\/j.csl.2021.101254","article-title":"Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks","volume":"71","author":"Landini","year":"2022","journal-title":"Computer Speech & Language"},{"issue":"4","key":"10.3233\/JIFS-230249_ref6","first-page":"561","article-title":"Metaheuristic adapted convolutional neural network for Telugu speaker diarization[J]","volume":"15","author":"Prasad","year":"2021","journal-title":"Intelligent Decision Technologies"},{"key":"10.3233\/JIFS-230249_ref7","doi-asserted-by":"crossref","unstructured":"Wang Q. , et al., Speaker diarization with LSTM, in2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), 2018. IEEE.","DOI":"10.1109\/ICASSP.2018.8462628"},{"key":"10.3233\/JIFS-230249_ref8","first-page":"2019","article-title":"End-to-End Neural Speaker Diarization with Self-Attention","author":"Fujita","journal-title":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)"},{"key":"10.3233\/JIFS-230249_ref9","first-page":"2020","article-title":"End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors","author":"Horiguchi","journal-title":"arXiv preprint arXiv:2005.09921"},{"key":"10.3233\/JIFS-230249_ref10","first-page":"2021","article-title":"End-to-end neural diarization: From transformer to conformer","author":"Liu","journal-title":"arXiv preprint arXiv:2106.07167"},{"issue":"1","key":"10.3233\/JIFS-230249_ref11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13636-022-00260-9","article-title":"AUC optimization for deep learning-based voice activity detection","volume":"2022","author":"Zhang","year":"2022","journal-title":"EURASIP Journal on Audio, Speech, and Music Processing"},{"key":"10.3233\/JIFS-230249_ref12","first-page":"2020","article-title":"Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario","author":"Medennikov","journal-title":"arXiv preprint arXiv:2005.07272"},{"issue":"8","key":"10.3233\/JIFS-230249_ref13","doi-asserted-by":"crossref","first-page":"1181","DOI":"10.1109\/LSP.2018.2811740","article-title":"Voice activity detection using an adaptive context attention model","volume":"25","author":"Kim","year":"2018","journal-title":"IEEE Signal Processing Letters"},{"key":"10.3233\/JIFS-230249_ref14","first-page":"2022","article-title":"Incorporating End-to-End Framework Into Target-Speaker Voice Activity Detection","author":"Wang","journal-title":"2022\u20132022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)"},{"key":"10.3233\/JIFS-230249_ref15","first-page":"1999","article-title":"Speaker-based segmentation for audio data indexing","author":"Delacourt","journal-title":"ESCA Tutorial and Research Workshop (ETRW) on Accessing Information in Spoken Audio"},{"key":"10.3233\/JIFS-230249_ref16","doi-asserted-by":"crossref","unstructured":"Malegaonkar A. , et al., Unsupervised speaker change detection using probabilistic pattern matching, 13(8) (2006), 509\u2013512.","DOI":"10.1109\/LSP.2006.873656"},{"issue":"4","key":"10.3233\/JIFS-230249_ref17","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1109\/TASL.2010.2064307","article-title":"Front-end factor analysis for speaker verification","volume":"19","author":"Dehak","year":"2010","journal-title":"IEEE Transactions on Audio, Speech, and Language Processing"},{"issue":"4","key":"10.3233\/JIFS-230249_ref18","doi-asserted-by":"crossref","first-page":"4937","DOI":"10.3233\/JIFS-181359","article-title":"Speaker identification using fuzzy I-vector tree[J]","volume":"37","author":"Ga\u0142ka","year":"2019","journal-title":"Journal of Intelligent & Fuzzy Systems"},{"key":"10.3233\/JIFS-230249_ref19","first-page":"2013","article-title":"Speaker adaptation of neural network acoustic models using i-vectors","author":"Saon","journal-title":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding"},{"key":"10.3233\/JIFS-230249_ref20","first-page":"2017","article-title":"Speaker diarization using deep neural network embeddings","author":"Garcia-Romero","journal-title":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)"},{"key":"10.3233\/JIFS-230249_ref21","first-page":"2016","article-title":"Deep neural network-based speaker embeddings for end-to-end speaker verification","author":"Snyder","journal-title":"2016 IEEE Spoken Language Technology Workshop (SLT)"},{"key":"10.3233\/JIFS-230249_ref22","first-page":"2018","article-title":"X-Vectors: Robust DNN Embeddings for Speaker Recognition","author":"Snyder","journal-title":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)"},{"key":"10.3233\/JIFS-230249_ref23","first-page":"2019","article-title":"Speaker diarization for multi-speaker conversations via x-vectors","author":"Zhang","journal-title":"2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP)"},{"key":"10.3233\/JIFS-230249_ref24","unstructured":"J. Luque and J. Hernando, On the use of agglomerative and spectral clustering in speaker diarization of meetings, 2012."},{"key":"10.3233\/JIFS-230249_ref25","doi-asserted-by":"crossref","unstructured":"Q. Lin, et al., LSTM based similarity measurement with spectral clustering for speaker diarization, 2019.","DOI":"10.21437\/Interspeech.2019-1388"},{"key":"10.3233\/JIFS-230249_ref26","doi-asserted-by":"crossref","unstructured":"K.J. Han and S.S. Narayanan, A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system, 2007.","DOI":"10.21437\/Interspeech.2007-516"},{"key":"10.3233\/JIFS-230249_ref27","unstructured":"J.E. Rougui, et al., Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, 2006. IEEE."},{"key":"10.3233\/JIFS-230249_ref28","doi-asserted-by":"crossref","unstructured":"S. Novoselov, et al., Speaker diarization with deep speaker embeddings for DIHARD challenge II, 2019.","DOI":"10.21437\/Interspeech.2019-2757"},{"key":"10.3233\/JIFS-230249_ref29","first-page":"2018","article-title":"The first DIHARD speech diarization challenge","author":"Ryant","journal-title":"Proceedings of Interspeech 2018"},{"key":"10.3233\/JIFS-230249_ref30","first-page":"2019","article-title":"The second dihard diarization challenge: Dataset, task, and baselines","author":"Ryant","journal-title":"arXiv preprint arXiv:1906.07839"},{"key":"10.3233\/JIFS-230249_ref31","doi-asserted-by":"crossref","unstructured":"Watanabe S. , et al., CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings, 2020.","DOI":"10.21437\/CHiME.2020-1"},{"issue":"17","key":"10.3233\/JIFS-230249_ref32","first-page":"6000","article-title":"Attention is All You Need","author":"Vaswani","journal-title":"Nips\u20192017"},{"key":"10.3233\/JIFS-230249_ref33","unstructured":"J. Devlin, et al., Bert: Pre-training of deep bidirectional transformers for language understanding, 2018."},{"key":"10.3233\/JIFS-230249_ref34","first-page":"2021","article-title":"Swin transformer: Hierarchical vision transformer using shifted windows","author":"Liu","journal-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision"},{"key":"10.3233\/JIFS-230249_ref35","first-page":"2020","article-title":"Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition","author":"Dong","journal-title":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)"},{"key":"10.3233\/JIFS-230249_ref36","first-page":"2017","article-title":"A structured self-attentive sentence embedding,","author":"Lin","journal-title":"arXiv preprint arXiv:1703.03130"},{"key":"10.3233\/JIFS-230249_ref37","first-page":"2020","article-title":"Longformer: The long-document transformer","author":"Beltagy","journal-title":"arXiv preprint arXiv:2004.05150"},{"key":"10.3233\/JIFS-230249_ref38","doi-asserted-by":"crossref","unstructured":"Guo Maosheng , Zhang Yu and Liu Ting , Gaussian transformer: a lightweight approach for natural language inference, Proceedings of the AAAI Conference on Artificial Intelligence 33(01) (2019).","DOI":"10.1609\/aaai.v33i01.33016489"},{"key":"10.3233\/JIFS-230249_ref39","first-page":"2021","article-title":"ECAPA-TDNN embeddings for speaker diarization","author":"Dawalatabad","journal-title":"arXiv preprint arXiv:2104.01466"},{"key":"10.3233\/JIFS-230249_ref40","first-page":"2020","article-title":"Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification","author":"Desplanques","journal-title":"arXiv preprint arXiv:2005.07143"},{"key":"10.3233\/JIFS-230249_ref41","first-page":"2022","article-title":"Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset","author":"Yang","journal-title":"arXiv preprint arXiv:2203.16844"},{"key":"10.3233\/JIFS-230249_ref42","first-page":"2015","article-title":"Musan: A music, speech, and noise corpus","author":"Snyder","journal-title":"arXiv preprint arXiv:1510.08484"},{"key":"10.3233\/JIFS-230249_ref43","first-page":"2017","article-title":"A study on data augmentation of reverberant speech for robust speech recognition","author":"Ko","journal-title":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)"},{"key":"10.3233\/JIFS-230249_ref44","unstructured":"Kingma Diederik P. and Ba Jimmy , Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014)."},{"issue":"1","key":"10.3233\/JIFS-230249_ref45","doi-asserted-by":"crossref","first-page":"e0245230","DOI":"10.1371\/journal.pone.0245230","article-title":"Multi-view classification with convolutional neural networks[J]","volume":"16","author":"Seeland","year":"2021","journal-title":"Plos One"},{"issue":"21","key":"10.3233\/JIFS-230249_ref46","doi-asserted-by":"crossref","first-page":"18473","DOI":"10.1007\/s00521-022-07454-4","article-title":"Adam or Eve? Automatic users\u2019 gender classification via gestures analysis on touch devices[J]","volume":"34","author":"Guarino","year":"2022","journal-title":"Neural Computing and Applications"},{"key":"10.3233\/JIFS-230249_ref47","doi-asserted-by":"crossref","first-page":"15803","DOI":"10.1007\/s11042-020-10446-y","article-title":"Techno-regulation and intelligent safeguards: Analysis of touch gestures for online child protection[J]","volume":"80","author":"Zaccagnino","year":"2021","journal-title":"Multimedia Tools and Applications"}],"container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/JIFS-230249","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:41:47Z","timestamp":1777455707000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/JIFS-230249"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,4]]},"references-count":47,"journal-issue":{"issue":"5"},"URL":"https:\/\/doi.org\/10.3233\/jifs-230249","relation":{},"ISSN":["1064-1246","1875-8967"],"issn-type":[{"value":"1064-1246","type":"print"},{"value":"1875-8967","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,4]]}}}