{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T00:19:34Z","timestamp":1773015574330,"version":"3.50.1"},"reference-count":37,"publisher":"Wiley","issue":"3","license":[{"start":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T00:00:00Z","timestamp":1769904000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"},{"start":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T00:00:00Z","timestamp":1769904000000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/doi.wiley.com\/10.1002\/tdm_license_1.1"}],"funder":[{"DOI":"10.13039\/501100005247","name":"University of British Columbia","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100005247","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Expert Systems"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>Although speaker diarization has evolved to be more robust and more refined, including incorporating modern automatic speech recognition (ASR), current systems still suffer from several disruptive factors, like noise. We comprehensively evaluate the limitations of current diarization systems to uncover the underlying causes that hinder accuracy. Five open\u2010source diarization pipelines\u2014both diarization\u2010only and joint ASR and diarization systems\u2014are assessed on a set of heterogeneous benchmark data sets. We compare the performance of joint pipelines against those of diarization\u2010only systems, and analyse which audio characteristics hinder speaker discrimination, as well as the impact of using the speaker count as an input parameter. Our results indicate that diarization\u2010only and joint approaches are competitive with each other in unsupervised scenarios, and that providing the speaker count does not improve performance consistently. We also identify short audio duration and low speech\u2010to\u2010noise ratio (SNR) as the most impairing properties. We recommend using speech representation learning to further uncover underlying factors that affect diarization, pre\u2010processing techniques to remove noise, and performing hyper\u2010parameter tuning on, for example, the speech window length, and speech detection thresholds.<\/jats:p>","DOI":"10.1111\/exsy.70221","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:40:34Z","timestamp":1769992834000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["On the Limitations of Speaker Diarization"],"prefix":"10.1111","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1359-2095","authenticated-orcid":false,"given":"Joana","family":"Amorim","sequence":"first","affiliation":[{"name":"Faculty of Computer Science Dalhousie University  Halifax Nova Scotia Canada"},{"name":"Vector Institute for Artificial Intelligence  Toronto Ontario Canada"}]},{"given":"Jo\u00e3o","family":"Pimentel","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science Dalhousie University  Halifax Nova Scotia Canada"},{"name":"Vector Institute for Artificial Intelligence  Toronto Ontario Canada"}]},{"given":"Frank","family":"Rudzicz","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science Dalhousie University  Halifax Nova Scotia Canada"},{"name":"Vector Institute for Artificial Intelligence  Toronto Ontario Canada"}]}],"member":"311","published-online":{"date-parts":[[2026,2,1]]},"reference":[{"key":"e_1_2_11_2_1","volume-title":"CALLHOME American English Speech","author":"Alexandra Canavan D. G.","year":"1997"},{"key":"e_1_2_11_3_1","doi-asserted-by":"crossref","unstructured":"Bain M. J.Huh T.Han andA.Zisserman.2023.\u201cWhisperX: Time\u2010Accurate Speech Transcription of Long\u2010Form Audio.\u201d In:Proc. Interspeech 2023 4489\u20134493.","DOI":"10.21437\/Interspeech.2023-78"},{"key":"e_1_2_11_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.50"},{"key":"e_1_2_11_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9052974"},{"key":"e_1_2_11_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/11677482_3"},{"key":"e_1_2_11_7_1","doi-asserted-by":"crossref","unstructured":"Chung J. S. J.Huh A.Nagrani T.Afouras andA.Zisserman.2020.\u201cSpot the Conversation: Speaker Diarisation in the Wild.\u201d InProc. Interspeech 2020 299\u2013303.","DOI":"10.21437\/Interspeech.2020-2337"},{"key":"e_1_2_11_8_1","doi-asserted-by":"crossref","unstructured":"Cui C. I.Sheikh M.Sadeghi andE.Vincent.2024.\u201cImproving Speaker Assignment in Speaker\u2010Attributed ASR for Real Meeting Applications.\u201d InThe Speaker and Language Recognition Workshop (Odyssey 2024) 99\u2013106.","DOI":"10.21437\/odyssey.2024-15"},{"key":"e_1_2_11_9_1","doi-asserted-by":"crossref","unstructured":"Del Rio M. N.Delworth R.Westerman et\u00a0al.2021.\u201cEarnings\u201021: A Practical Benchmark for ASR in the Wild.\u201d InProc. Interspeech 2021 3465\u20133469.","DOI":"10.21437\/Interspeech.2021-1915"},{"key":"e_1_2_11_10_1","doi-asserted-by":"crossref","unstructured":"Desplanques B. J.Thienpondt andK.Demuynck.2020.\u201cECAPA\u2010TDNN: Emphasized Channel Attention Propagation and Aggregation in TDNN Based Speaker Verification.\u201d InProc. Interspeech 2020 3830\u20133834.","DOI":"10.21437\/Interspeech.2020-2650"},{"key":"e_1_2_11_11_1","doi-asserted-by":"crossref","unstructured":"Fu Y. L.Cheng S.Lv et\u00a0al.2021.\u201cAISHELL\u20104: An Open Source Dataset for Speech Enhancement Separation Recognition and Speaker Diarization in Conference Scenario.\u201d InProc. Interspeech 2021 3665\u20133669.","DOI":"10.21437\/Interspeech.2021-1397"},{"key":"e_1_2_11_12_1","doi-asserted-by":"crossref","unstructured":"Gauvain J. L. L. F.Lamel andG.Adda.1998.\u201cPartitioning and Transcription of Broadcast News Data.\u201d InProc. 5th International Conference on Spoken Language Processing (ICSLP 1998) paper 0084.","DOI":"10.21437\/ICSLP.1998-618"},{"key":"e_1_2_11_13_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0144610"},{"key":"e_1_2_11_14_1","unstructured":"Grauman K. A.Westbury E.Byrne et\u00a0al.2022.\u201cEgo4D: Around the World in 3 000 Hours of Egocentric Video.\u201d In2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 18973\u201318990."},{"key":"e_1_2_11_15_1","doi-asserted-by":"crossref","unstructured":"Jia F. S.Majumdar andB.Ginsburg.2021.\u201cMarbleNet: Deep 1D Time\u2010Channel Separable Convolutional Neural Network for Voice Activity Detection.\u201d In2021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 6818\u20136822.","DOI":"10.1109\/ICASSP39728.2021.9414470"},{"key":"e_1_2_11_16_1","doi-asserted-by":"crossref","unstructured":"Kanda N. X.Xiao Y.Gaur et\u00a0al.2022.\u201cTranscribe\u2010to\u2010Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End\u2010to\u2010End Speaker\u2010Attributed ASR.\u201d In2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 8082\u20138086.","DOI":"10.1109\/ICASSP43922.2022.9746225"},{"key":"e_1_2_11_17_1","doi-asserted-by":"crossref","unstructured":"Khare A. E.Han Y.Yang andA.Stolcke.2022.\u201cASR\u2010Aware End\u2010to\u2010End Neural Diarization.\u201d In2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 8092\u20138096.","DOI":"10.1109\/ICASSP43922.2022.9746964"},{"key":"e_1_2_11_18_1","doi-asserted-by":"crossref","unstructured":"Koluguri N. R. T.Park andB.Ginsburg.2022.\u201cTitaNet: Neural Model for Speaker Representation With 1D Depth\u2010Wise Separable Convolutions and Global Context.\u201d In2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 8102\u20138106.","DOI":"10.1109\/ICASSP43922.2022.9746806"},{"key":"e_1_2_11_19_1","unstructured":"Kuchaiev O. J.Li H.Nguyen et\u00a0al.2019.\u201cNeMo: A Toolkit for Building AI Applications Using Neural Modules.\u201dhttps:\/\/doi.org\/10.48550\/arXiv.1909.09577."},{"key":"e_1_2_11_20_1","doi-asserted-by":"crossref","unstructured":"Landini F. A.Lozano\u2010Diez M.Diez andL.Burget.2022.\u201cFrom Simulated Mixtures to Simulated Conversations as Training Data for End\u2010to\u2010End Neural Diarization.\u201d InProc. Interspeech 2022 5095\u20135099.","DOI":"10.21437\/Interspeech.2022-10451"},{"key":"e_1_2_11_21_1","doi-asserted-by":"crossref","unstructured":"Lavechin M. M.M\u00e9tais H.Titeux et\u00a0al.2023.\u201cBrouhaha: Multi\u2010Task Training for Voice Activity Detection Speech\u2010To\u2010Noise Ratio and C50 Room Acoustics Estimation.\u201darXiv.https:\/\/doi.org\/10.48550\/arXiv.2210.13248.","DOI":"10.1109\/ASRU57964.2023.10389718"},{"key":"e_1_2_11_22_1","doi-asserted-by":"crossref","unstructured":"Liu T. S.Fan X.Xiang et\u00a0al.2022.\u201cMSDWild: Multi\u2010Modal Speaker Diarization Dataset in the Wild.\u201d InProc. Interspeech 2022 1476\u20131480.","DOI":"10.21437\/Interspeech.2022-10466"},{"key":"e_1_2_11_23_1","doi-asserted-by":"crossref","unstructured":"Mao H. H. S.Li J.McAuley andG. W.Cottrell.2020.\u201cSpeech Recognition and Multi\u2010Speaker Diarization of Long Conversations.\u201d InProc. Interspeech 2020 691\u2013695.","DOI":"10.21437\/Interspeech.2020-3039"},{"key":"e_1_2_11_24_1","doi-asserted-by":"crossref","unstructured":"Medennikov I. M.Korenevsky T.Prisyach et\u00a0al.2020.\u201cThe STC System for the CHiME\u20106 Challenge.\u201d InProc. 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020) 36\u201341.","DOI":"10.21437\/CHiME.2020-9"},{"key":"e_1_2_11_25_1","doi-asserted-by":"crossref","unstructured":"Meng L. J.Kang M.Cui H.Wu X.Wu andH.Meng.2023.\u201cUnified Modeling of Multi\u2010Talker Overlapped Speech Recognition and Diarization With a Sidecar Separator.\u201d InInterspeech 2023 3467\u20133471.","DOI":"10.21437\/Interspeech.2023-1422"},{"key":"e_1_2_11_26_1","doi-asserted-by":"crossref","unstructured":"Park T. J. K. J.Han J.Huang et\u00a0al.2019.\u201cSpeaker Diarization With Lexical Information.\u201d InProc. Interspeech 2019 391\u2013395.","DOI":"10.21437\/Interspeech.2019-1947"},{"key":"e_1_2_11_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/lsp.2019.2961071"},{"key":"e_1_2_11_28_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2021.101317"},{"key":"e_1_2_11_29_1","doi-asserted-by":"crossref","unstructured":"Park T. J. N. R.Koluguri J.Balam andB.Ginsburg.2022.\u201cMulti\u2010Scale Speaker Diarization With Dynamic Scale Weighting.\u201d InProc. Interspeech 2022 5080\u20135084.","DOI":"10.21437\/Interspeech.2022-991"},{"key":"e_1_2_11_30_1","doi-asserted-by":"publisher","DOI":"10.1111\/exsy.13118"},{"key":"e_1_2_11_31_1","doi-asserted-by":"crossref","unstructured":"Plaquet A. andH.Bredin.2023.\u201cPowerset Multi\u2010Class Cross Entropy Loss for Neural Speaker Diarization.\u201d InProc. Interspeech 2023 3222\u20133226.","DOI":"10.21437\/Interspeech.2023-205"},{"key":"e_1_2_11_32_1","doi-asserted-by":"crossref","unstructured":"Ryant N. P.Singh V.Krishnamohan et\u00a0al.2021.\u201cThe Third DIHARD Diarization Challenge.\u201d InProc. Interspeech 2021 3570\u20133574.","DOI":"10.21437\/Interspeech.2021-1208"},{"key":"e_1_2_11_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2023.101534"},{"key":"e_1_2_11_34_1","doi-asserted-by":"crossref","unstructured":"Shafey L. E. H.Soltau andI.Shafran.2019.\u201cJoint Speech Recognition and Speaker Diarization via Sequence Transduction.\u201d InProc. Interspeech 2019 396\u2013400.","DOI":"10.21437\/Interspeech.2019-1943"},{"key":"e_1_2_11_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1998.675375"},{"key":"e_1_2_11_36_1","doi-asserted-by":"crossref","unstructured":"Xia W. H.Lu Q.Wang et\u00a0al.2022.\u201cTurn\u2010to\u2010Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection.\u201d In2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 8077\u20138081.","DOI":"10.1109\/ICASSP43922.2022.9746531"},{"key":"e_1_2_11_37_1","doi-asserted-by":"crossref","unstructured":"Yang Z. Y.Chen L.Luo et\u00a0al.2022.\u201cOpen Source MagicData\u2010RAMC: A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset.\u201d InProc. Interspeech 2022 1736\u20131740.","DOI":"10.21437\/Interspeech.2022-729"},{"key":"e_1_2_11_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746465"}],"container-title":["Expert Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1111\/exsy.70221","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/full-xml\/10.1111\/exsy.70221","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1111\/exsy.70221","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,8]],"date-time":"2026-03-08T23:20:50Z","timestamp":1773012050000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1111\/exsy.70221"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,1]]},"references-count":37,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["10.1111\/exsy.70221"],"URL":"https:\/\/doi.org\/10.1111\/exsy.70221","archive":["Portico"],"relation":{},"ISSN":["0266-4720","1468-0394"],"issn-type":[{"value":"0266-4720","type":"print"},{"value":"1468-0394","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,1]]},"assertion":[{"value":"2025-05-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70221"}}