{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T14:18:35Z","timestamp":1753885115036,"version":"3.41.2"},"reference-count":32,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2021,9,26]],"date-time":"2021-09-26T00:00:00Z","timestamp":1632614400000},"content-version":"vor","delay-in-days":268,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100009919","name":"Army Research Institute for the Behavioral and Social Sciences","doi-asserted-by":"publisher","award":["W911NF-17-1-0221"],"award-info":[{"award-number":["W911NF-17-1-0221"]}],"id":[{"id":"10.13039\/100009919","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Computational Intelligence and Neuroscience"],"published-print":{"date-parts":[[2021,1]]},"abstract":"<jats:p>Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left\u2010 and right\u2010channel audio signals in a few different ways and then by extracting the embedded features (also called <jats:italic>d<\/jats:italic>\u2010vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter\u2010sharing Gaussian mixture model was obtained to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multiperson discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono\u2010audio signals in more complicated conditions.<\/jats:p>","DOI":"10.1155\/2021\/6151651","type":"journal-article","created":{"date-parts":[[2021,9,26]],"date-time":"2021-09-26T19:42:10Z","timestamp":1632685330000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Utterance Clustering Using Stereo Audio Channels"],"prefix":"10.1155","volume":"2021","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1935-1105","authenticated-orcid":false,"given":"Yingjun","family":"Dong","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8478-8530","authenticated-orcid":false,"given":"Neil G.","family":"MacLaren","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2392-3700","authenticated-orcid":false,"given":"Yiding","family":"Cao","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Francis J.","family":"Yammarino","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shelley D.","family":"Dionne","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael D.","family":"Mumford","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2686-6836","authenticated-orcid":false,"given":"Shane","family":"Connelly","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2670-5864","authenticated-orcid":false,"given":"Hiroki","family":"Sayama","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gregory A.","family":"Ruark","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"311","published-online":{"date-parts":[[2021,9,26]]},"reference":[{"key":"e_1_2_11_1_2","doi-asserted-by":"crossref","unstructured":"MenneT. SklyarI. Schl\u00fcterR. andNeyH. Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech Proceedings of the Interspeech 2019 Graz Austria https:\/\/doi.org\/10.21437\/interspeech.2019-1728.","DOI":"10.21437\/Interspeech.2019-1728"},{"key":"e_1_2_11_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/tasl.2011.2125954"},{"key":"e_1_2_11_3_2","doi-asserted-by":"crossref","unstructured":"von NeumannT. KinoshitaK. DelcroixM. ArakiS. NakataniT. andHaeb-UmbachR. All-neural online source separation counting and diarization for meeting analysis Proceedings of the ICASSP 2019\u20132019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2019 Brighton UK IEEE 91\u201395 https:\/\/doi.org\/10.1109\/icassp.2019.8682572 2-s2.0-85068957436.","DOI":"10.1109\/ICASSP.2019.8682572"},{"key":"e_1_2_11_4_2","doi-asserted-by":"crossref","unstructured":"ChenZ. XiaoX. YoshiokaT. ErdoganH. LiJ. andGongY. Multi-channel overlapped speech recognition with location guided speech extraction network Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT) 2018 Athens Greece IEEE 558\u2013565 https:\/\/doi.org\/10.1109\/slt.2018.8639593 2-s2.0-85063076403.","DOI":"10.1109\/SLT.2018.8639593"},{"key":"e_1_2_11_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/tassp.1980.1163420"},{"key":"e_1_2_11_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2064307"},{"key":"e_1_2_11_7_2","doi-asserted-by":"crossref","unstructured":"SnyderD. Garcia-RomeroD. SellG. PoveyD. andKhudanpurS. X-vectors: robust DNN embeddings for speaker recognition Proceedings of the 2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2018 Calgary Canada IEEE 5329\u20135333.","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"e_1_2_11_8_2","doi-asserted-by":"crossref","unstructured":"VarianiE. LeiX. McDermottE. MorenoI. L. andGonzalez-DominguezJ. Deep neural networks for small footprint text-dependent speaker verification Proceedings of the 2014 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2014 Florence Italy IEEE 4052\u20134056 https:\/\/doi.org\/10.1109\/icassp.2014.6854363 2-s2.0-84905252894.","DOI":"10.1109\/ICASSP.2014.6854363"},{"key":"e_1_2_11_9_2","doi-asserted-by":"crossref","unstructured":"DongY. MacLarenN. G. CaoY. YammarinoF. J. DionneS. D. MumfordM. D. ConnellyS. SayamaH. andRuarkG. A. Utterance clustering using stereo audio channels 2021 http:\/\/arxiv.org\/abs\/2009.05076.","DOI":"10.1155\/2021\/6151651"},{"key":"e_1_2_11_10_2","doi-asserted-by":"publisher","DOI":"10.1155\/2017\/1735698"},{"key":"e_1_2_11_11_2","doi-asserted-by":"crossref","unstructured":"WanL. WangQ. PapirA. andMorenoI. L. Generalized end-to-end loss for speaker verification Proceedings of the 2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2018 Calgary Canada IEEE 4879\u20134883.","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"e_1_2_11_12_2","doi-asserted-by":"publisher","DOI":"10.1155\/2020\/4748606"},{"key":"e_1_2_11_13_2","doi-asserted-by":"publisher","DOI":"10.1155\/2020\/8810901"},{"key":"e_1_2_11_14_2","doi-asserted-by":"publisher","DOI":"10.1155\/2015\/598610"},{"key":"e_1_2_11_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/s0167-6393(00)00027-3"},{"key":"e_1_2_11_16_2","doi-asserted-by":"publisher","DOI":"10.1155\/2014\/628516"},{"key":"e_1_2_11_17_2","doi-asserted-by":"publisher","DOI":"10.1155\/2017\/6986391"},{"key":"e_1_2_11_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/tasl.2013.2264673"},{"key":"e_1_2_11_19_2","doi-asserted-by":"crossref","unstructured":"Zaj\u00edcZ. Hr\u00fazM. andM\u00fcllerL. Speaker diarization using convolutional neural network for statistics accumulation refinement Proceedings of the INTERSPEECH 2017 Stockholm Sweden 3562\u20133566 https:\/\/doi.org\/10.21437\/interspeech.2017-51 2-s2.0-85028634312.","DOI":"10.21437\/Interspeech.2017-51"},{"key":"e_1_2_11_20_2","doi-asserted-by":"crossref","unstructured":"WangQ. DowneyC. WanL. MansfieldP. A. andMorenoI. L. Speaker diarization with LSTM Proceedings of the 2018 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2018 Calgary Canada IEEE 5239\u20135243.","DOI":"10.1109\/ICASSP.2018.8462628"},{"key":"e_1_2_11_21_2","doi-asserted-by":"crossref","unstructured":"ZhangA. WangQ. ZhuZ. PaisleyJ. andWangC. Fully supervised speaker diarization Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2019 Brighton UK IEEE 6301\u20136305 https:\/\/doi.org\/10.1109\/icassp.2019.8683892 2-s2.0-85069451910.","DOI":"10.1109\/ICASSP.2019.8683892"},{"key":"e_1_2_11_22_2","doi-asserted-by":"crossref","unstructured":"McFeeB. RaffelC. LiangD.et al. librosa: audio and music signal analysis in python Proceedings of the 14th Python in Science Conference 2015 Austin TX USA 18\u201325.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"e_1_2_11_23_2","unstructured":"JemineC. Real-time-voice-cloning 2019 University of Li\u00e9ge Li\u00e9ge Belgium Master\u2019s thesis https:\/\/github.com\/CorentinJ\/Real-Time-Voice-Cloning."},{"key":"e_1_2_11_24_2","doi-asserted-by":"crossref","unstructured":"PanayotovV. ChenG. PoveyD. andKhudanpurS. Librispeech: an ASR corpus based on public domain audio books Proceedings of the 2015 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 2015 Brisbane Australia IEEE 5206\u20135210.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_2_11_25_2","doi-asserted-by":"crossref","unstructured":"NagraniA. ChungJ. S. andZissermanA. Voxceleb: a large-scale speaker identification dataset Proceedings of the INTERSPEECH 2017 Stockholm Sweden https:\/\/doi.org\/10.21437\/interspeech.2017-950 2-s2.0-85039159334.","DOI":"10.21437\/Interspeech.2017-950"},{"key":"e_1_2_11_26_2","doi-asserted-by":"crossref","unstructured":"ChungJ. S. NagraniA. andZissermanA. Voxceleb2: deep speaker recognition Proceedings of the INTERSPEECH 2018 Hyderabad India https:\/\/doi.org\/10.21437\/interspeech.2018-1929 2-s2.0-85054964925.","DOI":"10.21437\/Interspeech.2018-1929"},{"volume-title":"Automatic Speech Recognition","year":"2016","author":"Yu D.","key":"e_1_2_11_27_2"},{"key":"e_1_2_11_28_2","article-title":"A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models","volume":"4","author":"Bilmes J. A.","year":"1998","journal-title":"International Computer Science Institute"},{"key":"e_1_2_11_29_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.leaqua.2020.101409"},{"key":"e_1_2_11_30_2","unstructured":"FFmpeg Developers FFmpeg Tool https:\/\/ffmpeg.org\/."},{"key":"e_1_2_11_31_2","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa F.","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_11_32_2","first-page":"2579","article-title":"Visualizing data using t-sne","volume":"9","author":"Maaten L. v. d.","year":"2008","journal-title":"Journal of Machine Learning Research"}],"container-title":["Computational Intelligence and Neuroscience"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/downloads.hindawi.com\/journals\/cin\/2021\/6151651.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/cin\/2021\/6151651.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1155\/2021\/6151651","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T12:04:04Z","timestamp":1722945844000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1155\/2021\/6151651"}},"subtitle":[],"editor":[{"given":"Carlos M.","family":"Travieso-Gonz\u00e1lez","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2021,1]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,1]]}},"alternative-id":["10.1155\/2021\/6151651"],"URL":"https:\/\/doi.org\/10.1155\/2021\/6151651","archive":["Portico"],"relation":{},"ISSN":["1687-5265","1687-5273"],"issn-type":[{"type":"print","value":"1687-5265"},{"type":"electronic","value":"1687-5273"}],"subject":[],"published":{"date-parts":[[2021,1]]},"assertion":[{"value":"2021-04-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-26","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"6151651"}}