{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T20:23:46Z","timestamp":1776889426092,"version":"3.51.2"},"reference-count":75,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T00:00:00Z","timestamp":1775606400000},"content-version":"vor","delay-in-days":97,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Recent advancements in large language models (LLMs) like GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering an improved user experience over text-based interactions. However, a suitable benchmark to rigorously evaluate such speech interactions systems is currently lacking. To bridge this gap, we introduce VoiceBench, the first benchmark specifically designed to assess LLM-based voice assistants. VoiceBench comprises 6,783 synthetic and real spoken instructions recorded from diverse speakers across eight distinct tasks. These instructions are meticulously crafted to assess three crucial capability areas: general knowledge, instruction-following, and safety compliance. Furthermore, VoiceBench systematically incorporates realistic variations common in spoken interactions, including differences in speaker characteristics (e.g., accents), heterogeneous environmental conditions (e.g., reverberation), and content complexities such as mispronunciations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.1<\/jats:p>","DOI":"10.1162\/tacl.a.628","type":"journal-article","created":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T13:13:12Z","timestamp":1775653992000},"page":"378-398","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":8,"title":["VoiceBench: Benchmarking LLM-Based Voice Assistants"],"prefix":"10.1162","volume":"14","author":[{"given":"Yiming","family":"Chen","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore. yiming.chen@u.nus.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xianghu","family":"Yue","sequence":"additional","affiliation":[{"name":"Tianjin University, China. yuexianghu@tju.edu.cn"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chen","family":"Zhang","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore. chen_zhang@u.nus.edu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaoxue","family":"Gao","sequence":"additional","affiliation":[{"name":"I2R, Agency for Science, Technology, and Research (A*STAR), Singapore. Gao_Xiaoxue@a-star.edu.sg"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Robby T.","family":"Tan","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haizhou","family":"Li","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China"},{"name":"Shenzhen Research Institute of Big Data, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2026,4,1]]},"reference":[{"key":"2026040809130721400_bib1","article-title":"GPT-4 technical report","author":"Achiam","year":"2023","journal-title":"arXiv preprint arXiv:2303.08774"},{"key":"2026040809130721400_bib2","unstructured":"Anthropic. 2024. Claude 3.5 sonnet\n                        model card addendum. Online; accessed October\n                        2024."},{"key":"2026040809130721400_bib3","first-page":"4218","article-title":"Common voice: A massively-multilingual speech\n                        corpus","volume-title":"Proceedings of the Twelfth Language\n                        Resources and Evaluation Conference","author":"Ardila","year":"2020"},{"key":"2026040809130721400_bib4","doi-asserted-by":"publisher","first-page":"1561","DOI":"10.21437\/Interspeech.2018-1768","article-title":"The fifth \u2018chime\u2019 speech\n                        separation and recognition challenge: Dataset, task and\n                        baselines","author":"Barker","year":"2018","journal-title":"Interspeech"},{"issue":"2","key":"2026040809130721400_bib5","doi-asserted-by":"publisher","first-page":"707","DOI":"10.1016\/j.cognition.2007.04.005","article-title":"Perceptual adaptation to non-native speech","volume":"106","author":"Bradlow","year":"2008","journal-title":"Cognition"},{"key":"2026040809130721400_bib6","doi-asserted-by":"publisher","first-page":"2144","DOI":"10.18653\/v1\/2020.coling-main.195","article-title":"Grammatical error detection in\n                        transcriptions of spoken English","volume-title":"Proceedings of\n                        the 28th International Conference on Computational Linguistics","author":"Caines","year":"2020"},{"issue":"2","key":"2026040809130721400_bib7","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1093\/applin\/16.2.141","article-title":"Grammar and the spoken\n                        language","volume":"16","author":"Carter","year":"1995","journal-title":"Applied Linguistics"},{"key":"2026040809130721400_bib8","doi-asserted-by":"publisher","first-page":"5455","DOI":"10.1109\/CVPR52734.2025.00513","article-title":"Emova: Empowering language models to see,\n                        hear and speak with vivid emotions","volume-title":"Proceedings\n                        of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition\n                        (CVPR)","author":"Chen","year":"2025"},{"key":"2026040809130721400_bib9","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.emnlp-main.511","article-title":"Recent advances in large langauge model benchmarks against\n                        data contamination: From static to dynamic evaluation","author":"Chen","year":"2025","journal-title":"arXiv preprint arXiv:2502.17521"},{"key":"2026040809130721400_bib10","doi-asserted-by":"publisher","first-page":"7164","DOI":"10.18653\/v1\/2023.acl-long.395","article-title":"Dynamic transformers provide a false sense of\n                        efficiency","volume-title":"Proceedings of the 61st Annual\n                        Meeting of the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Chen","year":"2023"},{"key":"2026040809130721400_bib11","doi-asserted-by":"publisher","first-page":"10917","DOI":"10.18653\/v1\/2024.findings-emnlp.640","article-title":"Beyond single-audio: Advancing multi-audio processing in\n                        audio large language models","volume-title":"Findings of the\n                        Association for Computational Linguistics: EMNLP 2024","author":"Chen","year":"2024"},{"key":"2026040809130721400_bib12","doi-asserted-by":"publisher","first-page":"1359","DOI":"10.18653\/v1\/2024.findings-acl.80","article-title":"Unveiling the achilles\u2019 heel of NLG evaluators: A\n                        unified adversarial framework driven by large language\n                        models","volume-title":"Findings of the Association for\n                        Computational Linguistics: ACL 2024","author":"Chen","year":"2024"},{"issue":"12","key":"2026040809130721400_bib13","doi-asserted-by":"publisher","first-page":"220101","DOI":"10.1007\/s11432-024-4231-5","article-title":"How far are we to GPT-4V? Closing the gap to\n                        commercial multimodal models with open-source suites","volume":"67","author":"Chen","year":"2024","journal-title":"Science China Information Sciences"},{"key":"2026040809130721400_bib14","article-title":"Qwen2-audio technical report","author":"Chu","year":"2024","journal-title":"arXiv\n                        preprint arXiv:2407.10759"},{"key":"2026040809130721400_bib15","article-title":"Qwen-audio: Advancing universal audio understanding via\n                        unified large-scale audio-language models","author":"Chu","year":"2023","journal-title":"arXiv\n                        preprint arXiv:2311.07919"},{"key":"2026040809130721400_bib16","doi-asserted-by":"publisher","first-page":"454","DOI":"10.1162\/tacl_a_00317","article-title":"TyDi QA: A benchmark for\n                        information-seeking question answering in typologically diverse\n                        languages","volume":"8","author":"Clark","year":"2020","journal-title":"Transactions of the Association for\n                        Computational Linguistics"},{"key":"2026040809130721400_bib17","article-title":"Moshi: A speech-text foundation model for\n                        real-time dialogue","author":"D\u00e9fossez","year":"2024","journal-title":"arXiv preprint arXiv:\n                        2410.00037"},{"issue":"6","key":"2026040809130721400_bib18","doi-asserted-by":"publisher","first-page":"611","DOI":"10.1016\/S0022-5371(81)90202-4","article-title":"Stages in sentence production: An analysis of\n                        speech error data","volume":"20","author":"Dell","year":"1981","journal-title":"Journal of Verbal Learning and\n                        Verbal Behavior"},{"key":"2026040809130721400_bib19","article-title":"CosyVoice: A scalable multilingual zero-shot text-to-speech\n                        synthesizer based on supervised semantic tokens","author":"Zhihao","year":"2024","journal-title":"arXiv preprint arXiv:2407.05407"},{"key":"2026040809130721400_bib20","doi-asserted-by":"publisher","first-page":"8304","DOI":"10.18653\/v1\/2023.findings-emnlp.557","article-title":"Automatic pronunciation assessment - a\n                        review","volume-title":"Findings of the Association for\n                        Computational Linguistics: EMNLP 2023","author":"El Kheir","year":"2023"},{"key":"2026040809130721400_bib21","doi-asserted-by":"publisher","first-page":"3296","DOI":"10.18653\/v1\/2021.findings-emnlp.281","article-title":"SD-QA: Spoken dialectal question answering\n                        for the real world","volume-title":"Findings of the Association\n                        for Computational Linguistics: EMNLP 2021","author":"Faisal","year":"2021"},{"key":"2026040809130721400_bib22","article-title":"LLaMA-Omni: Seamless speech interaction with large language\n                        models","volume-title":"the Thirteenth International Conference\n                        on Learning Representations","author":"Fang","year":"2025"},{"issue":"6","key":"2026040809130721400_bib23","doi-asserted-by":"publisher","first-page":"709","DOI":"10.1006\/jmla.1995.1032","article-title":"The effects of false starts and\n                        repetitions on the processing of subsequent words in spontaneous\n                        speech","volume":"34","author":"Fox Tree","year":"1995","journal-title":"Journal of Memory and Language"},{"key":"2026040809130721400_bib24","article-title":"VITA: Towards open-source interactive omni multimodal\n                        LLM","author":"Chaoyou","year":"2024","journal-title":"arXiv preprint\n                    arXiv:2408.05211"},{"key":"2026040809130721400_bib25","doi-asserted-by":"publisher","first-page":"6556","DOI":"10.18653\/v1\/2024.naacl-long.365","article-title":"GPTScore: Evaluate as you desire","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of\n                        the Association for Computational Linguistics: Human Language Technologies\n                        (Volume 1: Long Papers)","author":"Jinlan","year":"2024"},{"key":"2026040809130721400_bib26","doi-asserted-by":"publisher","first-page":"693","DOI":"10.1109\/TASLPRO.2025.3533357","article-title":"TTslow: Slow down text-to-speech with\n                        efficiency robustness evaluations","volume":"33","author":"Gao","year":"2025","journal-title":"IEEE Transactions\n                        on Audio, Speech and Language Processing"},{"key":"2026040809130721400_bib27","doi-asserted-by":"publisher","first-page":"2200","DOI":"10.1109\/LSP.2024.3443711","article-title":"Transferable adversarial attacks against ASR","volume":"31","author":"Gao","year":"2024","journal-title":"IEEE Signal Processing Letters"},{"key":"2026040809130721400_bib28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ICASSP49660.2025.10888737","article-title":"EMO-DPO: Controllable emotional speech\n                        synthesis through direct preference optimization","volume-title":"ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech\n                        and Signal Processing (ICASSP)","author":"Gao","year":"2025"},{"key":"2026040809130721400_bib29","article-title":"MultiGen: Child-friendly multilingual speech\n                        generator with LLMs","author":"Gao","year":"2025","journal-title":"arXiv preprint\n                        arXiv:2508.08715"},{"key":"2026040809130721400_bib30","article-title":"Prompt-unseen-emotion: Zero-shot expressive\n                        speech synthesis with prompt-LLM contextual knowledge for mixed\n                        emotions","author":"Gao","year":"2025","journal-title":"arXiv preprint\n                    arXiv:2506.02742"},{"key":"2026040809130721400_bib31","article-title":"Listen, think, and understand","volume-title":"the Twelfth International Conference on Learning\n                        Representations","author":"Gong","year":"2024"},{"issue":"6","key":"2026040809130721400_bib32","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1109\/MSP.2019.2918706","article-title":"Speech processing for digital home\n                        assistants: Combining signal processing with deep-learning\n                        techniques","volume":"36","author":"Haeb-Umbach","year":"2019","journal-title":"IEEE Signal Processing\n                    Magazine"},{"key":"2026040809130721400_bib33","doi-asserted-by":"publisher","first-page":"7876","DOI":"10.18653\/v1\/2025.acl-long.388","article-title":"Distilling an end-to-end voice assistant without instruction\n                        training data","volume-title":"Proceedings of the 63rd Annual\n                        Meeting of the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Held","year":"2025"},{"key":"2026040809130721400_bib34","doi-asserted-by":"publisher","first-page":"12136","DOI":"10.1109\/ICASSP48485.2024.10448257","article-title":"Dynamic-Superb: Towards a dynamic, collaborative, and\n                        comprehensive instruction-tuning benchmark for speech","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech\n                        and Signal Processing (ICASSP)","author":"Huang","year":"2024"},{"key":"2026040809130721400_bib35","doi-asserted-by":"publisher","first-page":"2051","DOI":"10.18653\/v1\/2020.findings-emnlp.186","article-title":"End-to-end speech recognition and\n                        disfluency removal","volume-title":"Findings of the Association\n                        for Computational Linguistics: EMNLP 2020","author":"Lou","year":"2020"},{"key":"2026040809130721400_bib36","article-title":"BeaverTails: Towards improved safety alignment of LLM via a\n                        human-preference dataset","author":"Ji","year":"2024","journal-title":"Advances in Neural\n                        Information Processing Systems"},{"key":"2026040809130721400_bib37","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.acl-long.937","article-title":"CodeJudgeBench: Benchmarking LLM-as-a-judge for coding\n                        tasks","author":"Jiang","year":"2025","journal-title":"arXiv preprint arXiv: 2507.10535"},{"key":"2026040809130721400_bib38","first-page":"19112","article-title":"UniCodec: Unified audio codec with single domain-adaptive\n                        codebook","volume-title":"Proceedings of the 63rd Annual Meeting\n                        of the Association for Computational Linguistics","author":"Jiang","year":"2024"},{"key":"2026040809130721400_bib39","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1007\/978-3-540-49127-9_4","article-title":"Perception of speech and\n                    sound","author":"Kollmeier","year":"2008","journal-title":"Springer Handbook of Speech Processing"},{"issue":"1","key":"2026040809130721400_bib40","doi-asserted-by":"publisher","first-page":"362","DOI":"10.1121\/1.1635842","article-title":"Acoustic properties of naturally produced\n                        clear speech at normal speaking rates","volume":"115","author":"Krause","year":"2004","journal-title":"Journal of\n                        the Acoustical Society of America"},{"issue":"4","key":"2026040809130721400_bib41","doi-asserted-by":"publisher","first-page":"745","DOI":"10.1109\/TASLP.2014.2304637","article-title":"An overview of noise-robust automatic\n                        speech recognition","volume":"22","author":"Li","year":"2014","journal-title":"IEEE\/ACM Transactions on Audio,\n                        Speech, and Language Processing"},{"key":"2026040809130721400_bib42","unstructured":"Xuechen\n              Li\n            ,\n                                TianyiZhang,\n                                YannDubois,\n                                RohanTaori,\n                                IshaanGulrajani,\n                                CarlosGuestrin,\n                                PercyLiang, and\n                                Tatsunori B.Hashimoto.\n                        2023. AlpacaEval: An automatic evaluator of\n                        instruction-following models. https:\/\/github.com\/tatsu-lab\/alpaca_eval"},{"key":"2026040809130721400_bib43","article-title":"Baichuan-Omni technical\n                        report","author":"Li","year":"2024","journal-title":"arXiv preprint\n                    arXiv:2410.08565"},{"key":"2026040809130721400_bib44","doi-asserted-by":"publisher","first-page":"9373","DOI":"10.18653\/v1\/2024.findings-emnlp.548","article-title":"PEDANTS: Cheap but effective and\n                        interpretable answer equivalence","volume-title":"Findings of the\n                        Association for Computational Linguistics: EMNLP 2024","author":"Li","year":"2024"},{"key":"2026040809130721400_bib45","article-title":"Visual instruction tuning","volume":"36","author":"Liu","year":"2024","journal-title":"Advances\n                        in Neural Information Processing Systems"},{"key":"2026040809130721400_bib46","article-title":"Trustworthy LLMs: A survey and guideline for evaluating large\n                        language models\u2019 alignment","volume-title":"Socially\n                        Responsible Language Modelling Research","author":"Liu","year":"2023"},{"key":"2026040809130721400_bib47","doi-asserted-by":"publisher","first-page":"11479","DOI":"10.18653\/v1\/2023.findings-acl.728","article-title":"Disfluency generation for more robust dialogue\n                        systems","volume-title":"Findings of the Association for\n                        Computational Linguistics: ACL 2023","author":"Marie","year":"2023"},{"issue":"3","key":"2026040809130721400_bib48","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1093\/elt\/49.3.207","article-title":"Spoken grammar: What is it and how can we\n                        teach it?","volume":"49","author":"McCarthy","year":"1995","journal-title":"ELT Journal"},{"issue":"11","key":"2026040809130721400_bib49","doi-asserted-by":"publisher","first-page":"e79279","DOI":"10.1371\/journal.pone.0079279","article-title":"Speech recognition in natural background\n                        noise","volume":"8","author":"Meyer","year":"2013","journal-title":"PloS One"},{"key":"2026040809130721400_bib50","doi-asserted-by":"publisher","first-page":"2381","DOI":"10.18653\/v1\/D18-1260","article-title":"Can a suit of armor conduct electricity? A\n                        new dataset for open book question answering","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural\n                        Language Processing","author":"Mihaylov","year":"2018"},{"key":"2026040809130721400_bib51","article-title":"Open-LLM-Leaderboard: From multi-choice to open-style\n                        questions for LLMs evaluation, benchmark, and arena","author":"Myrzakhan","year":"2024","journal-title":"arXiv preprint arXiv:2406.07545"},{"key":"2026040809130721400_bib52","article-title":"Qwen2.5-Omni technical\n                        report","author":"Qwen","year":"2025","journal-title":"arXiv preprint\n                    arXiv:2503.20215"},{"key":"2026040809130721400_bib53","first-page":"28492","article-title":"Robust speech recognition via large-scale\n                        weak supervision","volume-title":"International Conference on\n                        Machine Learning","author":"Radford","year":"2023"},{"key":"2026040809130721400_bib54","article-title":"Gemini 1.5: Unlocking multimodal understanding\n                        across millions of tokens of context","author":"Reid","year":"2024","journal-title":"arXiv preprint\n                        arXiv:2403.05530"},{"key":"2026040809130721400_bib55","first-page":"31210","article-title":"Large language models can be easily distracted by irrelevant\n                        context","volume-title":"Proceedings of the 40th International\n                        Conference on Machine Learning","author":"Shi","year":"2023"},{"key":"2026040809130721400_bib56","unstructured":"Elizabeth Ellen\n              Shriberg\n            \n          .\n                        1994. Preliminaries to a theory of speech\n                        disfluencies. Ph.D. thesis,\n                        Citeseer."},{"key":"2026040809130721400_bib57","article-title":"SpokenWOZ: A large-scale speech-text benchmark for spoken\n                        task-oriented dialogue agents","volume-title":"Thirty-seventh\n                        Conference on Neural Information Processing Systems Datasets and Benchmarks\n                        Track","author":"Si","year":"2023"},{"key":"2026040809130721400_bib58","doi-asserted-by":"publisher","first-page":"13003","DOI":"10.18653\/v1\/2023.findings-acl.824","article-title":"Challenging BIG-bench tasks and whether chain-of-thought can\n                        solve them","volume-title":"Findings of the Association for\n                        Computational Linguistics: ACL 2023","author":"Suzgun","year":"2023"},{"key":"2026040809130721400_bib59","article-title":"SALMONN: Towards generic hearing abilities for\n                        large language models","volume-title":"the Twelfth International\n                        Conference on Learning Representations","author":"Tang","year":"2024"},{"key":"2026040809130721400_bib60","doi-asserted-by":"publisher","first-page":"11939","DOI":"10.18653\/v1\/2024.findings-emnlp.697","article-title":"Resilience of large language models for noisy\n                        instructions","volume-title":"Findings of the Association for\n                        Computational Linguistics: EMNLP 2024","author":"Wang","year":"2024"},{"key":"2026040809130721400_bib61","article-title":"MMLU-Pro: A more robust and challenging\n                        multi-task language understanding benchmark","volume-title":"the\n                        Thirty-eight Conference on Neural Information Processing Systems Datasets\n                        and Benchmarks Track","author":"Wang","year":"2024"},{"key":"2026040809130721400_bib62","doi-asserted-by":"publisher","first-page":"1137","DOI":"10.1109\/SLT61566.2024.10832300","article-title":"Just ASR + LLM? A study on speech large\n                        language models\u2019 ability to identify and understand speaker in spoken\n                        dialogue","volume-title":"2024 IEEE Spoken Language Technology\n                        Workshop (SLT)","author":"Junkai","year":"2024"},{"key":"2026040809130721400_bib63","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2756440","article-title":"Mini-Omni: Language models can hear, talk\n                        while thinking in streaming","author":"Xie","year":"2024","journal-title":"arXiv preprint arXiv:\n                        2408.16725"},{"key":"2026040809130721400_bib64","article-title":"Mini-Omni2: Towards open-source GPT-4o\n                        model with vision, speech and duplex","author":"Xie","year":"2024","journal-title":"arXiv preprint\n                        arXiv:2410.11190"},{"issue":"12","key":"2026040809130721400_bib65","doi-asserted-by":"crossref","first-page":"2410","DOI":"10.1109\/TASLP.2017.2756440","article-title":"Toward human parity in conversational speech\n                        recognition","volume":"25","author":"Xiong","year":"2017","journal-title":"IEEE\/ACM Transactions on Audio, Speech,\n                        and Language Processing"},{"key":"2026040809130721400_bib66","doi-asserted-by":"publisher","first-page":"5587","DOI":"10.18653\/v1\/2024.acl-long.303","article-title":"SafeDecoding: Defending against jailbreak\n                        attacks via safety-aware decoding","volume-title":"Proceedings of\n                        the 62nd Annual Meeting of the Association for Computational Linguistics\n                        (Volume 1: Long Papers)","author":"Zhangchen","year":"2024"},{"key":"2026040809130721400_bib67","doi-asserted-by":"publisher","first-page":"1979","DOI":"10.18653\/v1\/2024.acl-long.109","article-title":"AIR-Bench: Benchmarking large audio-language models via\n                        generative comprehension","volume-title":"Proceedings of the 62nd\n                        Annual Meeting of the Association for Computational Linguistics (Volume 1:\n                        Long Papers)","author":"Yang","year":"2024"},{"key":"2026040809130721400_bib68","doi-asserted-by":"publisher","first-page":"3255","DOI":"10.1109\/TASLPRO.2025.3587467","article-title":"COAVT: A cognition-inspired unified audio-visual-text\n                        pre-training model for multimodal processing","volume":"33","author":"Yue","year":"2025","journal-title":"IEEE\n                        Transactions on Audio, Speech and Language Processing"},{"key":"2026040809130721400_bib69","article-title":"Evaluating large language models at evaluating instruction\n                        following","volume-title":"the Twelfth International Conference\n                        on Learning Representations","author":"Zeng","year":"2024"},{"key":"2026040809130721400_bib70","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v38i17.29923","article-title":"A comprehensive analysis of the effectiveness of large\n                        language models as automatic dialogue evaluators","volume-title":"Proceedings of the Thirty-Eighth AAAI Conference on Artificial\n                        Intelligence and Thirty-Sixth Conference on Innovative Applications of\n                        Artificial Intelligence and Fourteenth Symposium on Educational Advances in\n                        Artificial Intelligence","author":"Zhang","year":"2024"},{"key":"2026040809130721400_bib71","doi-asserted-by":"publisher","first-page":"15757","DOI":"10.18653\/v1\/2023.findings-emnlp.1055","article-title":"SpeechGPT: Empowering large language models with intrinsic\n                        cross-modal conversational abilities","volume-title":"Findings of\n                        the Association for Computational Linguistics: EMNLP 2023","author":"Zhang","year":"2023"},{"key":"2026040809130721400_bib72","doi-asserted-by":"publisher","first-page":"10921","DOI":"10.1109\/ICASSP48485.2024.10447169","article-title":"A chat about boring problems: Studying\n                        GPT-based text normalization","volume-title":"ICASSP 2024-2024\n                        IEEE International Conference on Acoustics, Speech and Signal Processing\n                        (ICASSP)","author":"Zhang","year":"2024"},{"key":"2026040809130721400_bib73","article-title":"Melotts: High-quality multi-lingual multi-accent\n                        text-to-speech","author":"Zhao","year":"2023"},{"key":"2026040809130721400_bib74","article-title":"Instruction-following evaluation for large\n                        language models","author":"Zhou","year":"2023","journal-title":"arXiv preprint\n                        arXiv:2311.07911"},{"key":"2026040809130721400_bib75","article-title":"Universal and transferable adversarial\n                        attacks on aligned language models","author":"Zou","year":"2023","journal-title":"arXiv preprint\n                        arXiv:2307.15043"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/TACL.a.628\/2593104\/tacl.a.628.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/TACL.a.628\/2593104\/tacl.a.628.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T13:13:18Z","timestamp":1775653998000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/TACL.a.628\/136245\/VoiceBench-Benchmarking-LLM-Based-Voice-Assistants"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026]]},"references-count":75,"URL":"https:\/\/doi.org\/10.1162\/tacl.a.628","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026]]},"published":{"date-parts":[[2026]]}}}