{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T15:40:34Z","timestamp":1774021234377,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T00:00:00Z","timestamp":1758585600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Responsiveness\u2014the speed at which a text-to-speech (TTS) system produces audible output\u2014is critical for real-time voice assistants yet has received far less attention than perceptual quality metrics. Existing evaluations often touch on latency but do not establish reproducible, open-source standards that capture responsiveness as a first-class dimension. This work introduces a baseline benchmark designed to fill that gap. Our framework unifies latency distribution, tail latency, and intelligibility within a transparent and dataset-diverse pipeline, enabling a fair and replicable comparison across 13 widely used open-source TTS models. By grounding evaluation in structured input sets ranging from single words to sentence-length utterances and adopting a methodology inspired by standardized inference benchmarks, we capture both typical and worst-case user experiences. Unlike prior studies that emphasize closed or proprietary systems, our focus is on establishing open, reproducible baselines rather than ranking against commercial references. The results reveal substantial variability across architectures, with some models delivering near-instant responses while others fail to meet interactive thresholds. By centering evaluation on responsiveness and reproducibility, this study provides an infrastructural foundation for benchmarking TTS systems and lays the groundwork for more comprehensive assessments that integrate both fidelity and speed.<\/jats:p>","DOI":"10.3390\/computers14100406","type":"journal-article","created":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T13:48:37Z","timestamp":1758635317000},"page":"406","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Benchmarking the Responsiveness of Open-Source Text-to-Speech Systems"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-1605-3332","authenticated-orcid":false,"given":"Ha Pham Thien","family":"Dinh","sequence":"first","affiliation":[{"name":"Faculty of Science, Engineering and Built Environment, School of Information Technology, Deakin University, Burwood, VIC 3125, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6397-0974","authenticated-orcid":false,"given":"Rutherford Agbeshi","family":"Patamia","sequence":"additional","affiliation":[{"name":"Faculty of Science, Engineering and Built Environment, School of Information Technology, Deakin University, Burwood, VIC 3125, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2160-6111","authenticated-orcid":false,"given":"Ming","family":"Liu","sequence":"additional","affiliation":[{"name":"Faculty of Science, Engineering and Built Environment, School of Information Technology, Deakin University, Burwood, VIC 3125, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4203-6477","authenticated-orcid":false,"given":"Akansel","family":"Cosgun","sequence":"additional","affiliation":[{"name":"Faculty of Science, Engineering and Built Environment, School of Information Technology, Deakin University, Burwood, VIC 3125, Australia"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1080\/02763869.2018.1404391","article-title":"Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants","volume":"37","author":"Hoy","year":"2018","journal-title":"Med. Ref. Serv. Q."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Sainath, T.N., and Parada, C. (2015, January 6\u201310). Convolutional Neural Networks for Small-Footprint Keyword Spotting. Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-352"},{"key":"ref_3","first-page":"355","article-title":"How Voice Assistants Are Taking Over Our Lives\u2014A Review","volume":"6","author":"Singh","year":"2019","journal-title":"J. Emerg. Technol. Innov. Res."},{"key":"ref_4","unstructured":"D\u00e9fossez, A., Mazar\u00e9, L., Orsini, M., Royer, A., P\u00e9rez, P., J\u00e9gou, H., Grave, E., and Zeghidour, N. (2024). Moshi: A speech-text foundation model for real-time dialogue. arXiv."},{"key":"ref_5","unstructured":"Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A., and Garriga-Alonso, A. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv."},{"key":"ref_6","unstructured":"Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv."},{"key":"ref_7","unstructured":"Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., and Brockman, G. (2021). Evaluating large language models trained on code. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Reddi, V.J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.J., Anderson, B., Breughe, M., Charlebois, M., and Chou, W. (June, January 30). Mlperf inference benchmark. Proceedings of the 2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual.","DOI":"10.1109\/ISCA45697.2020.00045"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Spangher, L., Li, T., Arnold, W.F., Masiewicki, N., Dotiwalla, X., Parusmathi, R., Grabowski, P., Ie, E., and Gruhl, D. (2024). Project MPG: Towards a generalized performance benchmark for LLM capabilities. arXiv.","DOI":"10.18653\/v1\/2025.naacl-industry.77"},{"key":"ref_10","unstructured":"Banerjee, D., Singh, P., Avadhanam, A., and Srivastava, S. (2023). Benchmarking LLM powered chatbots: Methods and metrics. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., and Zheng, B. (2024). Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.401"},{"key":"ref_12","unstructured":"Malode, V.M. (2024). Benchmarking Public Large Language Model. [Ph.D. Thesis, Technische Hochschule Ingolstadt]."},{"key":"ref_13","unstructured":"Jacovi, A., Wang, A., Alberti, C., Tao, C., Lipovetz, J., Olszewska, K., Haas, L., Liu, M., Keating, N., and Bloniarz, A. (2025). The FACTS Grounding Leaderboard: Benchmarking LLMs\u2019 Ability to Ground Responses to Long-Form Input. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wang, B., Zou, X., Lin, G., Sun, S., Liu, Z., Zhang, W., Liu, Z., Aw, A., and Chen, N.F. (2024). Audiobench: A universal benchmark for audio large language models. arXiv.","DOI":"10.18653\/v1\/2025.naacl-long.218"},{"key":"ref_15","unstructured":"Gandhi, S., Von Platen, P., and Rush, A.M. (2022). Esb: A benchmark for multi-domain end-to-end speech recognition. arXiv."},{"key":"ref_16","unstructured":"Fang, Y., Sun, H., Liu, J., Zhang, T., Zhou, Z., Chen, W., Xing, X., and Xu, X. (2025). S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models. arXiv."},{"key":"ref_17","unstructured":"Chen, Y., Yue, X., Zhang, C., Gao, X., Tan, R.T., and Li, H. (2024). Voicebench: Benchmarking llm-based voice assistants. arXiv."},{"key":"ref_18","unstructured":"Alberts, L., Ellis, B., Lupu, A., and Foerster, J. (2024). CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Fnu, N., and Bansal, A. (2024, January 14\u201315). Understanding the architecture of vision transformer and its variants: A review. Proceedings of the 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR), Muscat, Oman.","DOI":"10.1109\/ICIESTR60916.2024.10798341"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1016\/j.csl.2003.12.001","article-title":"Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale","volume":"19","author":"Viswanathan","year":"2005","journal-title":"Comput. Speech Lang."},{"key":"ref_21","unstructured":"Srivastav, V., Fourrier, C., Pouget, L., Lacombe, Y., and Gandhi, S. (2025, August 01). Text to Speech Arena. Available online: https:\/\/huggingface.co\/spaces\/TTS-AGI\/TTS-Arena."},{"key":"ref_22","unstructured":"Srivastav, V., Fourrier, C., Pouget, L., Lacombe, Y., Gandhi, S., Passos, A., and Cuenca, P. (2025, August 01). TTS Arena 2.0: Benchmarking Text-to-Speech Models in the Wild. Available online: https:\/\/huggingface.co\/spaces\/TTS-AGI\/TTS-Arena-V2."},{"key":"ref_23","unstructured":"Picovoice (2025, August 01). Picovoice TTS Latency Benchmark. Available online: https:\/\/github.com\/Picovoice\/tts-latency-benchmark."},{"key":"ref_24","unstructured":"Artificial Analysis (2025, August 01). Text-to-Speech Benchmarking Methodology. Available online: https:\/\/artificialanalysis.ai\/text-to-speech."},{"key":"ref_25","unstructured":"Labelbox (2025, August 01). Evaluating Leading Text-to-Speech Models. Available online: https:\/\/labelbox.com\/guides\/evaluating-leading-text-to-speech-models\/."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Minixhofer, C., Klejch, O., and Bell, P. (2024, January 2\u20135). TTSDS-Text-to-Speech Distribution Score. Proceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT), Macau, China.","DOI":"10.1109\/SLT61566.2024.10832178"},{"key":"ref_27","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, January 23\u201329). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA."},{"key":"ref_28","unstructured":"Griffies, S.M., Perrie, W.A., and Hull, G. (2013). Elements of style for writing scientific journal articles. Publishing Connect, Elsevier."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wallwork, A. (2016). English for Writing Research Papers, Springer.","DOI":"10.1007\/978-3-319-26094-5"},{"key":"ref_30","unstructured":"Carnegie Mellon University Speech Group (2025, July 23). The Carnegie Mellon Pronouncing Dictionary (CMUdict). 1993\u20132014. Available online: http:\/\/www.speech.cs.cmu.edu\/cgi-bin\/cmudict."},{"key":"ref_31","unstructured":"Day, D.L. (2025, July 23). CMU Pronouncing Dictionary Python Package. Available online: https:\/\/pypi.org\/project\/cmudict\/."},{"key":"ref_32","unstructured":"Davies, M. (2025, July 23). Word Frequency Data from the Corpus of Contemporary American English (COCA). Available online: https:\/\/www.wordfrequency.info."},{"key":"ref_33","unstructured":"Davies, M. (2025, July 23). The iWeb Corpus: 14 Billion Words of English from the Web. Available online: https:\/\/www.english-corpora.org\/iweb\/."},{"key":"ref_34","unstructured":"Suno-AI (2025, March 10). Bark: A Transformer-Based Text-to-Audio Model. Available online: https:\/\/github.com\/suno-ai\/bark."},{"key":"ref_35","unstructured":"D\u00e9fossez, A., Copet, J., Synnaeve, G., and Adi, Y. (2022). High fidelity neural audio compression. arXiv."},{"key":"ref_36","unstructured":"Zhang, Z., Zhou, L., Wang, C., Chen, S., Wu, Y., Liu, S., Chen, Z., Liu, Y., Wang, H., and Li, J. (2023). Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Mehta, S., Sz\u00e9kely, \u00c9., Beskow, J., and Henter, G.E. (2022, January 23\u201327). Neural HMMs are all you need (for high-quality attention-free TTS). Proceedings of the ICASSP 2022\u20142022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746686"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Su, J., Jin, Z., and Finkelstein, A. (2021, January 17\u201320). HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features. Proceedings of the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA.","DOI":"10.1109\/WASPAA52581.2021.9632770"},{"key":"ref_39","unstructured":"Ito, K., and Johnson, L. (2025, August 01). The LJ Speech Dataset. Available online: https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Sz\u00e9kely, \u00c9., and Henter, G.E. (2022). OverFlow: Putting flows on top of neural transducers for better TTS. arXiv.","DOI":"10.21437\/Interspeech.2023-1996"},{"key":"ref_41","unstructured":"rany2 (2025, August 06). Edge-tts: Microsoft Edge Text-to-Speech Library. Available online: https:\/\/pypi.org\/project\/edge-tts\/."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"\u0141a\u0144cucki, A. (2021, January 6\u201311). Fastpitch: Parallel text-to-speech with pitch prediction. Proceedings of the ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413889"},{"key":"ref_43","unstructured":"Kim, J., Kong, J., and Son, J. (2021, January 18\u201324). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. Proceedings of the International Conference on Machine Learning. PMLR, Virtual."},{"key":"ref_44","first-page":"8067","article-title":"Glow-tts: A generative flow for text-to-speech via monotonic alignment search","volume":"33","author":"Kim","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., and Xie, L. (2021, January 19\u201322). Multi-band melgan: Faster waveform generation for high-quality text-to-speech. Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Online.","DOI":"10.1109\/SLT48900.2021.9383551"},{"key":"ref_46","unstructured":"Battenberg, E., Mariooryad, S., Stanton, D., Skerry-Ryan, R., Shannon, M., Kao, D., and Bagby, T. (2019). Effective use of variational embedding capacity in expressive end-to-end speech synthesis. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"King, S., and Karaiskos, V. (2013, January 3). The Blizzard Challenge 2013. Proceedings of the Blizzard Challenge Workshop, Barcelona, Spain.","DOI":"10.21437\/Blizzard.2013-1"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15\u201320). Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"ref_49","unstructured":"Durette, P.N. (2025, March 10). gTTS: Documentation. Available online: https:\/\/gtts.readthedocs.io\/."},{"key":"ref_50","unstructured":"Microsoft Corporation (2025, March 10). Speech API 5.4 Documentation. Available online: https:\/\/learn.microsoft.com\/en-us\/previous-versions\/windows\/desktop\/ee125663(v=vs.85)."},{"key":"ref_51","unstructured":"NVIDIA (2025, August 01). How to Deploy Real-Time Text-to-Speech Applications on GPUs Using TensorRT. Available online: https:\/\/developer.nvidia.com\/blog\/how-to-deploy-real-time-text-to-speech-applications-on-gpus-using-tensorrt\/."},{"key":"ref_52","unstructured":"Milvus AI (2025, August 01). What Are the Challenges of Deploying TTS on Embedded Systems?. Available online: https:\/\/blog.milvus.io\/ai-quick-reference\/what-are-the-challenges-of-deploying-tts-on-embedded-systems."},{"key":"ref_53","first-page":"1","article-title":"Blockwise parallel decoding for deep autoregressive models","volume":"31","author":"Stern","year":"2018","journal-title":"Adv. Neural Inf. Process. Syst."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/10\/406\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:47:55Z","timestamp":1760035675000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/10\/406"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,23]]},"references-count":53,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2025,10]]}},"alternative-id":["computers14100406"],"URL":"https:\/\/doi.org\/10.3390\/computers14100406","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,23]]}}}