{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T00:47:47Z","timestamp":1775177267002,"version":"3.50.1"},"reference-count":18,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2023,9,20]],"date-time":"2023-09-20T00:00:00Z","timestamp":1695168000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Big Data"],"abstract":"<jats:sec><jats:title>Introduction<\/jats:title><jats:p>Speech to text (STT) technology has seen increased usage in recent years for automating transcription of spoken language. To choose the most suitable tool for a given task, it is essential to evaluate the performance and quality of both open source and paid STT services.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>In this paper, we conduct a benchmarking study of open source and paid STT services, with a specific focus on assessing their performance concerning the variety of input text. We utilizes ix datasets obtained from diverse sources, including interviews, lectures, and speeches, as input for the STT tools. The evaluation of the instruments employs the Word Error Rate (WER), a standard metric for STT evaluation.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Our analysis of the results demonstrates significant variations in the performance of the STT tools based on the input text. Certain tools exhibit superior performance on specific types of audio samples compared to others. Our study provides insights into STT tool performance when handling substantial data volumes, as well as the challenges and opportunities posed by the multimedia nature of the data.<\/jats:p><\/jats:sec><jats:sec><jats:title>Discussion<\/jats:title><jats:p>Although paid services generally demonstrate better accuracy and speed compared to open source alternatives, their performance remains dependent on the input text. The study highlights the need for considering specific requirements and characteristics of the audio samples when selecting an appropriate STT tool.<\/jats:p><\/jats:sec>","DOI":"10.3389\/fdata.2023.1210559","type":"journal-article","created":{"date-parts":[[2023,9,21]],"date-time":"2023-09-21T07:40:29Z","timestamp":1695282029000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Benchmarking open source and paid services for speech to text: an analysis of quality and input variety"],"prefix":"10.3389","volume":"6","author":[{"given":"Antonino","family":"Ferraro","sequence":"first","affiliation":[]},{"given":"Antonio","family":"Galli","sequence":"additional","affiliation":[]},{"given":"Valerio","family":"La Gatta","sequence":"additional","affiliation":[]},{"given":"Marco","family":"Postiglione","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2023,9,20]]},"reference":[{"key":"B1","doi-asserted-by":"crossref","first-page":"20","DOI":"10.18653\/v1\/P18-2004","author":"Ali","year":"2018","journal-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)"},{"key":"B2","first-page":"4218","article-title":"\u201cCommon voice: a massively-multilingual speech corpus,\u201d","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Ardila","year":"2020"},{"key":"B3","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2006.11477","article-title":"wav2vec 2.0: a framework for self-supervised learning of speech representations","author":"Baevski","year":"2020","journal-title":"arXiv preprint arXiv:2006.11477"},{"key":"B4","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2303.00747","article-title":"Whisperx: time-accurate speech transcription of long-form audio","author":"Bain","year":"2023","journal-title":"arXiv preprint arXiv:2303.00747"},{"key":"B5","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1803.10609","article-title":"The fifth \u2018CHIME' speech separation and recognition challenge: dataset, task and baselines","author":"Barker","year":"2018","journal-title":"arXiv preprint arXiv:1803.10609"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2104.02133","article-title":"SpeechStew: simply mix all available speech recognition data to train one large neural network","author":"Chan","year":"2021","journal-title":"arXiv preprint arXiv:2104.02133"},{"key":"B7","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1910.13934","article-title":"SMS-WSJ: database, performance measures, and baseline recipe for multi-channel source separation and recognition","author":"Drude","year":"2019","journal-title":"arXiv preprint arXiv:1910.13934"},{"key":"B8","first-page":"5036","author":"Gulati","year":"2020","journal-title":"Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event"},{"key":"B9","doi-asserted-by":"publisher","first-page":"124","DOI":"10.1109\/JPROC.2020.3018668","article-title":"Far-field automatic speech recognition","volume":"109","author":"Haeb-Umbach","year":"2020","journal-title":"Proc. IEEE"},{"key":"B10","doi-asserted-by":"crossref","first-page":"999","DOI":"10.1109\/SLT54892.2023.10023181","article-title":"\u201cBenchmarking evaluation metrics for code-switching automatic speech recognition,\u201d","volume-title":"2022 IEEE Spoken Language Technology Workshop (SLT)","author":"Hamed","year":"2023"},{"key":"B11","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1412.5567","article-title":"Deep speech: scaling up end-to-end speech recognition","author":"Hannun","year":"2014","journal-title":"arXiv preprint arXiv:1412.5567"},{"key":"B12","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1805.04699","article-title":"TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation","author":"Hernandez","year":"2018","journal-title":"arXiv preprint arXiv:1805.04699"},{"key":"B13","doi-asserted-by":"publisher","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"Hubert: self-supervised speech representation learning by masked prediction of hidden units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE ACM Trans. Audio Speech Lang. Process."},{"key":"B14","doi-asserted-by":"publisher","first-page":"9411","DOI":"10.1007\/s11042-020-10073-7","article-title":"Automatic speech recognition: a survey","volume":"80","author":"Malik","year":"2021","journal-title":"Multimedia Tools Appl."},{"key":"B15","first-page":"5206","article-title":"\u201cLibrispeech: an ASR corpus based on public domain audio books,\u201d","volume-title":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015","author":"Panayotov","year":"2015"},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2106.04624","article-title":"SpeechBrain: a general-purpose speech toolkit","author":"Ravanelli","year":"2021","journal-title":"arXiv preprint arXiv:2106.04624"},{"key":"B17","first-page":"125","article-title":"\u201cTED-LIUM: an automatic speech recognition dedicated corpus,\u201d","volume-title":"Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)","author":"Rousseau","year":"2012"},{"key":"B18","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1016\/0167-6393(90)90010-7","article-title":"Speech database development at MIT: timit and beyond","volume":"9","author":"Zue","year":"1990","journal-title":"Speech Commun."}],"container-title":["Frontiers in Big Data"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdata.2023.1210559\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,21]],"date-time":"2023-09-21T07:40:48Z","timestamp":1695282048000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdata.2023.1210559\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,20]]},"references-count":18,"alternative-id":["10.3389\/fdata.2023.1210559"],"URL":"https:\/\/doi.org\/10.3389\/fdata.2023.1210559","relation":{},"ISSN":["2624-909X"],"issn-type":[{"value":"2624-909X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,20]]},"article-number":"1210559"}}