{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T21:35:17Z","timestamp":1776116117939,"version":"3.50.1"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,8,22]],"date-time":"2024-08-22T00:00:00Z","timestamp":1724284800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2024,8,22]]},"abstract":"<jats:p>We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted vibrations behind the ear into speech. This innovation is designed to improve social interactions for those with voice disorders, and furthermore enable discreet public communication. Unlike prior efforts, StethoSpeech does not require (a) paired-speech data for recorded vibrations and (b) a specialized device for recording vibrations, as it can work with an off-the-shelf clinical stethoscope. The novelty of our framework lies in the overall design, simulation of the ground-truth speech, and a sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and our proposed StethoText: a large-scale synchronized database of non-audible murmur and text for speech research. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming existing methods on several quantitative and qualitative metrics. Additionally, we showcase its capacity to extend its application to speakers not encountered during training and its effectiveness in challenging, noisy environments. Speech samples are available at https:\/\/stethospeech.github.io\/StethoSpeech\/.<\/jats:p>","DOI":"10.1145\/3678515","type":"journal-article","created":{"date-parts":[[2024,9,9]],"date-time":"2024-09-09T14:36:21Z","timestamp":1725892581000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7517-3673","authenticated-orcid":false,"given":"Neil","family":"Shah","sequence":"first","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, Telangana, India and TCS Research, Pune, Maharashtra, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-1101-8701","authenticated-orcid":false,"given":"Neha","family":"Sahipjohn","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1396-1106","authenticated-orcid":false,"given":"Vishal","family":"Tambrahalli","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9441-7074","authenticated-orcid":false,"given":"Ramanathan","family":"Subramanian","sequence":"additional","affiliation":[{"name":"University of Canberra, Bruce, Canberra, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8861-7731","authenticated-orcid":false,"given":"Vineet","family":"Gandhi","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, Telangana, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,9,9]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Garnett (Eds.)","volume":"31","author":"Arik Sercan","year":"2018","unstructured":"Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. Neural Voice Cloning with a Few Samples. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2018\/file\/4559912e7a94a9c32b09d894f2bc3c82-Paper.pdf"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054224"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 1044, 12 pages. https:\/\/dl.acm.org\/doi\/abs\/10.5555\/3495724.3496768"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2023.3288409"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1055\/s-0039-3402497"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2022.3188113"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcomdis.2015.06.008"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/N19-1423"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01268"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1044\/jslhr.4106.1253"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.6028\/NIST.IR.4930"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2757263"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1093\/ejo\/19.6.647"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1155\/2007\/94068"},{"key":"e_1_2_1_16_1","volume-title":"The Japan China Joint Conference of Acoustics","volume":"100","author":"Hirahara Tatsuya","year":"2007","unstructured":"Tatsuya Hirahara, Shota Shimizu, and Makoto Otani. 2007. Acoustic characteristics of non-audible murmur. In The Japan China Joint Conference of Acoustics, Vol. 100. 4000. https:\/\/www.researchgate.net\/profile\/Tatsuya-Hirahara\/publication\/251737500_ACOUSTIC_CHARACTERISTICS_OF_NON-AUDIBLE_MURMUR\/links\/00b7d52a80765613f2000000\/ACOUSTIC-CHARACTERISTICS-OF-NON-AUDIBLE-MURMUR.pdf"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3122291"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01802"},{"key":"e_1_2_1_19_1","unstructured":"Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6854066"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3172944.3172977"},{"key":"e_1_2_1_22_1","volume-title":"International Conference","author":"Kikuchi Yoshinobu","year":"2004","unstructured":"Yoshinobu Kikuchi and Hideki Kasuya. 2004. Development and evaluation of pitch adjustable electrolarynx. In Speech Prosody 2004, International Conference (Nara, Japan). https:\/\/www.isca-archive.org\/speechprosody_2004\/kikuchi04_speechprosody.pdf"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411763.3451552"},{"key":"e_1_2_1_24_1","volume-title":"Psychogenic aphonia: no fixation even after a lengthy period of aphonia. Swiss medical weekly 140, 1--2","author":"Kollbrunner Juerg","year":"2010","unstructured":"Juerg Kollbrunner, Anne-Dorine Menet, and Eberhard Seifert. 2010. Psychogenic aphonia: no fixation even after a lengthy period of aphonia. Swiss medical weekly 140, 1--2 (2010), 12--17. https:\/\/boris.unibe.ch\/1509\/1\/smw-12776.pdf"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"Kong Jungil","year":"2020","unstructured":"Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 1428, 12 pages. https:\/\/dl.acm.org\/doi\/abs\/10.5555\/3495724.3497152"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.769"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9052966"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3427384.3427392"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2003.1200069"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.4781262"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-286"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-4009"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.23919\/EUSIPCO.2019.8902961"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-475"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","unstructured":"K R Prajwal Rudrabha Mukhopadhyay Vinay P. Namboodiri and C.V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https:\/\/openaccess.thecvf.com\/content_CVPR_2020\/papers\/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.pdf","DOI":"10.1109\/CVPR42600.2020.01381"},{"key":"e_1_2_1_36_1","unstructured":"Thomas F. Quatieri. 2002. Discrete-time speech signal processing: principles and practice. Pearson Education India. https:\/\/books.google.co.in\/books\/about\/Discrete_time_Speech_Signal_Processing.html?id=5KYeAQAAIAAJ"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"28518","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 28492--28518. https:\/\/proceedings.mlr.press\/v202\/radford23a.html"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3580706"},{"key":"e_1_2_1_39_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=piLPYqxtWuA","author":"Ren Yi","year":"2021","unstructured":"Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=piLPYqxtWuA"},{"key":"e_1_2_1_40_1","volume-title":"FastSpeech: fast, robust and controllable text to speech","author":"Ren Yi","unstructured":"Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: fast, robust and controllable text to speech. Curran Associates Inc., Red Hook, NY, USA. https:\/\/dl.acm.org\/doi\/abs\/10.5555\/3454287.3454572"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/APSIPAASC58517.2023.10317357"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-1873"},{"key":"e_1_2_1_43_1","volume-title":"Findings of the Association for Computational Linguistics: EACL 2024","author":"Shah Neil","year":"2024","unstructured":"Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha S, Anil Nelakanti, and Vineet Gandhi. 2024. ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations. In Findings of the Association for Computational Linguistics: EACL 2024. Association for Computational Linguistics, St. Julian's, Malta, 79--91. https:\/\/aclanthology.org\/2024.findings-eacl.6"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1565"},{"key":"e_1_2_1_45_1","unstructured":"Nirmesh J Shah Mihir Parmar Neil Shah and Hemant A Patil. 2018. Novel MMSE DiscoGAN for cross-domain whisper-to-speech conversion. In Machine Learning in Speech and Language Processing (MLSLP) Workshop. Google Office 1--3. https:\/\/www.researchgate.net\/publication\/326668619_Novel_MMSE_DiscoGAN_for_Cross-Domain_Whisper-to-Speech_Conversion"},{"key":"e_1_2_1_46_1","volume-title":"Frequency characteristics of several non-audible murmur (NAM) microphones. Acoustical science and technology 30, 2","author":"Shimizu Shota","year":"2009","unstructured":"Shota Shimizu, Makoto Otani, and Tatsuya Hirahara. 2009. Frequency characteristics of several non-audible murmur (NAM) microphones. Acoustical science and technology 30, 2 (2009), 139--142. https:\/\/www.jstage.jst.go.jp\/article\/ast\/30\/2\/30_2_139\/_pdf"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/OJSP.2024.3379092"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581465"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2007.907344"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2009.4960405"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2005-611"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2009.11.005"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1078"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295378"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_2_1_56_1","volume-title":"Last accessed on","author":"von Platen Patrick","year":"2023","unstructured":"Patrick von Platen. 2022. Fine-tuning Wav2Vec2 for English ASR. https:\/\/colab.research.google.com\/github\/patrickvonplaten\/notebooks\/blob\/master\/Fine_tuning_Wav2Vec2_for_English_ASR.ipynb Google Colab Notebook, Last accessed on 06 September 2023."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3369812"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCSLP.2012.6423522"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3580838"},{"key":"e_1_2_1_60_1","unstructured":"zeta chicken. 2017. toWhisper. https:\/\/github.com\/zeta-chicken\/toWhisper 2017 github."}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3678515","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3678515","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:45:04Z","timestamp":1755787504000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3678515"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,22]]},"references-count":60,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,8,22]]}},"alternative-id":["10.1145\/3678515"],"URL":"https:\/\/doi.org\/10.1145\/3678515","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,8,22]]},"assertion":[{"value":"2024-09-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}