{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T21:09:03Z","timestamp":1776114543834,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2022,9,6]],"date-time":"2022-09-06T00:00:00Z","timestamp":1662422400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2022,9,6]]},"abstract":"<jats:p>Speech enhancement can benefit lots of practical voice-based interaction applications, where the goal is to generate clean speech from noisy ambient conditions. This paper presents a practical design, namely UltraSpeech, to enhance speech by exploring the correlation between the ultrasound (profiled articulatory gestures) and speech. UltraSpeech uses a commodity smartphone to emit the ultrasound and collect the composed acoustic signal for analysis. We design a complex masking framework to deal with complex-valued spectrograms, incorporating the magnitude and phase rectification of speech simultaneously. We further introduce an interaction module to share information between ultrasound and speech two branches and thus enhance their discrimination capabilities. Extensive experiments demonstrate that UltraSpeech increases the Scale Invariant SDR by 12dB, improves the speech intelligibility and quality effectively, and is capable to generalize to unknown speakers.<\/jats:p>","DOI":"10.1145\/3550303","type":"journal-article","created":{"date-parts":[[2022,9,7]],"date-time":"2022-09-07T14:54:27Z","timestamp":1662562467000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":28,"title":["UltraSpeech"],"prefix":"10.1145","volume":"6","author":[{"given":"Han","family":"Ding","sequence":"first","affiliation":[{"name":"Xi'an Jiaotong University, China"}]},{"given":"Yizhan","family":"Wang","sequence":"additional","affiliation":[{"name":"Xi'an Jiaotong University, China"}]},{"given":"Hao","family":"Li","sequence":"additional","affiliation":[{"name":"Xi'an Jiaotong University, China"}]},{"given":"Cui","family":"Zhao","sequence":"additional","affiliation":[{"name":"Xi'an Jiaotong University, China"}]},{"given":"Ge","family":"Wang","sequence":"additional","affiliation":[{"name":"Xi'an Jiaotong University, China"}]},{"given":"Wei","family":"Xi","sequence":"additional","affiliation":[{"name":"Xi'an Jiaotong University, China"}]},{"given":"Jizhong","family":"Zhao","sequence":"additional","affiliation":[{"name":"Xi'an Jiaotong University, China"}]}],"member":"320","published-online":{"date-parts":[[2022,9,7]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"https:\/\/en.wikipedia.org\/wiki\/Amazon_Echo. Wikipedia","author":"Echo Amazon","year":"2022","unstructured":"[n.d.]. Amazon Echo. https:\/\/en.wikipedia.org\/wiki\/Amazon_Echo. Wikipedia, 2022."},{"key":"e_1_2_1_2_1","volume-title":"An Android APP that Emits Sounds of User-Specified Frequencies. https:\/\/github.com\/dtczhl\/dtc-frequency-player","year":"2019","unstructured":"[n.d.]. An Android APP that Emits Sounds of User-Specified Frequencies. https:\/\/github.com\/dtczhl\/dtc-frequency-player. 2019."},{"key":"e_1_2_1_3_1","volume-title":"Speech Recognition on Android with Wav2Vec2. https:\/\/github.com\/pytorch\/android-demo-app\/tree\/master\/SpeechRecognition","year":"2021","unstructured":"[n.d.]. Speech Recognition on Android with Wav2Vec2. https:\/\/github.com\/pytorch\/android-demo-app\/tree\/master\/SpeechRecognition. 2021."},{"key":"e_1_2_1_4_1","volume-title":"Andrew Senior, Oriol Vinyals, and Andrew Zisserman.","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-Visual Speech Recognition. IEEE transactions on Pattern Analysis and Machine Intelligence (2018)."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-3114"},{"key":"e_1_2_1_6_1","volume-title":"A Framework for Self-supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477 (2020)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCA.2010.2041656"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.367"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of PMLR ICML.","author":"Fu Szu-Wei","year":"2019","unstructured":"Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. 2019. Metricgan: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. In Proceedings of PMLR ICML."},{"key":"e_1_2_1_10_1","first-page":"29","article-title":"DARPA TIMIT Acoustic-Phonetic Speech Database","volume":"15","author":"Garofolo John S","year":"1988","unstructured":"John S Garofolo et al. 1988. DARPA TIMIT Acoustic-Phonetic Speech Database. National Institute of Standards and Technology (NIST) 15 (1988), 29--50.","journal-title":"National Institute of Standards and Technology (NIST)"},{"key":"e_1_2_1_11_1","volume-title":"End-to-End Multi-Channel Speech Separation. arXiv preprint arXiv:1905.06286","author":"Gu Rongzhi","year":"2019","unstructured":"Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu. 2019. End-to-End Multi-Channel Speech Separation. arXiv preprint arXiv:1905.06286 (2019)."},{"key":"e_1_2_1_12_1","volume-title":"Google Home Specifications","author":"Help Google Home","year":"2017","unstructured":"Google Home Help. [n.d.]. Google Home Specifications. Google Inc., 2017."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/2209820.2210675"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2537"},{"key":"e_1_2_1_15_1","volume-title":"ITU-Telecommunication Standardization Sector","author":"Recommendation ITU-T P ITU.","year":"2007","unstructured":"Recommendation ITU-T P ITU. [n.d.]. 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs. ITU-Telecommunication Standardization Sector, 2007."},{"key":"e_1_2_1_16_1","volume-title":"Ma\u00ebva Garnier and John Smith","author":"Bernardoni Joe Wolfe Nathalie Henrich","year":"2020","unstructured":"Nathalie Henrich Bernardoni Joe Wolfe, Ma\u00ebva Garnier and John Smith. 2020. The Mechanics and Acoustics of the Singing Voice. Routledge."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2002.5745591"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300376"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053214"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3311823.3311831"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/SAHCN.2018.8397099"},{"key":"e_1_2_1_22_1","volume-title":"Multimedia analysis, processing and communications","author":"Loizou Philipos C","unstructured":"Philipos C Loizou. 2011. Speech Quality Assessment. In Multimedia analysis, processing and communications. Springer, 623--654."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397320"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462116"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of Citeseer ICML.","author":"Maas Andrew L","year":"2013","unstructured":"Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of Citeseer ICML."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of ACM IMWUT\/UbiComp","author":"Cordourier Maruri H\u00e9ctor A","year":"2018","unstructured":"H\u00e9ctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-Speech: Noise-Robust Speech Capturing Glasses using Vibration Sensors. Proceedings of ACM IMWUT\/UbiComp (2018)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2016.2580946"},{"key":"e_1_2_1_28_1","volume-title":"A Beginner's Guide","author":"Cognitive Neuroscience Fundamentals","year":"2013","unstructured":"Fundamentals of Cognitive Neuroscience: A Beginner's Guide. 2013. Chapter 11 - Language. Academic Press."},{"key":"e_1_2_1_29_1","volume-title":"Human-centric Interfaces for Ambient Intelligence","author":"Paliwal Kuldip K","unstructured":"Kuldip K Paliwal and Kaisheng Yao. 2010. Robust Speech Recognition Under Noisy Ambient Conditions. In Human-centric Interfaces for Ambient Intelligence. Elsevier, 135--162."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683634"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1428"},{"key":"e_1_2_1_32_1","unstructured":"Markku Pukkila. 2000. Channel Estimation Modeling."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAU.1969.1162058"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2634317.2634322"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242587.3242599"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447993.3448626"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/IWAENC.2018.8521383"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053723"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1405"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2016.2537202"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of ASSTA ICPhS.","author":"Teplansky Kristin J","year":"2019","unstructured":"Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and Lip Motion Patterns in Voiced, Whispered, and Silent Vowel Production. In Proceedings of ASSTA ICPhS."},{"key":"e_1_2_1_43_1","volume-title":"Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal.","author":"Trabelsi Chiheb","year":"2017","unstructured":"Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, Joao Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal. 2017. Deep Complex Networks. arXiv preprint arXiv:1705.09792 (2017)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093345"},{"key":"e_1_2_1_45_1","volume-title":"On Training Targets for Supervised Speech Separation","author":"Wang Yuxuan","year":"2014","unstructured":"Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On Training Targets for Supervised Speech Separation. IEEE\/ACM transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849--1858."},{"key":"e_1_2_1_46_1","volume-title":"Complex Ratio Masking for Monaural Speech Separation","author":"Williamson Donald S","year":"2015","unstructured":"Donald S Williamson, Yuxuan Wang, and DeLiang Wang. 2015. Complex Ratio Masking for Monaural Speech Separation. IEEE\/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483--492."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682245"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01444"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6489"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478093"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3381008"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478116"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3550303","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3550303","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T04:44:48Z","timestamp":1752468288000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3550303"}},"subtitle":["Speech Enhancement by Interaction between Ultrasound and Speech"],"short-title":[],"issued":{"date-parts":[[2022,9,6]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,9,6]]}},"alternative-id":["10.1145\/3550303"],"URL":"https:\/\/doi.org\/10.1145\/3550303","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,9,6]]},"assertion":[{"value":"2022-09-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}