{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T10:10:08Z","timestamp":1764843008238,"version":"3.44.0"},"reference-count":112,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/501100001809","name":"NSFC","doi-asserted-by":"crossref","award":["62222216"],"award-info":[{"award-number":["62222216"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Hong Kong UGC","award":["GRF Grant No. 17209822"],"award-info":[{"award-number":["GRF Grant No. 17209822"]}]},{"name":"Hong Kong RGC","award":["ECS Grant No. 27204522, GRF Grant No. 17212224"],"award-info":[{"award-number":["ECS Grant No. 27204522, GRF Grant No. 17212224"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2025,6,9]]},"abstract":"<jats:p>Speech enhancement is crucial for ubiquitous human-computer interaction. Recently, ultrasound-based acoustic sensing has emerged as an attractive choice for speech enhancement because of its superior ubiquity and performance. However, due to inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition, existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USPEECH, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes the correspondence between visual and ultrasonic modalities by leveraging audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, outperforming state-of-the-art ultrasound-based speech enhancement baselines. USPEECH is open-sourced at https:\/\/github.com\/aiot-lab\/USpeech\/.<\/jats:p>","DOI":"10.1145\/3729462","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:21:56Z","timestamp":1750281716000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-3964-5874","authenticated-orcid":false,"given":"Luca Jiang-Tao","family":"Yu","sequence":"first","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2496-3429","authenticated-orcid":false,"given":"Running","family":"Zhao","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6615-1982","authenticated-orcid":false,"given":"Sijie","family":"Ji","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3454-8731","authenticated-orcid":false,"given":"Edith C.H.","family":"Ngai","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9700-4627","authenticated-orcid":false,"given":"Chenshu","family":"Wu","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Joon Son Chung, and Andrew Zisserman","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018)."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445138"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/89.326615"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.comnet.2020.107447"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCSP.2019.8697923"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAES.2007.4441755"},{"key":"e_1_2_2_7_1","unstructured":"Crystal Boyd. 2024. Happiness is a Journey. https:\/\/happinessisajourney.com\/ 2024-01-27."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0095-4470(19)30777-6"},{"key":"e_1_2_2_9_1","volume-title":"Kah Phooi Seng, and Li-Minn Ang","author":"Chin Siew Wen","year":"2012","unstructured":"Siew Wen Chin, Kah Phooi Seng, and Li-Minn Ang. 2012. Audio-visual speech processing for human computer interaction. In Advances in robotics and virtual reality. Springer, 135--165."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00718"},{"volume-title":"Lip Reading in the Wild. In Asian Conference on Computer Vision.","author":"Chung J. S.","key":"e_1_2_2_11_1","unstructured":"J. S. Chung and A. Zisserman. 2016. Lip Reading in the Wild. In Asian Conference on Computer Vision."},{"key":"e_1_2_2_12_1","volume-title":"Joon Son Chung, and Hong-Goo Kang","author":"Chung Soo-Whan","year":"2020","unstructured":"Soo-Whan Chung, Soyeon Choe, Joon Son Chung, and Hong-Goo Kang. 2020. Facefilter: Audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074 (2020)."},{"key":"e_1_2_2_13_1","doi-asserted-by":"crossref","unstructured":"F. L. Darley A. E. Aronson and J. R. Brown. 1975. Motor Speech Disorders (3 ed.). W.B. Saunders Company Philadelphia PA.","DOI":"10.3109\/asl2.1975.3.issue-1.03"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_2_15_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3550303","article-title":"UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech","volume":"6","author":"Ding Han","year":"2022","unstructured":"Han Ding, Yizhan Wang, Hao Li, Cui Zhao, Ge Wang, Wei Xi, and Jizhong Zhao. 2022. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--25.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3613904.3642095"},{"key":"e_1_2_2_17_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3631447"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1984.1164453"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/89.397090"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953127"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178061"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MAES.2015.7119820"},{"key":"e_1_2_2_24_1","unstructured":"Grant Fairbanks. 1960. Voice and articulation drillbook. (1960)."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM53939.2023.10229085"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.4236\/jsip.2018.94015"},{"key":"e_1_2_2_28_1","first-page":"1170","article-title":"Visual Speech Enhancement","volume":"2018","author":"Gabbay Aviv","year":"2018","unstructured":"Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. 2018. Visual Speech Enhancement. In Proc. Interspeech 2018.1170-1174.","journal-title":"Proc. Interspeech"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/89.701367"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01524"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411830"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.6028\/NIST.IR.4930"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952261"},{"volume-title":"Articulatory phonetics","author":"Gick Bryan","key":"e_1_2_2_34_1","unstructured":"Bryan Gick, Ian Wilson, and Donald Derrick. 2013. Articulatory phonetics. John Wiley & Sons."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3379337.3415901"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2022.3224688"},{"key":"e_1_2_2_37_1","volume-title":"2012 9th European Radar Conference. IEEE, 198--201","author":"Groot SR","year":"2012","unstructured":"SR Groot, AG Yarovoy, RIA Harmanny, and JN Driessen. 2012. Model-based classification of human motion: Particle filtering applied to the micro-Doppler spectrum. In 2012 9th European Radar Conference. IEEE, 198--201."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581791.3596832"},{"key":"e_1_2_2_39_1","volume-title":"Comma gets a cure. diagnostic passage","author":"Honorof Douglas N","year":"2000","unstructured":"Douglas N Honorof, Jill McCullough, and Barbara Somerville. 2000. Comma gets a cure. diagnostic passage (2000)."},{"key":"e_1_2_2_40_1","unstructured":"Wei-Ning Hsu Tal Remez Bowen Shi Jacob Donley and Yossi Adi. 2022. ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement. arXiv:2212.11377 [eess.AS] https:\/\/arxiv.org\/abs\/2212.11377"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/2209820.2210675"},{"key":"e_1_2_2_42_1","volume-title":"DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264","author":"Hu Yanxin","year":"2020","unstructured":"Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020)."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/HSI.2018.8431232"},{"key":"e_1_2_2_44_1","unstructured":"International Telecommunication Union. 2007. ITU-T Recommendation P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs. https:\/\/www.itu.int\/rec\/T-REC-P.862.2. Accessed: [23 MAR 2024]."},{"key":"e_1_2_2_45_1","unstructured":"Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534613"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2002.5745591"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/LGRS.2015.2452311"},{"key":"e_1_2_2_49_1","first-page":"2758","article-title":"Lip to speech synthesis with visual context attentional gan","volume":"34","author":"Kim Minsu","year":"2021","unstructured":"Minsu Kim, Joanna Hong, and Yong Man Ro. 2021. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems 34 (2021), 2758--2770.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.3030497"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33012588"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411841"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3586183.3606775"},{"key":"e_1_2_2_54_1","volume-title":"Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation","author":"Luo Yi","year":"2019","unstructured":"Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE\/ACM transactions on audio, speech, and language processing 27, 8 (2019), 1256--1266."},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2973750.2973755"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-021-11291-3"},{"key":"e_1_2_2_57_1","unstructured":"R\u00d8DE Microphones. Accessed: 2025-01-12. Wireless GO II Dual Wireless Microphone System. https:\/\/rode.com\/en-us\/microphones\/wireless\/wirelessgoii?variant_sku=WIGOII."},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3027063.3027086"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.2005.860774"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/2858036.2858580"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6639038"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2305833"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2016.2580946"},{"key":"e_1_2_2_64_1","volume-title":"Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748","author":"van den Oord Aaron","year":"2018","unstructured":"Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)."},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_39"},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2023.3250846"},{"key":"e_1_2_2_67_1","volume-title":"The importance of phase in speech enhancement. speech communication 53, 4","author":"Paliwal Kuldip","year":"2011","unstructured":"Kuldip Paliwal, Kamil W\u00f3jcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. speech communication 53, 4 (2011), 465--494."},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683634"},{"key":"e_1_2_2_69_1","volume-title":"SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452","author":"Pascual Santiago","year":"2017","unstructured":"Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)."},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806390"},{"key":"e_1_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2017.00011"},{"volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Prajwal K R","key":"e_1_2_2_72_1","unstructured":"K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"volume-title":"Fundamentals of speech recognition","author":"Rabiner Lawrence R","key":"e_1_2_2_73_1","unstructured":"Lawrence R Rabiner and Biing-Hwang Juang. 1999. Fundamentals of speech recognition. Tsinghua University Press."},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1049\/sil2.12233"},{"key":"e_1_2_2_75_1","volume-title":"U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234--241."},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAU.1969.1162058"},{"key":"e_1_2_2_77_1","unstructured":"Seeing Speech. n.d.. How UTI Works. https:\/\/www.seeingspeech.ac.uk\/how-uti-works\/. Accessed: 2024-03-31."},{"volume-title":"Diversified radar micro-Doppler simulations as training data for deep residual neural networks. In 2018 IEEE radar Conference (radarConf18)","author":"Seyfioglu Mehmet S","key":"e_1_2_2_78_1","unstructured":"Mehmet S Seyfioglu, Baris Erol, Sevgi Z Gurbuz, and Moeness G Amin. 2018. Diversified radar micro-Doppler simulations as training data for deep residual neural networks. In 2018 IEEE radar Conference (radarConf18). IEEE, 0612--0617."},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAES.2018.2883847"},{"key":"e_1_2_2_80_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Z1Qlm11uOM","author":"Shi Bowen","year":"2022","unstructured":"Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Z1Qlm11uOM"},{"key":"e_1_2_2_81_1","volume-title":"A guide to analysing tongue motion from ultrasound images. Clinical linguistics & phonetics 19, 6--7","author":"Stone Maureen","year":"2005","unstructured":"Maureen Stone. 2005. A guide to analysing tongue motion from ultrasound images. Clinical linguistics & phonetics 19, 6--7 (2005), 455--501."},{"key":"e_1_2_2_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413901"},{"key":"e_1_2_2_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447993.3448626"},{"key":"e_1_2_2_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/3241539.3241568"},{"key":"e_1_2_2_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBCAS.2020.2988121"},{"key":"e_1_2_2_86_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"e_1_2_2_87_1","volume-title":"SEANet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095","author":"Tagliasacchi Marco","year":"2020","unstructured":"Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. 2020. SEANet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 (2020)."},{"key":"e_1_2_2_88_1","volume-title":"Proc. International Congress of Phonetic Sciences. 1--5.","author":"Teplansky Kristin J","year":"2019","unstructured":"Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and lip motion patterns in voiced, whispered, and silent vowel production. In Proc. International Congress of Phonetic Sciences. 1--5."},{"key":"e_1_2_2_89_1","unstructured":"University of Wisconsin - Dialect Research. 2023. Arthur the Rat. https:\/\/dare.wisc.edu\/audio\/arthur-the-rat\/. Accessed: 2023-09-30."},{"key":"e_1_2_2_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.492"},{"key":"e_1_2_2_91_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_2_92_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-04230-4"},{"key":"e_1_2_2_93_1","doi-asserted-by":"publisher","DOI":"10.5555\/3232296.3232303"},{"key":"e_1_2_2_94_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3209943"},{"key":"e_1_2_2_95_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093345"},{"key":"e_1_2_2_96_1","volume-title":"Image quality assessment: from error visibility to structural similarity","author":"Wang Zhou","year":"2004","unstructured":"Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612."},{"key":"e_1_2_2_97_1","volume-title":"Complex ratio masking for monaural speech separation","author":"Williamson Donald S","year":"2015","unstructured":"Donald S Williamson, Yuxuan Wang, and DeLiang Wang. 2015. Complex ratio masking for monaural speech separation. IEEE\/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483--492."},{"volume-title":"Year of Publication. The North Wind and the Sun","author":"Winter Milo","key":"e_1_2_2_98_1","unstructured":"Milo Winter. Year of Publication. The North Wind and the Sun. http:\/\/mythfolklore.net\/aesopica\/milowinter\/141.htm. 2024--01-27."},{"key":"e_1_2_2_99_1","volume-title":"Development","author":"Wolfe Joe","year":"2020","unstructured":"Joe Wolfe, Ma\u00ebva Garnier, Nathalie Henrich Bernardoni, and John Smith. 2020. The Mechanics and Acoustics of the Singing Voice: Registers, Resonances and the Source-Filter Interaction. In The Routledge Companion to Interdisciplinary Studies in Singing, Volume I: Development. Routledge, 64--78."},{"key":"e_1_2_2_100_1","volume-title":"An experimental study on speech enhancement based on deep neural networks","author":"Xu Yong","year":"2013","unstructured":"Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2013. An experimental study on speech enhancement based on deep neural networks. IEEE Signal processing letters 21, 1 (2013), 65--68."},{"key":"e_1_2_2_101_1","volume-title":"A regression approach to speech enhancement based on deep neural networks","author":"Xu Yong","year":"2014","unstructured":"Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014. A regression approach to speech enhancement based on deep neural networks. IEEE\/ACM transactions on audio, speech, and language processing 23, 1 (2014), 7--19."},{"key":"e_1_2_2_102_1","unstructured":"Junichi Yamagishi. 2012. English multi-speaker corpus for CSTR voice cloning toolkit. http:\/\/homepages.inf.ed.ac.uk\/jyamagis\/page3\/page58\/page58.html"},{"key":"e_1_2_2_103_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053795"},{"key":"e_1_2_2_104_1","doi-asserted-by":"publisher","DOI":"10.1145\/3625687.3625792"},{"key":"e_1_2_2_105_1","volume-title":"LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading. arXiv preprint arXiv:2306.03258","author":"Yemini Yochai","year":"2023","unstructured":"Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, and Ethan Fetaya. 2023. LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading. arXiv preprint arXiv:2306.03258 (2023)."},{"key":"e_1_2_2_106_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6489"},{"key":"e_1_2_2_107_1","doi-asserted-by":"publisher","DOI":"10.1145\/3081333.3081356"},{"key":"e_1_2_2_108_1","doi-asserted-by":"publisher","DOI":"10.1145\/3494990"},{"key":"e_1_2_2_109_1","doi-asserted-by":"publisher","DOI":"10.1145\/3594738.3611365"},{"key":"e_1_2_2_110_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3580801"},{"key":"e_1_2_2_111_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-738"},{"key":"e_1_2_2_112_1","doi-asserted-by":"publisher","DOI":"10.1145\/3610873"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3729462","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:23:26Z","timestamp":1755865406000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3729462"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,9]]},"references-count":112,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,9]]}},"alternative-id":["10.1145\/3729462"],"URL":"https:\/\/doi.org\/10.1145\/3729462","relation":{},"ISSN":["2474-9567"],"issn-type":[{"type":"electronic","value":"2474-9567"}],"subject":[],"published":{"date-parts":[[2025,6,9]]},"assertion":[{"value":"2025-06-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}