{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T04:29:15Z","timestamp":1778300955336,"version":"3.51.4"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,3,27]],"date-time":"2023-03-27T00:00:00Z","timestamp":1679875200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62272213"],"award-info":[{"award-number":["62272213"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61872173"],"award-info":[{"award-number":["61872173"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2023,3,27]]},"abstract":"<jats:p>Silent speech recognition (SSR) allows users to speak to the device without making a sound, avoiding being overheard or disturbing others. Compared to the video-based approach, wireless signal-based SSR can work when the user is wearing a mask and has fewer privacy concerns. However, previous wireless-based systems are still far from well-studied, e.g. they are only evaluated in corpus with highly limited size, making them only feasible for interaction with dozens of deterministic commands. In this paper, we present mSilent, a millimeter-wave (mmWave) based SSR system that can work in the general corpus containing thousands of daily conversation sentences. With the strong recognition capability, mSilent not only supports the more complex interaction with assistants, but also enables more general applications in daily life such as communication and input. To extract fine-grained articulatory features, we build a signal processing pipeline that uses a clustering-selection algorithm to separate articulatory gestures and generates a multi-scale detrended spectrogram (MSDS). To handle the complexity of the general corpus, we design an end-to-end deep neural network that consists of a multi-branch convolutional front-end and a Transformer-based sequence-to-sequence back-end. We collect a general corpus dataset of 1,000 daily conversation sentences that contains 21K samples of bi-modality data (mmWave and video). Our evaluation shows that mSilent achieves a 9.5% average word error rate (WER) at a distance of 1.5m, which is comparable to the performance of the state-of-the-art video-based approach. We also explore deploying mSilent in two typical scenarios of text entry and in-car assistant, and the less than 6% average WER demonstrates the potential of mSilent in general daily applications.<\/jats:p>","DOI":"10.1145\/3580838","type":"journal-article","created":{"date-parts":[[2023,3,28]],"date-time":"2023-03-28T14:57:51Z","timestamp":1680015471000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":28,"title":["mSilent"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2495-7109","authenticated-orcid":false,"given":"Shang","family":"Zeng","sequence":"first","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9726-195X","authenticated-orcid":false,"given":"Haoran","family":"Wan","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4841-8388","authenticated-orcid":false,"given":"Shuyu","family":"Shi","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9882-2090","authenticated-orcid":false,"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, Jiangsu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Andrew Senior, Oriol Vinyals, and Andrew Zisserman.","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-Visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)."},{"key":"e_1_2_1_2_1","volume-title":"Lipnet: End-to-end Sentence-level Lipreading. arXiv preprint arXiv:1611.01599","author":"Assael Yannis M","year":"2016","unstructured":"Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end Sentence-level Lipreading. arXiv preprint arXiv:1611.01599 (2016)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/7.1.1"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP46214.2022.9833568"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550325"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447993.3483251"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.2229005"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL '17)","author":"Eric Mihail","unstructured":"Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL '17). 37--49."},{"key":"e_1_2_1_9_1","volume-title":"International Conference on Machine Learning (ICML '15)","author":"Ganin Yaroslav","year":"2015","unstructured":"Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In International Conference on Machine Learning (ICML '15). PMLR, 1180--1189."},{"key":"e_1_2_1_10_1","first-page":"1","article-title":"EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users","volume":"4","author":"Gao Yang","year":"2020","unstructured":"Yang Gao, Yincheng Jin, Jiyang Li, Seokmin Choi, and Zhanpeng Jin. 2020. EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 3 (2020), 1--27.","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/NAECON.2015.7443084"},{"key":"e_1_2_1_12_1","unstructured":"Google. 2022. Google Soli Products. https:\/\/www.atap.google.com\/soli\/products\/"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372224.3419982"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-9457(83)90004-0"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_17_1","unstructured":"Sepp Hochreiter Yoshua Bengio Paolo Frasconi J\u00fcrgen Schmidhuber et al. 2001. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-term Dependencies."},{"key":"e_1_2_1_18_1","volume-title":"The Fundamentals of Millimeter Wave Sensors. Texas Instruments","author":"Iovescu Cesar","year":"2017","unstructured":"Cesar Iovescu and Sandeep Rao. 2017. The Fundamentals of Millimeter Wave Sensors. Texas Instruments (2017), 1--8."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/IMS30576.2020.9223838"},{"key":"e_1_2_1_20_1","first-page":"1","article-title":"EarCommand: \"Hearing\" Your Silent Speech Commands In Ear","volume":"6","author":"Jin Yincheng","year":"2022","unstructured":"Yincheng Jin, Yang Gao, Xuhai Xu, Seokmin Choi, Jiyang Li, Feng Liu, Zhengxiong Li, and Zhanpeng Jin. 2022. EarCommand: \"Hearing\" Your Silent Speech Commands In Ear. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--28.","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR 22)","author":"Kaplan Jared","year":"2022","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2022. Scaling Laws for Neural Language Models. In Proceedings of the International Conference on Learning Representations (ICLR 22)."},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the Third Conference on Empirical Methods for Natural Language Processing (EMNLP '98)","author":"Kilgarriff Adam","year":"1998","unstructured":"Adam Kilgarriff and Tony Rose. 1998. Measures for Corpus Similarity and Homogeneity. In Proceedings of the Third Conference on Empirical Methods for Natural Language Processing (EMNLP '98). 46--52."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1139"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3502015"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR '15)","author":"Kingma Diederik P","year":"2015","unstructured":"Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR '15)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3498361.3538926"},{"key":"e_1_2_1_28_1","volume-title":"Network in Network. arXiv preprint arXiv:1312.4400","author":"Lin Min","year":"2013","unstructured":"Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in Network. arXiv preprint arXiv:1312.4400 (2013)."},{"key":"e_1_2_1_29_1","volume-title":"Yonina C Eldar, and Stefano Buzzi.","author":"Liu Fan","year":"2022","unstructured":"Fan Liu, Yuanhao Cui, Christos Masouros, Jie Xu, Tony Xiao Han, Yonina C Eldar, and Stefano Buzzi. 2022. Integrated Sensing and Communications: Towards dual-Functional Wireless Networks for 6G and beyond. IEEE Journal on Selected Areas in Communications (2022)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2019.2891733"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414567"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR '21)","author":"Ma Shuang","year":"2021","unstructured":"Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In Proceedings of the International Conference on Learning Representations (ICLR '21)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2006.479"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445430"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00510"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-015-2774-3"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2634317.2634322"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3381010"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1565"},{"key":"e_1_2_1_41_1","volume-title":"International Conference on Learning Representations (ICLR '22)","author":"Shi Bowen","year":"2022","unstructured":"Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In International Conference on Learning Representations (ICLR '22)."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR '15)","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR '15)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.367"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/642611.642632"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242587.3242599"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447993.3448626"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.nlposs-1.9"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the International Congress of Phonetic Sciences (ICPhS '19)","author":"Teplansky Kristin J","year":"2019","unstructured":"Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and Lip Motion Patterns in Voiced, Whispered, and Silent Vowel Production. In Proceedings of the International Congress of Phonetic Sciences (ICPhS '19). 1--5."},{"key":"e_1_2_1_49_1","volume-title":"Advances in Neural Information Processing Systems (NeruIPS '17)","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems (NeruIPS '17)."},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Michael Wand Christopher Schulte Matthias Janke and Tanja Schultz. 2013. Array-based Electromyographic Silent Speech Interface. In Biosignals. 89--96.","DOI":"10.5220\/0004252400890096"},{"key":"e_1_2_1_51_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3534592","article-title":"Wavesdropper: Through-wall Word Detection of Human Speech via Commercial mmWave Devices","volume":"6","author":"Wang Chao","year":"2022","unstructured":"Chao Wang, Feng Lin, Zhongjie Ba, Fan Zhang, Wenyao Xu, and Kui Ren. 2022. Wavesdropper: Through-wall Word Detection of Human Speech via Commercial mmWave Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 2 (2022), 1--26.","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14)","author":"Wang Guanhua","unstructured":"Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M. Ni. 2014. We Can Hear You with Wi-Fi!. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14). 593--604."},{"key":"e_1_2_1_53_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3369812","article-title":"Rfid Tattoo: A Wireless Platform for Speech Recognition","volume":"3","author":"Wang Jingxian","year":"2019","unstructured":"Jingxian Wang, Chengfeng Pan, Haojian Jin, Vaibhav Singh, Yash Jain, Jason I Hong, Carmel Majidi, and Swarun Kumar. 2019. Rfid Tattoo: A Wireless Platform for Speech Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 4 (2019), 1--24.","journal-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534601"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/IMS30576.2020.9223988"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM41043.2020.9155293"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307334.3326073"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3133962"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3494990"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3494990"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3432192"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580838","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3580838","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T04:42:31Z","timestamp":1752468151000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580838"}},"subtitle":["Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar"],"short-title":[],"issued":{"date-parts":[[2023,3,27]]},"references-count":61,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,27]]}},"alternative-id":["10.1145\/3580838"],"URL":"https:\/\/doi.org\/10.1145\/3580838","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,27]]},"assertion":[{"value":"2023-03-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}