{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T00:17:33Z","timestamp":1770337053501,"version":"3.49.0"},"reference-count":77,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,3,19]],"date-time":"2021-03-19T00:00:00Z","timestamp":1616112000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2021,3,19]]},"abstract":"<jats:p>As a natural and convenient interaction modality, voice input has now become indispensable to smart devices (e.g. mobile phones and smart appliances). However, voice input is strongly constrained by surroundings and may raise privacy leakage in public areas. In this paper, we present SoundLip, an end-to-end interaction system enabling users to interact with smart devices via silent voice input. The key insight is to use inaudible acoustic signals to capture the lip movements of users when they issue commands. Previous works have considered lip reading as a naive classification task and thus can only recognize individual words. In contrast, our proposed system enables lip reading at both word and sentence levels, which are more suitable for daily-life use. We exploit the built-in speakers and microphones of smart devices to emit acoustic signals and listen to their reflections, respectively. In order to better abstract representations from multi-frequency and multi-modality acoustic signals, we elaborate a hierarchical convolutional neural network (HCNN) to serve as the front-end as well as recognize individual word commands. Then, for the sentence-level recognition, we exploit a multi-task encoder-decoder network to get around temporal segmentation and output sentences in an end-to-end way. We evaluate SoundLip on 20 individual words and 70 sentences from 12 participants. Our system achieves an accuracy of 91.2% at word-level and a word error rate of 7.1% at sentence-level in both user-independent and environment-independent settings. Given its innovative solution and promising performance, we believe that SoundLip has made a significant contribution to the advancement of silent voice input technology.<\/jats:p>","DOI":"10.1145\/3448087","type":"journal-article","created":{"date-parts":[[2021,3,30]],"date-time":"2021-03-30T18:56:41Z","timestamp":1617130601000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["SoundLip"],"prefix":"10.1145","volume":"5","author":[{"given":"Qian","family":"Zhang","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dong","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Run","family":"Zhao","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yinggang","family":"Yu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,3,30]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Andrew Senior, Oriol Vinyals, and Andrew Zisserman.","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung , Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018 . Deep Audio-visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2018), 1--1. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1--1."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302506.3310391"},{"key":"e_1_2_1_3_1","volume-title":"Chang","author":"Anumanchipalli Gopala K.","year":"2019","unstructured":"Gopala K. Anumanchipalli , Josh Chartier , and Edward F . Chang . 2019 . Speech synthesis from neural decoding of spoken sentences. Nature 568, 7753 (2019), 493--498. Gopala K. Anumanchipalli, Josh Chartier, and Edward F. Chang. 2019. Speech synthesis from neural decoding of spoken sentences. Nature 568, 7753 (2019), 493--498."},{"key":"e_1_2_1_4_1","volume-title":"Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599","author":"Assael Yannis M","year":"2016","unstructured":"Yannis M Assael , Brendan Shillingford , Shimon Whiteson , and Nando De Freitas . 2016 . Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016). Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)."},{"key":"e_1_2_1_5_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ( 2014 ). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"key":"e_1_2_1_6_1","volume-title":"2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4945--4949","author":"Bahdanau D.","unstructured":"D. Bahdanau , J. Chorowski , D. Serdyuk , P. Brakel , and Y. Bengio . 2016. End-to-end attention-based large vocabulary speech recognition . In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4945--4949 . D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. 2016. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4945--4949."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2015.310"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCA.2010.2041656"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2010.01.001"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"e_1_2_1_11_1","volume-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv: Neural and Evolutionary Computing","author":"Chung Junyoung","year":"2014","unstructured":"Junyoung Chung , Caglar Gulcehre , Kyunghyun Cho , and Yoshua Bengio . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv: Neural and Evolutionary Computing ( 2014 ). Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv: Neural and Evolutionary Computing (2014)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.367"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2009.08.002"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341162.3343756"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 16th International Conference on Human Interface and the Management of Information. Information and Knowledge in Applications and Services -","volume":"8522","author":"Moorthy Aarthi Easwara","unstructured":"Aarthi Easwara Moorthy and Kim-Phuong L. Vu . 2014. Voice Activated Personal Assistant: Acceptability of Use in the Public Space . In Proceedings of the 16th International Conference on Human Interface and the Management of Information. Information and Knowledge in Applications and Services - Volume 8522 . Springer-Verlag, Berlin, Heidelberg, 324--334. https:\/\/doi.org\/10.1007\/978-3-319-07863-2_32 Aarthi Easwara Moorthy and Kim-Phuong L. Vu. 2014. Voice Activated Personal Assistant: Acceptability of Use in the Public Space. In Proceedings of the 16th International Conference on Human Interface and the Management of Information. Information and Knowledge in Applications and Services - Volume 8522. Springer-Verlag, Berlin, Heidelberg, 324--334. https:\/\/doi.org\/10.1007\/978-3-319-07863-2_32"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3241539.3241559"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242587.3242603"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2207676.2208331"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2906388.2906396"},{"key":"e_1_2_1_21_1","volume-title":"Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2016 . Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_22_1","unstructured":"Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs.CV]  Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs.CV]"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/2209820.2210675"},{"key":"e_1_2_1_24_1","unstructured":"Google Inc. [n.d.]. Clund Speech-to-Text. https:\/\/cloud.google.com\/speech-to-text\/  Google Inc. [n.d.]. Clund Speech-to-Text. https:\/\/cloud.google.com\/speech-to-text\/"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2738568"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3172944.3172977"},{"key":"e_1_2_1_27_1","volume-title":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4835--4839","author":"Kim S.","unstructured":"S. Kim , T. Hori , and S. Watanabe . 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning . In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4835--4839 . S. Kim, T. Hori, and S. Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4835--4839."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300376"},{"key":"e_1_2_1_29_1","volume-title":"Adam: A Method for Stochastic Optimization. arXiv: Learning","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A Method for Stochastic Optimization. arXiv: Learning (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv: Learning (2014)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-711"},{"key":"e_1_2_1_31_1","first-page":"2579","article-title":"Visualizing Data using t-SNE","volume":"9","author":"Maaten Laurens Van Der","year":"2008","unstructured":"Van Der Maaten Laurens and Geoffrey Hinton . 2008 . Visualizing Data using t-SNE . Journal of Machine Learning Research 9 , 2605 (2008), 2579 -- 2605 . Van Der Maaten Laurens and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2605 (2008), 2579--2605.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_32_1","volume-title":"UltraGesture: Fine-Grained Gesture Sensing and Recognition. In 2018 15th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). 1--9.","author":"Ling K.","unstructured":"K. Ling , H. Dai , Y. Liu , and A. X. Liu . 2018 . UltraGesture: Fine-Grained Gesture Sensing and Recognition. In 2018 15th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). 1--9. K. Ling, H. Dai, Y. Liu, and A. X. Liu. 2018. UltraGesture: Fine-Grained Gesture Sensing and Recognition. In 2018 15th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). 1--9."},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning -","volume":"70","author":"Liu Hairong","year":"2017","unstructured":"Hairong Liu , Zhenyao Zhu , Xiangang Li , and Sanjeev Satheesh . 2017 . Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling . In Proceedings of the 34th International Conference on Machine Learning - Volume 70 . 2188--2197. Hairong Liu, Zhenyao Zhu, Xiangang Li, and Sanjeev Satheesh. 2017. Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling. In Proceedings of the 34th International Conference on Machine Learning - Volume 70. 2188--2197."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2019.2891733"},{"key":"e_1_2_1_35_1","volume-title":"Manning","author":"Luong Minh Thang","year":"2015","unstructured":"Minh Thang Luong , Hieu Pham , and Christopher D . Manning . 2015 . Effective Approaches to Attention-based Neural Machine Translation . (2015). Minh Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. (2015)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2973750.2973755"},{"key":"e_1_2_1_37_1","volume-title":"RNN-Based Room Scale Hand Motion Tracking. In The 25th Annual International Conference on Mobile Computing and Networking","author":"Mao Wenguang","year":"2019","unstructured":"Wenguang Mao , Mei Wang , Wei Sun , Lili Qiu , Swadhin Pradhan , and Yi-Chao Chen . 2019 . RNN-Based Room Scale Hand Motion Tracking. In The 25th Annual International Conference on Mobile Computing and Networking ( Los Cabos, Mexico) (MobiCom '19). Association for Computing Machinery, New York, NY, USA, Article 38, 16 pages. https:\/\/doi.org\/10.1145\/3300061.3345439 Wenguang Mao, Mei Wang, Wei Sun, Lili Qiu, Swadhin Pradhan, and Yi-Chao Chen. 2019. RNN-Based Room Scale Hand Motion Tracking. In The 25th Annual International Conference on Mobile Computing and Networking (Los Cabos, Mexico) (MobiCom '19). Association for Computing Machinery, New York, NY, USA, Article 38, 16 pages. https:\/\/doi.org\/10.1145\/3300061.3345439"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3081333.3081362"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3287058"},{"key":"e_1_2_1_40_1","volume-title":"Hearing lips and seeing voices. Nature 264, 5588","author":"McGurk Harry","year":"1976","unstructured":"Harry McGurk and John MacDonald . 1976. Hearing lips and seeing voices. Nature 264, 5588 ( 1976 ), 746--748. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746--748."},{"key":"e_1_2_1_41_1","unstructured":"Tomas Mikolov Kai Chen Greg S Corrado and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013).  Tomas Mikolov Kai Chen Greg S Corrado and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013)."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1080\/10447318.2014.986642"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2742647.2742674"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the 28th International Conference on International Conference on Machine Learning. 689--696","author":"Ngiam Jiquan","unstructured":"Jiquan Ngiam , Aditya Khosla , Mingyu Kim , Juhan Nam , Honglak Lee , and Andrew Y. Ng . 2011. Multimodal Deep Learning . In Proceedings of the 28th International Conference on International Conference on Machine Learning. 689--696 . Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning. 689--696."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1322263.1322265"},{"key":"e_1_2_1_47_1","volume-title":"C G\u00f3rriz, and R Ramirez Camacho.","author":"Valiente A Rodr\u00edguez","year":"2014","unstructured":"A Rodr\u00edguez Valiente , A Trinidad , JR Garc\u00eda Berrocal , C G\u00f3rriz, and R Ramirez Camacho. 2014 . Extended high-frequency (9-20 kHz) audiometry reference thresholds in 645 healthy subjects. International journal of audiology 53, 8 (2014), 531--545. A Rodr\u00edguez Valiente, A Trinidad, JR Garc\u00eda Berrocal, C G\u00f3rriz, and R Ramirez Camacho. 2014. Extended high-frequency (9-20 kHz) audiometry reference thresholds in 645 healthy subjects. International journal of audiology 53, 8 (2014), 531--545."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2971648.2971736"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2752365"},{"key":"e_1_2_1_50_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2014).  Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2014)."},{"key":"e_1_2_1_51_1","unstructured":"Kihyuk Sohn Wenling Shang and Honglak Lee. 2014. Improved Multimodal Deep Learning with Variation of Information. In Advances in Neural Information Processing Systems 27. 2141--2149.  Kihyuk Sohn Wenling Shang and Honglak Lee. 2014. Improved Multimodal Deep Learning with Variation of Information. In Advances in Neural Information Processing Systems 27. 2141--2149."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242587.3242599"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3241539.3241568"},{"key":"e_1_2_1_54_1","unstructured":"Ilya Sutskever Oriol Vinyals and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.  Ilya Sutskever Oriol Vinyals and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112."},{"key":"e_1_2_1_55_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. (2016) 2818--2826.  Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. (2016) 2818--2826.","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM.2017.8057099"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3191768"},{"key":"e_1_2_1_58_1","volume-title":"Fundamentals of wireless communication","author":"Tse David","unstructured":"David Tse and Pramod Viswanath . 2005. Fundamentals of wireless communication . Cambridge university press . David Tse and Pramod Viswanath. 2005. Fundamentals of wireless communication. Cambridge university press."},{"key":"e_1_2_1_59_1","unstructured":"voicebot.ai. 2019. Voice Assistant Consumer Adoption Report. https:\/\/voicebot.ai\/wp-content\/uploads\/2019\/01\/voice-assistant-consumer-adoption-report-2018-voicebot.pdf  voicebot.ai. 2019. Voice Assistant Consumer Adoption Report. https:\/\/voicebot.ai\/wp-content\/uploads\/2019\/01\/voice-assistant-consumer-adoption-report-2018-voicebot.pdf"},{"key":"e_1_2_1_60_1","doi-asserted-by":"crossref","unstructured":"Michael Wand Tanja Schultz and J\u00fcrgen Schmidhuber. 2018. Domain-Adversarial Training for Session Independent EMG-based Speech Recognition.. In Interspeech. 3167--3171.  Michael Wand Tanja Schultz and J\u00fcrgen Schmidhuber. 2018. Domain-Adversarial Training for Session Independent EMG-based Speech Recognition.. In Interspeech. 3167--3171.","DOI":"10.21437\/Interspeech.2018-2318"},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14)","author":"Wang Guanhua","unstructured":"Guanhua Wang , Yongpan Zou , Zimu Zhou , Kaishun Wu , and Lionel M. Ni . 2014. We Can Hear You with Wi-Fi! . In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14) . 593--604. Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M. Ni. 2014. We Can Hear You with Wi-Fi!. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom '14). 593--604."},{"key":"e_1_2_1_62_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3369812","article-title":"RFID Tattoo: A Wireless Platform for Speech Recognition","volume":"3","author":"Wang Jingxian","year":"2019","unstructured":"Jingxian Wang , Chengfeng Pan , Haojian Jin , Vaibhav Singh , Yash Jain , Jason I Hong , Carmel Majidi , and Swarun Kumar . 2019 . RFID Tattoo: A Wireless Platform for Speech Recognition . Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3 , 4 (2019), 1 -- 24 . Jingxian Wang, Chengfeng Pan, Haojian Jin, Vaibhav Singh, Yash Jain, Jason I Hong, Carmel Majidi, and Swarun Kumar. 2019. RFID Tattoo: A Wireless Platform for Speech Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 4 (2019), 1--24.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_2_1_63_1","volume-title":"Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82--94","author":"Wang Wei","year":"2016","unstructured":"Wei Wang , Alex X Liu , and Ke Sun . 2016 . Device-free gesture tracking using acoustic signals . In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82--94 . Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82--94."},{"key":"e_1_2_1_64_1","volume-title":"Push the Limit of Acoustic Gesture Recognition. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications. 566--575","author":"Wang Y.","unstructured":"Y. Wang , J. Shen , and Y. Zheng . 2020 . Push the Limit of Acoustic Gesture Recognition. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications. 566--575 . Y. Wang, J. Shen, and Y. Zheng. 2020. Push the Limit of Acoustic Gesture Recognition. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications. 566--575."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2019.2944058"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307334.3326073"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM.2017.8057022"},{"key":"e_1_2_1_68_1","unstructured":"Xiangyu Xu Jiadi Yu Yingying Chen Yanmin Zhu Linghe Kong and Minglu Li. 2019. BreathListener: Fine-grained Breathing Monitoring in Driving Environments Utilizing Acoustic Signals. (2019) 54--66.  Xiangyu Xu Jiadi Yu Yingying Chen Yanmin Zhu Linghe Kong and Minglu Li. 2019. BreathListener: Fine-grained Breathing Monitoring in Driving Environments Utilizing Acoustic Signals. (2019) 54--66."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3332165.3347950"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359266"},{"key":"e_1_2_1_71_1","unstructured":"Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. (2016).  Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. (2016)."},{"key":"e_1_2_1_72_1","first-page":"1","article-title":"An Indirect Eavesdropping Attack of Keystrokes on Touch Screen through Acoustic Sensing","volume":"99","author":"Yu Jiadi","year":"2019","unstructured":"Jiadi Yu , Li Lu , Yingying Chen , Yanmin Zhu , and Linghe Kong . 2019 . An Indirect Eavesdropping Attack of Keystrokes on Touch Screen through Acoustic Sensing . IEEE Transactions on Mobile Computing PP , 99 (2019), 1 -- 1 . Jiadi Yu, Li Lu, Yingying Chen, Yanmin Zhu, and Linghe Kong. 2019. An Indirect Eavesdropping Attack of Keystrokes on Touch Screen through Acoustic Sensing. IEEE Transactions on Mobile Computing PP, 99 (2019), 1--1.","journal-title":"IEEE Transactions on Mobile Computing PP"},{"key":"e_1_2_1_73_1","volume-title":"Strata: Fine-Grained Acoustic-based Device-Free Tracking.","author":"Yun Sangki","year":"2017","unstructured":"Sangki Yun , Yichao Chen , Huihuang Zheng , Lili Qiu , and Wenguang Mao . 2017 . Strata: Fine-Grained Acoustic-based Device-Free Tracking. (2017), 15--28. Sangki Yun, Yichao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-Grained Acoustic-based Device-Free Tracking. (2017), 15--28."},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3133962"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3381008"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3241539.3241575"},{"key":"e_1_2_1_77_1","volume-title":"A review of recent advances in visual speech decoding. Image and vision computing 32, 9","author":"Zhou Ziheng","year":"2014","unstructured":"Ziheng Zhou , Guoying Zhao , Xiaopeng Hong , and Matti Pietik\u00e4inen . 2014. A review of recent advances in visual speech decoding. Image and vision computing 32, 9 ( 2014 ), 590--605. Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietik\u00e4inen. 2014. A review of recent advances in visual speech decoding. Image and vision computing 32, 9 (2014), 590--605."}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3448087","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3448087","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:59Z","timestamp":1750195499000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3448087"}},"subtitle":["Enabling Word and Sentence-level Lip Interaction for Smart Devices"],"short-title":[],"issued":{"date-parts":[[2021,3,19]]},"references-count":77,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,3,19]]}},"alternative-id":["10.1145\/3448087"],"URL":"https:\/\/doi.org\/10.1145\/3448087","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,3,19]]},"assertion":[{"value":"2021-03-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}