{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T21:09:03Z","timestamp":1776114543252,"version":"3.50.1"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,9,27]],"date-time":"2023-09-27T00:00:00Z","timestamp":1695772800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2023,9,27]]},"abstract":"<jats:p>Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.<\/jats:p>","DOI":"10.1145\/3610873","type":"journal-article","created":{"date-parts":[[2023,9,27]],"date-time":"2023-09-27T15:45:03Z","timestamp":1695829503000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Radio2Text"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2496-3429","authenticated-orcid":false,"given":"Running","family":"Zhao","sequence":"first","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3964-5874","authenticated-orcid":false,"given":"Jiangtao","family":"Yu","sequence":"additional","affiliation":[{"name":"Shanghai Qi Zhi Institute, Shanghai, China and IIIS, Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1928-7841","authenticated-orcid":false,"given":"Hang","family":"Zhao","sequence":"additional","affiliation":[{"name":"IIIS, Tsinghua University, Beijing, China and Shanghai Qi Zhi Institute, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3454-8731","authenticated-orcid":false,"given":"Edith C.H.","family":"Ngai","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,9,27]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1126"},{"key":"e_1_2_2_3_1","volume-title":"Garnett (Eds.)","volume":"29","author":"Aytar Yusuf","year":"2016","unstructured":"Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. SoundNet: Learning Sound Representations from Unlabeled Video. In Advances in Neural Information Processing Systems (NeurIPS), D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29."},{"key":"e_1_2_2_4_1","volume-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems (NeurIPS) 33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems (NeurIPS) 33 (2020), 12449--12460."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP46214.2022.9833568"},{"key":"e_1_2_2_6_1","volume-title":"Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418","author":"Becker S\u00f6ren","year":"2018","unstructured":"S\u00f6ren Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert M\u00fcller, and Wojciech Samek. 2018. Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418 (2018)."},{"key":"e_1_2_2_7_1","volume-title":"Representation learning: A review and new perspectives","author":"Bengio Yoshua","year":"2013","unstructured":"Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798--1828."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2007.02.006"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/78.489029"},{"key":"e_1_2_2_10_1","volume-title":"Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems (NeurIPS) 30","author":"Chen Guobin","year":"2017","unstructured":"Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems (NeurIPS) 30 (2017)."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","unstructured":"Xie Chen Yu Wu Zhenghao Wang Shujie Liu and Jinyu Li. 2021. Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 5904--5908. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9413535","DOI":"10.1109\/ICASSP39728.2021.9413535"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.1907229"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4828"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2601097.2601119"},{"key":"e_1_2_2_15_1","volume-title":"HARVARD speech corpus--audio recording","author":"Demonte Philippa","year":"2019","unstructured":"Philippa Demonte. 2019. HARVARD speech corpus--audio recording 2019. University of Salford Collection (2019)."},{"key":"e_1_2_2_16_1","volume-title":"Computer Vision--ECCV 2020: 16th European Conference. 105--123.","author":"Fan Lijie","unstructured":"Lijie Fan, Tianhong Li, Yuan Yuan, and Dina Katabi. 2020. In-home daily-life captioning using radio signals. In Computer Vision--ECCV 2020: 16th European Conference. 105--123."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM53939.2023.10229085"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM53939.2023.10229095"},{"key":"e_1_2_2_19_1","volume-title":"InertiEAR: Automatic and Device-independent IMU-based Eavesdropping on Smartphones. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 1129--1138","author":"Gao Ming","year":"2022","unstructured":"Ming Gao, Yajie Liu, Yike Chen, Yimin Li, Zhongjie Ba, Xian Xu, and Jinsong Han. 2022. InertiEAR: Automatic and Device-independent IMU-based Eavesdropping on Smartphones. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 1129--1138."},{"key":"e_1_2_2_20_1","volume-title":"Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711","author":"Graves Alex","year":"2012","unstructured":"Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)."},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.309"},{"key":"e_1_2_2_24_1","volume-title":"First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873","author":"Hannun Awni Y","year":"2014","unstructured":"Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and Andrew Y Ng. 2014. First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873 (2014)."},{"key":"e_1_2_2_25_1","volume-title":"NIPS Deep Learning and Representation Learning Workshop. http:\/\/arxiv.org\/abs\/1503","author":"Hinton Geoffrey","year":"2015","unstructured":"Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http:\/\/arxiv.org\/abs\/1503.02531"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3122291"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054663"},{"key":"e_1_2_2_28_1","volume-title":"2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 836--852","author":"Hu Pengfei","year":"2022","unstructured":"Pengfei Hu, Wenhao Li, Riccardo Spolaor, and Xiuzhen Cheng. 2022. mmEcho: A mmWave-based Acoustic Eavesdropping Method. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 836--852."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM48880.2022.9796940"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP46214.2022.9833716"},{"key":"e_1_2_2_31_1","unstructured":"Texas Instruments Incorporated. 2020. AWR1642: Single-chip 76-GHz to 81-GHz automotive radar sensor integrating DSP and MCU. https:\/\/www.ti.com\/product\/AWR1642"},{"key":"e_1_2_2_32_1","unstructured":"Texas Instruments Incorporated. 2020. DCA1000EVM: Real-time data-capture adapter for radar sensing evaluation module. https:\/\/www.ti.com\/tool\/DCA1000EVM"},{"key":"e_1_2_2_33_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Islam Md Amirul","unstructured":"Md Amirul Islam, Sen Jia, and Neil D. B. Bruce. 2020. How much Position Information Do Convolutional Neural Networks Encode?. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_34_1","unstructured":"Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/MAES.2019.180130"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372224.3419202"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3384419.3430733"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"e_1_2_2_39_1","volume-title":"Proceedings of the 36th International Conference on Machine Learning (ICML)","volume":"97","author":"Kornblith Simon","year":"2019","unstructured":"Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of Neural Network Representations Revisited. In Proceedings of the 36th International Conference on Machine Learning (ICML), Vol. 97. PMLR, 3519--3529."},{"key":"e_1_2_2_40_1","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). Association for Computational Linguistics","author":"Kudo Taku","year":"2018","unstructured":"Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). Association for Computational Linguistics, Brussels, Belgium, 66--71."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2019.00008"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/COMST.2019.2934489"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00271"},{"key":"e_1_2_2_44_1","volume-title":"23rd USENIX Security Symposium (USENIX Security 14)","author":"Michalevsky Yan","year":"2014","unstructured":"Yan Michalevsky, Dan Boneh, and Gabi Nakibly. 2014. Gyrophone: Recognizing speech from gyroscope signals. In 23rd USENIX Security Symposium (USENIX Security 14). 1053--1067."},{"key":"e_1_2_2_45_1","volume-title":"Streaming Automatic Speech Recognition with the Transformer Model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6074--6078","author":"Moritz Niko","year":"2020","unstructured":"Niko Moritz, Takaaki Hori, and Jonathan Le. 2020. Streaming Automatic Speech Recognition with the Transformer Model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6074--6078."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2023.3250846"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_2_2_48_1","unstructured":"Vineel Pratap Andros Tjandra Bowen Shi Paden Tomasello Arun Babu Sayani Kundu Ali Elkahky Zhaoheng Ni Apoorv Vyas Maryam Fazel-Zarandi et al. 2023. Scaling speech technology to 1 000+ languages. arXiv preprint arXiv:2305.13516 (2023)."},{"key":"e_1_2_2_49_1","volume-title":"FitNets: Hints for Thin Deep Nets. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1412","author":"Romero Adriana","year":"2015","unstructured":"Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. FitNets: Hints for Thin Deep Nets. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1412.6550"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2906388.2906415"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3384419.3430781"},{"key":"e_1_2_2_52_1","volume-title":"a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)."},{"key":"e_1_2_2_53_1","unstructured":"Petre Stoica Randolph L Moses et al. 2005. Spectral analysis of signals. Vol. 452. Pearson Prentice Hall Upper Saddle River NJ."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1441"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2020.2978507"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3530811"},{"key":"e_1_2_2_57_1","volume-title":"\u0141 ukasz Kaiser, and Illia Polosukhin","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30."},{"key":"e_1_2_2_58_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3534592","article-title":"Wavesdropper: Through-wall Word Detection of Human Speech via Commercial mmWave Devices","volume":"6","author":"Wang Chao","year":"2022","unstructured":"Chao Wang, Feng Lin, Zhongjie Ba, Fan Zhang, Wenyao Xu, and Kui Ren. 2022. Wavesdropper: Through-wall Word Detection of Human Speech via Commercial mmWave Devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1--26.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM48880.2022.9796806"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3495243.3560543"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-1292"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2018.2842159"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053712"},{"key":"e_1_2_2_64_1","volume-title":"Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209","author":"Warden Pete","year":"2018","unstructured":"Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)."},{"key":"e_1_2_2_65_1","first-page":"2207","article-title":"ESPnet","volume":"2018","author":"Watanabe Shinji","year":"2018","unstructured":"Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. In Proc. Interspeech 2018. 2207--2211.","journal-title":"End-to-End Speech Processing Toolkit. In Proc. Interspeech"},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2789168.2790119"},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307334.3326073"},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458864.3467679"},{"key":"e_1_2_2_69_1","volume-title":"The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation. In The Eleventh International Conference on Learning Representations (ICLR).","author":"Xue Zihui","year":"2022","unstructured":"Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao. 2022. The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation. In The Eleventh International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01200"},{"key":"e_1_2_2_71_1","first-page":"21300","article-title":"Towards efficient 3d object detection with knowledge distillation","volume":"35","author":"Yang Jihan","year":"2022","unstructured":"Jihan Yang, Shaoshuai Shi, Runyu Ding, Zhe Wang, and Xiaojuan Qi. 2022. Towards efficient 3d object detection with knowledge distillation. Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), 21300--21313.","journal-title":"Advances in Neural Information Processing Systems (NeurIPS)"},{"key":"e_1_2_2_72_1","volume-title":"5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.","author":"Zagoruyko Sergey","year":"2017","unstructured":"Sergey Zagoruyko and Nikos Komodakis. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings."},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/COMST.2023.3298300"},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3569482"},{"key":"e_1_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM48880.2022.9796905"},{"key":"e_1_2_2_76_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7356--7365","author":"Zhao Mingmin","year":"2018","unstructured":"Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong Tian, Hang Zhao, Antonio Torralba, and Dina Katabi. 2018. Throughwall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7356--7365."},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230579"},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSEN.2020.3040865"},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-738"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3610873","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3610873","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,28]],"date-time":"2025-07-28T16:27:21Z","timestamp":1753720041000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3610873"}},"subtitle":["Streaming Speech Recognition Using mmWave Radio Signals"],"short-title":[],"issued":{"date-parts":[[2023,9,27]]},"references-count":79,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,9,27]]}},"alternative-id":["10.1145\/3610873"],"URL":"https:\/\/doi.org\/10.1145\/3610873","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,27]]},"assertion":[{"value":"2023-09-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}