{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T15:05:10Z","timestamp":1775228710935,"version":"3.50.1"},"reference-count":71,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,19]],"date-time":"2023-12-19T00:00:00Z","timestamp":1702944000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100006374","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No.62032017, No.62272368"],"award-info":[{"award-number":["No.62032017, No.62272368"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2023,12,19]]},"abstract":"<jats:p>Streaming speech recognition aims to transcribe speech to text in a streaming manner, providing real-time speech interaction for smartphone users. However, it is not trivial to develop a high-performance streaming speech recognition system purely running on mobile platforms, due to the complex real-world acoustic environments and the limited computational resources of smartphones. Most existing solutions lack the generalization to unseen environments and have difficulty to work with streaming speech. In this paper, we design AdaStreamLite, an environment-adaptive streaming speech recognition tool for smartphones. AdaStreamLite interacts with its surroundings to capture the characteristics of the current acoustic environment to improve the robustness against ambient noise in a lightweight manner. We design an environment representation extractor to model acoustic environments with compact feature vectors, and construct a representation lookup table to improve the generalization of AdaStreamLite to unseen environments. We train our system using large speech datasets publicly available covering different languages. We conduct experiments in a large range of real acoustic environments with different smartphones. The results show that AdaStreamLite outperforms the state-of-the-art methods in terms of recognition accuracy, computational resource consumption and robustness against unseen environments.<\/jats:p>","DOI":"10.1145\/3631460","type":"journal-article","created":{"date-parts":[[2024,1,12]],"date-time":"2024-01-12T12:52:04Z","timestamp":1705063924000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["AdaStreamLite"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6606-3392","authenticated-orcid":false,"given":"Yuheng","family":"Wei","sequence":"first","affiliation":[{"name":"Xidian University, Xi'an, Shaanxi, China and Engineering Research Center of Blockchain Technology Application and Evaluation, Ministry of Education, Xi'an, Shaanxi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5396-4554","authenticated-orcid":false,"given":"Jie","family":"Xiong","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, Shanghai, China and University of Massachusetts Amherst, Amherst, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4423-459X","authenticated-orcid":false,"given":"Hui","family":"Liu","sequence":"additional","affiliation":[{"name":"Xidian University, Xi'an, Shaanxi, China and Engineering Research Center of Blockchain Technology Application and Evaluation, Ministry of Education, Xi'an, Shaanxi, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-5244-0514","authenticated-orcid":false,"given":"Yingtao","family":"Yu","sequence":"additional","affiliation":[{"name":"Xidian University, Xi'an, Shaanxi, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8014-1295","authenticated-orcid":false,"given":"Jiangtao","family":"Pan","sequence":"additional","affiliation":[{"name":"Xidian University, Xi'an, Shaanxi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8105-3224","authenticated-orcid":false,"given":"Junzhao","family":"Du","sequence":"additional","affiliation":[{"name":"Xidian University, Xi'an, Shaanxi, China and Engineering Research Center of Blockchain Technology Application and Evaluation, Ministry of Education, Xi'an, Shaanxi, China"}]}],"member":"320","published-online":{"date-parts":[[2024,1,12]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2021.03.004"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1979.1163209"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSDA.2017.8384449"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3196168"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.628714"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9747888"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","unstructured":"Xie Chen Yu Wu Zhenghao Wang Shujie Liu and Jinyu Li. 2021. Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 5904--5908. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9413535","DOI":"10.1109\/ICASSP39728.2021.9413535"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462105"},{"key":"e_1_2_1_10_1","volume-title":"Garnett (Eds.)","volume":"28","author":"Chorowski Jan K","year":"2015","unstructured":"Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-Based Models for Speech Recognition. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2015\/file\/1068c6e4c8051cfd4e9ea8072e3189e2-Paper.pdf"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2011.2134090"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3087709"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2020.2976475"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2650"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550303"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1985.1164550"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201357"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3131895"},{"key":"e_1_2_1_19_1","volume-title":"Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711","author":"Graves Alex","year":"2012","unstructured":"Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"1772","author":"Graves Alex","year":"2014","unstructured":"Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 32), Eric P. Xing and Tony Jebara (Eds.). PMLR, Bejing, China, 1764--1772. https:\/\/proceedings.mlr.press\/v32\/graves14.html"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"e_1_2_1_23_1","volume-title":"Neural turing machines. arXiv preprint arXiv:1410.5401","author":"Graves Alex","year":"2014","unstructured":"Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)."},{"key":"e_1_2_1_24_1","volume-title":"Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100","author":"Gulati Anmol","year":"2020","unstructured":"Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)."},{"key":"e_1_2_1_25_1","volume-title":"Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv:1807.08312","author":"Hajibabaei Mahdi","year":"2018","unstructured":"Mahdi Hajibabaei and Dengxin Dai. 2018. Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv:1807.08312 (2018)."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682336"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462624"},{"key":"e_1_2_1_28_1","volume-title":"Retrieved","author":"Intelligence Insider","year":"2023","unstructured":"Insider Intelligence. 2023. Voice Assistants in 2023: Usage, growth, and future of the AI voice assistant market. Retrieved July 08, 2023 from https:\/\/www.insiderintelligence.com\/insights\/voice-assistants\/"},{"key":"e_1_2_1_29_1","volume-title":"International conference on machine learning. pmlr, 448--456","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. pmlr, 448--456."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/PROC.1976.10159"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU46091.2019.9004027"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9747166"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2020.2975749"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/PROC.1979.11540"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448125"},{"key":"e_1_2_1_36_1","volume-title":"Speech enhancement: theory and practice","author":"Loizou Philipos C","unstructured":"Philipos C Loizou. 2013. Speech enhancement: theory and practice. CRC press."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2013-130"},{"key":"e_1_2_1_38_1","article-title":"The discrete fourier transform, part 4: spectral leakage","volume":"8","author":"Lyon Douglas A","year":"2009","unstructured":"Douglas A Lyon. 2009. The discrete fourier transform, part 4: spectral leakage. Journal of object technology 8, 7 (2009).","journal-title":"Journal of object technology"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.2987752"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413395"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1209"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-993"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_2_1_44_1","volume-title":"SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452","author":"Pascual Santiago","year":"2017","unstructured":"Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.18626"},{"key":"e_1_2_1_46_1","volume-title":"Searching for activation functions. arXiv preprint arXiv:1710.05941","author":"Ramachandran Prajit","year":"2017","unstructured":"Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/PROC.1976.10158"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","unstructured":"Leda Sar\u0131 Niko Moritz Takaaki Hori and Jonathan Le Roux. 2020. Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 7384--7388. https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9054249","DOI":"10.1109\/ICASSP40776.2020.9054249"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","unstructured":"Hendrik Schroter Alberto N. Escalante-B Tobias Rosenkranz and Andreas Maier. 2022. Deepfilternet: A Low Complexity Speech Enhancement Framework for Full-Band Audio Based On Deep Filtering. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 7407--7411. https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9747055","DOI":"10.1109\/ICASSP43922.2022.9747055"},{"key":"e_1_2_1_50_1","volume-title":"Retrieved","author":"Schwartz Eric Hal","year":"2021","unstructured":"Eric Hal Schwartz. 2021. EU Publishes Privacy Guidelines for Voice Assistants for Comment. Retrieved April 27, 2023 from https:\/\/voicebot.ai\/2021\/03\/23\/eu-publishes-privacy-guidelines-for-voice-assistants-for-comment\/"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6639100"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"e_1_2_1_53_1","volume-title":"Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1","author":"Srivastava Nitish","year":"2014","unstructured":"Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958."},{"key":"e_1_2_1_54_1","volume-title":"Proc. Meetings Acoust. 1--6.","author":"Thiemann Joachim","year":"2013","unstructured":"Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. 2013. DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments. In Proc. Meetings Acoust. 1--6."},{"key":"e_1_2_1_55_1","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).","journal-title":"Journal of machine learning research"},{"key":"e_1_2_1_56_1","volume-title":"Oh (Eds.)","volume":"35","author":"Variani Ehsan","year":"2022","unstructured":"Ehsan Variani, Ke Wu, Michael D Riley, David Rybach, Matt Shannon, and Cyril Allauzen. 2022. Global Normalization for Streaming Speech Recognition in a Modular Framework. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 4257--4269. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/1b4839ff1f843b6be059bd0e8437e975-Paper-Conference.pdf"},{"key":"e_1_2_1_57_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2352935"},{"key":"e_1_2_1_60_1","volume-title":"International Conference on Machine Learning. PMLR, 5180--5189","author":"Wang Yuxuan","year":"2018","unstructured":"Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2017.2763455"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2538"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2364452"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746682"},{"key":"e_1_2_1_65_1","volume-title":"More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455","author":"Zhang Binbin","year":"2022","unstructured":"Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, and Jianwei Niu. 2022. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455 (2022)."},{"key":"e_1_2_1_66_1","volume-title":"Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481","author":"Zhang Binbin","year":"2020","unstructured":"Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, and Xin Lei. 2020. Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481 (2020)."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","unstructured":"Qian Zhang Han Lu Hasim Sak Anshuman Tripathi Erik McDermott Stephen Koo and Shankar Kumar. 2020. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 7829--7833. https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053896","DOI":"10.1109\/ICASSP40776.2020.9053896"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478093"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-563"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178115"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2019.2918951"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3631460","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3631460","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T17:00:59Z","timestamp":1756314059000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3631460"}},"subtitle":["Environment-adaptive Streaming Speech Recognition on Mobile Devices"],"short-title":[],"issued":{"date-parts":[[2023,12,19]]},"references-count":71,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,19]]}},"alternative-id":["10.1145\/3631460"],"URL":"https:\/\/doi.org\/10.1145\/3631460","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,19]]},"assertion":[{"value":"2024-01-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}