{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:25:44Z","timestamp":1750220744486,"version":"3.41.0"},"reference-count":24,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2020,5,15]],"date-time":"2020-05-15T00:00:00Z","timestamp":1589500800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["GetMobile: Mobile Comp. and Comm."],"published-print":{"date-parts":[[2020,5,15]]},"abstract":"<jats:p>We have reached an important milestone in Automatic Speech Recognition (ASR) technology, with major industrial AI companies, such as Samsung, Google, Apple, and Amazon releasing high-quality ASR models that run completely on-device, e.g., on consumer smartphones. This is the consequence of giant strides in technological advancements: from making commercial grade ASR systems feasible; to large scale cloud deployments; to the present day state-of-the-art models that run on resource constrained devices.<\/jats:p>","DOI":"10.1145\/3400713.3400715","type":"journal-article","created":{"date-parts":[[2020,5,24]],"date-time":"2020-05-24T05:13:02Z","timestamp":1590297182000},"page":"5-9","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Learning to Listen... On-Device"],"prefix":"10.1145","volume":"23","author":[{"given":"Ravichander","family":"Vipperla","sequence":"first","affiliation":[{"name":"Samsung AI Center, Cambridge"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Samin","family":"Ishtiaq","sequence":"additional","affiliation":[{"name":"Samsung AI Center, Cambridge"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rui","family":"Li","sequence":"additional","affiliation":[{"name":"Samsung AI Center, Cambridge"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sourav","family":"Bhattacharya","sequence":"additional","affiliation":[{"name":"Samsung AI Center, Cambridge"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ilias","family":"Leontiadis","sequence":"additional","affiliation":[{"name":"Samsung AI Center, Cambridge"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nicholas D.","family":"Lane","sequence":"additional","affiliation":[{"name":"Samsung AI Center, Cambridge; University of Oxford, England"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,5,18]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"The HTK Book Version 3.4","author":"Young S.J.","year":"2006","unstructured":"S.J. Young , D. Kershaw , J. Odell , D. Ollason , V. Valtchev , and P. Woodland , The HTK Book Version 3.4 . Cambridge University Press , 2006 . S.J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book Version 3.4. Cambridge University Press, 2006."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1006\/csla.2001.0184"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1006\/csla.2001.0185"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 33rd International Conference on International Conference on Machine Learning","volume":"48","author":"Amodei D.","year":"2016","unstructured":"D. Amodei , S. Ananthanarayanan , R. Anubhai , J. Bai , E. Battenberg , C. Case , J. Casper , B. Catanzaro , Q. Cheng , G. Chen, and et al. Deep speech 2: End-to-end speech recognition in English and Mandarin , in Proceedings of the 33rd International Conference on International Conference on Machine Learning , Volume 48 , ser. ICML'16. JMLR.org, 2016 , 173--182. D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, and et al. Deep speech 2: End-to-end speech recognition in English and Mandarin, in Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48, ser. ICML'16. JMLR.org, 2016, 173--182."},{"key":"#cr-split#-e_1_2_1_5_1.1","doi-asserted-by":"crossref","unstructured":"D. S. Park W. Chan Y. Zhang C.-C. Chiu B. Zoph E. D. Cubuk and Q. V. Le. September 2019. SpecAugment: A simple data augmentation method for Automatic Speech Recognition Interspeech http:\/\/dx.doi.org\/10.21437\/Interspeech.2019--2680 10.21437\/Interspeech.2019--2680","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"#cr-split#-e_1_2_1_5_1.2","doi-asserted-by":"crossref","unstructured":"D. S. Park W. Chan Y. Zhang C.-C. Chiu B. Zoph E. D. Cubuk and Q. V. Le. September 2019. SpecAugment: A simple data augmentation method for Automatic Speech Recognition Interspeech http:\/\/dx.doi.org\/10.21437\/Interspeech.2019--2680","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"e_1_2_1_6_1","unstructured":"G. Synnaeve Q. Xu J. Kahn E. Grave T. Likhomanenko V. Pratap A. Sriram V. Liptchinsky and R. Collobert. 2019. End-to-end ASR: From supervised to semi-supervised learning with modern architectures arXivpreprint arXiv:1911.08460.  G. Synnaeve Q. Xu J. Kahn E. Grave T. Likhomanenko V. Pratap A. Sriram V. Liptchinsky and R. Collobert. 2019. End-to-end ASR: From supervised to semi-supervised learning with modern architectures arXivpreprint arXiv:1911.08460."},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"K. Kim K. Lee D. Gowda J. Park S. Kim E. S. Kim Y.-Y. Lee J. Yeo D. Kim S. Jung J. Lee M. Han and C. Kim. 2019. Attention based ondevice streaming speech recognition with large speech corpus ASRU.  K. Kim K. Lee D. Gowda J. Park S. Kim E. S. Kim Y.-Y. Lee J. Yeo D. Kim S. Jung J. Lee M. Han and C. Kim. 2019. Attention based ondevice streaming speech recognition with large speech corpus ASRU.","DOI":"10.1109\/ASRU46091.2019.9004027"},{"key":"e_1_2_1_8_1","unstructured":"V. Pratap and R. Collobert. 2020. Online speech recognition with wav2letter@anywhere. https:\/\/ ai.facebook.com\/blog\/online-speech-recognitionwith- wav2letteranywhere  V. Pratap and R. Collobert. 2020. Online speech recognition with wav2letter@anywhere. https:\/\/ ai.facebook.com\/blog\/online-speech-recognitionwith- wav2letteranywhere"},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Y. He T. N. Sainath R. Prabhavalkar I. McGraw R. Alvarez D. Zhao D. Rybach A. Kannan Y. Wu R. Pang Q. Liang D. Bhatia Y. Shang-guan B. Li G. Pundak K. C. Sim T. Bagby S. Chang K. Rao and A. Gruenstein. 2019. Streaming Endto- end Speech Recognition for Mobile Devices in ICASSP 6381--6385.  Y. He T. N. Sainath R. Prabhavalkar I. McGraw R. Alvarez D. Zhao D. Rybach A. Kannan Y. Wu R. Pang Q. Liang D. Bhatia Y. Shang-guan B. Li G. Pundak K. C. Sim T. Bagby S. Chang K. Rao and A. Gruenstein. 2019. Streaming Endto- end Speech Recognition for Mobile Devices in ICASSP 6381--6385.","DOI":"10.1109\/ICASSP.2019.8682336"},{"key":"e_1_2_1_10_1","unstructured":"J. Huang Y. Zhang B. Ginsburg and P. Chitale. 2019. Develop smaller speech recognition models with NVIDIA's NeMo framework. https:\/\/devblogs. nvidia.com\/develop-smaller-speech-recognitionmodels- with-nvidias-nemo-framework\/  J. Huang Y. Zhang B. Ginsburg and P. Chitale. 2019. Develop smaller speech recognition models with NVIDIA's NeMo framework. https:\/\/devblogs. nvidia.com\/develop-smaller-speech-recognitionmodels- with-nvidias-nemo-framework\/"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00048"},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Lukasz Dudziak M.S. Abdelfattah R.Vipperla S.Laskaridis and N.D. Lane. 2019. ShrinkML: End-to-End ASR model compression using reinforcement learning Interspeech.  Lukasz Dudziak M.S. Abdelfattah R.Vipperla S.Laskaridis and N.D. Lane. 2019. ShrinkML: End-to-End ASR model compression using reinforcement learning Interspeech.","DOI":"10.21437\/Interspeech.2019-2811"},{"key":"e_1_2_1_14_1","volume-title":"Why are eight bits enough for deep neural networks?","author":"Warden P.","year":"2015","unstructured":"P. Warden , Why are eight bits enough for deep neural networks? 2015 . https:\/\/petewarden. com\/2015\/05\/23\/why-are-eight-bits-enough-fordeep- neural-networks\/ P. Warden, Why are eight bits enough for deep neural networks? 2015. https:\/\/petewarden. com\/2015\/05\/23\/why-are-eight-bits-enough-fordeep- neural-networks\/"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"R. Alvarez R. Prabhavalkar and A. Bakhtin. 2016. On the efficient representation and execution of deep acoustic models. arXiv:1607.04683.  R. Alvarez R. Prabhavalkar and A. Bakhtin. 2016. On the efficient representation and execution of deep acoustic models. arXiv:1607.04683.","DOI":"10.21437\/Interspeech.2016-128"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/72.129422"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/79173.79181"},{"key":"e_1_2_1_19_1","unstructured":"M. Andreessen \"Why software is eating the world \" Wall Street Journal August 2011.  M. Andreessen \"Why software is eating the world \" Wall Street Journal August 2011."},{"key":"e_1_2_1_20_1","volume-title":"All Hail the Hardware Gods","author":"Clebsch S.","year":"2017","unstructured":"S. Clebsch , \"We Software People are not Worthy : All Hail the Hardware Gods ,\" 2017 , keynote talk at ICOOOLPS 2017. S. Clebsch, \"We Software People are not Worthy: All Hail the Hardware Gods,\" 2017, keynote talk at ICOOOLPS 2017."},{"volume-title":"Proceedings of IEEE INFOCOM, 1423--1431","author":"Hu C.","key":"e_1_2_1_21_1","unstructured":"C. Hu , W. Bao , D. Wang , and F. Liu . April 2019. Dynamic adaptive DNN surgery for inference acceleration on the edge. April 2019 . Proceedings of IEEE INFOCOM, 1423--1431 . C. Hu, W. Bao, D. Wang, and F. Liu. April 2019. Dynamic adaptive DNN surgery for inference acceleration on the edge. April 2019. Proceedings of IEEE INFOCOM, 1423--1431."},{"volume-title":"International Conference on Parallel and Distributed Systems.","author":"Li H.","key":"e_1_2_1_22_1","unstructured":"H. Li , C. Hu , J. Jiang , Z. Wang , Y. Wen , and W. Zhu . 2019. JALAD: Joint accuracy-and latency-aware deep structure decoupling for edgecloud execution , in International Conference on Parallel and Distributed Systems. H. Li, C. Hu, J. Jiang, Z. Wang, Y. Wen, and W. Zhu. 2019. JALAD: Joint accuracy-and latency-aware deep structure decoupling for edgecloud execution, in International Conference on Parallel and Distributed Systems."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3325413.3329793"}],"container-title":["GetMobile: Mobile Computing and Communications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3400713.3400715","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3400713.3400715","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:38:43Z","timestamp":1750199923000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3400713.3400715"}},"subtitle":["Present and future perspectives of on-device ASR"],"short-title":[],"issued":{"date-parts":[[2020,5,15]]},"references-count":24,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,5,15]]}},"alternative-id":["10.1145\/3400713.3400715"],"URL":"https:\/\/doi.org\/10.1145\/3400713.3400715","relation":{},"ISSN":["2375-0529","2375-0537"],"issn-type":[{"type":"print","value":"2375-0529"},{"type":"electronic","value":"2375-0537"}],"subject":[],"published":{"date-parts":[[2020,5,15]]},"assertion":[{"value":"2020-05-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}