{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:13:09Z","timestamp":1750219989719,"version":"3.41.0"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,12,27]],"date-time":"2022-12-27T00:00:00Z","timestamp":1672099200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>\n            In this paper, we propose a novel Low-Power Feature-Attention Chinese Keyword Spotting Framework based on a\n            <jats:bold>depthwise separable convolution neural network (DSCNN)<\/jats:bold>\n            with distillation learning to recognize speech signals of Chinese wake-up words. The framework consists of a low-power feature-attention acoustic model and its learning methods. Different from the existing model, the proposed acoustic model based on\n            <jats:bold>connectionist temporal classification (CTC)<\/jats:bold>\n            focuses on the reduction of power consumption by reducing model network parameters and\n            <jats:bold>multiply-accumulate (MAC)<\/jats:bold>\n            operations through our designed feature-attention network and DSCNN. In particular, the feature-attention network is specially designed to extract effective syllable features from a large number of MFCC features. This could refine MFCC features by selectively focusing on important speech signal features and removing invalid speech signal features to reduce the number of speech signal features, which helps to significantly reduce the parameters and MAC operations of the whole acoustic model. Moreover, DSCNN with fewer parameters and MAC operations compared with traditional convolution neural networks is adopted to extract effective high-dimensional features from syllable features. Furthermore, we apply a distillation learning algorithm to efficiently train the proposed low-power acoustic model by utilizing the knowledge of the trained large acoustic model. Experimental results thoroughly verify the effectiveness of our model and show that the proposed acoustic model still has better accuracy than other acoustic models with the lowest power consumption and smallest latency measured by NVIDIA JETSON TX2. It has only 14.524\n            <jats:italic>KB<\/jats:italic>\n            parameters and consumes only 0.141\n            <jats:italic>J<\/jats:italic>\n            energy per query and 17.9\n            <jats:italic>ms<\/jats:italic>\n            latency on the platform, which is hardware-friendly.\n          <\/jats:p>","DOI":"10.1145\/3558002","type":"journal-article","created":{"date-parts":[[2022,8,17]],"date-time":"2022-08-17T12:05:56Z","timestamp":1660737956000},"page":"1-14","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Low-Power Feature-Attention Chinese Keyword Spotting Framework with Distillation Learning"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7456-6486","authenticated-orcid":false,"given":"Lei","family":"Lei","sequence":"first","affiliation":[{"name":"Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8690-7276","authenticated-orcid":false,"given":"Guoshun","family":"Yuan","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0779-5905","authenticated-orcid":false,"given":"Tianle","family":"Zhang","sequence":"additional","affiliation":[{"name":"Institute of Automation, Chinese Academy of Sciences, China and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0527-3676","authenticated-orcid":false,"given":"Hongjiang","family":"Yu","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,12,27]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2021. Jetson TX2 module. https:\/\/developer.nvidia.com\/embedded\/jetson-tx2."},{"key":"e_1_3_1_3_2","article-title":"Convolutional recurrent neural networks for small-footprint keyword spotting","author":"Arik Sercan O.","year":"2017","unstructured":"Sercan O. Arik, Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates. 2017. Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv preprint arXiv:1703.05390 (2017).","journal-title":"arXiv preprint arXiv:1703.05390"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3439800"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.21437\/ICSLP.2000-436"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6854370"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178970"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3453651"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/1778066.1778092"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2003.821662"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1993.319343"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_3_1_13_2","article-title":"Distilling the knowledge in a neural network","author":"Hinton Geoffrey","year":"2015","unstructured":"Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).","journal-title":"arXiv preprint arXiv:1503.02531"},{"key":"e_1_3_1_14_2","article-title":"MobileNets: Efficient convolutional neural networks for mobile vision applications","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).","journal-title":"arXiv preprint arXiv:1704.04861"},{"key":"e_1_3_1_15_2","article-title":"Bidirectional LSTM-CRF models for sequence tagging","author":"Huang Zhiheng","year":"2015","unstructured":"Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).","journal-title":"arXiv preprint arXiv:1508.01991"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.14257\/ijsip.2016.9.4.34"},{"key":"e_1_3_1_17_2","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).","journal-title":"arXiv preprint arXiv:1412.6980"},{"key":"e_1_3_1_18_2","article-title":"An end-to-end architecture for keyword spotting and voice activity detection","author":"Lengerich Chris","year":"2016","unstructured":"Chris Lengerich and Awni Hannun. 2016. An end-to-end architecture for keyword spotting and voice activity detection. arXiv preprint arXiv:1611.09405 (2016).","journal-title":"arXiv preprint arXiv:1611.09405"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICTA50426.2020.9332057"},{"key":"e_1_3_1_20_2","first-page":"1","volume-title":"Ismir","author":"Logan Beth","year":"2000","unstructured":"Beth Logan et\u00a0al. 2000. Mel frequency cepstral coefficients for music modeling. In Ismir, Vol. 270. Citeseer, 1\u201311."},{"key":"e_1_3_1_21_2","unstructured":"Xiaoyi Qin Hui Bu and Ming Li. 2019. HI-MIA: A Far-field Text-Dependent Speaker Verification Database and the Baselines. arxiv:cs.SD\/1912.01231."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6855122"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1989.266505"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1990.115555"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-352"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC19947.2020.9063000"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461995"},{"key":"e_1_3_1_28_2","first-page":"1878","volume-title":"Interspeech","author":"Tucker George","year":"2016","unstructured":"George Tucker, Minhua Wu, Ming Sun, Sankaran Panchapagesan, Gengshen Fu, and Shiv Vitaladevuni. 2016. Model compression applied to small-footprint keyword spotting. In Interspeech. 1878\u20131882."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCSLP.2018.8706631"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC42613.2021.9365816"},{"key":"e_1_3_1_31_2","article-title":"Speech commands: A dataset for limited-vocabulary speech recognition","author":"Warden Pete","year":"2018","unstructured":"Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).","journal-title":"arXiv preprint arXiv:1804.03209"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2007-481"},{"key":"e_1_3_1_33_2","article-title":"Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting","author":"Xu Menglong","year":"2020","unstructured":"Menglong Xu and Xiao-Lei Zhang. 2020. Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting. arXiv preprint arXiv:2004.12200 (2020).","journal-title":"arXiv preprint arXiv:2004.12200"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054618"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/JAS.2017.7510508"},{"key":"e_1_3_1_36_2","article-title":"Comparison of decoding strategies for CTC acoustic models","author":"Zenkel Thomas","year":"2017","unstructured":"Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian St\u00fcker, and Alex Waibel. 2017. Comparison of decoding strategies for CTC acoustic models. arXiv preprint arXiv:1708.04469 (2017).","journal-title":"arXiv preprint arXiv:1708.04469"},{"key":"e_1_3_1_37_2","article-title":"Hello edge: Keyword spotting on microcontrollers","author":"Zhang Yundong","year":"2017","unstructured":"Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128 (2017).","journal-title":"arXiv preprint arXiv:1711.07128"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178115"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3558002","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3558002","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:32Z","timestamp":1750182572000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3558002"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,27]]},"references-count":37,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3558002"],"URL":"https:\/\/doi.org\/10.1145\/3558002","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2022,12,27]]},"assertion":[{"value":"2021-07-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-14","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}