{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:14:36Z","timestamp":1750220076772,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2022,7,31]],"date-time":"2022-07-31T00:00:00Z","timestamp":1659225600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Emerg. Technol. Comput. Syst."],"published-print":{"date-parts":[[2022,7,31]]},"abstract":"<jats:p>Deep Neural Network (DNN) is gaining popularity thanks to its ability to attain high accuracy and performance in various security-crucial scenarios. However, recent research shows that DNN-based Automatic Speech Recognition (ASR) systems are vulnerable to adversarial attacks. Specifically, these attacks mainly focus on formulating a process of adversarial example generation as iterative, optimization-based attacks. Although these attacks make significant progress, they still take large generation time to produce adversarial examples, which makes them difficult to be launched in real-world scenarios. In this article, we propose a real-time attack framework that utilizes the neural network trained by the gradient approximation method to generate adversarial examples on Keyword Spotting (KWS) systems. The experimental results show that these generated adversarial examples can easily fool a black-box KWS system to output incorrect results with only one inference. In comparison to previous works, our attack can achieve a higher success rate with less than 0.004 s. We also extend our work by presenting a novel ensemble audio adversarial attack and testing the attack on KWS systems equipped with existing defense mechanisms. The efficacy of the proposed attack is well supported by promising experimental results.<\/jats:p>","DOI":"10.1145\/3491220","type":"journal-article","created":{"date-parts":[[2022,3,25]],"date-time":"2022-03-25T13:06:25Z","timestamp":1648213585000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Generation of Black-box Audio Adversarial Examples Based on Gradient Approximation and Autoencoders"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6214-2977","authenticated-orcid":false,"given":"Po-Hao","family":"Huang","sequence":"first","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9134-206X","authenticated-orcid":false,"given":"Honggang","family":"Yu","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Florida, Gainesville, Florida, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2849-7197","authenticated-orcid":false,"given":"Max","family":"Panoff","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Florida, Gainesville, Florida, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3435-0418","authenticated-orcid":false,"given":"Ting-Chi","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,8,2]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Speech commands dataset. Retrieved from https:\/\/research.googleblog.com\/2017\/08\/launching-speech-commands-dataset.html."},{"key":"e_1_3_1_3_2","unstructured":"Moustafa Alzantot Bharathan Balaji and Mani B. Srivastava. 2018. Did you hear that? Adversarial examples against automatic speech recognition. Retrieved from https:\/\/arxiv.org\/abs\/1801.00554."},{"key":"e_1_3_1_4_2","unstructured":"Chakraborty Anirban Alam Manaar Dey Vishal Chattopadhyay Anupam and Mukhopadhyay Debdeep. 2018. Adversarial attacks and defences: A survey. Retrieved from https:\/\/arxiv.org\/abs\/1810.00069."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2018.07.023"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2017.49"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2017.49"},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","unstructured":"Nicholas Carlini and David A. Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. Retrieved from https:\/\/arxiv.org\/abs\/1801.01944.","DOI":"10.1109\/SPW.2018.00009"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASP-DAC47756.2020.9045597"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3128572.3140448"},{"key":"e_1_3_1_11_2","volume-title":"Proceedings of the Neural Information Processing Systems (NIPS\u201919)","author":"Chen Xiangyi","year":"2019","unstructured":"Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox. 2019. ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization. In Proceedings of the Neural Information Processing Systems (NIPS\u201919)."},{"key":"e_1_3_1_12_2","unstructured":"Kevin Eykholt Ivan Evtimov Earlence Fernandes Bo Li Amir Rahmati Florian Tramer Atul Prakash Tadayoshi Kohno and Dawn Song. 2018. Physical adversarial examples for object detectors. Retrieved from https:\/\/arxiv.org\/abs\/1807.07769."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74695-9_23"},{"key":"e_1_3_1_14_2","unstructured":"M. A. Ganaie Minghui Hu M. Tanveer and P. N. Suganthan. 2021. Ensemble deep learning: A review. Retrieved from https:\/\/arxiv.org\/abs\/2104.02395."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/649"},{"key":"e_1_3_1_16_2","unstructured":"Ian J. Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative adversarial networks. Retrieved from https:\/\/arxiv.org\/abs\/1406.2661."},{"key":"e_1_3_1_17_2","unstructured":"Ian J. Goodfellow Jonathon Shlens and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. Retrieved from https:\/\/arxiv.org\/abs\/1412.6572."},{"key":"e_1_3_1_18_2","unstructured":"Awni Hannun Carl Case Jared Casper Bryan Catanzaro Greg Diamos Erich Elsen Ryan Prenger Sanjeev Satheesh Shubho Sengupta Adam Coates and Andrew Y. Ng. 2014. Deep speech: Scaling up end-to-end speech recognition. Retrieved from https:\/\/arxiv.org\/abs\/1412.5567."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3243734.3243757"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01214"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.21236\/ADA613971"},{"key":"e_1_3_1_22_2","volume-title":"Proceedings of the International Conference on Machine Learning (ICML\u201920)","author":"Liu Sijia","year":"2020","unstructured":"Sijia Liu, Songtao Lu, Xiangyi Chen, Yao Feng, Kaidi Xu, Abdullah Al Dujaili, Minyi Hong, and Una-May O\u2019Reilly. 2020. Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks. In Proceedings of the International Conference on Machine Learning (ICML\u201920)."},{"key":"e_1_3_1_23_2","article-title":"Towards deep learning models resistant to adversarial attacks","author":"Madry Aleksander","year":"2018","unstructured":"Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proceedings of the 6th International Conference on Learning Representations (ICLR\u201918).","journal-title":"Proceedings of the 6th International Conference on Learning Representations (ICLR\u201918)"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-351"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/EuroSP.2016.36"},{"key":"e_1_3_1_26_2","volume-title":"Proceedings of the International Conference on Machine Learning (ICML\u201919)","author":"Qin Yao","year":"2019","unstructured":"Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proceedings of the International Conference on Machine Learning (ICML\u201919)."},{"key":"e_1_3_1_27_2","volume-title":"Proceedings of the Conference on Computational Linguistics and Speech Processing (ROCLING\u201918)","author":"Rajaratnam Krishan","year":"2018","unstructured":"Krishan Rajaratnam, Kunal Shah, and Jugal Kalita. 2018. Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition. In Proceedings of the Conference on Computational Linguistics and Speech Processing (ROCLING\u201918)."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1990.115555"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","unstructured":"Vinod Subramanian Emmanouil Benetos Ning Xu SKoT McDonald and Mark Sandler. 2019. Adversarial attacks in sound event classification. Retrieved from https:\/\/arxiv.org\/abs\/1907.02477.","DOI":"10.33682\/sp9n-qk06"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Rohan Taori Amog Kamsetty Brenton Chu and Nikita Vemuri. 2018. Targeted adversarial examples for black box audio systems. Retrieved from https:\/\/arxiv.org\/abs\/1805.07820.","DOI":"10.1109\/SPW.2019.00016"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAU.1967.1161911"},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"256","DOI":"10.1007\/978-3-642-14706-7_20","volume-title":"Computer Network Security","author":"Teufl Peter","year":"2010","unstructured":"Peter Teufl, Udo Payer, and Guenter Lackner. 2010. From NLP (natural language processing) to MLP (machine language processing). In Computer Network Security, Igor Kotenko and Victor Skormin (Eds.). Springer, Berlin, 256\u2013269."},{"key":"e_1_3_1_33_2","unstructured":"Florian Tramer Nicholas Carlini Wieland Brendel and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses. Retrieved from https:\/\/arxiv.org\/abs\/2002.08347."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.3301742"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-1393"},{"key":"e_1_3_1_36_2","unstructured":"Jon Vadillo and Roberto Santana. 2019. Universal adversarial examples in speech command classification. Retrieved from https:\/\/arxiv.org\/abs\/1911.10182."},{"key":"e_1_3_1_37_2","first-page":"1870","article-title":"Automatic recognition of keywords in unconstrained speech using hidden Markov models","volume":"38","author":"Wilpon Jay","year":"1990","unstructured":"Jay Wilpon, Lawrence Rabiner, Chin-Hui Lee, and E. R. Goldman. 1990. Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. Audio Electroacoust. 38 (1990), 1870\u20131878.","journal-title":"IEEE Trans. Audio Electroacoust."},{"key":"e_1_3_1_38_2","doi-asserted-by":"crossref","unstructured":"Hiromu Yakura and Jun Sakuma. 2018. Robust audio adversarial example for a physical attack. Retrieved from https:\/\/arxiv.org\/abs\/1810.11793.","DOI":"10.24963\/ijcai.2019\/741"},{"key":"e_1_3_1_39_2","unstructured":"Jiancheng Yang Qiang Zhang Rongyao Fang Bingbing Ni Jinxian Liu and Qi Tian. 2019. Adversarial attack and defense on point sets. Retrieved from https:\/\/arxiv.org\/abs\/1902.10899."},{"key":"e_1_3_1_40_2","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Yang Zhuolin","year":"2018","unstructured":"Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Toward mitigating audio adversarial perturbations. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_3_1_41_2","unstructured":"Yundong Zhang Naveen Suda Liangzhen Lai and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. Retrieved from https:\/\/arxiv.org\/abs\/1711.07128."}],"container-title":["ACM Journal on Emerging Technologies in Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3491220","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3491220","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:19Z","timestamp":1750183759000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3491220"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,31]]},"references-count":40,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,7,31]]}},"alternative-id":["10.1145\/3491220"],"URL":"https:\/\/doi.org\/10.1145\/3491220","relation":{},"ISSN":["1550-4832","1550-4840"],"issn-type":[{"type":"print","value":"1550-4832"},{"type":"electronic","value":"1550-4840"}],"subject":[],"published":{"date-parts":[[2022,7,31]]},"assertion":[{"value":"2020-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}