{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T13:37:37Z","timestamp":1775741857614,"version":"3.50.1"},"reference-count":57,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,8,5]],"date-time":"2023-08-05T00:00:00Z","timestamp":1691193600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,8,5]],"date-time":"2023-08-05T00:00:00Z","timestamp":1691193600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Cybersecurity"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recently, studies show that deep learning-based automatic speech recognition (ASR) systems are vulnerable to adversarial examples (AEs), which add a small amount of noise to the original audio examples. These AE attacks pose new challenges to deep learning security and have raised significant concerns about deploying ASR systems and devices. The existing defense methods are either limited in application or only defend on results, but not on process. In this work, we propose a novel method to infer the adversary intent and discover audio adversarial examples based on the AEs generation process. The insight of this method is based on the observation: many existing audio AE attacks utilize query-based methods, which means the adversary must send continuous and similar queries to target ASR models during the audio AE generation process. Inspired by this observation, We propose a memory mechanism by adopting audio fingerprint technology to analyze the similarity of the current query with a certain length of memory query. Thus, we can identify when a sequence of queries appears to be suspectable to generate audio AEs. Through extensive evaluation on four state-of-the-art audio AE attacks, we demonstrate that on average our defense identify the adversary\u2019s intent with over <jats:inline-formula><jats:alternatives><jats:tex-math>$$90\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>90<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> accuracy. With careful regard for robustness evaluations, we also analyze our proposed defense and its strength to withstand two adaptive attacks. Finally, our scheme is available out-of-the-box and directly compatible with any ensemble of ASR defense models to uncover audio AE attacks effectively without model retraining.<\/jats:p>","DOI":"10.1186\/s42400-023-00177-6","type":"journal-article","created":{"date-parts":[[2023,8,5]],"date-time":"2023-08-05T01:02:09Z","timestamp":1691197329000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Towards the universal defense for query-based audio adversarial attacks on speech recognition system"],"prefix":"10.1186","volume":"6","author":[{"given":"Feng","family":"Guo","sequence":"first","affiliation":[]},{"given":"Zheng","family":"Sun","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0000-5031","authenticated-orcid":false,"given":"Yuxuan","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Lei","family":"Ju","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,8,5]]},"reference":[{"key":"177_CR1","doi-asserted-by":"crossref","unstructured":"Abdullah H, Garcia W, Peeters C, Traynor P, Butler KR, Wilson J (2019) Practical hidden voice attacks against speech and speaker recognition systems. arxiv: abs\/1904.05734","DOI":"10.14722\/ndss.2019.23362"},{"key":"177_CR2","doi-asserted-by":"crossref","unstructured":"Abdullah H, Rahman MS, Garcia W, Warren K, Yadav AS, Shrimpton T, Traynor P (2021) Hear \u201cno evil\u201d, see \u201ckenansville\u201d*: efficient and transferable black-box attacks on speech recognition and voice identification systems. In: 2021 IEEE symposium on security and privacy (SP). IEEE, pp 712\u2013729","DOI":"10.1109\/SP40001.2021.00009"},{"key":"177_CR3","unstructured":"Abdullah H, Rahman MS, Peeters C, Gibson C, Garcia W, Bindschaedler V, Shrimpton T, Traynor P (2021) Beyond l$${}_{\\text{p}}$$ clipping: equalization-based psychoacoustic attacks against ASRs. arxiv: abs\/2110.13250"},{"key":"177_CR4","doi-asserted-by":"crossref","unstructured":"Afchar D, Melchiorre AB, Schedl M, Hennequin R, Epure EV, Moussallam M (2022) Explainability in music recommender systems. arxiv: abs\/2201.10528","DOI":"10.1002\/aaai.12056"},{"key":"177_CR5","unstructured":"Akinwande V, Cintas C, Speakman S, Sridharan S (2020) Identifying audio adversarial examples via anomalous pattern detection. arxiv: abs\/2002.05463"},{"key":"177_CR6","doi-asserted-by":"crossref","unstructured":"Byun J, Go H, Kim C (2022) On the effectiveness of small input noise for defending against query-based black-box attacks. In: Proceedings of the IEEE\/CVF winter conference on applications of computer vision, pp 3051\u20133060","DOI":"10.1109\/WACV51458.2022.00387"},{"key":"177_CR7","unstructured":"Carlini N, Athalye A, Papernot N, Brendel W, Rauber J, Tsipras D, Goodfellow IJ, Madry A, Kurakin A (2019) On evaluating adversarial robustness. arxiv: abs\/1902.06705"},{"key":"177_CR8","doi-asserted-by":"crossref","unstructured":"Carlini N, Wagner D (2018) Audio adversarial examples: targeted attacks on speech-to-text. IEEE","DOI":"10.1109\/SPW.2018.00009"},{"key":"177_CR9","doi-asserted-by":"crossref","unstructured":"Chang K-H, Huang P-H, Yu H, Jin Y, Wang T-C (2020) Audio adversarial examples generation with recurrent neural networks. In: 2020 25th Asia and South pacific design automation conference (ASP-DAC). IEEE, pp 488\u2013493","DOI":"10.1109\/ASP-DAC47756.2020.9045597"},{"key":"177_CR10","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2022.3220673","author":"G Chen","year":"2022","unstructured":"Chen G, Zhao Z, Song F, Chen S, Fan L, Wang F, Wang J (2022) Towards understanding and mitigating audio adversarial examples for speaker recognition. IEEE Trans Dependable Secur Comput. https:\/\/doi.org\/10.1109\/TDSC.2022.3220673","journal-title":"IEEE Trans Dependable Secur Comput"},{"key":"177_CR11","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2022.3189397","author":"G Chen","year":"2022","unstructured":"Chen G, Zhao Z, Song F, Chen S, Fan L, Liu Y (2022) As2t: arbitrary source-to-target adversarial attack on speaker recognition systems. IEEE Trans Dependable Secur Comput. https:\/\/doi.org\/10.1109\/TDSC.2022.3189397","journal-title":"IEEE Trans Dependable Secur Comput"},{"key":"177_CR12","doi-asserted-by":"crossref","unstructured":"Chen S, Carlini N, Wagner D (2019) Stateful detection of black-box adversarial attacks","DOI":"10.1145\/3385003.3410925"},{"key":"177_CR13","doi-asserted-by":"crossref","unstructured":"Chen G, Chenb S, Fan L, Du X, Zhao Z, Song F, Liu Y (2021) Who is real bob? Adversarial attacks on speaker recognition systems. In: 2021 IEEE Symposium on Security and Privacy (SP). IEEE, pp 694\u2013711","DOI":"10.1109\/SP40001.2021.00004"},{"key":"177_CR14","unstructured":"Cheng S, Dong Y, Pang T, Su H, Zhu J (2019) Improving black-box adversarial attacks with a transfer-based prior. In: Wallach HM, Larochelle H, Beygelzimer A, d\u2019Alch\u00e9-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada, pp 10932\u201310942"},{"key":"177_CR15","unstructured":"Chen Y, Yuan X, Zhang J, Zhao Y, Zhang S, Chen K, Wang X (2020) Devil\u2019s whisper: a general approach for physical adversarial attacks against commercial black-box speech recognition devices. In: USENIX security symposium, pp 2667\u20132684"},{"key":"177_CR16","unstructured":"Cohen J, Rosenfeld E, Kolter Z (2019) Certified adversarial robustness via randomized smoothing. In: International conference on machine learning. PMLR, pp 1310\u20131320"},{"key":"177_CR17","doi-asserted-by":"crossref","unstructured":"Du T, Ji S, Li J, Gu Q, Wang T, Beyah RA (2020) Sirenattack: generating adversarial audio for end-to-end acoustic systems. In: Proceedings of the 15th ACM Asia conference on computer and communications security","DOI":"10.1145\/3320269.3384733"},{"key":"177_CR18","unstructured":"Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7\u20139, 2015, conference track proceedings . arxiv:1412.6572"},{"key":"177_CR19","unstructured":"Goyal S, Raghunathan A, Jain M, Simhadri HV, Jain P (2020) DROCC: deep robust one-class classification. In: International conference on machine learning. PMLR, pp 3711\u20133721"},{"key":"177_CR20","doi-asserted-by":"crossref","unstructured":"Guo Q, Ye J, Hu Y, Zhang G, Li H (2020) MultiPAD: a multivariant partition based method for audio adversarial examples detection. IEEE Access (99):1\u20131","DOI":"10.1109\/ACCESS.2020.2985231"},{"key":"177_CR21","unstructured":"Haitsma J, Kalker T (2002) A highly robust audio fingerprinting system. In: ISMIR 2002, 3rd International conference on music information retrieval, Paris, France, October 13\u201317, 2002, Proceedings"},{"key":"177_CR22","doi-asserted-by":"crossref","unstructured":"Han JK, Kim H, Woo SS (2019) Nickel to LEGO: minimal information examples to fool google cloud speech-to-text API. In: Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, pp 2593\u20132595","DOI":"10.1145\/3319535.3363264"},{"key":"177_CR23","unstructured":"Huang Z, Zhang T (2019) Black-box adversarial attack with transferable model-based embedding"},{"key":"177_CR24","unstructured":"Hussain S, Neekhara P, Dubnov S, McAuley J, Koushanfar F (2021) Waveguard: understanding and mitigating audio adversarial examples. arXiv:2103.03344"},{"key":"177_CR25","unstructured":"Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A (2019) Adversarial examples are not bugs, they are features. In: Wallach HM, Larochelle H, Beygelzimer A, d\u2019Alch\u00e9-Buc F, Fox EB, Garnett R (eds) Advances in Neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada, pp 125\u2013136"},{"key":"177_CR26","unstructured":"Joshi S, Villalba J, \u017belasko P, Moro-Vel\u00e1zquez L, Dehak N (2021) Adversarial attacks and defenses for speaker identification systems. arXiv e-prints, 2101"},{"key":"177_CR27","doi-asserted-by":"crossref","unstructured":"Khare S, Aralikatte R, Mani S (2018) Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization. arXiv:1811.01312","DOI":"10.21437\/Interspeech.2019-2420"},{"key":"177_CR28","unstructured":"Kurakin A, Goodfellow I, Bengio S (2016) Adversarial examples in the physical world"},{"key":"177_CR29","unstructured":"Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2017) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083"},{"issue":"1","key":"177_CR30","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1177\/0013164409344534","volume":"70","author":"F Marin-Martinez","year":"2010","unstructured":"Marin-Martinez F, S\u00e1nchez-Meca J (2010) Weighting by inverse variance or by sample size in random-effects meta-analysis. Educ Psychol Meas 70(1):56\u201373","journal-title":"Educ Psychol Meas"},{"key":"177_CR31","doi-asserted-by":"crossref","unstructured":"Moosavi-Dezfooli S-M, Fawzi A, Frossard P (2016) Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2574\u20132582","DOI":"10.1109\/CVPR.2016.282"},{"key":"177_CR32","doi-asserted-by":"publisher","first-page":"878","DOI":"10.1016\/j.ins.2021.12.123","volume":"589","author":"LNH Nam","year":"2022","unstructured":"Nam LNH (2022) Towards comprehensive approaches for the rating prediction phase in memory-based collaborative filtering recommender systems. Inf Sci 589:878\u2013910","journal-title":"Inf Sci"},{"key":"177_CR33","doi-asserted-by":"crossref","unstructured":"Pang R, Zhang X, Ji S, Luo X, Wang T (2020) Advmind: Inferring adversary intent of black-box attacks. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1899\u20131907","DOI":"10.1145\/3394486.3403241"},{"key":"177_CR34","unstructured":"Qin Y, Carlini N, Cottrell G, Goodfellow I, Raffel C (2019) Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In: International Conference on machine learning. PMLR, pp 5231\u20135240"},{"key":"177_CR35","doi-asserted-by":"crossref","unstructured":"Rajaratnam K, Kalita J (2018) Noise flooding for detecting audio adversarial examples against automatic speech recognition. IEEE","DOI":"10.1109\/ISSPIT.2018.8642623"},{"key":"177_CR36","doi-asserted-by":"crossref","unstructured":"Richards LE, Nguyen A, Capps R, Forsyth S, Matuszek C, Raff E (2021) Adversarial transfer attacks with unknown data and class overlap. In: Proceedings of the 14th ACM workshop on artificial intelligence and security, pp 13\u201324","DOI":"10.1145\/3474369.3486862"},{"key":"177_CR37","doi-asserted-by":"crossref","unstructured":"Samizade S, Tan Z-H, Shen C, Guan X (2020) Adversarial example detection by classification for deep speech recognition. In: ICASSP 2020\u20132020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3102\u20133106","DOI":"10.1109\/ICASSP40776.2020.9054750"},{"key":"177_CR38","doi-asserted-by":"crossref","unstructured":"Sch\u00f6nherr L, Kohls K, Zeiler S, Holz T, Kolossa D (2018) Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv:1808.05665 x","DOI":"10.14722\/ndss.2019.23288"},{"key":"177_CR39","unstructured":"Shafiloo R, Kaedi M, Pourmiri A (2021) Considering user dynamic preferences for mitigating negative effects of long tail in recommender systems. arxiv: abs\/2112.02406"},{"key":"177_CR40","doi-asserted-by":"crossref","unstructured":"Song J, Chang C, Sun F, Chen Z, Hu G, Jiang P (2021) Graph attention collaborative similarity embedding for recommender system. In: Database systems for advanced applications: 26th international conference, DASFAA 2021, Taipei, Taiwan, April 11\u201314, 2021, proceedings, Part III 26. Springer, pp 165\u2013178","DOI":"10.1007\/978-3-030-73200-4_11"},{"issue":"11","key":"177_CR41","doi-asserted-by":"publisher","first-page":"1826","DOI":"10.1109\/TASLP.2019.2933146","volume":"27","author":"S Su","year":"2019","unstructured":"Su S, Guo P, Xie L, Hwang MY (2019) Adversarial regularization for attention based end-to-end robust speech recognition. Audio Speech Lang Process IEEE\/ACM Trans 27(11):1826\u20131838","journal-title":"Audio Speech Lang Process IEEE\/ACM Trans"},{"key":"177_CR42","doi-asserted-by":"crossref","unstructured":"Sun S, Yeh C-F, Ostendorf M, Hwang M-Y, Xie L (2018) Training augmentation with adversarial examples for robust speech recognition. arXiv:1806.02782","DOI":"10.21437\/Interspeech.2018-1247"},{"key":"177_CR43","doi-asserted-by":"crossref","unstructured":"Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818\u20132826","DOI":"10.1109\/CVPR.2016.308"},{"key":"177_CR44","doi-asserted-by":"crossref","unstructured":"Tamura K, Omagari A, Hashida S (2019) Novel defense method against audio adversarial example for speech-to-text transcription neural networks. In: 2019 IEEE 11th international workshop on computational intelligence and applications (IWCIA)","DOI":"10.1109\/IWCIA47330.2019.8955062"},{"key":"177_CR45","unstructured":"Taori R, Dave A, Shankar V, Carlini N, Recht B, Schmidt L (2020) Measuring robustness to natural distribution shifts in image classification. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6\u201312, 2020, Virtual"},{"key":"177_CR46","doi-asserted-by":"crossref","unstructured":"Taori R, Kamsetty A, Chu B, Vemuri N (2019) Targeted adversarial examples for black box audio systems. In: 2019 IEEE security and privacy workshops (SPW). IEEE 6:15\u201320","DOI":"10.1109\/SPW.2019.00016"},{"key":"177_CR47","unstructured":"Tsipras D, Santurkar S, Engstrom L, Turner A, Madry A (2018) Robustness may be at odds with accuracy. arXiv:1805.12152"},{"key":"177_CR48","unstructured":"Vaidya T, Zhang Y, Sherr M, Shields C (2015) Cocaine noodles: exploiting the gap between human and machine speech recognition. In: Proceedings of the 9th USENIX conference on offensive technologies. WOOT\u201915, p. 16. USENIX Association, USA"},{"key":"177_CR49","unstructured":"Wang A (2003) An industrial-strength audio search algorithm. In: ISMIR 2003, 4th international conference on music information retrieval, Baltimore, Maryland, USA, October 27\u201330, 2003, Proceedings"},{"key":"177_CR50","unstructured":"Wang A et al. (2003) An industrial strength audio search algorithm. In: Ismir, vol. 2003, pp 7\u201313. Citeseer"},{"key":"177_CR51","doi-asserted-by":"publisher","first-page":"896","DOI":"10.1109\/TIFS.2020.3026543","volume":"16","author":"Q Wang","year":"2021","unstructured":"Wang Q, Zheng B, Li Q, Shen C, Ba Z (2021) Towards query-efficient adversarial attacks against automatic speech recognition systems. IEEE Trans Inf Forens Secur 16:896\u2013908. https:\/\/doi.org\/10.1109\/TIFS.2020.3026543","journal-title":"IEEE Trans Inf Forens Secur"},{"key":"177_CR52","doi-asserted-by":"crossref","unstructured":"Xu W, Evans D, Qi Y (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arxiv: abs\/1704.01155","DOI":"10.14722\/ndss.2018.23198"},{"key":"177_CR53","unstructured":"Yang Z, Li B, Chen P-Y, Song D (2018) Characterizing audio adversarial examples using temporal dependency. arXiv:1809.10875"},{"key":"177_CR54","unstructured":"Yuan X, Chen Y, Zhao Y, Long Y, Liu X, Chen K, Zhang S, Huang H, Wang X, Gunter CA (2018) $$\\{$$CommanderSong$$\\}$$: a systematic approach for practical adversarial voice recognition. In: 27th USENIX security symposium (USENIX Security 18), pp 49\u201364"},{"key":"177_CR55","doi-asserted-by":"crossref","unstructured":"Zhang Y, Jiang Z, Villalba J, Dehak N (2020) Black-box attacks on spoofing countermeasures using transferability of adversarial examples. In: Interspeech, pp 4238\u20134242","DOI":"10.21437\/Interspeech.2020-2834"},{"key":"177_CR56","doi-asserted-by":"crossref","unstructured":"Zhang J, Zhang B, Zhang B (2019) Defending adversarial attacks on cloud-aided automatic speech recognition systems. In: Proceedings of the seventh international workshop on security in cloud computing, pp 23\u201331","DOI":"10.1145\/3327962.3331456"},{"key":"177_CR57","doi-asserted-by":"crossref","unstructured":"Zheng B, Jiang P, Wang Q, Li Q, Shen C, Wang C, Ge Y, Teng Q, Zhang S (2021) Black-box adversarial attacks on commercial speech platforms with minimal information. In: Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pp 86\u2013107","DOI":"10.1145\/3460120.3485383"}],"container-title":["Cybersecurity"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s42400-023-00177-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s42400-023-00177-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s42400-023-00177-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,5]],"date-time":"2023-08-05T01:06:03Z","timestamp":1691197563000},"score":1,"resource":{"primary":{"URL":"https:\/\/cybersecurity.springeropen.com\/articles\/10.1186\/s42400-023-00177-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,5]]},"references-count":57,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["177"],"URL":"https:\/\/doi.org\/10.1186\/s42400-023-00177-6","relation":{},"ISSN":["2523-3246"],"issn-type":[{"value":"2523-3246","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,5]]},"assertion":[{"value":"19 April 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 July 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 August 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competinf Interests"}}],"article-number":"40"}}