{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T18:42:23Z","timestamp":1755801743756,"version":"3.44.0"},"reference-count":72,"publisher":"Association for Computing Machinery (ACM)","issue":"8","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["11726632"],"award-info":[{"award-number":["11726632"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Key Research and Development Program of Hunan Province","award":["2024JK2011"],"award-info":[{"award-number":["2024JK2011"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,8,31]]},"abstract":"<jats:p>Audio is an important medium in people\u2019s daily life, secret information can be embedded into audio for covert communication. However, traditional audio information hiding techniques cannot achieve large hiding capacity and good imperceptibility at the same time, and rely on complex encryption, which limits their applicability in resource-constrained Internet of Things (IoT) environments. In this article, we propose a new audio information hiding method, named AdvAudio, which can achieve large high capacity, as well as good imperceptibility, without reliance on cryptographic encryption. Specifically, AdvAudio leverages adversarial example technique to train a well-designed perturbation for cover audio and the secret information can only be extracted by the private automatic speech recognition (ASR) model. To achieve this, we implement two adversarial example algorithms tailored for both online transmission and physical-world transmission scenarios. In particular, our embedding algorithm dynamically adjusts the addition of simulated environmental noise depending on whether the audio is intended to propagate in the physical world. The iterative optimization process is guided by targeted adversarial attack objectives, ensuring that the private ASR model decodes the embedded secret information accurately. Taking DeepSpeech as the private model, we implement a prototype of AdvAudio, which achieves a high embedding capacity of 383.8 bps with excellent imperceptibility, yielding a Perceptual Evaluation of Speech Quality (PESQ) score of 2.351. Furthermore, it offers robust security, achieving a 100% defense success rate against both internal and external attacks. In the physical world, AdvAudio still maintains effectiveness across six different types of noise and retaining 82% accuracy even under sudden loud noises. Additionally, the secret information can only be extracted in the target environment, with a success rate of 26%, and 0% in non-target environments. In the future, we aim at enhancing the steganalysis resistance of AdvAudio and explore its potential applications in various environments or with alternative ASR models.<\/jats:p>","DOI":"10.1145\/3748309","type":"journal-article","created":{"date-parts":[[2025,7,10]],"date-time":"2025-07-10T15:45:57Z","timestamp":1752162357000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["AdvAudio: A New Information Hiding Method via Fooling Automatic Speech Recognition Model"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2291-9443","authenticated-orcid":false,"given":"Xiangqi","family":"Wang","sequence":"first","affiliation":[{"name":"School of Science, Nanjing University of Posts and Telecommunications","place":["Nanjing, China"]},{"name":"School of Mathematics and Statistics, Hunan First Normal University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5179-4336","authenticated-orcid":false,"given":"Yehao","family":"Kong","sequence":"additional","affiliation":[{"name":"College of Computer Science and Electronic Engineering, Hunan University","place":["Changsha, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4777-122X","authenticated-orcid":false,"given":"Luyuan","family":"Xie","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6820-6361","authenticated-orcid":false,"given":"Shengfang","family":"Zhai","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8752-3137","authenticated-orcid":false,"given":"Tairui","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6212-7822","authenticated-orcid":false,"given":"Boyan","family":"Chen","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-9070-1950","authenticated-orcid":false,"given":"Junkai","family":"Liang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4185-7214","authenticated-orcid":false,"given":"Xin","family":"Zhang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,8,20]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3606696"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3453651"},{"key":"e_1_3_1_4_2","first-page":"1","volume-title":"ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zhai Shengfang","year":"2023","unstructured":"Shengfang Zhai, Qingni Shen, Xiaoyi Chen, Weilong Wang, Cong Li, Yuejian Fang, and Zhonghai Wu. 2023. Ncl: Textual backdoor defense using noise-augmented contrastive learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1\u20135."},{"key":"e_1_3_1_5_2","first-page":"1577","volume-title":"ACM International Conference on Multimedia","author":"Zhai Shengfang","year":"2023","unstructured":"Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yuejian Fang, and Hang Su. 2023. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In ACM International Conference on Multimedia. 1577\u20131587."},{"key":"e_1_3_1_6_2","unstructured":"S. Zhai H. Chen Y. Dong J. Li Q. Shen Y. Gao H. Su and Y. Liu. 2025. Membership inference on text-to-image diffusion models via conditional likelihood discrepancy. In Proceedings of the 38th International Conference on Neural Information Processing Systems Ser. NIPS\u201924. Red Hook NY USA: Curran Associates Inc. 2025."},{"key":"e_1_3_1_7_2","first-page":"23","volume-title":"International Conference on Medical Image Computing and Computer-Assisted Intervention","author":"Xie Luyuan","year":"2023","unstructured":"Luyuan Xie, Cong Li, Zirui Wang, Xin Zhang, Boyan Chen, Qingni Shen, and Zhonghai Wu. 2023. Shisrcnet: Super-resolution and classification network for low-resolution breast cancer histopathology image. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 23\u201332."},{"issue":"2","key":"e_1_3_1_8_2","article-title":"Robust reversible audio watermarking scheme for telemedicine and privacy protection.","volume":"71","author":"Zhang Xiaorui","year":"2022","unstructured":"Xiaorui Zhang, Xun Sun, Xingming Sun, Wei Sun, and Sunil Kumar Jha. 2022. Robust reversible audio watermarking scheme for telemedicine and privacy protection. Computers, Materials & Continua 71, 2 (2022).","journal-title":"Computers, Materials & Continua"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2782487"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sigpro.2022.108561"},{"issue":"4","key":"e_1_3_1_11_2","article-title":"An information hiding method based on audio technology.","author":"Xia-ii TIAN","year":"2022","unstructured":"TIAN Xia-ii. 2022. An information hiding method based on audio technology. Study on Optical Communications\/Guangtongxin Yanjiu4 (2022).","journal-title":"Study on Optical Communications\/Guangtongxin Yanjiu"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.bea.2022.100048"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2022.116879"},{"key":"e_1_3_1_14_2","first-page":"219","volume-title":"International Conference on Artificial Neural Networks","author":"Heggan Calum","year":"2022","unstructured":"Calum Heggan, Sam Budgett, Timothy Hospedales, and Mehrdad Yaghoobi. 2022. Metaaudio: A few-shot audio classification benchmark. In International Conference on Artificial Neural Networks. Springer, 219\u2013230."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACES.2014.6807979"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISDFS.2018.8355361"},{"issue":"02","key":"e_1_3_1_17_2","doi-asserted-by":"crossref","first-page":"2258002","DOI":"10.1142\/S0218001422580022","article-title":"Private data hiding system using state-switch DWT coefficients quantization on digital signal","volume":"36","author":"Zhao Ming","year":"2022","unstructured":"Ming Zhao, Meng Li, Xindi Tong, Jie Li, and Shuo-Tsung Chen. 2022. Private data hiding system using state-switch DWT coefficients quantization on digital signal. International Journal of Pattern Recognition and Artificial Intelligence 36, 02 (2022), 2258002.","journal-title":"International Journal of Pattern Recognition and Artificial Intelligence"},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","unstructured":"H. Abdullah K. Warren V. Bindschaedler N. Papernot and P. Traynor. 2021. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE 730\u2013747.","DOI":"10.1109\/SP40001.2021.00014"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/SOCPAR.2013.7054098"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCITECHN.2018.8631918"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2011.2173678"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2014.2387385"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CECNet.2012.6201965"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-016-3934-9"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/IEMECON.2017.8079588"},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","unstructured":"J. Zhu R. Kaplan J. Johnson and L. Fei-Fei. 2018. Hidden: Hiding data with deep networks. In Proceedings of the European Conference on Computer Vision (ECCV\u201918). 657\u2013672.","DOI":"10.1007\/978-3-030-01267-0_40"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/GCCE46687.2019.9015498"},{"key":"e_1_3_1_28_2","unstructured":"D. Ye S. Jiang and J. Huang. 2019. Heard more than heard: An audio steganography method based on GAN."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054397"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-19586-0_2"},{"key":"e_1_3_1_31_2","unstructured":"A. Y. Hannun C. Case J. Casper B. Catanzaro G. Diamos E. Elsen R. Prenger S. Satheesh S. Sengupta A. Coates and A. Y. Ng. 2014. Deep speech: Scaling up end-to-end speech recognition. CoRR abs\/1412.5567 (2014)."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_1_33_2","article-title":"Sequence Modeling with CTC","author":"Hannun Awni","year":"2017","unstructured":"Awni Hannun. 2017. Sequence Modeling with CTC. Distill (2017).","journal-title":"Distill"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-74296-6_20"},{"key":"e_1_3_1_35_2","article-title":"Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical evaluation, and challenges","author":"Sobbahi Rayan Al","year":"2022","unstructured":"Rayan Al Sobbahi and Joe Tekli. 2022. Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical evaluation, and challenges. Image Communication (2022).","journal-title":"Image Communication"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3231480"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3610290"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3519386"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2023.3246766"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2019.2933524"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLA52953.2021.00135"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3548606.3560660"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2023.103168"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2021.102495"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3386263.3407599"},{"key":"e_1_3_1_46_2","first-page":"1","volume-title":"2020 IEEE 29th Asian Test Symposium (ATS)","author":"Zhang Jiliang","year":"2020","unstructured":"Jiliang Zhang, Shuang Peng, Yupeng Hu, Fei Peng, Wei Hu, Jinmei Lai, Jing Ye, and Xiangqi Wang. 2020. HRAE: Hardware-assisted randomization against adversarial example attacks. In 2020 IEEE 29th Asian Test Symposium (ATS). 1\u20136."},{"key":"e_1_3_1_47_2","volume-title":"3nd International Conference on Learning Representations (ICLR)","author":"Goodfellow Ian J.","year":"2015","unstructured":"Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In 3nd International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2017.49"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3326285.3329073"},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Yupeng Hu Wenxin Kuang Zheng Qin Kenli Li Jiliang Zhang Yansong Gao Wenjia Li and Keqin Li. 2021. Artificial intelligence security: Threats and countermeasures. ACM Comput. Surv. 55 1 (2021) 1\u201336.","DOI":"10.1145\/3487890"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3453688.3461751"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/SPW.2018.00009"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510582"},{"key":"e_1_3_1_54_2","first-page":"49","volume-title":"Proceedings of the 27th USENIX Conference on Security Symposium (SEC\u201918)","author":"Yuan Xuejing","year":"2018","unstructured":"Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A. Gunter. 2018. Commandersong: A Systematic Approach for Practical Adversarial Voice Recognition. In Proceedings of the 27th USENIX Conference on Security Symposium (SEC\u201918). USENIX Association, Berkeley, CA, USA, 49\u201364."},{"key":"e_1_3_1_55_2","first-page":"5334","volume-title":"Proceedings of the 28th International Joint Conference on Artificial Intelligence","author":"Yakura Hiromu","year":"2019","unstructured":"Hiromu Yakura and Jun Sakuma. 2019. Robust Audio Adversarial Example for a Physical Attack. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 5334\u20135341."},{"key":"e_1_3_1_56_2","unstructured":"Xiaomi \u201cXiaomi AI Speaker.\u201d [Online]. Available: https:\/\/www.mi.com\/aispeaker\/"},{"key":"e_1_3_1_57_2","unstructured":"Google \u201cGoogle Home.\u201d [Online]. Available: https:\/\/store.google.com\/product\/google_home"},{"key":"e_1_3_1_58_2","unstructured":"Alibaba \u201cTmall Genie.\u201d [Online]. Available: https:\/\/bot.tmall.com\/"},{"key":"e_1_3_1_59_2","unstructured":"Amazon \u201cAmazon Echo.\u201d [Online]. Available: https:\/\/www.amazon.com\/dp\/B06XCM9LJ4\/"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/COMST.2016.2548426"},{"key":"e_1_3_1_61_2","first-page":"2440","article-title":"Reverberation robust acoustic modeling using i-vectors with time delay neural networks","author":"Peddinti Vijayaditya","year":"2015","unstructured":"Vijayaditya Peddinti, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Reverberation robust acoustic modeling using i-vectors with time delay neural networks. Proc. Interspeech (2015), 2440\u20132444.","journal-title":"Proc. Interspeech"},{"key":"e_1_3_1_62_2","first-page":"284","volume-title":"Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research)","volume":"80","author":"Athalye Anish","year":"2018","unstructured":"Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Synthesizing robust adversarial examples. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research). Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsm\u00e4ssan, Stockholm Sweden, 284\u2013293. Retrieved from http:\/\/proceedings.mlr.press\/v80\/athalye18b.html"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/5.771065"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/49.668969"},{"key":"e_1_3_1_65_2","unstructured":"Mozilla \u201cCommon Voice.\u201d [Online]. Available: https:\/\/voice.mozilla.org\/datasets"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953152"},{"key":"e_1_3_1_67_2","unstructured":"Google \u201cGoogle Cloud Speech-to-Text.\u201d [Online]. Available: https:\/\/cloud.google.com\/speech-to-text\/"},{"key":"e_1_3_1_68_2","unstructured":"IBM \u201cIBM Watson Speech-to-Text.\u201d [Online]. Available: https:\/\/speech-to-text-demo.ng.bluemix.net\/"},{"key":"e_1_3_1_69_2","unstructured":"IFlytek \u201ciFlytek Speech-to-Text.\u201d [Online]. Available: https:\/\/www.iflyrec.com\/html\/addMachineOrder.html"},{"key":"e_1_3_1_70_2","first-page":"672","volume-title":"Proceedings of The 13th Asian Conference on Machine Learning","author":"Abdullah Hadi","unstructured":"Hadi Abdullah, Muhammad Sajidur Rahman, Christian Peeters, Cassidy Gibson, Washington Garcia, Vincent Bindschaedler, Thomas Shrimpton, and Patrick Traynor. Beyond \\(L_{p}\\) clipping: Equalization based Psychoacoustic attacks against ASRs. In Proceedings of The 13th Asian Conference on Machine Learning. 672\u2013688."},{"key":"e_1_3_1_71_2","first-page":"3799","volume-title":"32nd USENIX Security Symposium (USENIX Security 23)","author":"Yu Zhiyuan","year":"2023","unstructured":"Zhiyuan Yu, Yuanhaur Chang, Ning Zhang, and Chaowei Xiao. 2023. SMACK: Semantically Meaningful Adversarial Audio Attack. In 32nd USENIX Security Symposium (USENIX Security 23). 3799\u20133816."},{"key":"e_1_3_1_72_2","first-page":"247","volume-title":"32nd USENIX Security Symposium","author":"Wu Xinghui","year":"2023","unstructured":"Xinghui Wu, Shiqing Ma, Chao Shen, Chenhao Lin, Qian Wang, Qi Li, and Yuan Rao. 2023. KENKU: Towards Efficient and Stealthy Black-box Adversarial Attacks against ASR Systems. In 32nd USENIX Security Symposium. 247\u2013264."},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP54263.2024.00111"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3748309","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,20]],"date-time":"2025-08-20T18:07:56Z","timestamp":1755713276000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3748309"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,20]]},"references-count":72,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,8,31]]}},"alternative-id":["10.1145\/3748309"],"URL":"https:\/\/doi.org\/10.1145\/3748309","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,8,20]]},"assertion":[{"value":"2023-02-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-31","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}