{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,24]],"date-time":"2025-12-24T12:09:31Z","timestamp":1766578171030,"version":"3.48.0"},"reference-count":47,"publisher":"Oxford University Press (OUP)","issue":"12","license":[{"start":{"date-parts":[[2025,7,29]],"date-time":"2025-07-29T00:00:00Z","timestamp":1753747200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,12,24]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Modality discrepancies have perpetually posed significant challenges within the realm of Automated Audio Captioning (AAC) and across all multimodal domains. Facilitating models in comprehending text information plays a pivotal role in establishing a seamless connection between the two modalities of text and audio. While recent research has focused on closing the gap between these two modalities through contrastive learning, it is challenging to bridge the difference between both modalities using only simple contrastive loss. This paper introduces enhanced depth of text comprehension, which enhances the model\u2019s understanding of text information from three different perspectives. First, a combined Local-Global feature Fusion module is introduced to fuse heterogeneous audio features, enabling the extraction of high-level semantic information and the discovery of latent inter-sample relationships. Next, a novel representation module, TRANSLATOR, constructs a twin-branch structure based on the conventional dual-stream model, mapping features from both modalities into a shared high-dimensional audio-text space. Finally, contrastive learning is integrated with momentum-based weight updates, allowing the system to effectively capture shared high-level semantic representations across the audio and text modalities.<\/jats:p>","DOI":"10.1093\/comjnl\/bxaf087","type":"journal-article","created":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T11:44:52Z","timestamp":1753357492000},"page":"1957-1966","source":"Crossref","is-referenced-by-count":1,"title":["EDTC: enhanced depth of text comprehension in automated audio captioning"],"prefix":"10.1093","volume":"68","author":[{"given":"Liwen","family":"Tan","sequence":"first","affiliation":[{"name":"School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications , No. 2 Chongwen Road, Nan\u2019an District, Chongqing 400065 ,","place":["China"]}]},{"given":"Yi","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications , No. 2 Chongwen Road, Nan\u2019an District, Chongqing 400065 ,","place":["China"]}]},{"given":"Yin","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications , No. 2 Chongwen Road, Nan\u2019an District, Chongqing 400065 ,","place":["China"]}]},{"given":"Yin","family":"Cao","sequence":"additional","affiliation":[{"name":"Department of Intelligent Science, Xi\u2019an Jiaotong-Liverpool University , 111 Ren\u2019ai Road, Dushu Lake Science and Education Innovation District, Suzhou Industrial Park, Suzhou, Jiangsu 215123 ,","place":["China"]}]}],"member":"286","published-online":{"date-parts":[[2025,7,29]]},"reference":[{"key":"2025122407061016200_ref1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13636-022-00259-2","article-title":"Automated audio captioning: an overview of recent progress and new challenges","volume":"2022","author":"Mei","year":"2022","journal-title":"EURASIP J Audio Speech Music Process"},{"article-title":"A comprehensive survey of automated audio captioning","volume-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing","author":"Xu","key":"2025122407061016200_ref2"},{"key":"2025122407061016200_ref3","first-page":"736","article-title":"Clotho: an audio captioning dataset","volume-title":"ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Drossos","year":"2020"},{"key":"2025122407061016200_ref4","first-page":"119","article-title":"AudioCaps: generating captions for audios in the wild","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Kim","year":"2019"},{"key":"2025122407061016200_ref5","doi-asserted-by":"crossref","first-page":"2880","DOI":"10.1109\/TASLP.2020.3030497","article-title":"PANNs: large-scale pretrained audio neural networks for audio pattern recognition","volume":"28","author":"Kong","year":"2020","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"2025122407061016200_ref6","first-page":"646","article-title":"HTS-AT: a hierarchical token-semantic audio transformer for sound classification and detection","volume-title":"ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Chen","year":"2022"},{"key":"2025122407061016200_ref7","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","journal-title":"OpenAI blog"},{"key":"2025122407061016200_ref8","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis"},{"key":"2025122407061016200_ref9","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2023-1614","article-title":"Enhance temporal relations in audio captioning with sound event detection.","volume-title":"ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Xie"},{"key":"2025122407061016200_ref10","doi-asserted-by":"crossref","first-page":"4983","DOI":"10.1109\/ACCESS.2023.3235733","article-title":"Automated audio captioning with topic modeling","volume":"11","author":"zkaya Eren","year":"2023","journal-title":"IEEE Access"},{"article-title":"An encoder\u2013decoder based audio captioning system with transfer and reinforcement learning","volume-title":"Proceedings of the Detection and Classificationof Acoustic Scenes and Events Workshop","author":"Mei","key":"2025122407061016200_ref11"},{"key":"2025122407061016200_ref12","doi-asserted-by":"crossref","DOI":"10.1109\/TASLP.2024.3419446","article-title":"WavCaps: a chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research","volume-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing","author":"Mei"},{"key":"2025122407061016200_ref13","first-page":"1","article-title":"Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation","volume-title":"ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Yusong","year":"2023"},{"key":"2025122407061016200_ref14","first-page":"8748","article-title":"Learning transferable visual models from natural language supervision","volume-title":"International Conference on Machine Learning","author":"Radford","year":"2021"},{"key":"2025122407061016200_ref15","first-page":"1","article-title":"Clap learning audio concepts from natural language supervision","volume-title":"ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Elizalde","year":"2023"},{"key":"2025122407061016200_ref16","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP48485.2024.10447215","article-title":"Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation.","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Wu"},{"key":"2025122407061016200_ref17","first-page":"11976","article-title":"A convNet for the 2020s","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Liu","year":"2022"},{"article-title":"One embedder, any task: instruction-finetuned text embeddings.","volume-title":"Findings of the As-sociation for Computational Linguistics: ACL 2023","author":"Hongjin","key":"2025122407061016200_ref18"},{"key":"2025122407061016200_ref19","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2022-10510","article-title":"Interactive audio-text representation for automated audio captioning with contrastive learning","volume-title":"23rd Annual Conference of the Interna-tional Speech Communication Association","author":"Chen"},{"key":"2025122407061016200_ref20","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP48485.2024.10448115","article-title":"Training audio captioning models without audio.","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Deshmukh"},{"key":"2025122407061016200_ref21","first-page":"1","article-title":"Semantic embedding guided attention with explicit visual feature fusion for video captioning","volume":"19","author":"Dong","year":"2023","journal-title":"ACM Trans Multimedia Comput Commun Appl."},{"key":"2025122407061016200_ref22","doi-asserted-by":"crossref","first-page":"10535","DOI":"10.1109\/TPAMI.2023.3261282","article-title":"Visible and infrared image fusion using deep learning","volume":"45","author":"Zhang","year":"2023","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2025122407061016200_ref23","doi-asserted-by":"crossref","first-page":"3845","DOI":"10.1109\/TIP.2020.2966075","article-title":"Unsupervised deep image fusion with structure tensor representations","volume":"29","author":"Jung","year":"2020","journal-title":"IEEE Trans Image Process."},{"key":"2025122407061016200_ref24","doi-asserted-by":"crossref","first-page":"12484","DOI":"10.1609\/aaai.v34i07.6936","article-title":"FusionDN: a unified densely connected network for image fusion","volume":"34","author":"Han","year":"2020","journal-title":"Proceedings of the AAAI conference on artificial intelligence."},{"key":"2025122407061016200_ref25","first-page":"502","article-title":"U2Fusion: a unified unsupervised image fusion network","volume":"44","author":"Han","year":"2020","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2025122407061016200_ref26","first-page":"11039","article-title":"HAAV: hierarchical aggregation of augmented views for image captioning","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition.","author":"Kuo"},{"key":"2025122407061016200_ref27","first-page":"4634","article-title":"Attention on attention for image captioning","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Huang","year":"2019"},{"key":"2025122407061016200_ref28","first-page":"499","article-title":"Recurrent fusion network for image captioning","volume-title":"Proceedings of the European conference on computer vision (ECCV)","author":"Jiang","year":"2018"},{"key":"2025122407061016200_ref29","first-page":"9729","article-title":"Momentum contrast for unsupervised visual representation learning","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"He","year":"2020"},{"key":"2025122407061016200_ref30","first-page":"1597","article-title":"A simple framework for contrastive learning of visual representations","volume-title":"International Conference on Machine Learning","author":"Chen","year":"2020"},{"key":"2025122407061016200_ref31","doi-asserted-by":"crossref","first-page":"776","DOI":"10.1109\/ICASSP.2017.7952261","article-title":"Audio set: an ontology and human-labeled dataset for audio events","volume-title":"2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Gemmeke","year":"2017"},{"key":"2025122407061016200_ref32","doi-asserted-by":"crossref","DOI":"10.1109\/TASLP.2023.3293015","article-title":"ACTUAL: audio captioning with caption feature space regularization","volume":"31","author":"Zhang","year":"2023","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"article-title":"Improving the performance of automated audio captioning via integrating the acoustic and semantic information","volume-title":"Workshopon Detection and Classification of Acoustic Scenes and Events","author":"Ye","key":"2025122407061016200_ref33"},{"key":"2025122407061016200_ref34","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP48485.2024.10448085","article-title":"Audio difference learning for audio captioning.","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Komatsu"},{"key":"2025122407061016200_ref35","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2023-914","article-title":"Visually-aware audio captioning with adaptive audio-visual attention","volume-title":"Proc. Interspeech 2023, 24th Intl. Speech Communication Association Conf","author":"Liu"},{"key":"2025122407061016200_ref36","first-page":"1","article-title":"Prefix tuning for automated audio captioning","volume-title":"ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Kim","year":"2023"},{"article-title":"Pengi: an audio language model for audio tasks","volume-title":"Advances in Neural Information Processing Systems","author":"Deshmukh","key":"2025122407061016200_ref37"},{"article-title":"Exploring train and test-time augmentations for audio-language learning","year":"2023","author":"Kim","key":"2025122407061016200_ref38"},{"key":"2025122407061016200_ref39","first-page":"311","article-title":"BLEU: a method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th annual meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"article-title":"ROUGE: a package for automatic evaluation of summaries","volume-title":"Proceedings of the Workshop on Text Summarization Branches Out","author":"Lin","key":"2025122407061016200_ref40"},{"key":"2025122407061016200_ref41","first-page":"65","article-title":"METEOR: an automatic metric for MT evaluation with improved correlation with human judgments","volume-title":"Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization","author":"Banerjee","year":"2005"},{"key":"2025122407061016200_ref42","first-page":"4566","article-title":"CIDEr: consensus-based image description evaluation","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"Vedantam","year":"2015"},{"key":"2025122407061016200_ref43","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1007\/978-3-319-46454-1_24","article-title":"SPICE: semantic propositional image caption evaluation","volume-title":"Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14","author":"Anderson","year":"2016"},{"key":"2025122407061016200_ref44","first-page":"873","article-title":"Improved image captioning via policy gradient optimization of spider","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"Liu","year":"2017"},{"key":"2025122407061016200_ref45","first-page":"981","article-title":"Can audio captions be evaluated with image caption metrics?","volume-title":"ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zhou","year":"2022"},{"key":"2025122407061016200_ref46","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2019-2680","article-title":"SpecAugment: a simple data augmentation method for automatic speech recognition.","volume-title":"Proc. Interspeech 2019","author":"Park"},{"article-title":"BEATs: audio pre-training with acoustic tokenizers","volume-title":"Interna-tional Conference on Machine Learning","author":"Chen","key":"2025122407061016200_ref47"}],"container-title":["The Computer Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/68\/12\/1957\/63877313\/bxaf087.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/68\/12\/1957\/63877313\/bxaf087.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,24]],"date-time":"2025-12-24T12:06:24Z","timestamp":1766577984000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/comjnl\/article\/68\/12\/1957\/8216887"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,29]]},"references-count":47,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,7,29]]},"published-print":{"date-parts":[[2025,12,24]]}},"URL":"https:\/\/doi.org\/10.1093\/comjnl\/bxaf087","relation":{},"ISSN":["0010-4620","1460-2067"],"issn-type":[{"type":"print","value":"0010-4620"},{"type":"electronic","value":"1460-2067"}],"subject":[],"published-other":{"date-parts":[[2025,12]]},"published":{"date-parts":[[2025,7,29]]}}}