{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T07:04:48Z","timestamp":1771484688323,"version":"3.50.1"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"10","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61972192, 62172208, 61906085"],"award-info":[{"award-number":["61972192, 62172208, 61906085"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Collaborative Innovation Center of Novel Software Technology and Industrialization, and Fundamental Research Funds for the Central Universities","award":["14380001"],"award-info":[{"award-number":["14380001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>\n            Zero-Shot Cross-Modal Retrieval (ZS-CMR) aims to perform cross-modal retrieval on data of unseen classes, where a key challenge is how to address the modality-gap and domain-shift problems simultaneously. Existing methods tackle this challenge mainly by embracing a sample-label alignment paradigm, which aligns samples of different modalities but of the same class with the word embedding of their class label. However, these methods only focus on the class-level alignment and overlook the alignment of rich fine-grained semantic information in samples, incurring coarse understanding of sample matching and poor generalization on unseen classes. In this article, we propose a novel Fine-Grained Alignment Network, an end-to-end framework that learns representation with two fine-grained alignment strategies, yielding representation space that can be better generalized to unseen classes. Specifically, we extract two kinds of fine-grained representations, region embedding and label distribution, respectively, from aspects of both feature and label. To optimize the region embedding, we propose a Fine-Grained Contrastive Learning (FGCL) strategy to simultaneously conduct class-level alignment and model the intra-class discrepancy. To optimize the label distribution, we propose a Fine-Grained Label Alignment (FGLA) strategy to align diverse fine-grained semantic information among samples, rather than merely label information. Finally, both region embedding and label distribution are utilized together to perform ZS-CMR at a finer granularity. Experimental results on three widely used datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose. Our code will be available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/ShipingGe\/FGAN\">https:\/\/github.com\/ShipingGe\/FGAN<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3722223","type":"journal-article","created":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T12:21:36Z","timestamp":1741609296000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Fine-Grained Alignment Network for Zero-Shot Cross-Modal Retrieval"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9198-5324","authenticated-orcid":false,"given":"Shiping","family":"Ge","sequence":"first","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5243-4992","authenticated-orcid":false,"given":"Zhiwei","family":"Jiang","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9497-6244","authenticated-orcid":false,"given":"Yafeng","family":"Yin","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0916-7803","authenticated-orcid":false,"given":"Cong","family":"Wang","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8486-2614","authenticated-orcid":false,"given":"Zifeng","family":"Cheng","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1112-790X","authenticated-orcid":false,"given":"Qing","family":"Gu","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,14]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"15535","article-title":"Learning representations by maximizing mutual information across views","volume":"32","author":"Bachman Philip","year":"2019","unstructured":"Philip Bachman, R. Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views. Advances in Neural Information Processing Systems 32 (2019), 15535\u201315545.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_3_2","first-page":"9508","article-title":"Learning with differentiable perturbed optimizers","volume":"33","author":"Berthet Quentin","year":"2020","unstructured":"Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. 2020. Learning with differentiable perturbed optimizers. Advances in Neural Information Processing Systems 33 (2020), 9508\u20139519.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939812"},{"key":"e_1_3_1_5_2","first-page":"1597","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1597\u20131607."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"e_1_3_1_7_2","doi-asserted-by":"crossref","unstructured":"Jingze Chi and Yuxin Peng. 2018. Dual adversarial networks for zero-shot cross-media retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence 663\u2013669.","DOI":"10.24963\/ijcai.2018\/92"},{"issue":"4","key":"e_1_3_1_8_2","first-page":"1173","article-title":"Zero-shot cross-media embedding learning with dual adversarial distribution network","volume":"30","author":"Chi Jingze","year":"2019","unstructured":"Jingze Chi and Yuxin Peng. 2019. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Transactions on Circuits and Systems for Video Technology 30, 4 (2019), 1173\u20131187.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/1646396.1646452"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00238"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_12_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_1_13_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2185520.2185540"},{"key":"e_1_3_1_15_2","first-page":"17292","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Fang Kaipeng","year":"2024","unstructured":"Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, and Heng Tao Shen. 2024. Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 17292\u201317301."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206772"},{"key":"e_1_3_1_17_2","unstructured":"Tianyu Gao Xingcheng Yao and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv:2104.08821. Retrieved from https:\/\/arxiv.org\/abs\/2104.08821"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"key":"e_1_3_1_19_2","unstructured":"Alexander Hermans Lucas Beyer and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737. Retrieved from https:\/\/arxiv.org\/abs\/1703.07737"},{"key":"e_1_3_1_20_2","unstructured":"R. Devon Hjelm Alex Fedorov Samuel Lavoie-Marchildon Karan Grewal Phil Bachman Adam Trischler and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv:1808.06670. Retrieved from https:\/\/arxiv.org\/abs\/1808.06670"},{"key":"e_1_3_1_21_2","first-page":"e14947","volume-title":"Computer Graphics Forum","author":"Ho Yi-Hsuan","year":"2023","unstructured":"Yi-Hsuan Ho, Der-Lor Way, and Zen-Chung Shih. 2023. Sharing model framework for zero-shot sketch-based image retrieval. In Computer Graphics Forum, Vol. 42. Wiley Online Library, e14947."},{"key":"e_1_3_1_22_2","unstructured":"Prannay Khosla Piotr Teterwak Chen Wang Aaron Sarna Yonglong Tian Phillip Isola Aaron Maschinot Ce Liu and Dilip Krishnan. 2020. Supervised contrastive learning. arXiv:2004.11362. Retrieved from https:\/\/arxiv.org\/abs\/2004.11362"},{"key":"e_1_3_1_23_2","unstructured":"Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https:\/\/arxiv.org\/abs\/1312.6114"},{"key":"e_1_3_1_24_2","first-page":"1188","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Le Quoc","year":"2014","unstructured":"Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. PMLR, 1188\u20131196."},{"key":"e_1_3_1_25_2","first-page":"309","article-title":"Predicting what you already know helps: Provable self-supervised learning","volume":"34","author":"Lee Jason D.","year":"2021","unstructured":"Jason D. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. 2021. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems 34 (2021), 309\u2013323.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_26_2","first-page":"24564","article-title":"Prototype-based aleatoric uncertainty quantification for cross-modal retrieval","volume":"36","author":"Li Hao","year":"2024","unstructured":"Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, and Hengtao Shen. 2024. Prototype-based aleatoric uncertainty quantification for cross-modal retrieval. Advances in Neural Information Processing Systems 36 (2024), 24564\u201324585.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_27_2","first-page":"623","volume-title":"Proceedings of the International Conference on Database Systems for Advanced Applications","author":"Li Kun","year":"2022","unstructured":"Kun Li, Meng Lin, Songlin Hu, and Ruixuan Li. 2022. CLZT: A contrastive learning based framework for zero-shot text classification. In Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, 623\u2013630."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6817"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.247"},{"key":"e_1_3_1_30_2","unstructured":"Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. arXiv:1803.02893. Retrieved from https:\/\/arxiv.org\/abs\/1803.02893"},{"key":"e_1_3_1_31_2","unstructured":"Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv:1711.05101. Retrieved from https:\/\/arxiv.org\/abs\/1711.05101"},{"key":"e_1_3_1_32_2","unstructured":"Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan Trevor Killeen Zeming Lin Natalia Gimelshein Luca Antiga et al. 2019. Pytorch: An imperative style high-performance deep learning library. arXiv:1912.01703. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:1912.01703"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2705068"},{"key":"e_1_3_1_34_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873987"},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","first-page":"2160","DOI":"10.1109\/CVPR.2012.6247923","volume-title":"Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition","author":"Sharma Abhishek","year":"2012","unstructured":"Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2160\u20132167."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3275071"},{"key":"e_1_3_1_38_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_1_39_2","first-page":"1857","article-title":"Improved deep metric learning with multi-class n-pair loss objective","volume":"29","author":"Sohn Kihyuk","year":"2016","unstructured":"Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in Neural Information Processing Systems 29 (2016), 1857\u20131865.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_40_2","first-page":"52","volume-title":"Proceedings of the International Conference on Artificial Neural Networks","author":"Su Hanwen","year":"2024","unstructured":"Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, and Ming Yang. 2024. Cross-modal attention alignment network with auxiliary text description for zero-shot sketch-based image retrieval. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 52\u201365."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3532028"},{"key":"e_1_3_1_42_2","first-page":"6827","article-title":"What makes for good views for contrastive learning","volume":"33","author":"Tian Yonglong","year":"2020","unstructured":"Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems 33, 6827\u20136839.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/1148170.1148176"},{"key":"e_1_3_1_44_2","first-page":"11","article-title":"Visualizing data using t-SNE","volume":"9","author":"Van der Maaten Laurens","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 11.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_45_2","first-page":"6000","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 6000\u20136010.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123326"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICTAI.2015.45"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.261"},{"key":"e_1_3_1_49_2","first-page":"1","volume-title":"Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME)","author":"Wang Kai","year":"2022","unstructured":"Kai Wang, Yifan Wang, Xing Xu, Zuo Cao, and Xunliang Cai. 2022. Instance-level semantic alignment for zero-shot cross-modal retrieval. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1\u20136."},{"key":"e_1_3_1_50_2","unstructured":"Kaiye Wang Qiyue Yin Wei Wang Shu Wu and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https:\/\/arxiv.org\/abs\/1607.06215"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00100"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.320"},{"issue":"2","key":"e_1_3_1_53_2","first-page":"449","article-title":"Cross-modal retrieval with CNN visual features: A new baseline","volume":"47","author":"Wei Yunchao","year":"2016","unstructured":"Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2016. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2016), 449\u2013460.","journal-title":"IEEE Transactions on Cybernetics"},{"key":"e_1_3_1_54_2","doi-asserted-by":"crossref","unstructured":"Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz et al. 2019. Huggingface\u2019s transformers: State-of-the-art natural language processing. arXiv:1910.03771. Retrieved from https:\/\/arxiv.org\/abs\/1910.03771","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401149"},{"issue":"6","key":"e_1_3_1_56_2","first-page":"2400","article-title":"Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval","volume":"50","author":"Xu Xing","year":"2019","unstructured":"Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2019. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2019), 2400\u20132413.","journal-title":"IEEE Transactions on Cybernetics"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3206025.3206033"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3424341"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_3_1_60_2","first-page":"12589","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Yang Fan","year":"2020","unstructured":"Fan Yang, Zheng Wang, Jing Xiao, and Shin\u2019ichi Satoh. 2020. Mining on heterogeneous manifolds for zero-shot cross-modal image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 12589\u201312596."},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2024.108197"},{"key":"e_1_3_1_62_2","first-page":"5812","article-title":"Graph contrastive learning with augmentations","volume":"33","author":"You Yuning","year":"2020","unstructured":"Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33 (2020), 5812\u20135823.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_63_2","unstructured":"Jiahui Yu Zirui Wang Vijay Vasudevan Legg Yeung Mojtaba Seyedhosseini and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text foundation models. arXiv:2205.01917. Retrieved from https:\/\/arxiv.org\/abs\/2205.01917"},{"key":"e_1_3_1_64_2","first-page":"1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP \u201923)","author":"Zhang Haoxiang","year":"2023","unstructured":"Haoxiang Zhang, He Jiang, Ziqiang Wang, and Deqiang Cheng. 2023. Ontology-aware network for zero-shot sketch-based image retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP \u201923). IEEE, 1\u20135."},{"key":"e_1_3_1_65_2","first-page":"11954","volume-title":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34","author":"Zhang Haonan","year":"2024","unstructured":"Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2024. UMP: Unified modality-aware prompt tuning for text-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology 34, 11 (Nov. 2024), 11954\u201311964."},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01064"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3722223","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T21:25:09Z","timestamp":1760477109000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722223"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,14]]},"references-count":65,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3722223"],"URL":"https:\/\/doi.org\/10.1145\/3722223","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,14]]},"assertion":[{"value":"2024-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-27","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}