{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T20:15:52Z","timestamp":1776888952835,"version":"3.51.2"},"reference-count":69,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>\n            Image-Text Matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this article, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed\n            <jats:italic>Hire<\/jats:italic>\n            ) for ITM, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modeling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects\u2019 spatial positions and their scene graph. We use implicit relationship modeling for potential relationship interactions before explicit modeling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modeling is more beneficial for ITM. And the proposed\n            <jats:italic>Hire<\/jats:italic>\n            obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.\n          <\/jats:p>","DOI":"10.1145\/3714431","type":"journal-article","created":{"date-parts":[[2025,1,23]],"date-time":"2025-01-23T14:40:31Z","timestamp":1737643231000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["<i>Hire<\/i>\n            : Hybrid-Modal Interaction with Multiple Relational Enhancements for Image-Text Matching"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3925-4951","authenticated-orcid":false,"given":"Xuri","family":"Ge","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, Shandong University, Jinan, China and University of Glasgow, Glasgow, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5441-5998","authenticated-orcid":false,"given":"Fuhai","family":"Chen","sequence":"additional","affiliation":[{"name":"College of Computer and Data Science, Fuzhou University, Fuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5735-8674","authenticated-orcid":false,"given":"Songpei","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computing Science, University of Glasgow, Glasgow, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5729-9014","authenticated-orcid":false,"given":"Fuxiang","family":"Tao","sequence":"additional","affiliation":[{"name":"Department of Computer Science, The University of Sheffield, Sheffield, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-7117-8580","authenticated-orcid":false,"given":"Jie","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computing Science, University of Glasgow, Glasgow, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9228-1759","authenticated-orcid":false,"given":"Joemon M.","family":"Jose","sequence":"additional","affiliation":[{"name":"University of Glasgow, Glasgow, United Kingdom of Great Britain and Northern Ireland"}]}],"member":"320","published-online":{"date-parts":[[2025,6,11]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR 6077\u20136086.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP 740\u2013750.","DOI":"10.3115\/v1\/D14-1082"},{"key":"e_1_3_2_4_2","unstructured":"Fuhai Chen Rongrong Ji Jiayi Ji Xiaoshuai Sun Baochang Zhang Xuri Ge Yongjian Wu Feiyue Huang and Yan Wang. 2019. Variational structured semantic inference for diverse image captioning. In NeurIPS 1931\u20131941."},{"key":"e_1_3_2_5_2","doi-asserted-by":"crossref","unstructured":"Hui Chen Guiguang Ding Xudong Liu Zijia Lin Ji Liu and Jungong Han. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR 12655\u201312663.","DOI":"10.1109\/CVPR42600.2020.01267"},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Jiacheng Chen Hexiang Hu Hao Wu Yuning Jiang and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In CVPR 15789\u201315798.","DOI":"10.1109\/CVPR46437.2021.01553"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3487403"},{"key":"e_1_3_2_8_2","unstructured":"Jan K. Chorowski Dzmitry Bahdanau Dmitriy Serdyuk Kyunghyun Cho and Yoshua Bengio. 2015. Attention-based models for speech recognition. In NeurIPS."},{"key":"e_1_3_2_9_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In ACL."},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","first-page":"1218","DOI":"10.1609\/aaai.v35i2.16209","article-title":"Similarity reasoning and filtration for image-text matching","volume":"35","author":"Diao Haiwen","year":"2021","unstructured":"Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity reasoning and filtration for image-text matching. In AAAI, Vol. 35, 1218\u20131226.","journal-title":"AAAI"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Martin Engilberge Louis Chevallier Patrick P\u00e9rez and Matthieu Cord. 2018. Finding beans in burgers: Deep semantic-visual embedding with localization. In CVPR 3984\u20133993.","DOI":"10.1109\/CVPR.2018.00419"},{"key":"e_1_3_2_12_2","unstructured":"Fartash Faghri David J. Fleet Jamie Ryan Kiros and Sanja Fidler. 2018. Vse++: Improving visual-semantic embeddings with hard negatives. In BMVC."},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","unstructured":"Hao Fang Saurabh Gupta Forrest Iandola Rupesh K. Srivastava Li Deng Piotr Doll\u00e1r Jianfeng Gao Xiaodong He Margaret Mitchell John C. Platt et al. 2015. From captions to visual concepts and back. In CVPR 1473\u20131482.","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_3_2_14_2","unstructured":"Andrea Frome Greg S. Corrado Jon Shlens Samy Bengio Jeff Dean Marc\u2019Aurelio Ranzato and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NeurIPS 2121\u20132129."},{"key":"e_1_3_2_15_2","doi-asserted-by":"crossref","unstructured":"Junchen Fu Xuri Ge Xin Xin Alexandros Karatzoglou Ioannis Arapakis Jie Wang and Joemon M. Jose. 2024. IISAN: Efficiently adapting multimodal representation for sequential recommendation with decoupled PEFT. In SIGIR 687\u2013697.","DOI":"10.1145\/3626772.3657725"},{"key":"e_1_3_2_16_2","unstructured":"Junchen Fu Xuri Ge Xin Xin Alexandros Karatzoglou Ioannis Arapakis Kaiwen Zheng Yongxin Ni and Joemon M. Jose. 2024. Efficient and effective adaptation of multimodal foundation models in sequential recommendation. arXiv:2411.02992. Retrieved from https:\/\/arxiv.org\/abs\/2411.02992"},{"key":"e_1_3_2_17_2","unstructured":"Xuri Ge Fuhai Chen Joemon M. Jose Zhilong Ji Zhongqin Wu and Xiao Liu. 2021. Structured multi-modal feature embedding and alignment for image-sentence retrieval. In ACM MM 5185\u20135193."},{"key":"e_1_3_2_18_2","first-page":"356","volume-title":"ICME","author":"Ge Xuri","year":"2019","unstructured":"Xuri Ge, Fuhai Chen, Chen Shen, and Rongrong Ji. 2019. Colloquial image captioning. In ICME. IEEE, 356\u2013361."},{"key":"e_1_3_2_19_2","unstructured":"Xuri Ge Fuhai Chen Songpei Xu Fuxiang Tao and Joemon M. Jose. 2023. Cross-modal semantic enhanced interaction for image-sentence retrieval. In WACV."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2024.103716"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","unstructured":"Yan Huang Qi Wu Chunfeng Song and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In CVPR 6163\u20136171.","DOI":"10.1109\/CVPR.2018.00645"},{"key":"e_1_3_2_22_2","unstructured":"Zhong Ji Haoran Wang Jungong Han and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In ICCV 5754\u20135763."},{"key":"e_1_3_2_23_2","first-page":"4904","volume-title":"ICML","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 4904\u20134916."},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR 3128\u20133137.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_25_2","unstructured":"Andrej Karpathy Armand Joulin and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NeurIPS 1889\u20131897."},{"key":"e_1_3_2_26_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR."},{"key":"e_1_3_2_27_2","unstructured":"Ryan Kiros Ruslan Salakhutdinov and Richard S. Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Trans. Assoc. Comput. Linguist (2015)."},{"key":"e_1_3_2_28_2","unstructured":"Kuang-Huei Lee Xi Chen Gang Hua Houdong Hu and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV 201\u2013216."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i2.20020"},{"key":"e_1_3_2_31_2","unstructured":"Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV 4654\u20134662."},{"issue":"1","key":"e_1_3_2_32_2","first-page":"641","article-title":"Image-text embedding learning via visual and textual semantic reasoning","volume":"45","author":"Li Kunpeng","year":"2022","unstructured":"Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1 (2022), 641\u2013656.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","unstructured":"Wenrui Li Zhengyu Ma Liang-Jian Deng Penghong Wang Jinqiao Shi and Xiaopeng Fan. 2023. Reservoir computing transformer for image-text retrieval. In ACM MM 5605\u20135613.","DOI":"10.1145\/3581783.3611758"},{"key":"e_1_3_2_34_2","doi-asserted-by":"crossref","unstructured":"Lizi Liao Xiangnan He Bo Zhao Chong-Wah Ngo and Tat-Seng Chua. 2018. Interpretable multimodal retrieval for fashion products. In ACM MM 1571\u20131579.","DOI":"10.1145\/3240508.3240646"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","unstructured":"Chunxiao Liu Zhendong Mao An-An Liu Tianzhu Zhang Bin Wang and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM 3\u201311.","DOI":"10.1145\/3343031.3350869"},{"key":"e_1_3_2_37_2","unstructured":"Chunxiao Liu Zhendong Mao Tianzhu Zhang Hongtao Xie Bin Wang and Yongdong Zhang. 2020. Graph structured network for image-text matching. In CVPR 10921\u201310930."},{"key":"e_1_3_2_38_2","doi-asserted-by":"crossref","unstructured":"Meng Liu Xiang Wang Liqiang Nie Xiangnan He Baoquan Chen and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In ACM SIGIR 15\u201324.","DOI":"10.1145\/3209978.3210003"},{"key":"e_1_3_2_39_2","doi-asserted-by":"crossref","unstructured":"Siqu Long Soyeon Caren Han Xiaojun Wan and Josiah Poon. 2022. Gradual: Graph-based dual-modal representation for image-text matching. In WACV 3459\u20133468.","DOI":"10.1109\/WACV51458.2022.00252"},{"key":"e_1_3_2_40_2","doi-asserted-by":"crossref","unstructured":"Zijun Long Xuri Ge Richard McCreadie and Joemon M. Jose. 2024. CFIR: Fast and effective long-text to image retrieval for large corpora. In SIGIR 2188\u20132198.","DOI":"10.1145\/3626772.3657741"},{"key":"e_1_3_2_41_2","unstructured":"Junhua Mao Wei Xu Yi Yang Jiang Wang Zhiheng Huang and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv:1412.6632. Retrieved from https:\/\/arxiv.org\/abs\/1412.6632"},{"key":"e_1_3_2_42_2","unstructured":"Hyeonseob Nam Jung-Woo Ha and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR 299\u2013307."},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","unstructured":"Manh-Duy Nguyen Binh T. Nguyen and Cathal Gurrin. 2021. A deep local and global scene-graph matching for image-text retrieval. arXiv:2106.02400. Retrieved from https:\/\/arxiv.org\/abs\/2106.02400","DOI":"10.3233\/FAIA210049"},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","unstructured":"Zhengxin Pan Fangyu Wu and Bailing Zhang. 2023. Fine-grained image-text matching by cross-modal hard aligning network. In CVPR 19275\u201319284.","DOI":"10.1109\/CVPR52729.2023.01847"},{"key":"e_1_3_2_45_2","unstructured":"Leigang Qu Meng Liu Da Cao Liqiang Nie and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In ACM MM 1047\u20131055."},{"key":"e_1_3_2_46_2","doi-asserted-by":"crossref","unstructured":"Leigang Qu Meng Liu Jianlong Wu Zan Gao and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In ACM SIGIR 1104\u20131113.","DOI":"10.1145\/3404835.3462829"},{"key":"e_1_3_2_47_2","first-page":"8748","volume-title":"ICML","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748\u20138763."},{"key":"e_1_3_2_48_2","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS 91\u201399."},{"key":"e_1_3_2_49_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS 5998\u20136008."},{"key":"e_1_3_2_50_2","unstructured":"Ivan Vendrov Ryan Kiros Sanja Fidler and Raquel Urtasun. 2016. Order-embeddings of images and language. In ICLR."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58586-0_2"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2797921"},{"key":"e_1_3_2_53_2","doi-asserted-by":"crossref","unstructured":"Liwei Wang Yin Li and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR 5005\u20135013.","DOI":"10.1109\/CVPR.2016.541"},{"key":"e_1_3_2_54_2","doi-asserted-by":"crossref","unstructured":"Sijin Wang Ruiping Wang Ziwei Yao Shiguang Shan and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV 1508\u20131517.","DOI":"10.1109\/WACV45572.2020.9093614"},{"key":"e_1_3_2_55_2","doi-asserted-by":"crossref","unstructured":"Zihao Wang Xihui Liu Hongsheng Li Lu Sheng Junjie Yan Xiaogang Wang and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. In CVPR 5764\u20135773.","DOI":"10.1109\/ICCV.2019.00586"},{"key":"e_1_3_2_56_2","doi-asserted-by":"crossref","unstructured":"Xi Wei Tianzhu Zhang Yan Li Yongdong Zhang and Feng Wu. 2020. Multi-modality cross attention network for image and sentence matching. In CVPR 10941\u201310950.","DOI":"10.1109\/CVPR42600.2020.01095"},{"issue":"7","key":"e_1_3_2_57_2","first-page":"2866","article-title":"Learning dual semantic relations with graph attention for image-text matching","volume":"31","author":"Wen Keyu","year":"2020","unstructured":"Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circuits Syst. Video Technol. 31, 7 (2020), 2866\u20132879.","journal-title":"IEEE Trans. Circuits Syst. Video Technol"},{"issue":"6","key":"e_1_3_2_58_2","first-page":"1367","article-title":"Image captioning and visual question answering based on attributes and external knowledge","volume":"40","author":"Wu Qi","year":"2017","unstructured":"Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2017), 1367\u20131381.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"e_1_3_2_59_2","unstructured":"Yiling Wu Shuhui Wang Guoli Song and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM 2088\u20132096."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2022.103154"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2967597"},{"key":"e_1_3_2_62_2","doi-asserted-by":"crossref","unstructured":"Xun Yang Jianfeng Dong Yixin Cao Xun Wang Meng Wang and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR 1339\u20131348.","DOI":"10.1145\/3397271.3401151"},{"key":"e_1_3_2_63_2","doi-asserted-by":"crossref","unstructured":"Xun Yang Fuli Feng Wei Ji Meng Wang and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In ACM SIGIR 1\u201310.","DOI":"10.1145\/3404835.3462823"},{"key":"e_1_3_2_64_2","doi-asserted-by":"crossref","unstructured":"Ting Yao Yingwei Pan Yehao Li Zhaofan Qiu and Tao Mei. 2017. Boosting image captioning with attributes. In ICCV 4894\u20134902.","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_2_66_2","doi-asserted-by":"crossref","unstructured":"Rowan Zellers Mark Yatskar Sam Thomson and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR 5831\u20135840.","DOI":"10.1109\/CVPR.2018.00611"},{"key":"e_1_3_2_67_2","doi-asserted-by":"crossref","unstructured":"Kun Zhang Zhendong Mao Quan Wang and Yongdong Zhang. 2022. Negative-aware attention framework for image-text matching. In CVPR 15661\u201315670.","DOI":"10.1109\/CVPR52688.2022.01521"},{"key":"e_1_3_2_68_2","doi-asserted-by":"crossref","unstructured":"Qi Zhang Zhen Lei Zhaoxiang Zhang and Stan Z. Li. 2020. Context-aware attention network for image-text retrieval. In CVPR 3536\u20133545.","DOI":"10.1109\/CVPR42600.2020.00359"},{"key":"e_1_3_2_69_2","doi-asserted-by":"crossref","unstructured":"Liangli Zhen Peng Hu Xu Wang and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In CVPR 10394\u201310403.","DOI":"10.1109\/CVPR.2019.01064"},{"key":"e_1_3_2_70_2","doi-asserted-by":"crossref","unstructured":"Yaoxin Zhuo and Baoxin Li. 2024. FELGA: Unsupervised fragment embedding for fine-grained cross-modal association. In WACV 5635\u20135645.","DOI":"10.1109\/WACV57701.2024.00554"}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3714431","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,11]],"date-time":"2025-06-11T10:22:59Z","timestamp":1749637379000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3714431"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,11]]},"references-count":69,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3714431"],"URL":"https:\/\/doi.org\/10.1145\/3714431","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,11]]},"assertion":[{"value":"2024-03-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-26","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}