{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T15:42:27Z","timestamp":1772120547973,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2024,3,8]],"date-time":"2024-03-08T00:00:00Z","timestamp":1709856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62222203, 61976049, and 62072080"],"award-info":[{"award-number":["62222203, 61976049, and 62072080"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>In this article, we study the challenging cross-modal image retrieval task,<jats:italic>Composed Query-Based Image Retrieval (CQBIR)<\/jats:italic>, in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed<jats:italic><jats:bold>C<\/jats:bold>ross-<jats:bold>M<\/jats:bold>odal<jats:bold>A<\/jats:bold>ttention<jats:bold>P<\/jats:bold>reservation (CMAP)<\/jats:italic>. Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https:\/\/github.com\/CFM-MSG\/Code_CMAP.<\/jats:p>","DOI":"10.1145\/3639469","type":"journal-article","created":{"date-parts":[[2024,1,9]],"date-time":"2024-01-09T15:11:36Z","timestamp":1704813096000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6340-012X","authenticated-orcid":false,"given":"Shenshen","family":"Li","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5685-3123","authenticated-orcid":false,"given":"Xing","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2209-651X","authenticated-orcid":false,"given":"Xun","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7303-3231","authenticated-orcid":false,"given":"Fumin","family":"Shen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6531-0769","authenticated-orcid":false,"given":"Zhe","family":"Sun","sequence":"additional","affiliation":[{"name":"Juntendo University, Tokyo, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8364-7226","authenticated-orcid":false,"given":"Andrzej","family":"Cichocki","sequence":"additional","affiliation":[{"name":"Systems Research Institute of Polish Academy of Science, Warszawa, Poland and Tensor Learning Lab, Riken AIP, Tokyo, Japan"}]}],"member":"320","published-online":{"date-parts":[[2024,3,8]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00118"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.02080"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15549-9_48"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3375786"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00307"},{"key":"e_1_3_1_7_2","article-title":"Composed image retrieval with text feedback via multi-grained uncertainty regularization","volume":"2211","author":"Chen Yiyang","year":"2022","unstructured":"Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, and Tat-Seng Chua. 2022. Composed image retrieval with text feedback via multi-grained uncertainty regularization. CoRR abs\/2211.07394.","journal-title":"CoRR"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612349"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3499027"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_3_1_11_2","article-title":"ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity","volume":"2203","author":"Delmas Ginger","year":"2022","unstructured":"Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. 2022. ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. Computing Research Repository abs\/2203.08101.","journal-title":"Computing Research Repository"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00482"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01371"},{"key":"e_1_3_1_14_2","article-title":"CompoDiff: Versatile composed image retrieval with latent diffusion","volume":"2303","author":"Gu Geonmo","year":"2023","unstructured":"Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. 2023. CompoDiff: Versatile composed image retrieval with latent diffusion. CoRR abs\/2303.11916.","journal-title":"CoRR"},{"key":"e_1_3_1_15_2","first-page":"676","volume-title":"Advances in Neural Information Processing Systems","author":"Guo Xiaoxiao","year":"2018","unstructured":"Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Advances in Neural Information Processing Systems. 676\u2013686."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.163"},{"key":"e_1_3_1_17_2","article-title":"FashionViL: Fashion-focused vision-and-language representation learning","volume":"2207","author":"Han Xiao","year":"2022","unstructured":"Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-focused vision-and-language representation learning. Computing Research Repository abs\/2207.08150.","journal-title":"Computing Research Repository"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00365"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00067"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16271"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00086"},{"key":"e_1_3_1_25_2","article-title":"Data roaming and early fusion for composed image retrieval","volume":"2303","author":"Levy Matan","year":"2023","unstructured":"Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2023. Data roaming and early fusion for composed image retrieval. CoRR abs\/2303.09429.","journal-title":"CoRR"},{"key":"e_1_3_1_26_2","first-page":"1","article-title":"Multi-grained attention network with mutual exclusion for composed query-based image retrieval","author":"Li Shenshen","year":"2023","unstructured":"Shenshen Li, Xing Xu, Xun Jiang, Fumin Shen, Xin Liu, and Heng Tao Shen. 2023. Multi-grained attention network with mutual exclusion for composed query-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1\u20131.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Shenshen Li Xing Xu Yang Yang Fumin Shen Yijun Mo Yujie Li and Heng Tao Shen. 2023. DCEL: Deep cross-modal evidential learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia (MM\u201923 Ottawa ON Canada 29 October 2023- 3 November 2023) 6292\u20136300.","DOI":"10.1145\/3581783.3612244"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00376"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00213"},{"key":"e_1_3_1_30_2","volume-title":"International Conference on Learning Representations","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations."},{"key":"e_1_3_1_31_2","article-title":"Representation learning with contrastive predictive coding","author":"Oord Aaron van den","year":"2018","unstructured":"Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.","journal-title":"arXiv preprint arXiv:1807.03748"},{"key":"e_1_3_1_32_2","article-title":"Fair contrastive learning for facial attribute classification","volume":"2203","author":"Park Sungho","year":"2022","unstructured":"Sungho Park, Jewook Lee, Pilhyeon Lee, Sunhee Hwang, Dohyung Kim, and Hyeran Byun. 2022. Fair contrastive learning for facial attribute classification. Computing Research Repository abs\/2203.16209.","journal-title":"Computing Research Repository"},{"key":"e_1_3_1_33_2","first-page":"1532","volume-title":"EMNLP","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532\u20131543."},{"key":"e_1_3_1_34_2","first-page":"8748","volume-title":"Proceedings of the 38th International Conference on Machine Learning (ICML)","volume":"139","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Vol. 139. 8748\u20138763."},{"key":"e_1_3_1_35_2","unstructured":"Shaoqing Ren Kaiming He Ross B. Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 91\u201399."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00591"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00660"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123326"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01376"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00504"},{"key":"e_1_3_1_43_2","doi-asserted-by":"crossref","unstructured":"Xiaohan Wang Linchao Zhu Zhedong Zheng Mingliang Xu and Yi Yang. 2023. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Transactions on Multimedia 25 (2023) 6079\u20136089.","DOI":"10.1109\/TMM.2022.3204444"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462967"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01115"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2020.3009004"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3299791"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3572844"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3138302"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3478642"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475659"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475369"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3584703"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639469","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3639469","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:53:37Z","timestamp":1750287217000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639469"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,8]]},"references-count":52,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3639469"],"URL":"https:\/\/doi.org\/10.1145\/3639469","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,8]]},"assertion":[{"value":"2023-05-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}