{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,18]],"date-time":"2026-01-18T21:34:46Z","timestamp":1768772086162,"version":"3.49.0"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T00:00:00Z","timestamp":1740700800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T00:00:00Z","timestamp":1740700800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100020884","name":"Agencia Nacional de Investigaci\u00f3n y Desarrollo","doi-asserted-by":"publisher","award":["IT21I0019"],"award-info":[{"award-number":["IT21I0019"]}],"id":[{"id":"10.13039\/501100020884","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100020884","name":"Agencia Nacional de Investigaci\u00f3n y Desarrollo","doi-asserted-by":"publisher","award":["IT21I0019"],"award-info":[{"award-number":["IT21I0019"]}],"id":[{"id":"10.13039\/501100020884","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100020884","name":"Agencia Nacional de Investigaci\u00f3n y Desarrollo","doi-asserted-by":"publisher","award":["IT21I0019"],"award-info":[{"award-number":["IT21I0019"]}],"id":[{"id":"10.13039\/501100020884","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100014374","name":"Universitat Polit\u00e8cnica de Catalunya","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100014374","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2025,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels. We present comprehensive experiments on MS-COCO and Flickr30k, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. We also conduct a case study on the ROCO dataset to assess the performance of our method on medical images and present an ablation study on one of our approaches to understanding the impact of the different components of the proposed loss function. Our code is publicly available on GitHub <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/MariodotR\/FullHN.git\" ext-link-type=\"uri\">https:\/\/github.com\/MariodotR\/FullHN.git<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s10994-024-06710-z","type":"journal-article","created":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T18:51:17Z","timestamp":1740768677000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Intramodal consistency in triplet-based cross-modal learning for image retrieval"],"prefix":"10.1007","volume":"114","author":[{"given":"Mario","family":"Mallea","sequence":"first","affiliation":[]},{"given":"Ricardo","family":"\u00d1anculef","sequence":"additional","affiliation":[]},{"given":"Mauricio","family":"Araya","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,2,28]]},"reference":[{"key":"6710_CR1","doi-asserted-by":"crossref","unstructured":"Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision - ECCV 2016 (pp. 382\u2013398). Cham: Springer.","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"6710_CR2","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.patrec.2020.02.006","volume":"131","author":"U Chaudhuri","year":"2020","unstructured":"Chaudhuri, U., Banerjee, B., Bhattacharya, A., & Datcu, M. (2020). Cmir-net: A deep learning based model for cross-modal retrieval in remote sensing. Pattern recognition letters, 131, 456\u2013462.","journal-title":"Pattern recognition letters"},{"key":"6710_CR3","doi-asserted-by":"publisher","unstructured":"Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., B\u00fcttcher, S., & MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR \u201908, pp. 659\u2013666. Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/1390334.1390446 . https:\/\/doi.org\/10.1145\/1390334.1390446","DOI":"10.1145\/1390334.1390446"},{"key":"6710_CR4","doi-asserted-by":"crossref","unstructured":"Desai, K., & Johnson, J. (2020). Virtex: Learning visual representations from textual annotations. 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11157\u201311168","DOI":"10.1109\/CVPR46437.2021.01101"},{"key":"6710_CR5","doi-asserted-by":"crossref","unstructured":"Do, T.-T., Tran, T., & Reid, e.a. (2019). Ian: A theoretically sound upper bound on the triplet loss for improving the efficiency of deep distance metric learning. In: IEEE CVPR, pp. 10404\u201310413","DOI":"10.1109\/CVPR.2019.01065"},{"key":"6710_CR6","doi-asserted-by":"crossref","first-page":"2687","DOI":"10.1109\/TCSVT.2021.3080920","volume":"32","author":"SR Dubey","year":"2020","unstructured":"Dubey, S. R. (2020). A decade survey of content based image retrieval using deep learning. IEEE Transactions on Circuits and Systems for Video Technology, 32, 2687\u20132704.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"6710_CR7","unstructured":"Eslami, S., Melo, G., & Meinel, C. (2021). Does clip benefit visual question answering in the medical domain as much as it does in the general domain? ArXiv abs\/2112.13906"},{"key":"6710_CR8","unstructured":"Faghri, F., Fleet, D.J., Kiros, J.R., & Fidler, S. (2017). Vse++: Improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference. https:\/\/api.semanticscholar.org\/CorpusID:6095318"},{"key":"6710_CR9","doi-asserted-by":"crossref","first-page":"272","DOI":"10.1007\/978-3-030-01231-1_17","volume-title":"Computer Vision - ECCV 2018","author":"W Ge","year":"2018","unstructured":"Ge, W., Huang, W., Dong, D., & Scott, M. R. (2018). Deep metric learning with hierarchical triplet loss. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision - ECCV 2018 (pp. 272\u2013288). Cham: Springer."},{"issue":"4","key":"6710_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3447755","volume":"54","author":"FL Gewers","year":"2021","unstructured":"Gewers, F. L., Ferreira, G. R., Arruda, H. F. D., Silva, F. N., Comin, C. H., Amancio, D. R., & Costa, L. D. F. (2021). Principal component analysis. ACM Computing Surveys, 54(4), 1\u201334. https:\/\/doi.org\/10.1145\/3447755","journal-title":"ACM Computing Surveys"},{"key":"6710_CR11","doi-asserted-by":"crossref","DOI":"10.1016\/j.patcog.2022.109272","volume":"137","author":"Y Gong","year":"2023","unstructured":"Gong, Y., & Cosma, G. (2023). Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval. Pattern Recognition, 137, 109272.","journal-title":"Pattern Recognition"},{"key":"6710_CR12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3458754","volume":"3","author":"Y Gu","year":"2021","unstructured":"Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3, 1.","journal-title":"ACM Transactions on Computing for Healthcare"},{"key":"6710_CR13","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. ArXiv abs\/2102.05918"},{"issue":"4","key":"6710_CR14","doi-asserted-by":"crossref","first-page":"664","DOI":"10.1109\/TPAMI.2016.2598339","volume":"39","author":"A Karpathy","year":"2017","unstructured":"Karpathy, A., & Fei-Fei, L. (2017). Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 664\u2013676.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"6710_CR15","first-page":"1066","volume":"11","author":"M Kaya","year":"2019","unstructured":"Kaya, M., & Bilge, H. S. (2019). Symmetry. Deep metric learning: A survey, 11, 1066.","journal-title":"Deep metric learning: A survey"},{"key":"6710_CR16","unstructured":"Kingma, D.P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization"},{"key":"6710_CR17","unstructured":"Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74\u201381"},{"key":"6710_CR18","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, e.a. (2014). Serge: Microsoft coco: Common objects in context. In: ECCV, pp. 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"6710_CR19","doi-asserted-by":"crossref","first-page":"675","DOI":"10.1016\/j.neucom.2020.07.139","volume":"452","author":"X Li","year":"2021","unstructured":"Li, X., Yang, J., & Ma, J. (2021). Recent developments of content-based image retrieval (cbir). Neurocomputing, 452, 675\u2013689.","journal-title":"Neurocomputing"},{"key":"6710_CR20","doi-asserted-by":"crossref","unstructured":"Ma, H., Zhao, H., & Lin, e.a. (2022). Zhe: Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In: CVPR, pp. 18051\u201318061","DOI":"10.1109\/CVPR52688.2022.01752"},{"key":"6710_CR21","doi-asserted-by":"crossref","unstructured":"Messina, N., Falchi, F., Esuli, A., & Amato, G. (2021). Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222\u20135229. IEEE","DOI":"10.1109\/ICPR48806.2021.9413172"},{"key":"6710_CR22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3451390","volume":"45","author":"N Messina","year":"2021","unstructured":"Messina, N., Amato, G., & Esuli, A. (2021). Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications, 45, 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"6710_CR23","doi-asserted-by":"crossref","unstructured":"Molina, G., Mendoza, M., & Loayza, e.a. (2022). Ignacio: A new content-based image retrieval system for sars-cov-2 computer-aided diagnosis. In: MICAD 2021, pp. 316\u2013324","DOI":"10.1007\/978-981-16-3880-0_33"},{"key":"6710_CR24","unstructured":"Pelka, O., Koitka, S., R\u00fcckert, J., Nensa, F., & Friedrich, C. M. (2018). Radiology objects in context (roco): A multimodal image dataset. In D. Stoyanov, Z. Taylor, S. Balocco, R. Sznitman, A. Martel, L. Maier-Hein, L. Duong, G. Zahnd, S. Demirci, S. Albarqouni, S.-L. Lee, S. Moriconi, V. Cheplygina, D. Mateus, E. Trucco, E. Granger, & P. Jannin (Eds.), Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (pp. 180\u2013189). Cham: Springer."},{"key":"6710_CR25","doi-asserted-by":"crossref","unstructured":"Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks arXiv:1506.01497 [cs.CV]","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"6710_CR26","doi-asserted-by":"crossref","unstructured":"Ren, R., Lv, S., Qu, e.a. (2021). Yingqi: Pair: Leveraging passage-centric similarity relation for improving dense passage retrieval, pp. 2173\u20132183","DOI":"10.18653\/v1\/2021.findings-acl.191"},{"key":"6710_CR27","doi-asserted-by":"crossref","unstructured":"Schubert, E. (2021). A triangle inequality for cosine similarity","DOI":"10.1007\/978-3-030-89657-7_3"},{"key":"6710_CR28","doi-asserted-by":"crossref","unstructured":"Song, Y., & Soleymani, M. (2019). Polysemous visual-semantic embedding for cross-modal retrieval. In: CVPR, pp. 1979\u20131988","DOI":"10.1109\/CVPR.2019.00208"},{"key":"6710_CR29","doi-asserted-by":"crossref","unstructured":"Song, H.O., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In: IEEE CVPR, pp. 4004\u20134012","DOI":"10.1109\/CVPR.2016.434"},{"key":"6710_CR30","first-page":"16857","volume":"33","author":"K Song","year":"2020","unstructured":"Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33, 16857\u201316867.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6710_CR31","doi-asserted-by":"publisher","unstructured":"Sotomayor, C.G., Mendoza, M., Casta\u00f1eda, V., Far\u00edas, H., Molina, G., Pereira, G., H\u00e4rtel, S., Solar, M., & Araya, M. (2021). Content-based medical image retrieval and intelligent interactive visual browser for medical education, research and care. Diagnostics 11(8) https:\/\/doi.org\/10.3390\/diagnostics11081470","DOI":"10.3390\/diagnostics11081470"},{"key":"6710_CR32","unstructured":"Tan, M., & Le, Q.V. (2021). Efficientnetv2: Smaller models and faster training. CoRR abs\/2104.00298 2104.00298"},{"key":"6710_CR33","doi-asserted-by":"crossref","unstructured":"Tian, Y., Yu, X., Fan, e.a. (2019). Bin: Sosnet: Second order similarity regularization for local descriptor learning, pp. 11008\u201311017","DOI":"10.1109\/CVPR.2019.01127"},{"key":"6710_CR34","unstructured":"Tony\u00a0Ng, Y.T. Vassileios\u00a0Balntas, & Mikolajczyk, K. (2020). Solar: Second-order loss and attention for image retrieval. ArXiv"},{"key":"6710_CR35","doi-asserted-by":"crossref","unstructured":"Wang, Z., Wang, Y., & Dong, e.a. (2020). Bo: Adaptive margin based deep adversarial metric learning. IEEE BigDataSecurity\/HPSC\/IDS,2020, 100\u2013108.","DOI":"10.1109\/BigDataSecurity-HPSC-IDS49724.2020.00028"},{"key":"6710_CR36","doi-asserted-by":"crossref","unstructured":"Weihua\u00a0Chen, J.Z. Xiaotang\u00a0Chen, & Huang, K. (2017). Beyond triplet loss: A deep quadruplet network for person re-identification. IEEE CVPR, 1320\u20131329","DOI":"10.1109\/CVPR.2017.145"},{"key":"6710_CR37","doi-asserted-by":"crossref","unstructured":"Wu, Y., Wang, S., & Huang, Q. (2017). Online asymmetric similarity learning for cross-modal retrieval. In: IEEE CVPR, pp. 3984\u20133993","DOI":"10.1109\/CVPR.2017.424"},{"issue":"5","key":"6710_CR38","doi-asserted-by":"crossref","first-page":"1310","DOI":"10.1109\/TMM.2019.2942494","volume":"22","author":"Y Wu","year":"2020","unstructured":"Wu, Y., Wang, S., & Huang, Q. (2020). Online fast adaptive low-rank similarity learning for cross-modal retrieval. IEEE Transactions on Multimedia, 22(5), 1310\u20131322.","journal-title":"IEEE Transactions on Multimedia"},{"key":"6710_CR39","doi-asserted-by":"crossref","unstructured":"Xuan, H., Stylianou, A., Liu, X., & Pless, R. (2020). Hard negative examples are hard, but useful. In: ECCV, pp. 126\u2013142","DOI":"10.1007\/978-3-030-58568-6_8"},{"key":"6710_CR40","doi-asserted-by":"crossref","unstructured":"Yang, J., Duan, J., Tran, e.a.(2022). Son: Vision-language pre-training with triple contrastive learning. In: 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15650\u201315659","DOI":"10.1109\/CVPR52688.2022.01522"},{"issue":"6","key":"6710_CR41","doi-asserted-by":"crossref","first-page":"2872","DOI":"10.1109\/TPAMI.2021.3054775","volume":"44","author":"M Ye","year":"2021","unstructured":"Ye, M., Shen, J., & Lin, G. (2021). Gaojie: Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872\u20132893.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"6710_CR42","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","volume":"2","author":"P Young","year":"2014","unstructured":"Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67\u201378.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"6710_CR43","doi-asserted-by":"crossref","unstructured":"Yuan, X., Lin, Z.L., Kuen, J., Zhang, J., Wang, Y., Maire, M., Kale, A., & Faieta, B. (2021). Multimodal contrastive training for visual representation learning. 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6991\u20137000","DOI":"10.1109\/CVPR46437.2021.00692"},{"key":"6710_CR44","unstructured":"Zhang, S., Xu, Y., Usuyama, N., Bagga, J.K., Tinn, R., Preston, S., Rao, R.N., Wei, M.-H., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., & Poon, H. (2023). Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. https:\/\/api.semanticscholar.org\/CorpusID:257280046"},{"issue":"12","key":"6710_CR45","doi-asserted-by":"crossref","first-page":"3180","DOI":"10.1109\/TMM.2020.2972125","volume":"22","author":"C Zhao","year":"2020","unstructured":"Zhao, C., Lv, X., & Zhang, Z. (2020). Zhang: Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification. IEEE Transactions on Multimedia, 22(12), 3180\u20133195.","journal-title":"IEEE Transactions on Multimedia"},{"key":"6710_CR46","doi-asserted-by":"crossref","first-page":"4511","DOI":"10.1073\/pnas.1000488107","volume":"107","author":"T Zhou","year":"2010","unstructured":"Zhou, T., Kuscsik, Z., & Liu, J. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences, 107, 4511\u20134515.","journal-title":"Proceedings of the National Academy of Sciences"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06710-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-024-06710-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06710-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T15:08:38Z","timestamp":1743347318000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-024-06710-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,28]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,4]]}},"alternative-id":["6710"],"URL":"https:\/\/doi.org\/10.1007\/s10994-024-06710-z","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,28]]},"assertion":[{"value":"25 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 October 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 October 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 February 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"110"}}