{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,16]],"date-time":"2025-12-16T12:29:35Z","timestamp":1765888175031,"version":"3.41.0"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"2s","license":[{"start":{"date-parts":[[2019,4,30]],"date-time":"2019-04-30T00:00:00Z","timestamp":1556582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61725203, 61732008, 61876058"],"award-info":[{"award-number":["61725203, 61732008, 61876058"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Key Research and Development Program of China","award":["2018YFB0804200"],"award-info":[{"award-number":["2018YFB0804200"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,4,30]]},"abstract":"<jats:p>As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this article, to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. First, an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter-correlations among the image-sentence pairs, i.e., the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, i.e., Flickr8k, Flickr30k, and MS-COCO.<\/jats:p>","DOI":"10.1145\/3314577","type":"journal-article","created":{"date-parts":[[2019,7,3]],"date-time":"2019-07-03T13:47:53Z","timestamp":1562161673000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["Cross-Modality Retrieval by Joint Correlation Learning"],"prefix":"10.1145","volume":"15","member":"320","published-online":{"date-parts":[[2019,7,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.06.007"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2810133.2810136"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.177"},{"key":"e_1_2_1_4_1","volume-title":"Snoek","author":"Dong Jianfeng","year":"2016","unstructured":"Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2016. Word2VisualVec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016)."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654902"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2007.4408839"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1352-2310(97)00447-0"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46466-4_15"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.560"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/2566972.2566993"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2614132"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043612.2043613"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2016.2515640"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.767"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045167"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_19_1","volume-title":"Zemel","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)."},{"key":"e_1_2_1_20_1","volume-title":"Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399","author":"Klein Benjamin","year":"2014","unstructured":"Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2014. Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014)."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299073"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the Australasian Conference on Robotics and Automation","volume":"322","author":"Ledwich Luke","year":"2004","unstructured":"Luke Ledwich and Stefan Williams. 2004. Reduced SIFT features for image retrieval and indoor localisation. In Proceedings of the Australasian Conference on Robotics and Automation, Vol. 322. 3."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.483"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46466-4_50"},{"volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201914)","author":"Lin Tsung-Yi","key":"e_1_2_1_26_1","unstructured":"Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV\u201914). 740--755."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_17"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00898"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/850924.851523"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.301"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045138"},{"key":"e_1_2_1_32_1","first-page":"11","article-title":"Visualizing data using t-SNE","volume":"9","author":"van der Maaten Laurens","year":"2008","unstructured":"Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (Nov. 2008), 2579--2605.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201915)","author":"Mao Junhua","year":"2015","unstructured":"Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representations (ICLR\u201915)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999792.2999959"},{"key":"e_1_2_1_35_1","volume-title":"A randomized algorithm for CCA. arXiv preprint arXiv:1411.3409","author":"Mineiro Paul","year":"2014","unstructured":"Paul Mineiro and Nikos Karampatziakis. 2014. A randomized algorithm for CCA. arXiv preprint arXiv:1411.3409 (2014)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.378"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/79.543975"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.208"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.466"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873987"},{"volume-title":"Stan Z. Li and Anil Jain (Eds.)","author":"Reynolds Douglas","key":"e_1_2_1_42_1","unstructured":"Douglas Reynolds. 2015. Gaussian mixture models. Encyclopedia of Biometrics, Stan Z. Li and Anil Jain (Eds.). Springer, 827--832."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_2_1_44_1","first-page":"1","article-title":"Texture and color features based color image retrieval using canonical correlation","volume":"15","author":"Seetharaman K.","year":"2016","unstructured":"K. Seetharaman and Bachala Shyam Kumar. 2016. Texture and color features based color image retrieval using canonical correlation. Global J. Res. Eng. 15, 6 (2016), 1--9.","journal-title":"Global J. Res. Eng."},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201915)","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR\u201915)."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"volume-title":"L. G. Grimm and P. R. Yarnold","author":"Thompson Bruce","key":"e_1_2_1_47_1","unstructured":"Bruce Thompson. 2000. Canonical correlation analysis. Reading and Understanding More Multivariate Statistics, L. G. Grimm and P. R. Yarnold. American Psychological Association, 285--316."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.541"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240671"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.05.001"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2765836"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3314577","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3314577","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:02:03Z","timestamp":1750208523000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3314577"}},"subtitle":[],"editor":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4881-9344","authenticated-orcid":false,"given":"Shuo","family":"Wang","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2594-254X","authenticated-orcid":false,"given":"Dan","family":"Guo","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0748-3669","authenticated-orcid":false,"given":"Xin","family":"Xu","sequence":"additional","affiliation":[]},{"given":"Li","family":"Zhuo","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3094-7735","authenticated-orcid":false,"given":"Meng","family":"Wang","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2019,4,30]]},"references-count":53,"journal-issue":{"issue":"2s","published-print":{"date-parts":[[2019,4,30]]}},"alternative-id":["10.1145\/3314577"],"URL":"https:\/\/doi.org\/10.1145\/3314577","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2019,4,30]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}