{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T05:11:39Z","timestamp":1770354699766,"version":"3.49.0"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,2,7]],"date-time":"2019-02-07T00:00:00Z","timestamp":1549497600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["2018JBZ001"],"award-info":[{"award-number":["2018JBZ001"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61532005, 61332012, and 61572065"],"award-info":[{"award-number":["61532005, 61332012, and 61572065"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Joint Fund of Ministry of Education of China and China Mobile","award":["MCM20160102"],"award-info":[{"award-number":["MCM20160102"]}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development of China","doi-asserted-by":"crossref","award":["2016YFB0800404"],"award-info":[{"award-number":["2016YFB0800404"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,2,28]]},"abstract":"<jats:p>\n            Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called\n            <jats:italic>cross-modal retrieval<\/jats:italic>\n            , which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss--based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.\n          <\/jats:p>","DOI":"10.1145\/3300939","type":"journal-article","created":{"date-parts":[[2019,2,7]],"date-time":"2019-02-07T15:33:18Z","timestamp":1549553598000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["Modality-Invariant Image-Text Embedding for Image-Sentence Matching"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4679-3179","authenticated-orcid":false,"given":"Ruoyu","family":"Liu","sequence":"first","affiliation":[{"name":"Beijing Jiaotong University, Beijing, P. R., China"}]},{"given":"Yao","family":"Zhao","sequence":"additional","affiliation":[{"name":"Beijing Jiaotong University, Beijing, P. R., China"}]},{"given":"Shikui","family":"Wei","sequence":"additional","affiliation":[{"name":"Beijing Jiaotong University, Beijing, P. R., China"}]},{"given":"Liang","family":"Zheng","sequence":"additional","affiliation":[{"name":"Australian National University, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0512-880X","authenticated-orcid":false,"given":"Yi","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Technology Sydney, Ultimo NSW, Australia"}]}],"member":"320","published-online":{"date-parts":[[2019,2,7]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Mind\u2019s eye: A recurrent visual representation for image caption generation","author":"Chen Xinlei"},{"key":"e_1_2_1_2_1","volume-title":"Dzmitry Bahdanau, and Yoshua Bengio.","author":"Cho Kyunghyun","year":"2014"},{"key":"e_1_2_1_3_1","volume-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling. Arxiv Preprint Arxiv:1412.3555","author":"Chung Junyoung","year":"2014"},{"key":"e_1_2_1_4_1","volume-title":"Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.","author":"Donahue Jeffrey","year":"2015"},{"key":"e_1_2_1_5_1","volume-title":"Linking image and text with 2-way nets. Arxiv Preprint Arxiv:1608.07973","author":"Eisenschtat Aviv","year":"2016"},{"key":"e_1_2_1_6_1","volume-title":"Jamie Ryan Kiros, and Sanja Fidler","author":"Faghri Fartash","year":"2017"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2808205"},{"key":"e_1_2_1_8_1","volume-title":"Devise: A deep visual-semantic embedding model","author":"Frome Andrea","year":"2013"},{"key":"e_1_2_1_9_1","unstructured":"Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML. ACM 1180--1189.   Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML. ACM 1180--1189."},{"key":"e_1_2_1_10_1","first-page":"1","article-title":"Domain-adversarial training of neural networks","volume":"17","author":"Ganin Yaroslav","year":"2016","journal-title":"JMLR"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0658-4"},{"key":"e_1_2_1_12_1","volume-title":"Generative adversarial nets","author":"Goodfellow Ian"},{"key":"e_1_2_1_13_1","volume-title":"imagine and match: Improving textual-visual cross-modal retrieval with generative models. Arxiv Preprint Arxiv:1711.06420","author":"Gu Jiuxiang","year":"2017"},{"key":"e_1_2_1_14_1","volume-title":"Deep residual learning for image recognition","author":"He Kaiming"},{"key":"e_1_2_1_15_1","volume-title":"Unsupervised cross-modal retrieval through adversarial learning","author":"He Li"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/2566972.2566993"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/28.3-4.321"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2760101"},{"key":"e_1_2_1_19_1","volume-title":"Instance-aware image and sentence matching with selective multimodal LSTM. Arxiv Preprint Arxiv:1611.05588","author":"Huang Yan","year":"2016"},{"key":"e_1_2_1_20_1","volume-title":"Learning semantic concepts and order for image and sentence matching. Arxiv Preprint Arxiv:1712.02036","author":"Huang Yan","year":"2017"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","volume-title":"Deep visual-semantic alignments for generating image descriptions","author":"Karpathy Andrej","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_22_1","volume-title":"Fei-Fei","author":"Karpathy Andrej","year":"2014"},{"key":"e_1_2_1_23_1","volume-title":"Adam: A method for stochastic optimization. Arxiv Preprint Arxiv:1412.6980","author":"Kingma Diederik","year":"2014"},{"key":"e_1_2_1_24_1","volume-title":"Zemel","author":"Kiros Ryan","year":"2014"},{"key":"e_1_2_1_25_1","volume-title":"Skip-thought vectors","author":"Kiros Ryan"},{"key":"e_1_2_1_26_1","volume-title":"Associating neural word embeddings with deep image representations using Fisher vectors","author":"Klein Benjamin"},{"key":"e_1_2_1_27_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","volume-title":"RNN Fisher vectors for action recognition and image annotation","author":"Lev Guy","DOI":"10.1007\/978-3-319-46466-4_50"},{"key":"e_1_2_1_29_1","volume-title":"Microsoft COCO: Common objects in context","author":"Lin Tsung-Yi"},{"key":"e_1_2_1_30_1","volume-title":"Leveraging visual question answering for image-caption ranking","author":"Lin Xiao"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2018.112142537"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","volume-title":"Cross-media hashing with centroid approaching","author":"Liu Ruoyu","DOI":"10.1109\/ICME.2015.7177473"},{"key":"e_1_2_1_33_1","volume-title":"A new evaluation protocol and benchmarking results for extendable cross-media retrieval. Arxiv Preprint Arxiv:1703.03567","author":"Liu Ruoyu","year":"2017"},{"key":"e_1_2_1_34_1","volume-title":"Lew","author":"Liu Yu","year":"2017"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.301"},{"key":"e_1_2_1_36_1","first-page":"2579","article-title":"Visualizing data using t-SNE","author":"van der Maaten Laurens","year":"2008","journal-title":"JMLR 9"},{"key":"e_1_2_1_37_1","volume-title":"Yuille","author":"Mao Junhua","year":"2014"},{"key":"e_1_2_1_38_1","first-page":"234","article-title":"Context dependent recurrent neural network language model","volume":"12","author":"Mikolov Tomas","year":"2012","journal-title":"SLT"},{"key":"e_1_2_1_39_1","volume-title":"Dual attention networks for multimodal reasoning and matching. Arxiv Preprint Arxiv:1611.00471","author":"Nam Hyeonseob","year":"2016"},{"key":"e_1_2_1_40_1","volume-title":"Hierarchical multimodal LSTM for dense visual-semantic embedding","author":"Niu Zhenxing","year":"1881"},{"key":"e_1_2_1_41_1","volume-title":"Deep metric learning via lifted structured feature embedding","author":"Song Hyun Oh"},{"key":"e_1_2_1_42_1","volume-title":"Image-text multi-modal representation learning by adversarial backpropagation. Arxiv Preprint Arxiv:1612.08354","author":"Park Gwangbeen","year":"2016"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2705068"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1631\/FITEE.1601787"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873987"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_1_47_1","volume-title":"FaceNet: A unified embedding for face recognition and clustering","author":"Schroff Florian"},{"key":"e_1_2_1_48_1","volume-title":"Jacobs","author":"Sharma Abhishek","year":"2012"},{"key":"e_1_2_1_49_1","volume-title":"Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556","author":"Simonyan Karen","year":"2014"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00177"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_2_1_52_1","volume-title":"Show and tell: A neural image caption generator","author":"Vinyals Oriol"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123326"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","volume-title":"Learning deep structure-preserving image-text embeddings","author":"Wang Liwei","DOI":"10.1109\/CVPR.2016.541"},{"key":"e_1_2_1_55_1","first-page":"449","article-title":"Cross-modal retrieval with CNN visual features: A new baseline","volume":"47","author":"Wei Yunchao","year":"2017","journal-title":"IEEE Trans. on Cybernetics"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2967231"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2676345"},{"key":"e_1_2_1_58_1","doi-asserted-by":"crossref","volume-title":"Deep correlation for matching images and text","author":"Yan Fei","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_2_1_60_1","volume-title":"Unsupervised generative adversarial cross-modal hashing. Arxiv Preprint Arxiv:1712.00358","author":"Zhang Jian","year":"2017"},{"key":"e_1_2_1_61_1","volume-title":"Dual-path convolutional image-text embedding. Arxiv Preprint Arxiv:1711.05535","author":"Zheng Zhedong","year":"2017"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3300939","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3300939","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:25:23Z","timestamp":1750206323000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3300939"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,7]]},"references-count":61,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,2,28]]}},"alternative-id":["10.1145\/3300939"],"URL":"https:\/\/doi.org\/10.1145\/3300939","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,2,7]]},"assertion":[{"value":"2018-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}