{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T03:14:16Z","timestamp":1778901256444,"version":"3.51.4"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,2,7]],"date-time":"2019-02-07T00:00:00Z","timestamp":1549497600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61771025 and 61532005"],"award-info":[{"award-number":["61771025 and 61532005"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,2,28]]},"abstract":"<jats:p>\n            It is known that the inconsistent distributions and representations of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and have shown their strong ability to model data distribution and learn discriminative representation. It has also been shown that adversarial learning can be fully exploited to learn discriminative common representations for bridging the heterogeneity gap. Inspired by this, we aim to effectively correlate large-scale heterogeneous data of different modalities with the power of GANs to model cross-modal joint distribution. In this article, we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions. First, a\n            <jats:italic>cross-modal GAN architecture<\/jats:italic>\n            is proposed to model joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both compete with each other to promote cross-modal correlation learning. Second, the\n            <jats:italic>cross-modal convolutional autoencoders with weight-sharing constraint<\/jats:italic>\n            are proposed to form the generative model. They not only exploit the cross-modal correlation for learning the common representations but also preserve reconstruction information for capturing the semantic consistency within each modality. Third, a\n            <jats:italic>cross-modal adversarial training mechanism<\/jats:italic>\n            is proposed, which uses two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representations more discriminative by the adversarial training process. In summary, our proposed CM-GAN approach can use GANs to perform cross-modal common representation learning by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval compared with 13 state-of-the-art methods on 4 cross-modal datasets.\n          <\/jats:p>","DOI":"10.1145\/3284750","type":"journal-article","created":{"date-parts":[[2019,2,7]],"date-time":"2019-02-07T15:33:18Z","timestamp":1549553598000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":254,"title":["CM-GANs"],"prefix":"10.1145","volume":"15","author":[{"given":"Yuxin","family":"Peng","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"given":"Jinwei","family":"Qi","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2019,2,7]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"International Conference on Machine Learning (ICML\u201913)","author":"Andrew Galen","year":"2013"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1236471.1236472"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1646396.1646452"},{"key":"e_1_2_1_4_1","unstructured":"Emily L. Denton Soumith Chintala Rob Fergus etal 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems (NIPS\u201915). MIT Press 1486--1494.   Emily L. Denton Soumith Chintala Rob Fergus et al. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems (NIPS\u201915). MIT Press 1486--1494."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654902"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2808205"},{"key":"e_1_2_1_7_1","volume-title":"Advances in Neural Information Processing Systems (NIPS\u201916)","author":"Finn Chelsea"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0658-4"},{"key":"e_1_2_1_9_1","volume-title":"Advances in Neural Information Processing Systems (NIPS\u201914)","author":"Goodfellow Ian"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00750"},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Agrim Gupta Justin Johnson Li Fei-Fei Silvio Savarese and Alexandre Alahi. 2018. Social GAN: Socially acceptable trajectories with generative adversarial networks. arXiv preprint arXiv:1803.10892 (2018).  Agrim Gupta Justin Johnson Li Fei-Fei Silvio Savarese and Alexandre Alahi. 2018. Social GAN: Socially acceptable trajectories with generative adversarial networks. arXiv preprint arXiv:1803.10892 (2018).","DOI":"10.1109\/CVPR.2018.00240"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1162\/0899766042321814"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2017.8019549"},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Harold Hotelling. 1936. Relations between two sets of variates. Biometrika (1936) 321--377.  Harold Hotelling. 1936. Relations between two sets of variates. Biometrika (1936) 321--377.","DOI":"10.1093\/biomet\/28.3-4.321"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2390499"},{"key":"e_1_2_1_16_1","volume-title":"International Committee on Computational Linguistic (ICCL\u201912)","author":"Kim Jungi"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1181"},{"key":"e_1_2_1_18_1","unstructured":"A. Krizhevsky I. Sutskever and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS\u201912). MIT Press 1106--1114.   A. Krizhevsky I. Sutskever and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS\u201912). MIT Press 1106--1114."},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Christian Ledig Lucas Theis Ferenc Husz\u00e1r Jose Caballero Andrew Cunningham Alejandro Acosta Andrew Aitken Alykhan Tejani Johannes Totz Zehan Wang etal 2016. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802 (2016).  Christian Ledig Lucas Theis Ferenc Husz\u00e1r Jose Caballero Andrew Cunningham Alejandro Acosta Andrew Aitken Alykhan Tejani Johannes Totz Zehan Wang et al. 2016. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802 (2016).","DOI":"10.1109\/CVPR.2017.19"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00446"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2017.12.023"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/957013.957143"},{"key":"e_1_2_1_23_1","unstructured":"Jianan Li Xiaodan Liang Yunchao Wei Tingfa Xu Jiashi Feng and Shuicheng Yan. 2017. Perceptual generative adversarial networks for small object detection. arXiv preprint arXiv:1706.05274 (2017).  Jianan Li Xiaodan Liang Yunchao Wei Tingfa Xu Jiashi Feng and Shuicheng Yan. 2017. Perceptual generative adversarial networks for small object detection. arXiv preprint arXiv:1706.05274 (2017)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3152126"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2017.8019356"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 5588 (1976) 746--748.  Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 5588 (1976) 746--748.","DOI":"10.1038\/264746a0"},{"key":"e_1_2_1_27_1","volume-title":"Advances in Neural Information Processing Systems (NIPS\u201913)","author":"Mikolov Tomas"},{"key":"e_1_2_1_28_1","unstructured":"Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).  Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)."},{"key":"e_1_2_1_29_1","volume-title":"International Conference on Machine Learning (ICML\u201911)","author":"Ngiam Jiquan"},{"key":"e_1_2_1_30_1","unstructured":"Augustus Odena Christopher Olah and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANS. arXiv preprint arXiv:1610.09585 (2016).  Augustus Odena Christopher Olah and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANS. arXiv preprint arXiv:1610.09585 (2016)."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2482228"},{"key":"e_1_2_1_32_1","volume-title":"International Joint Conference on Artificial Intelligence (IJCAI\u201916)","author":"Peng Yuxin","year":"2016"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2705068"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2742704"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Yuxin Peng Wenwu Zhu Yao Zhao Changsheng Xu Qingming Huang Hanqing Lu Qinghua Zheng Tiejun Huang and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology 8 Electronic Engineering 18 1 (2017) 44--57.  Yuxin Peng Wenwu Zhu Yao Zhao Changsheng Xu Qingming Huang Hanqing Lu Qinghua Zheng Tiejun Huang and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology 8 Electronic Engineering 18 1 (2017) 44--57.","DOI":"10.1631\/FITEE.1601787"},{"key":"e_1_2_1_36_1","unstructured":"Alec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).  Alec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.466"},{"key":"e_1_2_1_38_1","volume-title":"Mechanical Turk. In NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk. ACL, 139--147","author":"Rashtchian Cyrus","year":"2010"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873987"},{"key":"e_1_2_1_40_1","volume-title":"International Conference on Machine Learning (ICML\u201916)","author":"Reed Scott","year":"2016"},{"key":"e_1_2_1_41_1","volume-title":"Advances in Neural Information Processing Systems (NIPS\u201916)","author":"Reed Scott E."},{"key":"e_1_2_1_42_1","volume-title":"Advances in Neural Information Processing Systems (NIPS\u201915)","author":"Ren Shaoqing"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.884618"},{"key":"e_1_2_1_44_1","volume-title":"International Conference on Learning Representations (ICLR\u201914)","author":"Simonyan Karen","year":"2014"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2010.5540112"},{"key":"e_1_2_1_46_1","volume-title":"International Conference on Machine Learning (ICML\u201912)","author":"Srivastava Nitish","year":"2012"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123326"},{"key":"e_1_2_1_48_1","volume-title":"International Joint Conference on Artificial Intelligence (IJCAI\u201915)","author":"Wang Daixin","year":"2015"},{"key":"e_1_2_1_49_1","volume-title":"AAAI Conference on Artificial Intelligence (AAAI\u201918)","author":"Wang Hongwei","year":"2018"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080786"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2015.2505311"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_20"},{"key":"e_1_2_1_53_1","first-page":"449","article-title":"Cross-modal retrieval with CNN visual features: A new baseline","volume":"47","author":"Wei Yunchao","year":"2017","journal-title":"IEEE Transactions on Cybernetics"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2775109"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2502097"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2967231"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2676345"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.917359"},{"key":"e_1_2_1_60_1","volume-title":"AAAI Conference on Artificial Intelligence (AAAI\u201913)","author":"Zhai Xiaohua","year":"2013"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2013.2276704"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.629"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3284750","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3284750","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:43:34Z","timestamp":1750207414000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3284750"}},"subtitle":["Cross-modal Generative Adversarial Networks for Common Representation Learning"],"short-title":[],"issued":{"date-parts":[[2019,2,7]]},"references-count":62,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,2,28]]}},"alternative-id":["10.1145\/3284750"],"URL":"https:\/\/doi.org\/10.1145\/3284750","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,2,7]]},"assertion":[{"value":"2018-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}