{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T23:50:58Z","timestamp":1768521058514,"version":"3.49.0"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T00:00:00Z","timestamp":1646352000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2018AAA0100604"],"award-info":[{"award-number":["2018AAA0100604"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62036012, 61720106006, 62002355, 61721004, 61832002, 62072455, U1705262, and U1836220"],"award-info":[{"award-number":["62036012, 61720106006, 62002355, 61721004, 61832002, 62072455, U1705262, and U1836220"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Key Research Program of Frontier Sciences of CAS","award":["QYZDJ-SSW- JSC039"],"award-info":[{"award-number":["QYZDJ-SSW- JSC039"]}]},{"DOI":"10.13039\/501100012152","name":"National Postdoctoral Program for Innovative Talents","doi-asserted-by":"crossref","award":["BX20190367"],"award-info":[{"award-number":["BX20190367"]}],"id":[{"id":"10.13039\/501100012152","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Beijing Natural Science Foundation","award":["L201001"],"award-info":[{"award-number":["L201001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,5,31]]},"abstract":"<jats:p>\n            Composing Text and Image to Image Retrieval (\n            <jats:italic>CTI-IR<\/jats:italic>\n            ) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and\n            <jats:italic>CTI-IR<\/jats:italic>\n            . The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.\n          <\/jats:p>","DOI":"10.1145\/3478642","type":"journal-article","created":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T09:53:20Z","timestamp":1646387600000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval"],"prefix":"10.1145","volume":"18","author":[{"given":"Feifei","family":"Zhang","sequence":"first","affiliation":[{"name":"NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China"}]},{"given":"Mingliang","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Information Engineering, Zhengzhou University, Henan Province, China"}]},{"given":"Changsheng","family":"Xu","sequence":"additional","affiliation":[{"name":"NLPR, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Peng Cheng Laboratory, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2022,3,4]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"7708","volume-title":"CVPR","author":"Ak Kenan E.","year":"2018","unstructured":"Kenan E. Ak, Ashraf A. Kassim, Joo Hwee Lim, and Jo Yew Tham. 2018. Learning attribute representations with localization for flexible fashion search. In CVPR. 7708\u20137717."},{"key":"e_1_3_1_3_2","first-page":"10541","volume-title":"ICCV","author":"Ak Kenan E.","year":"2019","unstructured":"Kenan E. Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A. Kassim. 2019. Attribute manipulation generative adversarial networks for fashion images. In ICCV. 10541\u201310550."},{"key":"e_1_3_1_4_2","first-page":"12652","article-title":"IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval","author":"Chen H.","year":"2020","unstructured":"H. Chen, G. Ding, Xudong Liu, Zijia Lin, J. Liu, and J. Han. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12652\u201312660.","journal-title":"CVPR"},{"key":"e_1_3_1_5_2","first-page":"791","volume-title":"CVPR","author":"Chen Jiaxin","year":"2019","unstructured":"Jiaxin Chen, Jie Qin, Li Liu, Fan Zhu, Fumin Shen, Jin Xie, and Ling Shao. 2019. Deep sketch-shape hashing with segmented 3D stochastic viewing. In CVPR. 791\u2013800."},{"key":"e_1_3_1_6_2","volume-title":"CVPR","author":"Chen Shizhe","year":"2020","unstructured":"Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR."},{"key":"e_1_3_1_7_2","article-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling","volume":"1412","author":"Chung Junyoung","year":"2014","unstructured":"Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs\/1412.3555 (2014).","journal-title":"CoRR"},{"key":"e_1_3_1_8_2","first-page":"2179","volume-title":"CVPR","author":"Dey Sounak","year":"2019","unstructured":"Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. 2019. Doodle to search: Practical zero-shot sketch-based image retrieval. In CVPR. 2179\u20132188."},{"key":"e_1_3_1_9_2","first-page":"9346","volume-title":"CVPR","author":"Dong Jianfeng","year":"2019","unstructured":"Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In CVPR. 9346\u20139355."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-018-0151-5"},{"key":"e_1_3_1_11_2","first-page":"5089","volume-title":"CVPR","author":"Dutta Anjan","year":"2019","unstructured":"Anjan Dutta and Zeynep Akata. 2019. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In CVPR. 5089\u20135098."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3390891"},{"key":"e_1_3_1_13_2","first-page":"1","volume-title":"ICCV","author":"Ferecatu Marin","year":"2007","unstructured":"Marin Ferecatu and Donald Geman. 2007. Interactive search for image categories by mental matching. In ICCV. 1\u20138."},{"key":"e_1_3_1_14_2","first-page":"2672","volume-title":"NIPS","author":"Goodfellow Ian","year":"2014","unstructured":"Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672\u20132680."},{"key":"e_1_3_1_15_2","first-page":"7181","volume-title":"CVPR","author":"Gu Jiuxiang","year":"2018","unstructured":"Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR. 7181\u20137189."},{"key":"e_1_3_1_16_2","first-page":"678","volume-title":"NIPS","author":"Guo Xiaoxiao","year":"2018","unstructured":"Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In NIPS. 678\u2013688."},{"key":"e_1_3_1_17_2","first-page":"1463","volume-title":"ICCV","author":"Han Xintong","year":"2017","unstructured":"Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic spatially aware fashion concept discovery. In ICCV. 1463\u20131471."},{"key":"e_1_3_1_18_2","first-page":"770","article-title":"Deep residual learning for image recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.770\u2013778.","journal-title":"I"},{"key":"e_1_3_1_19_2","first-page":"10753","volume-title":"CVPR","author":"Heim Eric","year":"2019","unstructured":"Eric Heim. 2019. Constrained generative adversarial networks for interactive image generation. In CVPR. 10753\u201310761."},{"key":"e_1_3_1_20_2","article-title":"MobileNets: Efficient convolutional neural networks for mobile vision applications","volume":"1704","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard, Menglong Zhu, Bo Chen, D. Kalenichenko, Weijun Wang, Tobias Weyand, M. Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. ArXiv abs\/1704.04861 (2017).","journal-title":"ArXiv"},{"key":"e_1_3_1_21_2","first-page":"1383","volume-title":"CVPR","author":"Isola Phillip","year":"2015","unstructured":"Phillip Isola, Joseph J. Lim, and Edward H. Adelson. 2015. Discovering states and transformations in image collections. In CVPR. 1383\u20131391."},{"key":"e_1_3_1_22_2","first-page":"69","volume-title":"ACM MM","author":"Jiang Xinyang","year":"2015","unstructured":"Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, and Yueting Zhuang. 2015. Deep compositional cross-modal learning to rank via local-global alignment. In ACM MM. 69\u201378."},{"key":"e_1_3_1_23_2","first-page":"2901","volume-title":"CVPR","author":"Johnson Justin","year":"2017","unstructured":"Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR. 2901\u20132910."},{"key":"e_1_3_1_24_2","first-page":"1889","volume-title":"NIPS","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889\u20131897."},{"key":"e_1_3_1_25_2","first-page":"4401","volume-title":"CVPR","author":"Karras Tero","year":"2019","unstructured":"Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In CVPR. 4401\u20134410."},{"key":"e_1_3_1_26_2","first-page":"1771","volume-title":"AAAI","author":"Kim Jongseok","year":"2021","unstructured":"Jongseok Kim, Young-Sun Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual compositional learning in interactive image retrieval. In AAAI. 1771\u20131779."},{"key":"e_1_3_1_27_2","first-page":"361","volume-title":"NIPS","author":"Kim Jin-Hwa","year":"2016","unstructured":"Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual QA. In NIPS. 361\u2013369."},{"key":"e_1_3_1_28_2","first-page":"1","volume-title":"ICLR","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. 1\u201315."},{"key":"e_1_3_1_29_2","first-page":"5041","volume-title":"CVPR","author":"Klein Benjamin","year":"2019","unstructured":"Benjamin Klein and Lior Wolf. 2019. End-to-end supervised product quantization for image search and retrieval. In CVPR. 5041\u20135050."},{"key":"e_1_3_1_30_2","first-page":"2973","volume-title":"CVPR","author":"Kovashka Adriana","year":"2012","unstructured":"Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2012. WhittleSearch: Image search with relative attribute feedback. In CVPR. 2973\u20132980."},{"key":"e_1_3_1_31_2","first-page":"7567","volume-title":"ICCV","author":"Lao Qicheng","year":"2019","unstructured":"Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader, Francis Dutil, Lisa Di Jorio, and Thomas Fevens. 2019. Dual adversarial inference for text-to-image synthesis. In ICCV. 7567\u20137576."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2912714"},{"key":"e_1_3_1_33_2","first-page":"4654","volume-title":"ICCV","author":"Li Kunpeng","year":"2019","unstructured":"Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV. 4654\u20134662."},{"key":"e_1_3_1_34_2","first-page":"665","volume-title":"ACM MM","author":"Liu Jiawei","year":"2019","unstructured":"Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. 2019. Deep adversarial graph attention convolution network for text-based person search. In ACM MM. 665\u2013673."},{"key":"e_1_3_1_35_2","article-title":"MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval","author":"Liu Xin","year":"2021","unstructured":"Xin Liu, Zhikai Hu, Haibin Ling, and Yiu-ming Cheung. 2021. MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans. Pattern Anal. Machine Intell. 43, 3 (2021), 964\u2013981.","journal-title":"IEEE Trans. Pattern Anal. Machine Intell."},{"key":"e_1_3_1_36_2","first-page":"1429","volume-title":"CVPR","author":"Mao Qi","year":"2019","unstructured":"Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. 2019. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR. 1429\u20131437."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2016.2639382"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.2974326"},{"key":"e_1_3_1_39_2","article-title":"Conditional generative adversarial nets","volume":"1411","author":"Mirza Mehdi","year":"2014","unstructured":"Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. CoRR abs\/1411.1784 (2014).","journal-title":"CoRR"},{"key":"e_1_3_1_40_2","first-page":"6429","volume-title":"CVPR","author":"Murrugarra-Llerena Nils","year":"2019","unstructured":"Nils Murrugarra-Llerena and Adriana Kovashka. 2019. Cross-modality personalization for retrieval. In CVPR. 6429\u20136438."},{"key":"e_1_3_1_41_2","first-page":"169","volume-title":"ECCV","author":"Nagarajan Tushar","year":"2018","unstructured":"Tushar Nagarajan and Kristen Grauman. 2018. Attributes as operators: Factorizing unseen attribute-object compositions. In ECCV. 169\u2013185."},{"key":"e_1_3_1_42_2","first-page":"4467","volume-title":"CVPR","author":"Nguyen Anh","year":"2017","unstructured":"Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR. 4467\u20134477."},{"key":"e_1_3_1_43_2","first-page":"30","volume-title":"CVPR","author":"Noh Hyeonwoo","year":"2016","unstructured":"Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In CVPR. 30\u201338."},{"key":"e_1_3_1_44_2","first-page":"1","volume-title":"BMVC","author":"Pang Kaiyue","year":"2017","unstructured":"Kaiyue Pang, Yi-Zhe Song, Tony Xiang, and Timothy M Hospedales. 2017. Cross-domain generative learning for fine-grained sketch-based image retrieval. In BMVC. 1\u201312."},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3284750"},{"key":"e_1_3_1_46_2","first-page":"3942","volume-title":"AAAI","author":"Perez Ethan","year":"2018","unstructured":"Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual reasoning with a general conditioning layer. In AAAI. 3942\u20133951."},{"key":"e_1_3_1_47_2","first-page":"1505","volume-title":"CVPR","author":"Qiao Tingting","year":"2019","unstructured":"Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR. 1505\u20131514."},{"key":"e_1_3_1_48_2","first-page":"1060","volume-title":"ICML","author":"Reed Scott","year":"2016","unstructured":"Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text-to-image synthesis. In ICML. 1060\u20131069."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_50_2","first-page":"4967","volume-title":"NIPS","author":"Santoro Adam","year":"2017","unstructured":"Adam Santoro, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS. 4967\u20134976."},{"key":"e_1_3_1_51_2","first-page":"5813","volume-title":"ICCV","author":"Sarafianos Nikolaos","year":"2019","unstructured":"Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In ICCV. 5813\u20135823."},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2877122"},{"key":"e_1_3_1_54_2","first-page":"1979","volume-title":"CVPR","author":"Song Yale","year":"2019","unstructured":"Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR. 1979\u20131988."},{"key":"e_1_3_1_55_2","first-page":"2818","volume-title":"CVPR","author":"Szegedy Christian","year":"2016","unstructured":"Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818\u20132826."},{"key":"e_1_3_1_56_2","first-page":"10501","volume-title":"ICCV","author":"Tan Hongchen","year":"2019","unstructured":"Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. 2019. Semantics-enhanced adversarial nets for text-to-image synthesis. In ICCV. 10501\u201310510."},{"key":"e_1_3_1_57_2","first-page":"2579","article-title":"Visualizing data using t-sne","volume":"9","author":"Maaten Laurens Van der","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, Nov. (2008), 2579\u20132605.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_1_58_2","first-page":"3156","volume-title":"CVPR","author":"Vinyals Oriol","year":"2015","unstructured":"Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156\u20133164."},{"key":"e_1_3_1_59_2","first-page":"6439","volume-title":"CVPR","author":"Vo Nam","year":"2019","unstructured":"Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In CVPR. 6439\u20136448."},{"key":"e_1_3_1_60_2","first-page":"154","volume-title":"ACM MM","author":"Wang Bokun","year":"2017","unstructured":"Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACM MM. 154\u2013162."},{"key":"e_1_3_1_61_2","first-page":"11572","volume-title":"CVPR","author":"Wang Hao","year":"2019","unstructured":"Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven C. H. Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In CVPR. 11572\u201311581."},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2797921"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2967584"},{"key":"e_1_3_1_64_2","first-page":"4269","volume-title":"CVPR","author":"Wu Yiling","year":"2017","unstructured":"Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online asymmetric similarity learning for cross-modal retrieval. In CVPR. 4269\u20134278."},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2942494"},{"key":"e_1_3_1_66_2","first-page":"1316","volume-title":"CVPR","author":"Xu Tao","year":"2018","unstructured":"Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text-to-image generation with attentional generative adversarial networks. In CVPR. 1316\u20131324."},{"key":"e_1_3_1_67_2","first-page":"3441","volume-title":"CVPR","author":"Yan Fei","year":"2015","unstructured":"Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441\u20133450."},{"key":"e_1_3_1_68_2","first-page":"2327","volume-title":"CVPR","author":"Yin Guojun","year":"2019","unstructured":"Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Semantics disentangling for text-to-image generation. In CVPR. 2327\u20132336."},{"key":"e_1_3_1_69_2","first-page":"1247","volume-title":"CVPR","author":"Zhang Da","year":"2019","unstructured":"Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR. 1247\u20131257."},{"key":"e_1_3_1_70_2","first-page":"3367","volume-title":"ACM MM","author":"Zhang Feifei","year":"2020","unstructured":"Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. In ACM MM. 3367\u20133376."},{"key":"e_1_3_1_71_2","first-page":"5907","volume-title":"ICCV","author":"Zhang Han","year":"2017","unstructured":"Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907\u20135915."},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2856256"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2922128"},{"key":"e_1_3_1_74_2","first-page":"297","volume-title":"ECCV","author":"Zhang Jingyi","year":"2018","unstructured":"Jingyi Zhang, Fumin Shen, Li Liu, Fan Zhu, Mengyang Yu, Ling Shao, Heng Tao Shen, and Luc Van Gool. 2018. Generative domain-migration hashing for sketch-to-image retrieval. In ECCV. 297\u2013314."},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2931352"},{"key":"e_1_3_1_76_2","first-page":"1520","volume-title":"CVPR","author":"Zhao Bo","year":"2017","unstructured":"Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-augmented attribute manipulation networks for interactive fashion search. In CVPR. 1520\u20131528."},{"key":"e_1_3_1_77_2","first-page":"11394","volume-title":"CVPR","author":"Zhen Liangli","year":"2019","unstructured":"Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In CVPR. 11394\u201310403."},{"key":"e_1_3_1_78_2","first-page":"11477","volume-title":"CVPR","author":"Zhu Bin","year":"2019","unstructured":"Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In CVPR. 11477\u201311486."},{"key":"e_1_3_1_79_2","first-page":"8743","article-title":"ActBERT: Learning global-local video-text representations","author":"Zhu Linchao","year":"2020","unstructured":"Linchao Zhu and Y. Yang. 2020. ActBERT: Learning global-local video-text representations. CVPR. 8743\u20138752.","journal-title":"CVPR"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3478642","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3478642","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:11:41Z","timestamp":1750191101000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3478642"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,4]]},"references-count":78,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,5,31]]}},"alternative-id":["10.1145\/3478642"],"URL":"https:\/\/doi.org\/10.1145\/3478642","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,4]]},"assertion":[{"value":"2020-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}