{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T16:18:14Z","timestamp":1775578694774,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":49,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Natural Science Foundation of China award award","award":["No. U19A2073, No. 62103454"],"award-info":[{"award-number":["No. U19A2073, No. 62103454"]}]},{"name":"the Shenzhen Municipal Basic Research Project for Natural Science Foundation award award","award":["No.JCYJ20190806143408992"],"award-info":[{"award-number":["No.JCYJ20190806143408992"]}]},{"name":"Shenzhen Fundamental Research Program award award","award":["No. JCYJ20190807154211365"],"award-info":[{"award-number":["No. JCYJ20190807154211365"]}]},{"name":"National Key R&D Program of China award award","award":["Grant No. 2020AAA0109700"],"award-info":[{"award-number":["Grant No. 2020AAA0109700"]}]},{"name":"Guangdong Outstanding Youth Fund award award","award":["No.2021B1515020061"],"award-info":[{"award-number":["No.2021B1515020061"]}]},{"name":"Guangdong Province Basic and Applied Basic Research award award","award":["No.2019B1515120039, No.2019A1515110680"],"award-info":[{"award-number":["No.2019B1515120039, No.2019A1515110680"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548230","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"4525-4535","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design"],"prefix":"10.1145","author":[{"given":"Xujie","family":"Zhang","sequence":"first","affiliation":[{"name":"Shenzhen Campus of Sun Yat-Sen University, shenzhen, China"}]},{"given":"Yu","family":"Sha","sequence":"additional","affiliation":[{"name":"Shenzhen Campus of Sun Yat-Sen University, shenzhen, China"}]},{"given":"Michael C.","family":"Kampffmeyer","sequence":"additional","affiliation":[{"name":"UiT The Arctic University of Norway, Troms\u00f8, Norway"}]},{"given":"Zhenyu","family":"Xie","sequence":"additional","affiliation":[{"name":"Shenzhen Campus of Sun Yat-Sen University,ByteDance, shenzhen, China"}]},{"given":"Zequn","family":"Jie","sequence":"additional","affiliation":[{"name":"Meituan Inc., shenzhen, China"}]},{"given":"Chengwen","family":"Huang","sequence":"additional","affiliation":[{"name":"Shidi Inc., shenzhen, China"}]},{"given":"Jianqing","family":"Peng","sequence":"additional","affiliation":[{"name":"Shenzhen Campus of Sun Yat-Sen University, shenzhen, China"}]},{"given":"Xiaodan","family":"Liang","sequence":"additional","affiliation":[{"name":"Shenzhen Campus of Sun Yat-Sen University, shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Towards cross-modal organ translation and segmentation: A cycle-and shape-consistent generative adversarial network. Medical image analysis 52","author":"Cai Jinzheng","year":"2019","unstructured":"Jinzheng Cai , Zizhao Zhang , Lei Cui , Yefeng Zheng , and Lin Yang . 2019. Towards cross-modal organ translation and segmentation: A cycle-and shape-consistent generative adversarial network. Medical image analysis 52 ( 2019 ), 174--184. Jinzheng Cai, Zizhao Zhang, Lei Cui, Yefeng Zheng, and Lin Yang. 2019. Towards cross-modal organ translation and segmentation: A cycle-and shape-consistent generative adversarial network. Medical image analysis 52 (2019), 174--184."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.1986.4767851"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_2_4_1","volume-title":"Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34","author":"Ding Ming","year":"2021","unstructured":"Ming Ding , Zhuoyi Yang , Wenyi Hong , Wendi Zheng , Chang Zhou , Da Yin , Junyang Lin , Xu Zou , Zhou Shao , Hongxia Yang , 2021 . Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021). Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021)."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2017.8296635"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"e_1_3_2_2_7_1","volume-title":"Ronan Le Bras, and Yejin Choi","author":"Hessel Jack","year":"2021","unstructured":"Jack Hessel , Ari Holtzman , Maxwell Forbes , Ronan Le Bras, and Yejin Choi . 2021 . CLIPScore : A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021). Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)."},{"key":"e_1_3_2_2_8_1","volume-title":"Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel , Hubert Ramsauer , Thomas Unterthiner , Bernhard Nessler , and Sepp Hochreiter . 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 ( 2017 ). Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_2_9_1","volume-title":"Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321","author":"Hinz Tobias","year":"2019","unstructured":"Tobias Hinz , Stefan Heinrich , and StefanWermter. 2019. Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321 ( 2019 ). Tobias Hinz, Stefan Heinrich, and StefanWermter. 2019. Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321 (2019)."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00833"},{"key":"e_1_3_2_2_11_1","volume-title":"An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469","author":"Huang He","year":"2018","unstructured":"He Huang , Philip S Yu , and Changhu Wang . 2018. An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469 ( 2018 ). He Huang, Philip S Yu, and Changhu Wang. 2018. An introduction to image synthesis with generative adversarial nets. arXiv preprint arXiv:1803.04469 (2018)."},{"key":"e_1_3_2_2_12_1","volume-title":"International Conference on Machine Learning. PMLR, 4904--4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling up visual and visionlanguage representation learning with noisy text supervision . In International Conference on Machine Learning. PMLR, 4904--4916 . Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and visionlanguage representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916."},{"key":"e_1_3_2_2_13_1","volume-title":"Transgan: Two transformers can make one strong gan. arXiv preprint arXiv:2102.07074 1, 3","author":"Jiang Yifan","year":"2021","unstructured":"Yifan Jiang , Shiyu Chang , and Zhangyang Wang . 2021 . Transgan: Two transformers can make one strong gan. arXiv preprint arXiv:2102.07074 1, 3 (2021). Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. Transgan: Two transformers can make one strong gan. arXiv preprint arXiv:2102.07074 1, 3 (2021)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00453"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP40778.2020.9191228"},{"key":"e_1_3_2_2_16_1","volume-title":"Fahad Shahbaz Khan, and Mubarak Shah.","author":"Khan Salman","year":"2021","unstructured":"Salman Khan , Muzammal Naseer , Munawar Hayat , Syed Waqas Zamir , Fahad Shahbaz Khan, and Mubarak Shah. 2021 . Transformers in vision: A survey. arXiv preprint arXiv:2101.01169 (2021). Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. arXiv preprint arXiv:2101.01169 (2021)."},{"key":"e_1_3_2_2_17_1","volume-title":"International Conference on Machine Learning. PMLR, 5583--5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021 . Vilt: Vision-and-language transformer without convolution or region supervision . In International Conference on Machine Learning. PMLR, 5583--5594 . Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00766"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00790"},{"key":"e_1_3_2_2_20_1","volume-title":"Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019 . Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019). Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)."},{"key":"e_1_3_2_2_21_1","unstructured":"Liunian Harold Li Pengchuan Zhang Haotian Zhang Jianwei Yang Chunyuan Li Yiwu Zhong LijuanWang Lu Yuan Lei Zhang Jenq-Neng Hwang etal 2021. Grounded Language-Image Pre-training. arXiv preprint arXiv:2112.03857 (2021).  Liunian Harold Li Pengchuan Zhang Haotian Zhang Jianwei Yang Chunyuan Li Yiwu Zhong LijuanWang Lu Yuan Lei Zhang Jenq-Neng Hwang et al. 2021. Grounded Language-Image Pre-training. arXiv preprint arXiv:2112.03857 (2021)."},{"key":"e_1_3_2_2_22_1","volume-title":"Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409","author":"Li Wei","year":"2020","unstructured":"Wei Li , Can Gao , Guocheng Niu , Xinyan Xiao , Hao Liu , Jiachen Liu , Hua Wu , and Haifeng Wang . 2020 . Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020). Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2020. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020)."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01245"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01086"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.124"},{"key":"e_1_3_2_2_26_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019). Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_2_27_1","volume-title":"Neural discrete representation learning. arXiv preprint arXiv:1711.00937","author":"van den Oord Aaron","year":"2017","unstructured":"Aaron van den Oord , Oriol Vinyals , and Koray Kavukcuoglu . 2017. Neural discrete representation learning. arXiv preprint arXiv:1711.00937 ( 2017 ). Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. arXiv preprint arXiv:1711.00937 (2017)."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00209"},{"key":"e_1_3_2_2_29_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021 . Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021). Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)."},{"key":"e_1_3_2_2_30_1","unstructured":"Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9.  Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_2_31_1","volume-title":"Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh , Mikhail Pavlov , Gabriel Goh , Scott Gray , Chelsea Voss , Alec Radford , Mark Chen , and Ilya Sutskever . 2021. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 ( 2021 ). Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)."},{"key":"e_1_3_2_2_32_1","volume-title":"Aaron Van den Oord, and Oriol Vinyals","author":"Razavi Ali","year":"2019","unstructured":"Ali Razavi , Aaron Van den Oord, and Oriol Vinyals . 2019 . Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019). Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_2_33_1","volume-title":"International Conference on Machine Learning. PMLR, 1060--1069","author":"Reed Scott","year":"2016","unstructured":"Scott Reed , Zeynep Akata , Xinchen Yan , Lajanugen Logeswaran , Bernt Schiele , and Honglak Lee . 2016 . Generative adversarial text to image synthesis . In International Conference on Machine Learning. PMLR, 1060--1069 . Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning. PMLR, 1060--1069."},{"key":"e_1_3_2_2_34_1","volume-title":"Improved techniques for training gans. Advances in neural information processing systems 29","author":"Salimans Tim","year":"2016","unstructured":"Tim Salimans , Ian Goodfellow , Wojciech Zaremba , Vicki Cheung , Alec Radford , and Xi Chen . 2016. Improved techniques for training gans. Advances in neural information processing systems 29 ( 2016 ). Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in neural information processing systems 29 (2016)."},{"key":"e_1_3_2_2_35_1","volume-title":"Cross-modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis","author":"Tan Hongchen","year":"2021","unstructured":"Hongchen Tan , Xiuping Liu , Baocai Yin , and Xin Li. 2021. Cross-modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis . IEEE Transactions on Multimedia ( 2021 ). Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin Li. 2021. Cross-modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis. IEEE Transactions on Multimedia (2021)."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475226"},{"key":"e_1_3_2_2_37_1","unstructured":"Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2.  Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2."},{"key":"e_1_3_2_2_38_1","volume-title":"TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. arXiv preprint arXiv:2012.03308","author":"Xia Weihao","year":"2020","unstructured":"Weihao Xia , Yujiu Yang , Jing-Hao Xue , and Baoyuan Wu. 2020. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. arXiv preprint arXiv:2012.03308 ( 2020 ). Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2020. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. arXiv preprint arXiv:2012.03308 (2020)."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00143"},{"key":"e_1_3_2_2_40_1","volume-title":"FILIP: Fine-grained Interactive Language-Image Pre-Training. arXiv:2111.07783 [cs.CV]","author":"Yao Lewei","year":"2021","unstructured":"Lewei Yao , Runhui Huang , Lu Hou , Guansong Lu , Minzhe Niu , Hang Xu , Xiaodan Liang , Zhenguo Li , Xin Jiang , and Chunjing Xu . 2021 . FILIP: Fine-grained Interactive Language-Image Pre-Training. arXiv:2111.07783 [cs.CV] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. FILIP: Fine-grained Interactive Language-Image Pre-Training. arXiv:2111.07783 [cs.CV]"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240559"},{"key":"e_1_3_2_2_42_1","volume-title":"Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers. arXiv preprint arXiv:2111.03481","author":"Zeng Yanhong","year":"2021","unstructured":"Yanhong Zeng , Huan Yang , Hongyang Chao , JianboWang, and Jianlong Fu. 2021. Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers. arXiv preprint arXiv:2111.03481 ( 2021 ). Yanhong Zeng, Huan Yang, Hongyang Chao, JianboWang, and Jianlong Fu. 2021. Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers. arXiv preprint arXiv:2111.03481 (2021)."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00089"},{"key":"e_1_3_2_2_44_1","volume-title":"Stackgan: Realistic image synthesis with stacked generative adversarial networks","author":"Zhang Han","year":"2018","unstructured":"Han Zhang , Tao Xu , Hongsheng Li , Shaoting Zhang , Xiaogang Wang , Xiaolei Huang , and Dimitris N Metaxas . 2018 . Stackgan: Realistic image synthesis with stacked generative adversarial networks . IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1947--1962. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1947--1962."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"e_1_3_2_2_46_1","volume-title":"UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis. arXiv preprint arXiv:2105.14211","author":"Zhang Zhu","year":"2021","unstructured":"Zhu Zhang , Jianxin Ma , Chang Zhou , Rui Men , Zhikang Li , Ming Ding , Jie Tang , Jingren Zhou , and Hongxia Yang . 2021. UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis. arXiv preprint arXiv:2105.14211 ( 2021 ). Zhu Zhang, Jianxin Ma, Chang Zhou, Rui Men, Zhikang Li, Ming Ding, Jie Tang, Jingren Zhou, and Hongxia Yang. 2021. UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis. arXiv preprint arXiv:2105.14211 (2021)."},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00649"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00556"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN52387.2021.9534074"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548230","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548230","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:20Z","timestamp":1750186820000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548230"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":49,"alternative-id":["10.1145\/3503161.3548230","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548230","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}