{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T16:07:52Z","timestamp":1778170072886,"version":"3.51.4"},"reference-count":80,"publisher":"Springer Science and Business Media LLC","issue":"28","license":[{"start":{"date-parts":[[2023,10,6]],"date-time":"2023-10-06T00:00:00Z","timestamp":1696550400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,6]],"date-time":"2023-10-06T00:00:00Z","timestamp":1696550400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100008205","name":"Auckland University of Technology","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100008205","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google\u2019s Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e. MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.<\/jats:p>","DOI":"10.1007\/s00521-023-09021-x","type":"journal-article","created":{"date-parts":[[2023,10,6]],"date-time":"2023-10-06T16:01:35Z","timestamp":1696608095000},"page":"17245-17260","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Swinv2-Imagen: hierarchical vision transformer diffusion models for text-to-image generation"],"prefix":"10.1007","volume":"36","author":[{"given":"Ruijun","family":"Li","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9215-4979","authenticated-orcid":false,"given":"Weihua","family":"Li","sequence":"additional","affiliation":[]},{"given":"Yi","family":"Yang","sequence":"additional","affiliation":[]},{"given":"Hanyu","family":"Wei","sequence":"additional","affiliation":[]},{"given":"Jianhua","family":"Jiang","sequence":"additional","affiliation":[]},{"given":"Quan","family":"Bai","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,10,6]]},"reference":[{"key":"9021_CR1","doi-asserted-by":"publisher","first-page":"153113","DOI":"10.1109\/ACCESS.2020.3017881","volume":"8","author":"D Kim","year":"2020","unstructured":"Kim D, Joo D, Kim J (2020) Tivgan: text to image to video generation with step-by-step evolutionary generator. IEEE Access 8:153113\u2013153122","journal-title":"IEEE Access"},{"issue":"12","key":"9021_CR2","doi-asserted-by":"publisher","first-page":"3075","DOI":"10.1109\/TMM.2020.2972856","volume":"22","author":"R Li","year":"2020","unstructured":"Li R, Wang N, Feng F, Zhang G, Wang X (2020) Exploring global and local linguistic representations for text-to-image synthesis. IEEE Trans Multimed 22(12):3075\u20133087","journal-title":"IEEE Trans Multimed"},{"key":"9021_CR3","doi-asserted-by":"crossref","unstructured":"Mathesul S, Bhutkar G, Rambhad A (2021) Attngan: realistic text-to-image synthesis with attentional generative adversarial networks. In: IFIP conference on human-computer interaction, pp 397\u2013403. Springer","DOI":"10.1007\/978-3-030-98388-8_35"},{"key":"9021_CR4","unstructured":"Park DH, Azadi S, Liu X, Darrell T, Rohrbach A (2021) Benchmark for compositional text-to-image synthesis. In: NeurIPS datasets and benchmarks"},{"key":"9021_CR5","unstructured":"Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical text-conditional image generation with clip latents. ArXiv arXiv:2204.06125"},{"key":"9021_CR6","doi-asserted-by":"crossref","unstructured":"Saharia C, Chan W, Saxena S, Li L, Whang J, Denton EL, Ghasemipour SKS, Ayan BK, Mahdavi SS, Lopes RG, Salimans T, Ho J, Fleet DJ, Norouzi M (2022) Photorealistic text-to-image diffusion models with deep language understanding. ArXiv arXiv:2205.11487","DOI":"10.1145\/3528233.3530757"},{"key":"9021_CR7","unstructured":"Raffel C, Shazeer NM, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv arXiv:1910.10683"},{"key":"9021_CR8","doi-asserted-by":"crossref","unstructured":"Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019) Object-driven text-to-image synthesis via adversarial training. In: 2019 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 12166\u201312174","DOI":"10.1109\/CVPR.2019.01245"},{"key":"9021_CR9","doi-asserted-by":"crossref","unstructured":"Ganar AN, Gode C, Jambhulkar SM (2014) Enhancement of image retrieval by using colour, texture and shape features. In: 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies, pp. 251\u2013255. IEEE","DOI":"10.1109\/ICESC.2014.48"},{"key":"9021_CR10","unstructured":"Kauderer-Abrams E (2017) Quantifying translation-invariance in convolutional neural networks. arXiv preprint arXiv:1801.01450"},{"key":"9021_CR11","unstructured":"Chidester B, Do MN, Ma J (2018) Rotation equivariance and invariance in convolutional neural networks. arXiv preprint arXiv:1805.12301"},{"issue":"11","key":"9021_CR12","doi-asserted-by":"publisher","first-page":"3212","DOI":"10.1109\/TNNLS.2018.2876865","volume":"30","author":"Z-Q Zhao","year":"2019","unstructured":"Zhao Z-Q, Zheng P, Xu S-T, Wu X (2019) Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 30(11):3212\u20133232","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"9021_CR13","unstructured":"Li J, Yan Y, Liao S, Yang X, Shao L (2021) Local-to-global self-attention in vision transformers. arXiv preprint arXiv:2107.04735"},{"key":"9021_CR14","unstructured":"Liang C, Wang W, Zhou T, Miao J, Luo Y, Yang Y (2022) Local-global context aware transformer for language-guided video segmentation. arXiv preprint arXiv:2203.09773"},{"key":"9021_CR15","doi-asserted-by":"crossref","unstructured":"Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 1219\u20131228","DOI":"10.1109\/CVPR.2018.00133"},{"key":"9021_CR16","doi-asserted-by":"crossref","unstructured":"Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B (2022) Swin transformer v2: Scaling up capacity and resolution. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), 11999\u201312009","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"9021_CR17","doi-asserted-by":"crossref","unstructured":"Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. In: 2019 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), 5795\u20135803","DOI":"10.1109\/CVPR.2019.00595"},{"key":"9021_CR18","doi-asserted-by":"crossref","unstructured":"Zhu B, Ngo C-W (2020) Cookgan: Causality based text-to-image synthesis. 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5518\u20135526","DOI":"10.1109\/CVPR42600.2020.00556"},{"key":"9021_CR19","doi-asserted-by":"publisher","first-page":"1947","DOI":"10.1109\/TPAMI.2018.2856256","volume":"41","author":"H Zhang","year":"2019","unstructured":"Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2019) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41:1947\u20131962","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"9021_CR20","doi-asserted-by":"crossref","unstructured":"Xia W, Yang Y, Xue J, Wu B (2021) Tedigan: text-guided diverse face image generation and manipulation. In: 2021 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 2256\u20132265","DOI":"10.1109\/CVPR46437.2021.00229"},{"key":"9021_CR21","doi-asserted-by":"crossref","unstructured":"Crowson K, Biderman SR, Kornis D, Stander D, Hallahan E, Castricato L, Raff E (2022) Vqgan-clip: open domain image generation and editing with natural language guidance. ArXiv arXiv:2204.08583","DOI":"10.1007\/978-3-031-19836-6_6"},{"key":"9021_CR22","doi-asserted-by":"crossref","unstructured":"Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: rich feature generation for text-to-image synthesis from prior knowledge. In: IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 10908\u201310917","DOI":"10.1109\/CVPR42600.2020.01092"},{"key":"9021_CR23","unstructured":"Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. ArXiv arXiv:2006.11239"},{"key":"9021_CR24","first-page":"47","volume":"23","author":"J Ho","year":"2022","unstructured":"Ho J, Saharia C, Chan W, Fleet DJ, Norouzi M, Salimans T (2022) Cascaded diffusion models for high fidelity image generation. J Mach Learn Res 23:47\u201314733","journal-title":"J Mach Learn Res"},{"key":"9021_CR25","unstructured":"Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I, Chen M (2022) Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML"},{"key":"9021_CR26","doi-asserted-by":"crossref","unstructured":"Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 10674\u201310685","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"9021_CR27","unstructured":"Song J, Meng C, Ermon S (2021) Denoising diffusion implicit models. ArXiv arXiv:2010.02502"},{"key":"9021_CR28","unstructured":"Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. ArXiv arXiv:2105.05233"},{"key":"9021_CR29","unstructured":"Yang L, Zhang Z, Hong S, Xu R, Zhao Y, Shao Y, Zhang W, Yang M-H, Cui B (2022) Diffusion models: a comprehensive survey of methods and applications. ArXiv arXiv:2209.00796"},{"key":"9021_CR30","unstructured":"Cao HK, Tan C, Gao Z, Chen G, Heng P-A, Li SZ (2022) A survey on generative diffusion model. ArXiv arXiv:2209.02646"},{"key":"9021_CR31","unstructured":"Mittal G, Agrawal S, Agarwal A, Mehta S, Marwah T (2019) Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743"},{"key":"9021_CR32","unstructured":"Zhu G, Zhang L, Jiang Y, Dang Y, Hou H, Shen P, Feng M, Zhao X, Miao Q, Shah SAA (2022) Bennamoun: scene graph generation: a comprehensive survey. ArXiv arXiv:2201.00443"},{"key":"9021_CR33","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/TPAMI.2021.3137605","volume":"45","author":"X Chang","year":"2021","unstructured":"Chang X, Ren P, Xu P, Li Z, Chen X, Hauptmann AG (2021) A comprehensive survey of scene graphs: generation and application. IEEE Trans Pattern Anal Mach Intell 45:1\u201326","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"9021_CR34","doi-asserted-by":"crossref","unstructured":"Johnson J, Krishna R, Stark M, Li L-J, Shamma DA, Bernstein MS, Fei-Fei L (2015) Image retrieval using scene graphs. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3668\u20133678","DOI":"10.1109\/CVPR.2015.7298990"},{"key":"9021_CR35","doi-asserted-by":"crossref","unstructured":"Schuster S, Krishna R, Chang AX, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: VL@EMNLP","DOI":"10.18653\/v1\/W15-2812"},{"key":"9021_CR36","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1007\/s10462-020-09854-1","volume":"54","author":"SA Taghanaki","year":"2020","unstructured":"Taghanaki SA, Abhishek K, Cohen JP, Cohen-Adad J, Hamarneh G (2020) Deep semantic segmentation of natural and medical images: a review. Artif Intell Rev 54:137\u2013178","journal-title":"Artif Intell Rev"},{"key":"9021_CR37","doi-asserted-by":"crossref","unstructured":"Jaritz M, Vu T-H, de Charette R, Wirbel \u00c9, P\u00e9rez P (2020) xmuda: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In: 2020 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), 12602\u201312611","DOI":"10.1109\/CVPR42600.2020.01262"},{"key":"9021_CR38","doi-asserted-by":"crossref","unstructured":"Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: 2019 IEEE\/CVF international conference on computer vision (ICCV), pp 10312\u201310321","DOI":"10.1109\/ICCV.2019.01041"},{"key":"9021_CR39","doi-asserted-by":"crossref","unstructured":"Gao L, Wang B, Wang W (2018) Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th international conference on machine learning and computing","DOI":"10.1145\/3195106.3195114"},{"key":"9021_CR40","doi-asserted-by":"crossref","unstructured":"Yang X, Tang K, Zhang H Cai J (2019) Auto-encoding scene graphs for image captioning. In: 2019 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 10677\u201310686","DOI":"10.1109\/CVPR.2019.01094"},{"key":"9021_CR41","doi-asserted-by":"crossref","unstructured":"Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. ArXiv arXiv:2007.11731","DOI":"10.1007\/978-3-030-58568-6_13"},{"key":"9021_CR42","doi-asserted-by":"crossref","unstructured":"Gu J, Joty SR, Cai J, Zhao H, Yang X, Wang G (2019) Unpaired image captioning via scene graph alignments. In: 2019 IEEE\/CVF international conference on computer vision (ICCV), 10322\u201310331","DOI":"10.1109\/ICCV.2019.01042"},{"key":"9021_CR43","unstructured":"Li Y, Ma T, Bai Y, Duan N, Wei S, Wang X (2019) Pastegan: a semi-parametric method to generate image from scene graph. Adv Neural Inf Process Syst 32"},{"key":"9021_CR44","doi-asserted-by":"crossref","unstructured":"Zhao B, Meng L, Yin W, Sigal L (2019) Image generation from layout. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 8584\u20138593","DOI":"10.1109\/CVPR.2019.00878"},{"key":"9021_CR45","doi-asserted-by":"crossref","unstructured":"Li Y, Yang X, Xu C (2022) Dynamic scene graph generation via anticipatory pre-training. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 13874\u201313883","DOI":"10.1109\/CVPR52688.2022.01350"},{"key":"9021_CR46","doi-asserted-by":"crossref","unstructured":"Hamilton WL (2020) Graph representation learning. Synthesis lectures on artificial intelligence and machine learning","DOI":"10.1007\/978-3-031-01588-5"},{"key":"9021_CR47","doi-asserted-by":"crossref","unstructured":"Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining","DOI":"10.1145\/2939672.2939754"},{"key":"9021_CR48","unstructured":"Mikolov T, Chen K, Corrado GS, Dean J (2013) Efficient estimation of word representations in vector space. In: ICLR"},{"key":"9021_CR49","doi-asserted-by":"crossref","unstructured":"Chen F, Wang YC, Wang B, Kuo C-CJ (2020) Graph representation learning: a survey. APSIPA Trans Signal Inf Process 9","DOI":"10.1017\/ATSIP.2020.13"},{"key":"9021_CR50","unstructured":"Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: methods and applications. ArXiv arXiv:1709.05584"},{"key":"9021_CR51","doi-asserted-by":"crossref","unstructured":"Chen J, Ye G, Zhao Y, Liu S, Deng L, Chen X, Zhou R, Zheng K (2022) Efficient join order selection learning with graph-based representation. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp 97\u2013107","DOI":"10.1145\/3534678.3539303"},{"key":"9021_CR52","unstructured":"Park J, Song J, Yang E (2021) Graphens: Neighbor-aware ego network synthesis for class-imbalanced node classification. In: International conference on learning representations"},{"key":"9021_CR53","doi-asserted-by":"publisher","DOI":"10.1016\/j.media.2021.102272","volume":"75","author":"M Ghorbani","year":"2022","unstructured":"Ghorbani M, Kazi A, Baghshah MS, Rabiee HR, Navab N (2022) Ra-gcn: graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal 75:102272","journal-title":"Med Image Anal"},{"key":"9021_CR54","doi-asserted-by":"crossref","unstructured":"Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. ArXiv arXiv:1505.04597","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"9021_CR55","unstructured":"Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3431\u20133440"},{"key":"9021_CR56","doi-asserted-by":"crossref","unstructured":"Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2018) Unet++: a nested u-net architecture for medical image segmentation. Deep learning in medical image analysis and multimodal learning for clinical decision support : 4th international workshop, DLMIA 2018, and 8th International workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, S... 11045, 3\u201311","DOI":"10.1007\/978-3-030-00889-5_1"},{"key":"9021_CR57","doi-asserted-by":"publisher","first-page":"1856","DOI":"10.1109\/TMI.2019.2959609","volume":"39","author":"Z Zhou","year":"2020","unstructured":"Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2020) Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging 39:1856\u20131867","journal-title":"IEEE Trans Med Imaging"},{"key":"9021_CR58","doi-asserted-by":"crossref","unstructured":"Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen Y-W, Wu J (2020) Unet 3+: A full-scale connected unet for medical image segmentation. ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1055\u20131059","DOI":"10.1109\/ICASSP40776.2020.9053405"},{"key":"9021_CR59","doi-asserted-by":"publisher","first-page":"749","DOI":"10.1109\/LGRS.2018.2802944","volume":"15","author":"Z Zhang","year":"2018","unstructured":"Zhang Z, Liu Q, Wang Y (2018) Road extraction by deep residual u-net. IEEE Geosci Remote Sens Lett 15:749\u2013753","journal-title":"IEEE Geosci Remote Sens Lett"},{"issue":"6","key":"9021_CR60","doi-asserted-by":"publisher","first-page":"1275","DOI":"10.21037\/qims-19-1090","volume":"10","author":"S Cai","year":"2020","unstructured":"Cai S, Tian Y, Lui H, Zeng H, Wu Y, Chen G (2020) Dense-unet: a novel multiphoton in vivo cellular image segmentation model based on a convolutional neural network. Quant Imaging Med Surg 10(6):1275\u20131285","journal-title":"Quant Imaging Med Surg"},{"key":"9021_CR61","first-page":"74","volume":"121","author":"N Ibtehaz","year":"2020","unstructured":"Ibtehaz N, Rahman MS (2020) Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Netw Off J Int Neural Netw Soc 121:74\u201387","journal-title":"Neural Netw Off J Int Neural Netw Soc"},{"key":"9021_CR62","doi-asserted-by":"crossref","unstructured":"Alom MZ, Hasan M, Yakopcic C, Taha TM, Asari VK (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. ArXiv arXiv:1802.06955","DOI":"10.1109\/NAECON.2018.8556686"},{"key":"9021_CR63","unstructured":"Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2021) Swin-unet: Unet-like pure transformer for medical image segmentation. ArXiv arXiv:2105.05537"},{"key":"9021_CR64","unstructured":"Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training"},{"key":"9021_CR65","unstructured":"Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners"},{"key":"9021_CR66","unstructured":"Brock A, Donahue J, Simonyan K (2019) Large scale gan training for high fidelity natural image synthesis. ArXiv arXiv:1809.11096"},{"key":"9021_CR67","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL"},{"key":"9021_CR68","doi-asserted-by":"crossref","unstructured":"Luo C, Zhan J, Wang L, Yang Q (2018) Cosine normalization: Using cosine similarity instead of dot product in neural networks. ArXiv arXiv:1702.05870","DOI":"10.1007\/978-3-030-01418-6_38"},{"key":"9021_CR69","doi-asserted-by":"crossref","unstructured":"Cho K, van Merrienboer B, \u00c7aglar G\u00fcl\u00e7ehre Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder\u2013decoder for statistical machine translation. In: EMNLP","DOI":"10.3115\/v1\/D14-1179"},{"key":"9021_CR70","unstructured":"Ho J (2022) Classifier-free diffusion guidance. ArXiv arXiv:2207.12598"},{"key":"9021_CR71","unstructured":"Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS"},{"key":"9021_CR72","doi-asserted-by":"publisher","first-page":"330","DOI":"10.1016\/j.neucom.2021.03.059","volume":"449","author":"Z Qi","year":"2021","unstructured":"Qi Z, Sun J, Qian J, Xu J, Zhan S (2021) Pccm-gan: photographic text-to-image generation with pyramid contrastive consistency model. Neurocomputing 449:330\u2013341","journal-title":"Neurocomputing"},{"key":"9021_CR73","doi-asserted-by":"crossref","unstructured":"Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation. 2021 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 833\u2013842","DOI":"10.1109\/CVPR46437.2021.00089"},{"key":"9021_CR74","unstructured":"Ding M, Yang Z, Hong W, Zheng W, Zhou C, Yin D, Lin J, Zou X, Shao Z, Yang H, Tang J (2021) Cogview: Mastering text-to-image generation via transformers. In: NeurIPS"},{"key":"9021_CR75","doi-asserted-by":"crossref","unstructured":"Zhou Y, Zhang R, Chen C, Li C, Tensmeyer C, Yu T, Gu J, Xu J, Sun T (2022) Towards language-free training for text-to-image generation. 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 17886\u201317896","DOI":"10.1109\/CVPR52688.2022.01738"},{"key":"9021_CR76","unstructured":"Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. ArXiv arXiv:2102.12092"},{"key":"9021_CR77","doi-asserted-by":"crossref","unstructured":"Gafni O, Polyak A, Ashual O, Sheynin S, Parikh D, Taigman Y (2022) Make-a-scene: scene-based text-to-image generation with human priors. ArXiv arXiv:2203.13131","DOI":"10.1007\/978-3-031-19784-0_6"},{"key":"9021_CR78","unstructured":"Barratt ST, Sharma R (2018) A note on the inception score. ArXiv arXiv:1801.01973"},{"key":"9021_CR79","doi-asserted-by":"crossref","unstructured":"Tao M, Tang H, Wu F, Jing X-Y, Bao B-K, Xu C (2022) Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp. 16515\u201316525","DOI":"10.1109\/CVPR52688.2022.01602"},{"key":"9021_CR80","doi-asserted-by":"crossref","unstructured":"Gu S, Chen D, Bao J, Wen F, Zhang B, Chen D, Yuan L, Guo B (2022) Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp. 10696\u201310706","DOI":"10.1109\/CVPR52688.2022.01043"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-023-09021-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-023-09021-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-023-09021-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,18]],"date-time":"2024-09-18T15:06:54Z","timestamp":1726672014000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-023-09021-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,6]]},"references-count":80,"journal-issue":{"issue":"28","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["9021"],"URL":"https:\/\/doi.org\/10.1007\/s00521-023-09021-x","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,6]]},"assertion":[{"value":"7 December 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 September 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 October 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}