{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,27]],"date-time":"2026-04-27T23:12:10Z","timestamp":1777331530051,"version":"3.51.4"},"reference-count":48,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,1,8]],"date-time":"2025-01-08T00:00:00Z","timestamp":1736294400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Text-to-image models have demonstrated remarkable progress in generating visual content from textual descriptions. However, the presence of linguistic ambiguity in the text prompts poses a potential challenge to these models, possibly leading to undesired or inaccurate outputs. This work conducts a preliminary study and provides insights into how text-to-image diffusion models resolve linguistic ambiguity through a series of experiments. We investigate a set of prompts that exhibit different types of linguistic ambiguities with different models and the images they generate, focusing on how the models\u2019 interpretations of linguistic ambiguity compare to those of humans. In addition, we present a curated dataset of ambiguous prompts and their corresponding images known as the Visual Linguistic Ambiguity Benchmark (V-LAB) dataset. Furthermore, we report a number of limitations and failure modes caused by linguistic ambiguity in text-to-image models and propose prompt engineering guidelines to minimize the impact of ambiguity. The findings of this exploratory study contribute to the ongoing improvement of text-to-image models and provide valuable insights for future advancements in the field.<\/jats:p>","DOI":"10.3390\/computers14010019","type":"journal-article","created":{"date-parts":[[2025,1,8]],"date-time":"2025-01-08T07:40:13Z","timestamp":1736322013000},"page":"19","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1933-3254","authenticated-orcid":false,"given":"Wala","family":"Elsharif","sequence":"first","affiliation":[{"name":"College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6082-7873","authenticated-orcid":false,"given":"Mahmood","family":"Alzubaidi","sequence":"additional","affiliation":[{"name":"College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar"}]},{"given":"James","family":"She","sequence":"additional","affiliation":[{"name":"Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong 999077, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2752-3525","authenticated-orcid":false,"given":"Marco","family":"Agus","sequence":"additional","affiliation":[{"name":"College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar"}]}],"member":"1968","published-online":{"date-parts":[[2025,1,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., and Passos, A. (2024, January 16\u201324). Ledits++: Limitless image editing using text-to-image models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00846"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubinstein, M., and Aberman, K. (2024, January 16\u201324). Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00624"},{"key":"ref_3","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_4","unstructured":"Hao, Y., Chi, Z., Dong, L., and Wei, F. (2024). Probabilistic modeling of semantic ambiguity for scene graph generation. Adv. Neural Inf. Process. Syst., 36."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"20212657","DOI":"10.1098\/rspb.2021.2657","article-title":"Long-range sequential dependencies precede complex syntactic production in language acquisition","volume":"289","author":"Sainburg","year":"2022","journal-title":"Proc. R. Soc. B"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Zhang, L., Rao, A., and Agrawala, M. (2023, January 2\u20136). Adding conditional control to text-to-image diffusion models. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"24412","DOI":"10.1109\/ACCESS.2024.3365043","article-title":"Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction","volume":"12","author":"Alhabeeb","year":"2024","journal-title":"IEEE Access"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1145\/3422622","article-title":"Generative adversarial networks","volume":"63","author":"Goodfellow","year":"2020","journal-title":"Commun. ACM"},{"key":"ref_9","unstructured":"Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016, January 19\u201324). Generative adversarial text to image synthesis. Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA."},{"key":"ref_10","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_11","unstructured":"Brock, A., Donahue, J., and Simonyan, K. (2019, January 6\u20139). Large Scale GAN Training for High Fidelity Natural Image Synthesis. Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA."},{"key":"ref_12","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the ICML, Online."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., and Raff, E. (2022). Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv.","DOI":"10.1007\/978-3-031-19836-6_6"},{"key":"ref_14","unstructured":"Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015, January 6\u201311). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the International Conference on Machine Learning. PMLR, Lille, France."},{"key":"ref_15","unstructured":"Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. (2021, January 18\u201324). Zero-Shot Text-to-Image Generation. Proceedings of the 38th International Conference on Machine Learning, Virtual, Proceedings of Machine Learning Research."},{"key":"ref_16","unstructured":"(2024, July 02). Midjourney AI. Available online: https:\/\/www.midjourneyfree.ai\/."},{"key":"ref_17","unstructured":"Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022, January 18\u201324). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"ref_19","first-page":"36479","article-title":"Photorealistic text-to-image diffusion models with deep language understanding","volume":"35","author":"Saharia","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","unstructured":"Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., and Millican, K. (2023). Gemini: A family of highly capable multimodal models. arXiv."},{"key":"ref_21","unstructured":"Anthropic (2024). Introducing the Next Generation of Claude, Anthropic."},{"key":"ref_22","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv."},{"key":"ref_23","unstructured":"Feng, W., Zhu, W., Fu, T.J., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., and Wang, W.Y. (2024). Layoutgpt: Compositional visual planning and generation with large language models. Adv. Neural Inf. Process. Syst., 36."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Xia, X., Zhang, P., Neubig, G., and Ramanan, D. (2024, January 16\u201322). Evaluating and Improving Compositional Text-to-Visual Generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPRW63382.2024.00538"},{"key":"ref_25","unstructured":"Du, Y., Durkan, C., Strudel, R., Tenenbaum, J.B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W.S. (2023, January 23\u201329). Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. (2023, January 2\u20136). Human preference score: Better aligning text-to-image models with human preference. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00200"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Liu, N., Du, Y., Li, S., Tenenbaum, J.B., and Torralba, A. (2023, January 2\u20136). Unsupervised compositional concepts discovery with text-to-image generative models. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00199"},{"key":"ref_28","unstructured":"Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., and Wang, W.Y. (2023, January 1\u20135). Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda."},{"key":"ref_29","first-page":"78723","article-title":"T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation","volume":"36","author":"Huang","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_30","unstructured":"Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., and Zou, J. (2023, January 1\u20135). When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?. Proceedings of the International Conference on Learning Representations, Kigali, Rwanda."},{"key":"ref_31","unstructured":"Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., and Chechik, G. (2024). Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Adv. Neural Inf. Process. Syst., 36."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Mehrabi, N., Goyal, P., Verma, A., Dhamala, J., Kumar, V., Hu, Q., Chang, K.W., Zemel, R., Galstyan, A., and Gupta, R. (2023, January 9\u201314). Resolving ambiguities in text-to-image generative models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada.","DOI":"10.18653\/v1\/2023.acl-long.804"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Yang, G., Zhang, J., Zhang, Y., Wu, B., and Yang, Y. (2021, January 20\u201325). Probabilistic modeling of semantic ambiguity for scene graph generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01234"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Rassin, R., Ravfogel, S., and Goldberg, Y. (2022, January 8). DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models. Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Abu Dhabi, United Arab Emirates.","DOI":"10.18653\/v1\/2022.blackboxnlp-1.28"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Guo, Y., Shao, H., Liu, C., Xu, K., and Yuan, X. (2024). PrompTHis: Visualizing the Process and Influence of Prompt Editing during Text-to-Image Creation. IEEE Trans. Vis. Comput. Graph., 1\u201312.","DOI":"10.1109\/TVCG.2024.3408255"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"e50638","DOI":"10.2196\/50638","article-title":"Prompt engineering as an important emerging skill for medical professionals: Tutorial","volume":"25","year":"2023","journal-title":"J. Med Internet Res."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"2629","DOI":"10.1007\/s10439-023-03272-4","article-title":"Prompt engineering with ChatGPT: A guide for academic writers","volume":"51","author":"Giray","year":"2023","journal-title":"Ann. Biomed. Eng."},{"key":"ref_38","unstructured":"Spurlock, K.D., Acun, C., Saka, E., and Nasraoui, O. (2024). ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback. arXiv."},{"key":"ref_39","first-page":"295","article-title":"Promptmagician: Interactive prompt engineering for text-to-image creation","volume":"30","author":"Feng","year":"2023","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_40","unstructured":"Liu, V., and Chilton, L.B. (May, January 29). Design Guidelines for Prompt Engineering Text-to-Image Generative Models. Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.R., and Yang, Q. (2024, January 16\u201322). Dynamic Prompt Optimizing for Text-to-Image Generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02514"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"692","DOI":"10.1006\/jmla.1993.1035","article-title":"The interaction of lexical and syntactic ambiguity","volume":"32","author":"MacDonald","year":"1993","journal-title":"J. Mem. Lang."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/stul.12221","article-title":"Ambiguity in Linguistics 1","volume":"78","author":"Fortuny","year":"2024","journal-title":"Stud. Linguist."},{"key":"ref_45","first-page":"139","article-title":"Disambiguating ambiguity: A comparative analysis of lexical decision-making in native and non-native English speakers","volume":"13","author":"Kreishan","year":"2024","journal-title":"Int. J. Engl. Lang. Lit. Stud."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Liu, A., Wu, Z., Michael, J., Suhr, A., West, P., Koller, A., Swayamdipta, S., Smith, N.A., and Choi, Y. (2023, January 6\u201310). We\u2019re Afraid Language Models Aren\u2019t Modeling Ambiguity. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.51"},{"key":"ref_47","first-page":"325","article-title":"Language, linguistics and cognition","volume":"14","author":"Baggio","year":"2012","journal-title":"Handb. Philos. Sci."},{"key":"ref_48","first-page":"8","article-title":"Improving image generation with better captions","volume":"2","author":"Betker","year":"2023","journal-title":"Comput. Sci."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/1\/19\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T10:24:59Z","timestamp":1759919099000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/1\/19"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,8]]},"references-count":48,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,1]]}},"alternative-id":["computers14010019"],"URL":"https:\/\/doi.org\/10.3390\/computers14010019","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,8]]}}}