{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T20:39:49Z","timestamp":1776112789112,"version":"3.50.1"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"ISSTA","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2025,6,22]]},"abstract":"<jats:p>Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for improving the development efficiency. There exist deep learning-based methods for the task; however, they heavily rely on a large amount of labeled training data and struggle with generalizing to real-world, unseen web page designs. The advent of Multimodal Large Language Models (MLLMs) presents potential for alleviating the issue, but they are difficult to comprehend the complex layouts in UIs and generate the accurate code with layout preserved. To address these issues, we propose LayoutCoder, a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules: (1) Element Relation Construction, which aims at capturing UI layout by identifying and grouping components with similar structures; (2) UI Layout Parsing, which aims at generating UI layout trees for guiding the subsequent code generation process; and (3) Layout-Guided Code Fusion, which aims at producing the accurate code with layout preserved. For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code, divided into seen and unseen parts for mitigating the data leakage issue, besides the popular dataset Design2Code. Extensive evaluation shows the superior performance of LayoutCoder over the state-of-the-art approaches. Compared with the best-performing baseline, LayoutCoder improves 10.14% in the BLEU score and 3.95% in the CLIP score on average across all datasets.<\/jats:p>","DOI":"10.1145\/3728925","type":"journal-article","created":{"date-parts":[[2025,6,22]],"date-time":"2025-06-22T10:52:56Z","timestamp":1750589576000},"page":"1123-1145","source":"Crossref","is-referenced-by-count":5,"title":["MLLM-Based UI2Code Automation Guided by UI Layout Information"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-9090-1832","authenticated-orcid":false,"given":"Fan","family":"Wu","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4774-2434","authenticated-orcid":false,"given":"Cuiyun","family":"Gao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6323-1402","authenticated-orcid":false,"given":"Shuqing","family":"Li","sequence":"additional","affiliation":[{"name":"Chinese University of Hong Kong, Hong Kong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2115-9921","authenticated-orcid":false,"given":"Xin-Cheng","family":"Wen","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1012-5301","authenticated-orcid":false,"given":"Qing","family":"Liao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,22]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS \u201922)","author":"Alayrac Jean-Baptiste","year":"2024","unstructured":"Jean-Baptiste Alayrac, Jeff Donahue, and Pauline Luc. 2024. Flamingo: a visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS \u201922). Curran Associates Inc., Red Hook, NY, USA. Article 1723, 21 pages. isbn:9781713871088"},{"key":"e_1_2_1_2_1","unstructured":"Anthropic. 2024. Claude 3.5. https:\/\/www.anthropic.com Accessed: 2024-10-30"},{"key":"e_1_2_1_3_1","volume-title":"2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). 1\u20134.","author":"A\u015f\u0131ro\u011flu Batuhan","year":"2019","unstructured":"Batuhan A\u015f\u0131ro\u011flu, B\u00fc\u015fta R\u00fcmeysa Mete, Eyy\u00fcp Y\u0131ld\u0131z, Ya\u011f\u0131z Nal\u00e7akan, Alper Sezen, Mustafa Da\u011ftekin, and Tolga Ensari. 2019. Automatic HTML code generation from mock-up images using machine learning techniques. In 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). 1\u20134."},{"key":"e_1_2_1_4_1","unstructured":"Microsoft Azure. 2018. Turn your whiteboard sketches to working code in seconds with sketch2code. https:\/\/azure.microsoft.com\/en-us\/blog\/turn-your-whiteboard-sketches-to-working-code-in-seconds-with-sketch2code\/ Accessed: 2024-10-30"},{"key":"e_1_2_1_5_1","unstructured":"Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding Localization Text Reading and Beyond. arxiv:2308.12966. arxiv:2308.12966"},{"key":"e_1_2_1_6_1","volume-title":"Aldo von Wangenheim, Jean C. R. Hauck, and Edson C. Vargas J\u00fanior.","author":"Baul\u00e9 Daniel","year":"2021","unstructured":"Daniel Baul\u00e9, Christiane Gresse von Wangenheim, Aldo von Wangenheim, Jean C. R. Hauck, and Edson C. Vargas J\u00fanior. 2021. Automatic code generation from sketches of mobile applications in end-user development using Deep Learning. arxiv:2103.05704. arxiv:2103.05704"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3220134.3220135"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3180155.3180240"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-021-00804-7"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-023-15108-3"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126594.3126651"},{"key":"e_1_2_1_12_1","volume-title":"A density-based algorithm for discovering clusters in large spatial databases with noise. KDD\u201996","author":"Ester Martin","unstructured":"Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD\u201996. AAAI Press, 226\u2013231."},{"key":"e_1_2_1_13_1","volume-title":"Ronan Le Bras, and Yejin Choi","author":"Hessel Jack","year":"2022","unstructured":"Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2022. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arxiv:2104.08718. arxiv:2104.08718"},{"key":"e_1_2_1_14_1","unstructured":"Vanita Jain Piyush Agrawal Subham Banga Rishabh Kapoor and Shashwat Gulyani. 2019. Sketch2Code: Transformation of Sketches to UI in Real-time Using Deep Neural Network. arxiv:1910.08930. arxiv:1910.08930"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01765"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i1.19994"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Alexander Kirillov Eric Mintun Nikhila Ravi Hanzi Mao Chloe Rolland Laura Gustafson Tete Xiao Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll\u00e1r and Ross Girshick. 2023. Segment Anything. arxiv:2304.02643. arxiv:2304.02643","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_2_1_18_1","unstructured":"Hugo Lauren\u00e7on L\u00e9o Tronchon and Victor Sanh. 2024. Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset. arxiv:2403.09029. arxiv:2403.09029"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning (ICML\u201923)","author":"Lee Kenton","year":"2023","unstructured":"Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, and Fangyu Liu. 2023. Pix2Struct: screenshot parsing as pretraining for visual language understanding. In Proceedings of the 40th International Conference on Machine Learning (ICML\u201923). JMLR.org, Article 780, 20 pages."},{"key":"e_1_2_1_20_1","unstructured":"Junnan Li Dongxu Li Silvio Savarese and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arxiv:2301.12597. arxiv:2301.12597"},{"key":"e_1_2_1_21_1","unstructured":"Jianan Li Jimei Yang Aaron Hertzmann Jianming Zhang and Tingfa Xu. 2019. LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators. arxiv:1901.06767. arxiv:1901.06767"},{"key":"e_1_2_1_22_1","volume-title":"Visual Instruction Tuning. ArXiv, abs\/2304.08485","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. ArXiv, abs\/2304.08485 (2023), https:\/\/api.semanticscholar.org\/CorpusID:258179774"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2018.2844788"},{"key":"e_1_2_1_24_1","volume-title":"2015 30th IEEE\/ACM International Conference on Automated Software Engineering (ASE), 248\u2013259","author":"Nguyen Tuan Anh","year":"2015","unstructured":"Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse Engineering Mobile Application User Interfaces with REMAUI (T). 2015 30th IEEE\/ACM International Conference on Automated Software Engineering (ASE), 248\u2013259. https:\/\/api.semanticscholar.org\/CorpusID:7499368"},{"key":"e_1_2_1_25_1","unstructured":"OpenAI Josh Achiam Steven Adler Sandhini Agarwal and Lama Ahmad. 2024. GPT-4 Technical Report. arxiv:2303.08774. arxiv:2303.08774"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_27_1","volume-title":"International conference on machine learning. 8748\u20138763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. 8748\u20138763."},{"key":"e_1_2_1_28_1","unstructured":"Alex Robinson. 2019. Sketch2code: Generating a website from a paper mockup. arxiv:1905.13750. arxiv:1905.13750"},{"key":"e_1_2_1_29_1","unstructured":"Andy Rutledge. 2009. Gestalt Principles - 3: Proximity Uniform Connectedness and Good Continuation. https:\/\/andyrutledge.com\/gestalt-principles-3.html Accessed: 2024-10-30"},{"key":"e_1_2_1_30_1","unstructured":"Chenglei Si Yanzhe Zhang Zhengyuan Yang Ruibo Liu and Diyi Yang. 2024. Design2Code: How Far Are We From Automating Front-End Engineering? arXiv preprint arXiv:2403.03163."},{"key":"e_1_2_1_31_1","unstructured":"Digital Silk. 2024. How Many Websites Are There In 2024? https:\/\/www.digitalsilk.com\/digital-trends\/how-many-websites-are-there\/ Accessed: 2024-10-31"},{"key":"e_1_2_1_32_1","volume-title":"Lyu","author":"Wan Yuxuan","year":"2024","unstructured":"Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael R. Lyu. 2024. Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach. arxiv:2406.16386. arxiv:2406.16386"},{"key":"e_1_2_1_33_1","volume-title":"Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.","author":"Wang Weihan","year":"2023","unstructured":"Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, and Xixuan Song. 2023. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079."},{"key":"e_1_2_1_34_1","unstructured":"Fan Wu. 2025. MLLM-Based UI2Code Automation Guided by UI Layout Information. https:\/\/github.com\/ay7u1009\/LayoutCoder\/ Accessed: 2025-04-05"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472749.3474763"},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Shuhong Xiao Yunnong Chen Jiazhi Li Liuqing Chen Lingyun Sun and Tingting Zhou. 2024. Prototype2Code: End-to-end Front-end Code Generation from UI Design Prototypes. arxiv:2405.04975. arxiv:2405.04975","DOI":"10.1115\/DETC2024-143139"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Shuhong Xiao Yunnong Chen Yaxuan Song Liuqing Chen Lingyun Sun Yankun Zhen and Yanfang Chang. 2024. UI Semantic Group Detection: Grouping UI Elements with Similar Semantics in Mobile Graphical User Interface. arxiv:2403.04984. arxiv:2403.04984","DOI":"10.1016\/j.displa.2024.102679"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3417940"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3540250.3549138"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_41"},{"key":"e_1_2_1_41_1","volume-title":"DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arxiv:2203.03605. arxiv:2203.03605","author":"Zhang Hao","year":"2022","unstructured":"Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. 2022. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arxiv:2203.03605. arxiv:2203.03605"},{"key":"e_1_2_1_42_1","volume-title":"2023 IEEE\/CVF International Conference on Computer Vision (ICCV), 7192\u20137202","author":"Zhang Junyi","unstructured":"Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and D. Zhang. 2023. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. 2023 IEEE\/CVF International Conference on Computer Vision (ICCV), 7192\u20137202. https:\/\/api.semanticscholar.org\/CorpusID:257636725"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1155\/2022\/4415479"},{"key":"e_1_2_1_44_1","first-page":"1","article-title":"Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels.. In CHI, Yoshifumi Kitamura, Aaron Quigley, Katherine Isbister, Takeo Igarashi, Pernille Bj\u00f8rn, and Steven Mark Drucker (Eds.)","volume":"275","author":"Zhang Xiaoyi","year":"2021","unstructured":"Xiaoyi Zhang, Lilian de Greef, and Amanda Swearngin. 2021. Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels.. In CHI, Yoshifumi Kitamura, Aaron Quigley, Katherine Isbister, Takeo Igarashi, Pernille Bj\u00f8rn, and Steven Mark Drucker (Eds.). ACM, 275:1\u2013275:15. isbn:978-1-4503-8096-6 http:\/\/dblp.uni-trier.de\/db\/conf\/chi\/chi2021.html##ZhangGSWMYSNWFE21","journal-title":"ACM"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3289600.3290610"},{"key":"e_1_2_1_46_1","volume-title":"PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR), 1015\u20131022","author":"Zhong Xu","year":"2019","unstructured":"Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR), 1015\u20131022. https:\/\/api.semanticscholar.org\/CorpusID:201124789"},{"key":"e_1_2_1_47_1","unstructured":"Ting Zhou Yanjie Zhao Xinyi Hou Xiaoyu Sun Kai Chen and Haoyu Wang. 2024. Bridging Design and Development with Automated Declarative UI Code Generation. arxiv:2409.11667. arxiv:2409.11667"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3728925","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T16:53:26Z","timestamp":1752684806000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3728925"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,22]]},"references-count":47,"journal-issue":{"issue":"ISSTA","published-print":{"date-parts":[[2025,6,22]]}},"alternative-id":["10.1145\/3728925"],"URL":"https:\/\/doi.org\/10.1145\/3728925","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,22]]}}}