{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T02:31:15Z","timestamp":1781231475010,"version":"3.54.1"},"reference-count":54,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T00:00:00Z","timestamp":1773619200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Covision Lab Scarl"},{"name":"Schaeffler Automotive Buehl GmbH"},{"name":"Open Access Publishing Fund of the Free University of Bozen-Bolzano"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding-box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models.<\/jats:p>","DOI":"10.3390\/jimaging12030132","type":"journal-article","created":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T13:32:07Z","timestamp":1773667927000},"page":"132","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Maps"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-9693-5755","authenticated-orcid":false,"given":"Emanuele","family":"Caruso","sequence":"first","affiliation":[{"name":"Department of Engineering, Free University of Bozen-Bolzano, 39100 Bozen-Bolzano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5548-2620","authenticated-orcid":false,"given":"Francesco","family":"Pelosin","sequence":"additional","affiliation":[{"name":"Covision Lab Scarl, 39042 Brixen-Bressanone, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3095-3294","authenticated-orcid":false,"given":"Alessandro","family":"Simoni","sequence":"additional","affiliation":[{"name":"Covision Lab Scarl, 39042 Brixen-Bressanone, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4793-4276","authenticated-orcid":false,"given":"Oswald","family":"Lanz","sequence":"additional","affiliation":[{"name":"Department of Engineering, Free University of Bozen-Bolzano, 39100 Bozen-Bolzano, Italy"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2026,3,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Chan, X., Wang, X., Yu, D., Mi, H., and Yu, D. (2024). Scaling Synthetic Data Creation with 1,000,000,000 Personas. arXiv.","DOI":"10.14218\/JCTH.2023.00464"},{"key":"ref_2","unstructured":"Rogers, A., Boyd-Graber, J.L., and Okazaki, N. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics."},{"key":"ref_3","unstructured":"Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. (2024, January 21\u201327). Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria."},{"key":"ref_4","unstructured":"Gjerde, M.K., Slez\u00e1k, F., Haurum, J.B., and Moeslund, T.B. (2024, January 18). From NeRF to 3DGS: A Leap in Stereo Dataset Quality?. Proceedings of the Synthetic Data for Computer Vision Workshop (CVPRW), Seattle, WA, USA."},{"key":"ref_5","unstructured":"Geng, S., Krishna, R., and Koh, P.W. (2024, January 18). Training with real instead of synthetic generated images still performs better. Proceedings of the Synthetic Data for Computer Vision Workshop (CVPRW), Seattle, WA, USA."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Li, B., Lin, Z., Pathak, D., Li, J.E., Xia, X., Neubig, G., Zhang, P., and Ramanan, D. (2024, January 18). GenAI-bench: A holistic benchmark for compositional text-to-visual generation. Proceedings of the Synthetic Data for Computer Vision Workshop (CVPRW), Seattle, WA, USA.","DOI":"10.1109\/CVPRW63382.2024.00538"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Yu, P., and Wu, Y.N. (2024, January 18). Object-Conditioned Energy-Based Model for Attention Map Alignment in Text-to-Image Diffusion Models. Proceedings of the Synthetic Data for Computer Vision Workshop (CVPRW), Seattle, WA, USA.","DOI":"10.1007\/978-3-031-72946-1_4"},{"key":"ref_8","unstructured":"Chiruzzo, L., Ritter, A., and Wang, L. (2025). DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April\u20134 May 2025, Association for Computational Linguistics."},{"key":"ref_9","unstructured":"Vo, D.T., Duc, P.A., Thao, N.N., and Ninh, H. (2024, January 18). An approach to synthesize thermal infrared ship images. Proceedings of the Synthetic Data for Computer Vision Workshop (CVPRW), Seattle, WA, USA."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sasiaowapak, T., Boonsang, S., Chuwongin, S., Tongloy, T., and Lalitrojwong, P. (2023, January 3\u20135). Generative AI for Industrial Applications: Synthetic Dataset. Proceedings of the International Conference on Information Technology and Electrical Engineering (ICITEE), Changde, China.","DOI":"10.1109\/ICITEE59582.2023.10317774"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y., and Li, X. (2023, January 17\u201324). Layoutdiffusion: Controllable diffusion model for layout-to-image generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02154"},{"key":"ref_12","unstructured":"(2026, March 09). Unreal Engine. Available online: https:\/\/www.unrealengine.com."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A.M. (2016, January 27\u201330). The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.352"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. (2016, January 27\u201330). Virtual worlds as proxy for multi-object tracking analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.470"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Raistrick, A., Lipson, L., Ma, Z., Mei, L., Wang, M., Zuo, Y., Kayan, K., Wen, H., Han, B., and Wang, Y. (2023, January 17\u201324). Infinite Photorealistic Worlds Using Procedural Generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01215"},{"key":"ref_16","first-page":"4396","article-title":"Domain Generalization: A Survey","volume":"45","author":"Zhou","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., and Gould, S. (2025). 3D-GPT: Procedural 3D modeling with large language models. 2025 International Conference on 3D Vision (3DV), IEEE.","DOI":"10.1109\/3DV66043.2025.00119"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1145\/3422622","article-title":"Generative adversarial networks","volume":"63","author":"Goodfellow","year":"2020","journal-title":"Commun. ACM"},{"key":"ref_19","unstructured":"Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NIPS), Neural Information Processing Systems Foundation."},{"key":"ref_20","unstructured":"Guo, P., Zhao, C., Yang, D., Xu, Z., Nath, V., Tang, Y., Simon, B., Belue, M., Harmon, S., and Turkbey, B. (March, January 28). Maisi: Medical ai for synthetic imaging. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Skandarani, Y., Jodoin, P., and Lalande, A. (2023). GANs for Medical Image Synthesis: An Empirical Study. J. Imaging, 9.","DOI":"10.3390\/jimaging9030069"},{"key":"ref_22","unstructured":"Zhao, C., Svoboda, D., Wolterink, J.M., and Escobar, M. (2022). Can Segmentation Models Be Trained with Fully Synthetically Generated Data?. Simulation and Synthesis in Medical Imaging (MICCAI Workshops), Springer."},{"key":"ref_23","unstructured":"Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (2023). Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation. Advances in Neural Information Processing Systems (NIPS), Neural Information Processing Systems Foundation."},{"key":"ref_24","unstructured":"Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (2023). Scenario Diffusion: Controllable Driving Scenario Generation with Diffusion. Advances in Neural Information Processing Systems (NIPS), Neural Information Processing Systems Foundation."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Klemp, M., R\u00f6sch, K., Wagner, R., Quehl, J., and Lauer, M. (2023, January 17\u201324). LDFA: Latent Diffusion Face Anonymization for Self-driving Applications. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPRW59228.2023.00322"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"11569","DOI":"10.1109\/LRA.2022.3193225","article-title":"Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps","volume":"7","author":"Simoni","year":"2022","journal-title":"IEEE Robot. Autom. Lett. (RAL)"},{"key":"ref_27","unstructured":"NVIDIA (2025). Cosmos World Foundation Model Platform for Physical AI. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Zhao, B., Meng, L., Yin, W., and Sigal, L. (2019, January 15\u201320). Image Generation From Layout. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00878"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Karras, T., Laine, S., and Aila, T. (2019, January 15\u201320). A Style-Based Generator Architecture for Generative Adversarial Networks. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00453"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Sun, W., and Wu, T. (November, January 27). Image Synthesis From Reconfigurable Layout and Style. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.01063"},{"key":"ref_31","first-page":"5070","article-title":"Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis","volume":"44","author":"Sun","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Wang, B., Wu, T., Zhu, M., and Du, P. (2022, January 18\u201324). Interactive Image Synthesis with Panoptic Layout Generation. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00763"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., and Lee, Y.J. (2023, January 17\u201324). GLIGEN: Open-Set Grounded Text-to-Image Generation. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02156"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhang, L., Rao, A., and Agrawala, M. (2023, January 2\u20136). Adding Conditional Control to Text-to-Image Diffusion Models. Proceedings of the 2023 IEEE\/CVF International Conference on Computer Vision (ICCV), Paris, France.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Wang, X., Darrell, T., Rambhatla, S.S., Girdhar, R., and Misra, I. (2024, January 17\u201321). InstanceDiffusion: Instance-Level Control for Image Generation. Proceedings of the 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00596"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Wang, R., Hou, X., Schmedding, S., and Huber, M.F. (March, January 28). STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation. Proceedings of the 2025 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA.","DOI":"10.1109\/WACV61041.2025.00379"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Toker, A., Eisenberger, M., Cremers, D., and Leal-Taix\u00e9, L. (2024, January 17\u201321). SatSynth: Augmenting Image-Mask Pairs Through Diffusion Models for Aerial Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02615"},{"key":"ref_38","unstructured":"Chen, T., Zhang, R., and Hinton, G.E. (2023, January 1\u20135). Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning. Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention\u2014MICCAI 2015, Springer. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI).","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_40","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled Weight Decay Regularization. Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"363","DOI":"10.2478\/aut-2019-0035","article-title":"A Public Fabric Database for Defect Detection Methods and Results","volume":"19","author":"Miralles","year":"2019","journal-title":"AUTEX Res. J."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Huang, Y., Qiu, C., Guo, Y., Wang, X., and Yuan, K. (2018, January 20\u201324). Surface Defect Saliency of Magnetic Tile. Proceedings of the 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE), Munich, Germany.","DOI":"10.1109\/COASE.2018.8560423"},{"key":"ref_44","unstructured":"\u00d6zgenel, C.F. (2026, March 10). Concrete Crack Images for Classification. Mendeley Data, V2, 2019. Available online: https:\/\/data.mendeley.com\/datasets\/5y9wdsg2zt\/2."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"759","DOI":"10.1007\/s10845-019-01476-x","article-title":"Segmentation-based deep-learning approach for surface-defect detection","volume":"31","author":"Tabernik","year":"2020","journal-title":"J. Intell. Manuf."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"103459","DOI":"10.1016\/j.compind.2021.103459","article-title":"Mixed supervision for surface-defect detection: From weakly to fully supervised learning","volume":"129","author":"Tabernik","year":"2021","journal-title":"Comput. Ind."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"581","DOI":"10.12688\/f1000research.52903.2","article-title":"A large-scale image dataset of wood surface defects for automated vision-based quality control processes","volume":"10","author":"Kodytek","year":"2022","journal-title":"F1000Research"},{"key":"ref_48","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems (NIPS), Neural Information Processing Systems Foundation."},{"key":"ref_49","unstructured":"Bi\u0144kowski, M., Sutherland, D.J., Arbel, M., and Gretton, A. (2023, January 12\u201314). Demystifying MMD GANs. Proceedings of the International Conference on Learning Representations (ICLR), Munich, Germany."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., Efros, A.A., Shechtman, E., and Wang, O. (2018, January 18\u201323). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00068"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016, January 27\u201330). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_52","unstructured":"Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (NIPS), Neural Information Processing Systems Foundation."},{"key":"ref_53","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA."},{"key":"ref_54","unstructured":"Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., and Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50\u00d7 fewer parameters and <0.5 MB model size. arXiv."}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/12\/3\/132\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T14:11:59Z","timestamp":1773670319000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/12\/3\/132"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,16]]},"references-count":54,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["jimaging12030132"],"URL":"https:\/\/doi.org\/10.3390\/jimaging12030132","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,16]]}}}