{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,29]],"date-time":"2026-03-29T02:00:18Z","timestamp":1774749618375,"version":"3.50.1"},"reference-count":43,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2024,3,1]],"date-time":"2024-03-01T00:00:00Z","timestamp":1709251200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["12301581"],"award-info":[{"award-number":["12301581"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["JDJQ20220805"],"award-info":[{"award-number":["JDJQ20220805"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"the outstanding Youth Program of Beijing University of Civil Engineering and Architecture","award":["12301581"],"award-info":[{"award-number":["12301581"]}]},{"name":"the outstanding Youth Program of Beijing University of Civil Engineering and Architecture","award":["JDJQ20220805"],"award-info":[{"award-number":["JDJQ20220805"]}]},{"name":"BUCEA Post Graduate Innovation Project","award":["12301581"],"award-info":[{"award-number":["12301581"]}]},{"name":"BUCEA Post Graduate Innovation Project","award":["JDJQ20220805"],"award-info":[{"award-number":["JDJQ20220805"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.<\/jats:p>","DOI":"10.3390\/e26030224","type":"journal-article","created":{"date-parts":[[2024,3,1]],"date-time":"2024-03-01T06:07:53Z","timestamp":1709273273000},"page":"224","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Style-Enhanced Transformer for Image Captioning in Construction Scenes"],"prefix":"10.3390","volume":"26","author":[{"given":"Kani","family":"Song","sequence":"first","affiliation":[{"name":"School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China"}]},{"given":"Linlin","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6693-0161","authenticated-orcid":false,"given":"Hengyou","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,3,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"Babytalk: Understanding and generating simple image descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_3","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7\u20139). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"102178","DOI":"10.1016\/j.ipm.2019.102178","article-title":"Image caption generation with dual attention mechanism","volume":"57","author":"Liu","year":"2020","journal-title":"Inf. Process. Manag."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, X., Sun, X., Luo, Y., Ji, J., Zhou, Y., Wu, Y., Huang, F., and Ji, R. (2021, January 20\u201325). RSTNet: Captioning with adaptive attention on visual and non-visual words. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01521"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Feng, Y., Ma, L., Liu, W., and Luo, J. (2019, January 15\u201320). Unsupervised image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00425"},{"key":"ref_7","unstructured":"Laina, I., Rupprecht, C., and Navab, N. (November, January 27). Towards unsupervised image captioning with shared multimodal embeddings. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Guo, L., Liu, J., Yao, P., Li, J., and Lu, H. (2019, January 15\u201320). Mscap: Multi-style image captioning with unpaired stylized text. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00433"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Deng, C., Ding, N., Tan, M., and Wu, Q. (2020, January 23\u201328). Length-controllable image captioning. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XIII 16.","DOI":"10.1007\/978-3-030-58601-0_42"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the Computer Vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland. Proceedings, Part V 13.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"390","DOI":"10.1016\/j.autcon.2017.06.014","article-title":"Image-based construction hazard avoidance system using augmented reality in wearable device","volume":"83","author":"Kim","year":"2017","journal-title":"Autom. Constr."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1016\/j.aei.2018.05.003","article-title":"Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach","volume":"37","author":"Fang","year":"2018","journal-title":"Adv. Eng. Inform."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1016\/j.autcon.2018.01.003","article-title":"Transfer learning and deep convolutional neural networks for safety guardrail detection in 2D images","volume":"89","author":"Kolar","year":"2018","journal-title":"Autom. Constr."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"04018066","DOI":"10.1061\/(ASCE)CP.1943-5487.0000813","article-title":"Vision-based framework for intelligent monitoring of hardhat wearing on construction sites","volume":"33","author":"Mneymneh","year":"2019","journal-title":"J. Comput. Civ. Eng."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"103116","DOI":"10.1016\/j.autcon.2020.103116","article-title":"Context-based information generation for managing UAV-acquired data using image captioning","volume":"112","author":"Bang","year":"2020","journal-title":"Autom. Constr."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"103334","DOI":"10.1016\/j.autcon.2020.103334","article-title":"Manifesting construction activity scenes via image captioning","volume":"119","author":"Liu","year":"2020","journal-title":"Autom. Constr."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Han, S.H., and Choi, H.J. (2020, January 19\u201322). Domain-Specific Image Caption Generator with Semantic Ontology. Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Republic of Korea.","DOI":"10.1109\/BigComp48618.2020.00-12"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Pan, Y., Yao, T., Li, Y., and Mei, T. (2020, January 13\u201319). X-linear attention networks for image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Kim, D.J., Oh, T.H., Choi, J., and Kweon, I.S. (2023). Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data. arXiv.","DOI":"10.2139\/ssrn.4583222"},{"key":"ref_21","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA."},{"key":"ref_22","unstructured":"Huang, L., Wang, W., Chen, J., and Wei, X.Y. (November, January 27). Attention on attention for image captioning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"103482","DOI":"10.1016\/j.autcon.2020.103482","article-title":"Dataset and benchmark for detecting moving objects in construction sites","volume":"122","author":"An","year":"2021","journal-title":"Autom. Constr."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Cornia, M., Stefanini, M., Baraldi, L., and Cucchiara, R. (2020, January 13\u201319). Meshed-memory transformer for image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"7005","DOI":"10.1109\/TCSVT.2022.3178844","article-title":"Vision-Enhanced and Consensus-Aware Transformer for Image Captioning","volume":"32","author":"Cao","year":"2022","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"109420","DOI":"10.1016\/j.patcog.2023.109420","article-title":"Towards local visual modeling for image captioning","volume":"138","author":"Ma","year":"2023","journal-title":"Pattern Recognit."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Wang, N., Xie, J., Wu, J., Jia, M., and Li, L. (2023, January 7\u201314). Controllable image captioning via prompting. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i2.25360"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Vo, D.M., Luong, Q.A., Sugimoto, A., and Nakayama, H. (2023, January 17\u201324). A-CAP: Anticipation Captioning with Commonsense Knowledge. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01042"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yang, Z., Liu, Q., and Liu, G. (2020). Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training. Symmetry, 12.","DOI":"10.3390\/sym12121978"},{"key":"ref_31","first-page":"236","article-title":"A image caption method of construction scene based on attention mechanism and encoding-decoding architecture","volume":"56","author":"Nong","year":"2022","journal-title":"J. ZheJiang Univ. Eng. Sci."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., and Chen, X. (2020, January 13\u201319). In defense of grid features for visual question answering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01028"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ji, J., Luo, Y., Sun, X., Chen, F., Luo, G., Wu, Y., Gao, Y., and Ji, R. (2021, January 2\u20139). Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada.","DOI":"10.1609\/aaai.v35i2.16258"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017, January 21\u201326). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.131"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Li, F.F. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Min, J., McCoy, R.T., Das, D., Pitler, E., and Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.212"},{"key":"ref_37","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_39","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_40","unstructured":"Lin, C.Y. (2004). Text Summarization Branches Out, Association for Computational Linguistics."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2016, January 11\u201314). Spice: Semantic propositional image caption evaluation. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Fang, Z., Wang, J., Hu, X., Liang, L., Gan, Z., Wang, L., Yang, Y., and Liu, Z. (2022, January 18\u201324). Injecting semantic concepts into end-to-end image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01748"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/3\/224\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:07:55Z","timestamp":1760105275000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/3\/224"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,1]]},"references-count":43,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,3]]}},"alternative-id":["e26030224"],"URL":"https:\/\/doi.org\/10.3390\/e26030224","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,1]]}}}