{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T19:14:12Z","timestamp":1760037252785,"version":"build-2065373602"},"reference-count":61,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,8,18]],"date-time":"2025-08-18T00:00:00Z","timestamp":1755475200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Jiangxi Provincial Department of Science and Technology Natural Science Foundation","award":["20242BAB20223"],"award-info":[{"award-number":["20242BAB20223"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IJGI"],"abstract":"<jats:p>Understanding urban visual perception is crucial for modeling how individuals cognitively and emotionally interact with the built environment. However, traditional survey-based approaches are limited in scalability and often fail to generalize across diverse urban contexts. In this study, we introduce the UP-CBM, a transparent framework that leverages visual foundation models (VFMs) and concept-based reasoning to address these challenges. The UP-CBM automatically constructs a task-specific vocabulary of perceptual concepts using GPT-4o and processes urban scene images through a multi-scale visual prompting pipeline. This pipeline generates CLIP-based similarity maps that facilitate the learning of an interpretable bottleneck layer, effectively linking visual features with human perceptual judgments. Our framework not only achieves higher predictive accuracy but also offers enhanced interpretability, enabling transparent reasoning about urban perception. Experiments on two benchmark datasets\u2014Place Pulse 2.0 (achieving improvements of +0.041 in comparison accuracy and +0.029 in R2) and VRVWPR (+0.018 in classification accuracy)\u2014demonstrate the effectiveness and generalizability of our approach. These results underscore the potential of integrating VFMs with structured concept-driven pipelines for more explainable urban visual analytics.<\/jats:p>","DOI":"10.3390\/ijgi14080315","type":"journal-article","created":{"date-parts":[[2025,8,18]],"date-time":"2025-08-18T15:34:53Z","timestamp":1755531293000},"page":"315","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Towards Transparent Urban Perception: A Concept-Driven Framework with Visual Foundation Models"],"prefix":"10.3390","volume":"14","author":[{"given":"Yixin","family":"Yu","sequence":"first","affiliation":[{"name":"Faculty of Humanities and Arts, Macau University of Science and Technology, Macao SAR, China"}]},{"given":"Zepeng","family":"Yu","sequence":"additional","affiliation":[{"name":"Architecture and Design College, Nanchang University, No. 999 Xuefu Avenue, Nanchang 330031, China"}]},{"given":"Xuhua","family":"Shi","sequence":"additional","affiliation":[{"name":"Faculty of Humanities and Arts, Macau University of Science and Technology, Macao SAR, China"}]},{"given":"Ran","family":"Wan","sequence":"additional","affiliation":[{"name":"Architecture and Design College, Nanchang University, No. 999 Xuefu Avenue, Nanchang 330031, China"}]},{"given":"Bowen","family":"Wang","sequence":"additional","affiliation":[{"name":"D3 Center, Osaka University, 2-8 Yamadaoka, Suita, Osaka 565-0871, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6330-6723","authenticated-orcid":false,"given":"Jiaxin","family":"Zhang","sequence":"additional","affiliation":[{"name":"Architecture and Design College, Nanchang University, No. 999 Xuefu Avenue, Nanchang 330031, China"},{"name":"Environmental Design and Information Technology Laboratory, Division of Sustainable Energy and Environmental Engineering, Graduate School of Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka 565-0871, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"104037","DOI":"10.1016\/j.cities.2022.104037","article-title":"Subjective and objective measures of streetscape perceptions: Relationships with property value in Shanghai","volume":"132","author":"Qiu","year":"2023","journal-title":"Cities"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1016\/j.isprsjprs.2022.06.011","article-title":"Measuring residents\u2019 perceptions of city streets to inform better street planning through deep learning and space syntax","volume":"190","author":"Wang","year":"2022","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1681","DOI":"10.1126\/science.1161405","article-title":"The spreading of disorder","volume":"322","author":"Keizer","year":"2008","journal-title":"Science"},{"key":"ref_4","first-page":"29","article-title":"Broken windows","volume":"249","author":"Kelling","year":"1982","journal-title":"Atl. Mon."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, J., Yu, Z., Li, Y., and Wang, X. (2023). Uncovering Bias in Objective Mapping and Subjective Perception of Urban Building Functionality: A Machine Learning Approach to Urban Spatial Perception. Land, 12.","DOI":"10.20944\/preprints202306.0092.v1"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"805","DOI":"10.1177\/1536867X20976313","article-title":"Extracting Chinese geographic data from Baidu map API","volume":"20","author":"Xue","year":"2020","journal-title":"Stata J."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"102463","DOI":"10.1016\/j.aei.2024.102463","article-title":"Improving facade parsing with vision transformers and line integration","volume":"60","author":"Wang","year":"2024","journal-title":"Adv. Eng. Inform."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"e2220417120","DOI":"10.1073\/pnas.2220417120","article-title":"Urban visual intelligence: Uncovering hidden city profiles with street view images","volume":"120","author":"Fan","year":"2023","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1080\/01944369008975742","article-title":"The evaluative image of the city","volume":"56","author":"Nasar","year":"1990","journal-title":"J. Am. Plan. Assoc."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1776","DOI":"10.1080\/13467581.2023.2270047","article-title":"Towards a Fairer Green city: Measuring unfairness in daily accessible greenery in Chengdu\u2019s central city","volume":"23","author":"Zhang","year":"2024","journal-title":"J. Asian Archit. Build. Eng."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"2363","DOI":"10.1080\/13658816.2019.1643024","article-title":"A human-machine adversarial scoring framework for urban perception assessment using street-view images","volume":"33","author":"Yao","year":"2019","journal-title":"Int. J. Geogr. Inf. Sci."},{"key":"ref_12","unstructured":"Salesses, M.P. (2012). Place Pulse: Measuring the Collaborative Image of the City. [Master\u2019s Thesis, Massachusetts Institute of Technology]."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Salesses, P., Schechtner, K., and Hidalgo, C.A. (2013). The collaborative image of the city: Mapping the inequality of urban perception. PLoS ONE, 8.","DOI":"10.1371\/journal.pone.0068400"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"2245","DOI":"10.1109\/TPAMI.2024.3506283","article-title":"Foundation Models Defining a New Era in Vision: A Survey and Outlook","volume":"47","author":"Awais","year":"2025","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_15","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (March, January 26). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PMLR, Shenzhen, China."},{"key":"ref_16","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., and El-Nouby, A. (2023). Dinov2: Learning robust visual features without supervision. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Shtedritski, A., Rupprecht, C., and Vedaldi, A. (2023, January 2\u20133). What does clip know about a red circle?Visual prompt engineering for vlms. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01101"},{"key":"ref_18","unstructured":"Benou, N., Chen, L., and Gao, X. (2025). SALF-CBM: Spatially-Aware and Label-Free Concept Bottleneck Models. ICLR."},{"key":"ref_19","unstructured":"Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. (2020, January 13\u201318). Concept Bottleneck Models. Proceedings of the 37th International Conference on Machine Learning, PMLR, Online."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Wang, B., Li, L., Nakashima, Y., and Nagahara, H. (2023, January 18\u201322). Learning Bottleneck Concepts in Image Classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01055"},{"key":"ref_21","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). GPT-4 technical report. arXiv."},{"key":"ref_22","first-page":"4117","article-title":"Learning Temporal User Features for Repost Prediction with Large Language Models","volume":"82","author":"Sun","year":"2025","journal-title":"Comput. Mater. Contin."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"7028","DOI":"10.1109\/ACCESS.2024.3350641","article-title":"Leveraging diffusion modeling for remote sensing change detection in built-up urban areas","volume":"12","author":"Wan","year":"2024","journal-title":"IEEE Access"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"104140","DOI":"10.1016\/j.scs.2022.104140","article-title":"Measuring visual walkability perception using panoramic street view images, virtual reality, and deep learning","volume":"86","author":"Li","year":"2022","journal-title":"Sustain. Cities Soc."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Liu, J., Li, L., Xiang, T., Wang, B., and Qian, Y. (2023). Tcra-llm: Token compression retrieval augmented large language model for inference cost reduction. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.655"},{"key":"ref_26","unstructured":"Brown, T.B. (2020). Language models are few-shot learners. arXiv."},{"key":"ref_27","first-page":"74999","article-title":"Direct: Diagnostic reasoning for clinical notes via large language models","volume":"37","author":"Wang","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Han, Y., Liu, J., Luo, A., Wang, Y., and Bao, S. (2025). Fine-Tuning LLM-Assisted Chinese Disaster Geospatial Intelligence Extraction and Case Studies. ISPRS Int. J. Geo-Inf., 14.","DOI":"10.3390\/ijgi14020079"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"de Moraes Vestena, K., Phillipi Camboim, S., Brovelli, M.A., and Rodrigues dos Santos, D. (2024). Investigating the Performance of Open-Vocabulary Classification Algorithms for Pathway and Surface Material Detection in Urban Environments. ISPRS Int. J. Geo-Inf., 13.","DOI":"10.20944\/preprints202409.1321.v1"},{"key":"ref_30","unstructured":"Cheng, Y., Li, L., Xu, Y., Li, X., Yang, Z., Wang, W., and Yang, Y. (2023). Segment and track anything. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Andriiashen, V., van Liere, R., van Leeuwen, T., and Batenburg, K.J. (2021). Unsupervised foreign object detection based on dual-energy absorptiometry in the food industry. J. Imaging, 7.","DOI":"10.3390\/jimaging7070104"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Xu, S., Zhang, J., and Li, Y. (2024). Knowledge-driven and diffusion model-based methods for generating historical building facades: A case study of traditional Minnan residences in China. Information, 15.","DOI":"10.3390\/info15060344"},{"key":"ref_33","first-page":"3965","article-title":"A latency-efficient integration of channel attention for ConvNets","volume":"82","author":"Park","year":"2025","journal-title":"Comput. Mater. Contin."},{"key":"ref_34","first-page":"3399","article-title":"YOLO-LFD: A Lightweight and Fast Model for Forest Fire Detection","volume":"82","author":"Wang","year":"2025","journal-title":"Comput. Mater. Contin."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhang, K., and Liu, D. (2023). Customized segment anything model for medical image segmentation. arXiv.","DOI":"10.2139\/ssrn.4495221"},{"key":"ref_36","unstructured":"Shen, Q., Yang, X., and Wang, X. (2023). Anything-3d: Towards single-view anything reconstruction in the wild. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., and Girshick, R. (2022, January 19\u201323). Masked autoencoders are scalable vision learners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhang, J., Wang, B., Li, L., Nakashima, Y., and Nagahara, H. (2024, January 3\u20137). Instruct me more! random prompting for visual in-context learning. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV57701.2024.00258"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Yan, Y., Wen, H., Zhong, S., Chen, W., Chen, H., Wen, Q., Zimmermann, R., and Liang, Y. (2024, January 13\u201317). Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. Proceedings of the ACM Web Conference 2024, Singapore.","DOI":"10.1145\/3589334.3645378"},{"key":"ref_40","unstructured":"Hao, X., Chen, W., Yan, Y., Zhong, S., Wang, K., Wen, Q., and Liang, Y. (March, January 27). UrbanVLP: Multi-granularity vision-language pretraining for urban socioeconomic indicator prediction. Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA."},{"key":"ref_41","unstructured":"Yang, J., Ding, R., Brown, E., Qi, X., and Xie, S. (October, January 29). V-irl: Grounding virtual intelligence in real life. Proceedings of the European Conference on Computer Vision, Milan, Italy."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhang, J., Fukuda, T., and Yabuki, N. (2021). Development of a city-scale approach for fa\u00e7ade color measurement with building functional classification using deep learning and street view images. ISPRS Int. J. Geo-Inf., 10.","DOI":"10.3390\/ijgi10080551"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Naik, N., Philipoom, J., Raskar, R., and Hidalgo, C. (2014, January 23\u201328). Streetscore-predicting the perceived safety of one million streetscapes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA.","DOI":"10.1109\/CVPRW.2014.121"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"128","DOI":"10.1257\/aer.p20161030","article-title":"Cities are physical too: Using computer vision to measure the quality and impact of urban appearance","volume":"106","author":"Naik","year":"2016","journal-title":"Am. Econ. Rev."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1186\/1479-5868-10-103","article-title":"Developing and testing a street audit tool using Google Street View to measure environmental supportiveness for physical activity","volume":"10","author":"Griew","year":"2013","journal-title":"Int. J. Behav. Nutr. Phys. Act."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Halpern, D. (2014). Mental Health and the Built Environment: More than Bricks and Mortar?, Routledge.","DOI":"10.4324\/9781315041131"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"10956","DOI":"10.1007\/s10489-022-04072-4","article-title":"Match them up: Visually explainable few-shot image classification","volume":"53","author":"Wang","year":"2023","journal-title":"Appl. Intell."},{"key":"ref_48","unstructured":"Ghorbani, A., Wexler, J., Zou, J., and Kim, B. (2019, January 8\u201314). Towards Automatic Concept-based Explanations. Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada."},{"key":"ref_49","unstructured":"Ge, S., Zhang, L., and Liu, Q. (2021, January 6\u201314). Robust Concept-based Interpretability with Variational Concept Embedding. Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Online."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Laugel, T., Lesot, M.J., Marsala, C., Renard, X., and Detyniecki, M. (2019, January 10\u201316). The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Macao, China.","DOI":"10.24963\/ijcai.2019\/388"},{"key":"ref_51","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Zhou, Z., Zhao, Y., Zuo, H., and Chen, W. (2024, January 7\u201310). Ranking Enhanced Supervised Contrastive Learning for Regression. Proceedings of the Advances in Knowledge Discovery and Data Mining, Taipei, Taiwan.","DOI":"10.1007\/978-981-97-2253-2_2"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"1255","DOI":"10.1109\/TMI.2021.3137854","article-title":"Adaptive contrast for image regression in computer-aided disease assessment","volume":"41","author":"Dai","year":"2021","journal-title":"IEEE Trans. Med Imaging"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Dai, W., Li, X., and Cheng, K.T. (2023, January 7\u201314). Semi-supervised deep regression with uncertainty consistency and variational model ensembling via bayesian neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i6.25890"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"106122","DOI":"10.1016\/j.cities.2025.106122","article-title":"Urban safety perception assessments via integrating multimodal large language models with street view images","volume":"165","author":"Zhang","year":"2025","journal-title":"Cities"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_57","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 4\u20138). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations, Vienna, Austria."},{"key":"ref_58","first-page":"7786","article-title":"Towards robust interpretability with self-explaining neural networks","volume":"31","author":"Jaakkola","year":"2018","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_59","first-page":"8930","article-title":"This looks like that: Deep learning for interpretable image recognition","volume":"32","author":"Chen","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_60","unstructured":"Yuksekgonul, M., Wang, M., and Zou, J. (2022). Post-hoc concept bottleneck models. arXiv."},{"key":"ref_61","unstructured":"Oikarinen, T., Das, S., Nguyen, L.M., and Weng, T.W. (2023). Label-free concept bottleneck models. arXiv."}],"container-title":["ISPRS International Journal of Geo-Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2220-9964\/14\/8\/315\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:30:17Z","timestamp":1760034617000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2220-9964\/14\/8\/315"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,18]]},"references-count":61,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["ijgi14080315"],"URL":"https:\/\/doi.org\/10.3390\/ijgi14080315","relation":{},"ISSN":["2220-9964"],"issn-type":[{"type":"electronic","value":"2220-9964"}],"subject":[],"published":{"date-parts":[[2025,8,18]]}}}