{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T00:31:22Z","timestamp":1760574682776,"version":"build-2065373602"},"publisher-location":"Cham","reference-count":40,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783032083234","type":"print"},{"value":"9783032083241","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T00:00:00Z","timestamp":1760572800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T00:00:00Z","timestamp":1760572800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>In this work, we interpret the representations of multi-object scenes in vision encoders through the lens of structured representations. Structured representations allow modeling of individual objects distinctly and their flexible use based on the task context for both scene-level and object-specific tasks. These capabilities play a central role in human reasoning and generalization, allowing us to abstract away irrelevant details and focus on relevant information in a compact and usable form. We define structured representations as those that adhere to two specific properties: binding specific object information into discrete representation units and segregating object representations into separate sets of tokens to minimize cross-object entanglement. Based on these properties, we evaluated and compared image encoders pre-trained on classification (ViT), large vision-language models (CLIP, BLIP, FLAVA), and self-supervised methods (DINO, DINOv2). We examine the token representations by creating object-decoding tasks that measure the ability of specific tokens to capture individual objects in multi-object scenes from the COCO dataset. This analysis provides insights into how object-wise representations are distributed across tokens and layers within these vision encoders. Our findings highlight significant differences in the representation of objects depending on their relevance to the pre-training objective, with this effect particularly pronounced in the CLS token (often used for downstream tasks). Meanwhile, networks and layers that exhibit more structured representations retain better information about individual objects. To guide practical applications, we propose formal measures to quantify the two properties of structured representations, aiding in selecting and adapting vision encoders for downstream tasks. Overall, we aim to advance the understanding of object-wise structured representations in vision encoders, thus enhancing their transparency and interpretability. By clarifying how these models bind and segregate object-level information, we enable better-informed decisions for optimal downstream task adaptation, ultimately aligning their behaviour more closely with human reasoning.<\/jats:p>","DOI":"10.1007\/978-3-032-08324-1_16","type":"book-chapter","created":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T08:49:53Z","timestamp":1760518193000},"page":"359-382","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Interpreting the\u00a0Structure of\u00a0Multi-object Representations in\u00a0Vision Encoders"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7089-659X","authenticated-orcid":false,"given":"Tarun","family":"Khajuria","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0000-1170-7464","authenticated-orcid":false,"given":"Braian Olmiro","family":"Dias","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5414-6089","authenticated-orcid":false,"given":"Marharyta","family":"Domnich","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3927-452X","authenticated-orcid":false,"given":"Jaan","family":"Aru","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,16]]},"reference":[{"key":"16_CR1","doi-asserted-by":"crossref","unstructured":"Aflalo, E., et al.: VL-interpret: an interactive visualization tool for interpreting vision-language transformers. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 21406\u201321415 (2022)","DOI":"10.1109\/CVPR52688.2022.02072"},{"key":"16_CR2","unstructured":"Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016)"},{"key":"16_CR3","doi-asserted-by":"crossref","unstructured":"Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425\u20132433 (2015)","DOI":"10.1109\/ICCV.2015.279"},{"key":"16_CR4","doi-asserted-by":"crossref","unstructured":"Ayzenberg, L., Giryes, R., Greenspan, H.: DINOv2 based self supervised learning for few shot medical image segmentation. In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), pp.\u00a01\u20135. IEEE (2024)","DOI":"10.1109\/ISBI56570.2024.10635439"},{"key":"16_CR5","unstructured":"Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. arxiv 2012. arXiv preprint arXiv:1206.5538 (2012)"},{"key":"16_CR6","unstructured":"Bronstein, M.M., Bruna, J., Cohen, T., Veli\u010dkovi\u0107, P.: Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478 (2021)"},{"key":"16_CR7","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"565","DOI":"10.1007\/978-3-030-58539-6_34","volume-title":"Computer Vision \u2013 ECCV 2020","author":"J Cao","year":"2020","unstructured":"Cao, J., Gan, Z., Cheng, Yu., Yu, L., Chen, Y.-C., Liu, J.: Behind the scene: revealing the secrets of pre-trained vision-and-language models. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 565\u2013580. Springer, Cham (2020). https:\/\/doi.org\/10.1007\/978-3-030-58539-6_34"},{"key":"16_CR8","doi-asserted-by":"crossref","unstructured":"Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 9650\u20139660 (2021)","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"16_CR9","unstructured":"Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584 (2019)"},{"issue":"9","key":"16_CR10","doi-asserted-by":"publisher","first-page":"1342","DOI":"10.1038\/s41591-018-0107-6","volume":"24","author":"J Fauw","year":"2018","unstructured":"Fauw, J., et al.: Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24(9), 1342\u20131350 (2018)","journal-title":"Nat. Med."},{"key":"16_CR11","doi-asserted-by":"crossref","unstructured":"Fodor, J.A.: Concepts: Where Cognitive Science Went Wrong. Oxford University Press (1998)","DOI":"10.1093\/0198236360.001.0001"},{"key":"16_CR12","unstructured":"Greff, K., Van\u00a0Steenkiste, S., Schmidhuber, J.: On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208 (2020)"},{"key":"16_CR13","doi-asserted-by":"crossref","unstructured":"Hewitt, J., Liang, P.: Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368 (2019)","DOI":"10.18653\/v1\/D19-1275"},{"key":"16_CR14","doi-asserted-by":"crossref","unstructured":"Johnson, J., Hariharan, B., Van Der\u00a0Maaten, L., Fei-Fei, L., Lawrence\u00a0Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901\u20132910 (2017)","DOI":"10.1109\/CVPR.2017.215"},{"key":"16_CR15","unstructured":"Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338\u20135348. PMLR (2020)"},{"key":"16_CR16","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","volume":"123","author":"R Krishna","year":"2017","unstructured":"Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32\u201373 (2017)","journal-title":"Int. J. Comput. Vis."},{"key":"16_CR17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41586-023-06668-3","volume":"623","author":"BM Lake","year":"2023","unstructured":"Lake, B.M., Baroni, M.: Human-like systematic generalization through a meta-learning neural network. Nature 623, 1\u20137 (2023)","journal-title":"Nature"},{"key":"16_CR18","unstructured":"Lepori, M.A., Serre, T., Pavlick, E.: Break it down: evidence for structural compositionality in neural networks. arXiv preprint arXiv:2301.10884 (2023)"},{"key":"16_CR19","unstructured":"Lewis, M., Yu, Q., Merullo, J., Pavlick, E.: Does clip bind concepts? Probing compositionality in large image models. arXiv preprint arXiv:2212.10537 (2022)"},{"key":"16_CR20","doi-asserted-by":"crossref","unstructured":"Li, F., et al.: Mask DINO: towards a unified transformer-based framework for object detection and segmentation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 3041\u20133050 (2023)","DOI":"10.1109\/CVPR52729.2023.00297"},{"key":"16_CR21","unstructured":"Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888\u201312900. PMLR (2022)"},{"key":"16_CR22","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1007\/978-3-319-10602-1_48","volume-title":"Computer Vision \u2013 ECCV 2014","author":"T-Y Lin","year":"2014","unstructured":"Lin, T.-Y., et al.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740\u2013755. Springer, Cham (2014). https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48"},{"issue":"3","key":"16_CR23","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1145\/3236386.3241340","volume":"16","author":"ZC Lipton","year":"2018","unstructured":"Lipton, Z.C.: The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31\u201357 (2018)","journal-title":"Queue"},{"key":"16_CR24","first-page":"11525","volume":"33","author":"F Locatello","year":"2020","unstructured":"Locatello, F., et al.: Object-centric learning with slot attention. Adv. Neural. Inf. Process. Syst. 33, 11525\u201311538 (2020)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"16_CR25","doi-asserted-by":"publisher","first-page":"1193","DOI":"10.1162\/tacl_a_00514","volume":"10","author":"C Lovering","year":"2022","unstructured":"Lovering, C., Pavlick, E.: Unit testing for concepts in neural networks. Trans. Assoc. Comput. Linguist. 10, 1193\u20131208 (2022)","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"16_CR26","unstructured":"Oord, A.v.d., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. arXiv preprint arXiv:1711.00937 (2017)"},{"key":"16_CR27","unstructured":"Oquab, M., et\u00a0al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)"},{"issue":"2251","key":"16_CR28","doi-asserted-by":"publisher","first-page":"20220041","DOI":"10.1098\/rsta.2022.0041","volume":"381","author":"E Pavlick","year":"2023","unstructured":"Pavlick, E.: Symbols and grounding in large language models. Phil. Trans. R. Soc. A 381(2251), 20220041 (2023)","journal-title":"Phil. Trans. R. Soc. A"},{"key":"16_CR29","unstructured":"Pedregosa, F., et\u00a0al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825\u20132830 (2011)"},{"key":"16_CR30","unstructured":"Radford, A., et\u00a0al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748\u20138763. PMLR (2021)"},{"issue":"1","key":"16_CR31","doi-asserted-by":"publisher","first-page":"253","DOI":"10.1146\/annurev-neuro-092920-120559","volume":"44","author":"A Radulescu","year":"2021","unstructured":"Radulescu, A., Shin, Y.S., Niv, Y.: Human representation learning. Annu. Rev. Neurosci. 44(1), 253\u2013273 (2021)","journal-title":"Annu. Rev. Neurosci."},{"key":"16_CR32","first-page":"12116","volume":"34","author":"M Raghu","year":"2021","unstructured":"Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural. Inf. Process. Syst. 34, 12116\u201312128 (2021)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"issue":"5","key":"16_CR33","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1038\/s42256-019-0048-x","volume":"1","author":"C Rudin","year":"2019","unstructured":"Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206\u2013215 (2019)","journal-title":"Nat. Mach. Intell."},{"key":"16_CR34","unstructured":"Shridhar, M., Manuelli, L., Fox, D.: Cliport: what and where pathways for robotic 569 manipulation. In: Conference on Robot Learning, pp. 894\u2013906 (2021)"},{"key":"16_CR35","doi-asserted-by":"crossref","unstructured":"Singh, A., et al.: FLAVA: a foundational language and vision alignment model. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638\u201315650 (2022)","DOI":"10.1109\/CVPR52688.2022.01519"},{"key":"16_CR36","unstructured":"Tr\u00e4uble, F., et al.: Discrete key-value bottleneck. In: International Conference on Machine Learning, pp. 34431\u201334455. PMLR (2023)"},{"key":"16_CR37","doi-asserted-by":"crossref","unstructured":"Vobecky, A., Hurych, D., Sim\u00e9oni, O., Gidaris, S., Bursuc, A., P\u00e9rez, P., Sivic, J.: Unsupervised semantic segmentation of urban scenes via cross-modal distillation. Int. J. Comput. Vis. 133, 1\u201323 (2025)","DOI":"10.1007\/s11263-024-02320-3"},{"key":"16_CR38","unstructured":"Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Advances in Neural Information Processing Systems, vol. 31 (2018)"},{"key":"16_CR39","unstructured":"Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)"},{"key":"16_CR40","unstructured":"Yun, T., Bhalla, U., Pavlick, E., Sun, C.: Do vision-language pretrained models learn composable primitive concepts? arXiv preprint arXiv:2203.17271 (2022)"}],"container-title":["Communications in Computer and Information Science","Explainable Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-032-08324-1_16","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T08:50:03Z","timestamp":1760518203000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-032-08324-1_16"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,16]]},"ISBN":["9783032083234","9783032083241"],"references-count":40,"URL":"https:\/\/doi.org\/10.1007\/978-3-032-08324-1_16","relation":{},"ISSN":["1865-0929","1865-0937"],"issn-type":[{"value":"1865-0929","type":"print"},{"value":"1865-0937","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,16]]},"assertion":[{"value":"16 October 2025","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Disclosure of Interests"}},{"value":"xAI","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"World Conference on Explainable Artificial Intelligence","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Istanbul","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"T\u00fcrkiye","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2025","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"9 July 2025","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"11 July 2025","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"3","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"xai2025","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/xaiworldconference.com\/2025\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}