{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T16:43:11Z","timestamp":1754152991170,"version":"3.41.2"},"reference-count":47,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,7,22]],"date-time":"2025-07-22T00:00:00Z","timestamp":1753142400000},"content-version":"vor","delay-in-days":202,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,7,18]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Amidst the rapid advancement of artificial intelligence, research on large vision-language models (LVLMs) has emerged as a pivotal area. However, understanding their internal mechanisms remains challenging due to the limitations of existing interpretability methods, especially regarding faithfulness and plausibility. To address this, we first construct a human response interpretability dataset that evaluates the plausibility of model explanations by comparing the attention regions between the model and humans when answering the same questions. We then propose a patchwise cooperative game-based interpretability method for LVLMs, which employs Shapley values to quantify the impact of individual image patches on generation likelihood and enhances computational efficiency through a single input approximation approach. Experimental results demonstrate our method\u2019s faithfulness, plausibility, and robustness. Our method provides researchers with deeper insights into model behavior, allowing for an examination of the specific image regions each layer relies on during response generation, ultimately enhancing model reliability. Our code is available at https:\/\/github.com\/ZY123-GOOD\/Patchwise_Cooperative.<\/jats:p>","DOI":"10.1162\/tacl_a_00756","type":"journal-article","created":{"date-parts":[[2025,7,22]],"date-time":"2025-07-22T13:50:37Z","timestamp":1753192237000},"page":"744-759","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":0,"title":["Patchwise Cooperative Game-based Interpretability Method for Large\n                    Vision-language Models"],"prefix":"10.1162","volume":"13","author":[{"given":"Yao","family":"Zhu","sequence":"first","affiliation":[{"name":"Tsinghua University, China. ee_zhuy@zju.edu.cn"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yunjian","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University, China. sdtczyj@gmail.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zizhe","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University, China. wangzz@act.buaa.edu.cn"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiu","family":"Yan","sequence":"additional","affiliation":[{"name":"Meituan Group, China. yanx18@tsinghua.org.cn"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peng","family":"Sun","sequence":"additional","affiliation":[{"name":"Central University of Finance and Economics, China. 2023212399@email.cufe.edu.cn"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiangyang","family":"Ji","sequence":"additional","affiliation":[{"name":"Tsinghua University, China. xyji@tsinghua.edu.cn"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2025,7,18]]},"reference":[{"key":"2025072209503381000_bib1","doi-asserted-by":"publisher","first-page":"21406","DOI":"10.1109\/CVPR52688.2022.02072","article-title":"Vl-interpret: An interactive visualization tool for\n                        interpreting vision-language transformers","volume-title":"Proceedings of the IEEE\/CVF Conference on computer vision and\n                        pattern recognition","author":"Aflalo","year":"2022"},{"key":"2025072209503381000_bib2","first-page":"104","article-title":"Text or image? What is more important in\n                        cross-domain generalization capabilities of hate meme detection\n                        models?","volume-title":"Findings of the Association for\n                        Computational Linguistics: EACL 2024","author":"Aggarwal","year":"2024"},{"key":"2025072209503381000_bib3","first-page":"272","article-title":"Explaining deep neural networks with a\n                        polynomial time algorithm for shapley value approximation","volume-title":"Proceedings of the 36th International Conference on Machine\n                        Learning","author":"Ancona","year":"2019"},{"key":"2025072209503381000_bib4","first-page":"1803","article-title":"How to explain individual classification\n                        decisions","volume":"11","author":"Baehrens","year":"2010","journal-title":"The Journal of Machine Learning\n                        Research"},{"key":"2025072209503381000_bib5","first-page":"1","article-title":"Qwen-vl: A frontier large vision-language model with\n                        versatile abilities","author":"Bai","year":"2023","journal-title":"arXiv preprint\n                        arXiv:2308.12966"},{"key":"2025072209503381000_bib6","first-page":"1877","article-title":"Language models are few-shot\n                        learners","volume-title":"Proceedings of the 34th International\n                        Conference on Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2025072209503381000_bib7","doi-asserted-by":"publisher","first-page":"1220476","DOI":"10.3389\/frai.2023.1220476","article-title":"Interpreting vision and language generative models with\n                        semantic visual priors","volume":"6","author":"Cafagna","year":"2023","journal-title":"Frontiers in Artificial\n                        Intelligence"},{"key":"2025072209503381000_bib8","doi-asserted-by":"publisher","first-page":"397","DOI":"10.1109\/ICCV48922.2021.00045","article-title":"Generic attention-model explainability for interpreting\n                        bi-modal and encoder-decoder transformers","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer\n                        Vision","author":"Chefer","year":"2021"},{"key":"2025072209503381000_bib9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s42256-023-00657-x","article-title":"Algorithms to estimate shapley value feature\n                        attributions","author":"Chen","year":"2023","journal-title":"Nature Machine Intelligence"},{"key":"2025072209503381000_bib10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-031-72643-9_22","article-title":"Sharegpt4v: Improving large multi-modal models with better\n                        captions","volume-title":"European Conference on Computer\n                        Vision","author":"Chen","year":"2024"},{"key":"2025072209503381000_bib11","first-page":"1","article-title":"Fixing confirmation bias in feature\n                        attribution methods via semantic match","author":"Cin\u00e0","year":"2023","journal-title":"arXiv\n                        preprint arXiv:2307.00897"},{"key":"2025072209503381000_bib12","first-page":"3457","article-title":"Improving kernelshap: Practical shapley value estimation\n                        using linear regression","volume-title":"International Conference\n                        on Artificial Intelligence and Statistics","author":"Covert","year":"2021"},{"key":"2025072209503381000_bib13","article-title":"An image is worth 16\u00d716 words:\n                        Transformers for image recognition at scale","volume-title":"International Conference on Learning\n                    Representations","author":"Dosovitskiy","year":"2020"},{"key":"2025072209503381000_bib14","doi-asserted-by":"publisher","first-page":"4015","DOI":"10.1109\/ICCV51070.2023.00371","article-title":"Segment anything","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer\n                        Vision","author":"Kirillov","year":"2023"},{"key":"2025072209503381000_bib15","article-title":"Llava-med: Training a large language-and-vision assistant for\n                        biomedicine in one day","volume":"36","author":"Li","year":"2024","journal-title":"Advances in Neural\n                        Information Processing Systems"},{"key":"2025072209503381000_bib16","doi-asserted-by":"publisher","first-page":"3664","DOI":"10.1145\/3474085.3475337","article-title":"Instance-wise or class-wise? A tale of\n                        neighbor shapley for concept-based explanation","volume-title":"Proceedings of the 29th ACM International Conference on\n                        Multimedia","author":"Li","year":"2021"},{"key":"2025072209503381000_bib17","first-page":"9694","article-title":"Align before fuse: Vision and language\n                        representation learning with momentum distillation","volume":"34","author":"Li","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025072209503381000_bib18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2023.emnlp-main.20","article-title":"Evaluating object hallucination in large vision-language\n                        models","volume-title":"The 2023 Conference on Empirical Methods\n                        in Natural Language Processing","author":"Li","year":"2023"},{"key":"2025072209503381000_bib19","first-page":"34892","article-title":"Visual instruction tuning","volume-title":"Proceedings of the 37th International Conference on Neural\n                        Information Processing Systems","author":"Liu","year":"2023"},{"key":"2025072209503381000_bib20","article-title":"Visual instruction tuning","volume":"36","author":"Liu","year":"2024","journal-title":"Advances\n                        in Neural Information Processing Systems"},{"key":"2025072209503381000_bib21","article-title":"Vilbert: Pretraining task-agnostic visiolinguistic\n                        representations for vision-and-language tasks","volume":"32","author":"Jiasen","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025072209503381000_bib22","first-page":"1","article-title":"A unified approach to interpreting model\n                        predictions","volume":"30","author":"Lundberg","year":"2017","journal-title":"Advances in Neural Information\n                        Processing Systems"},{"key":"2025072209503381000_bib23","doi-asserted-by":"publisher","first-page":"455","DOI":"10.1145\/3514094.3534148","article-title":"Dime: Fine-grained interpretations of\n                        multimodal models via disentangled local explanations","volume-title":"Proceedings of the 2022 AAAI\/ACM Conference on AI, Ethics, and\n                        Society","author":"Lyu","year":"2022"},{"key":"2025072209503381000_bib24","article-title":"Smooth grad-cam++: An enhanced inference\n                        level visualization technique for deep convolutional neural network\n                        models","author":"Omeiza","year":"2019","journal-title":"arXiv preprint\n                    arXiv:1908.01224"},{"key":"2025072209503381000_bib25","doi-asserted-by":"publisher","first-page":"4032","DOI":"10.18653\/v1\/2023.acl-long.223","article-title":"Mm-shap: A performance-agnostic metric for\n                        measuring multimodal contributions in vision and language models &\n                        tasks","volume-title":"Proceedings of the 61st Annual Meeting of\n                        the Association for Computational Linguistics (Volume 1: Long\n                        Papers)","author":"Parcalabescu","year":"2023"},{"key":"2025072209503381000_bib26","first-page":"8748","article-title":"Learning transferable visual models from\n                        natural language supervision","volume-title":"International\n                        Conference on Machine Learning","author":"Radford","year":"2021"},{"key":"2025072209503381000_bib27","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1007\/978-981-19-8746-5_7","article-title":"Investigation of explainability techniques for multimodal\n                        transformers","volume-title":"Australasian Conference on Data\n                        Mining","author":"Ramesh","year":"2022"},{"key":"2025072209503381000_bib28","doi-asserted-by":"publisher","first-page":"1135","DOI":"10.1145\/2939672.2939778","article-title":"\u201cwhy should i trust you?\u201d\n                        Explaining the predictions of any classifier","volume-title":"Proceedings of the 22nd ACM SIGKDD International Conference on\n                        Knowledge Discovery and Data Mining","author":"Ribeiro","year":"2016"},{"key":"2025072209503381000_bib29","first-page":"18770","article-title":"A consistent and efficient evaluation\n                        strategy for attribution methods","volume-title":"International\n                        Conference on Machine Learning","author":"Rong","year":"2022"},{"key":"2025072209503381000_bib30","doi-asserted-by":"publisher","first-page":"618","DOI":"10.1109\/ICCV.2017.74","article-title":"Grad-cam: Visual explanations from deep\n                        networks via gradient-based localization","volume-title":"Proceedings of the IEEE International Conference on Computer\n                        Vision","author":"Selvaraju","year":"2017"},{"key":"2025072209503381000_bib31","doi-asserted-by":"publisher","first-page":"307","DOI":"10.1515\/9781400881970-018","article-title":"A value for n-person games","volume-title":"Contributions to the Theory of Games II","author":"Shapley","year":"1953"},{"key":"2025072209503381000_bib32","first-page":"1","article-title":"Smoothgrad: Removing noise by adding\n                        noise","author":"Smilkov","year":"2017","journal-title":"arXiv preprint arXiv:1706.03825"},{"issue":"1","key":"2025072209503381000_bib33","doi-asserted-by":"publisher","first-page":"1060","DOI":"10.1137\/15M1048070","article-title":"Shapley effects for global sensitivity\n                        analysis: Theory and computation","volume":"4","author":"Song","year":"2016","journal-title":"SIAM\/ASA Journal\n                        on Uncertainty Quantification"},{"key":"2025072209503381000_bib34","first-page":"1","article-title":"Striving for simplicity: The all\n                        convolutional net","volume-title":"ICLR (workshop\n                    track)","author":"Springenberg","year":"2015"},{"key":"2025072209503381000_bib35","first-page":"1","article-title":"Full-gradient representation for neural\n                        network visualization","volume":"32","author":"Srinivas","year":"2019","journal-title":"Advances in Neural\n                        Information Processing Systems"},{"key":"2025072209503381000_bib36","article-title":"Lvlm-intrepret: An interpretability tool for large\n                        vision-language models","author":"Melech Stan","year":"2024","journal-title":"arXiv preprint\n                        arXiv:2404.03118"},{"key":"2025072209503381000_bib37","article-title":"Explain any concept: Segment anything meets concept-based\n                        explanation","volume":"36","author":"Ao","year":"2024","journal-title":"Advances in Neural Information\n                        Processing Systems"},{"issue":"3\u20134","key":"2025072209503381000_bib38","doi-asserted-by":"publisher","first-page":"405","DOI":"10.1016\/S0957-4174(98)00041-4","article-title":"Ranking importance of input parameters of neural\n                        networks","volume":"15","author":"Sung","year":"1998","journal-title":"Expert Systems with Applications"},{"key":"2025072209503381000_bib39","doi-asserted-by":"publisher","first-page":"5100","DOI":"10.18653\/v1\/D19-1514","article-title":"Lxmert: Learning cross-modality encoder\n                        representations from transformers","volume-title":"Proceedings of\n                        the 2019 Conference on Empirical Methods in Natural Language Processing and\n                        the 9th International Joint Conference on Natural Language Processing\n                        (EMNLP-IJCNLP)","author":"Tan","year":"2019"},{"key":"2025072209503381000_bib40","doi-asserted-by":"publisher","first-page":"6021","DOI":"10.1609\/aaai.v34i04.6064","article-title":"Sanity checks for saliency\n                        metrics","volume-title":"Proceedings of the AAAI Conference on\n                        Artificial Intelligence","author":"Tomsett","year":"2020"},{"key":"2025072209503381000_bib41","first-page":"1","article-title":"Llama: Open and efficient foundation\n                        language models","author":"Touvron","year":"2023","journal-title":"arXiv preprint\n                        arXiv:2302.13971"},{"key":"2025072209503381000_bib42","first-page":"1","article-title":"Ss-cam: Smoothed score-cam for sharper\n                        visual feature localization","author":"Wang","year":"2020","journal-title":"arXiv preprint\n                        arXiv:2006.14255"},{"key":"2025072209503381000_bib43","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1109\/CVPRW50498.2020.00020","article-title":"Score-cam: Score-weighted visual\n                        explanations for convolutional neural networks","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and\n                        Pattern Recognition Workshops","author":"Wang","year":"2020"},{"issue":"1","key":"2025072209503381000_bib44","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1109\/TVCG.2019.2934619","article-title":"The what-if tool: Interactive probing of\n                        machine learning models","volume":"26","author":"Wexler","year":"2019","journal-title":"IEEE Transactions on\n                        Visualization and Computer Graphics"},{"key":"2025072209503381000_bib45","doi-asserted-by":"publisher","first-page":"6261","DOI":"10.1109\/CVPR.2019.00642","article-title":"Interpreting cnns via decision\n                        trees","volume-title":"Proceedings of the IEEE\/CVF Conference on\n                        Computer Vision and Pattern Recognition","author":"Zhang","year":"2019"},{"issue":"2","key":"2025072209503381000_bib46","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3639372","article-title":"Explainability for large language models:\n                        A survey","volume":"15","author":"Zhao","year":"2024","journal-title":"ACM Transactions on Intelligent Systems\n                        and Technology"},{"key":"2025072209503381000_bib47","doi-asserted-by":"publisher","first-page":"2921","DOI":"10.1109\/CVPR.2016.319","article-title":"Learning deep features for discriminative\n                        localization","volume-title":"Proceedings of the IEEE Conference\n                        on Computer Vision and Pattern Recognition","author":"Zhou","year":"2016"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00756\/2538249\/tacl_a_00756.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00756\/2538249\/tacl_a_00756.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,22]],"date-time":"2025-07-22T13:50:43Z","timestamp":1753192243000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00756\/131837\/Patchwise-Cooperative-Game-based-Interpretability"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":47,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00756","relation":{},"ISSN":["2307-387X"],"issn-type":[{"type":"electronic","value":"2307-387X"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}