{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T18:27:04Z","timestamp":1770229624610,"version":"3.49.0"},"reference-count":50,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2024,4,24]],"date-time":"2024-04-24T00:00:00Z","timestamp":1713916800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Nagoya University","award":["JP21H04892"],"award-info":[{"award-number":["JP21H04892"]}]},{"name":"Nagoya University","award":["JP21K12073"],"award-info":[{"award-number":["JP21K12073"]}]},{"name":"Nagoya University","award":["JPMJFS2120"],"award-info":[{"award-number":["JPMJFS2120"]}]},{"name":"Japan Society for the Promotion of Science (JSPS)","award":["JP21H04892"],"award-info":[{"award-number":["JP21H04892"]}]},{"name":"Japan Society for the Promotion of Science (JSPS)","award":["JP21K12073"],"award-info":[{"award-number":["JP21K12073"]}]},{"name":"Japan Society for the Promotion of Science (JSPS)","award":["JPMJFS2120"],"award-info":[{"award-number":["JPMJFS2120"]}]},{"name":"Japan Science and Technology Agency (JST)","award":["JP21H04892"],"award-info":[{"award-number":["JP21H04892"]}]},{"name":"Japan Science and Technology Agency (JST)","award":["JP21K12073"],"award-info":[{"award-number":["JP21K12073"]}]},{"name":"Japan Science and Technology Agency (JST)","award":["JPMJFS2120"],"award-info":[{"award-number":["JPMJFS2120"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Transformer-based models have gained popularity in the field of natural language processing (NLP) and are extensively utilized in computer vision tasks and multi-modal models such as GPT4. This paper presents a novel method to enhance the explainability of transformer-based image classification models. Our method aims to improve trust in classification results and empower users to gain a deeper understanding of the model for downstream tasks by providing visualizations of class-specific maps. We introduce two modules: the \u201cRelationship Weighted Out\u201d and the \u201cCut\u201d modules. The \u201cRelationship Weighted Out\u201d module focuses on extracting class-specific information from intermediate layers, enabling us to highlight relevant features. Additionally, the \u201cCut\u201d module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color. By integrating these modules, we generate dense class-specific visual explainability maps. We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset. Furthermore, we conduct a large number of experiments on the LRN dataset, which is specifically designed for automatic driving danger alerts, to evaluate the explainability of our method in scenarios with complex backgrounds. The results demonstrate a significant improvement over previous methods. Moreover, we conduct ablation experiments to validate the effectiveness of each module. Through these experiments, we are able to confirm the respective contributions of each module, thus solidifying the overall effectiveness of our proposed approach.<\/jats:p>","DOI":"10.3390\/s24092695","type":"journal-article","created":{"date-parts":[[2024,4,24]],"date-time":"2024-04-24T07:38:51Z","timestamp":1713944331000},"page":"2695","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9253-2527","authenticated-orcid":false,"given":"Yingjie","family":"Niu","sequence":"first","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"}]},{"given":"Ming","family":"Ding","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"}]},{"given":"Maoning","family":"Ge","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6106-9616","authenticated-orcid":false,"given":"Robin","family":"Karlsson","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8082-301X","authenticated-orcid":false,"given":"Yuxiao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5941-2195","authenticated-orcid":false,"given":"Alexander","family":"Carballo","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"},{"name":"Graduate School of Engineering, Gifu University, Gifu 501-1112, Japan"}]},{"given":"Kazuya","family":"Takeda","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan"},{"name":"Tier IV Inc., Tokyo 140-0001, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2024,4,24]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., and M\u00fcller, K.R. (2019). Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Springer Nature.","DOI":"10.1007\/978-3-030-28954-6"},{"key":"ref_2","unstructured":"Marcinkevics, R., and Vogt, J.E. (2020). Interpretability and Explainability: A Machine Learning Zoo Mini-tour. arXiv."},{"key":"ref_3","first-page":"130","article-title":"Decision tree methods: Applications for classification and prediction","volume":"27","author":"Song","year":"2015","journal-title":"Shanghai Arch. Psychiatry"},{"key":"ref_4","unstructured":"Kleinbaum, D.G., Dietz, K., Gail, M., Klein, M., and Klein, M. (2002). Logistic Regression, Springer."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Weisberg, S. (2005). Applied Linear Regression, John Wiley & Sons.","DOI":"10.1002\/0471704091"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_8","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_9","unstructured":"Tan, M., and Le, Q. (2019, January 9\u201315). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA."},{"key":"ref_10","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv."},{"key":"ref_11","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_12","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2021, January 18\u201324). Training data-efficient image transformers & distillation through attention. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., J\u00e9gou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021, January 11\u201317). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. (2021, January 11\u201317). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2300286","DOI":"10.1002\/aisy.202300286","article-title":"High-Resolution Range Profile Sequence Recognition Based on Transformer with Temporal\u2013Spatial Fusion and Label Smoothing","volume":"5","author":"Wang","year":"2023","journal-title":"Adv. Intell. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Qu, Y., and Kim, J. (2024). Enhancing Query Formulation for Universal Image Segmentation. Sensors, 24.","DOI":"10.3390\/s24061879"},{"key":"ref_18","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_19","first-page":"9694","article-title":"Align before fuse: Vision and language representation learning with momentum distillation","volume":"34","author":"Li","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S. (2022, January 17\u201323). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., and Hwang, J.N. (2022, January 18\u201324). Grounded language-image pre-training. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01069"},{"key":"ref_22","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Pruthi, D., Gupta, M., Dhingra, B., Neubig, G., and Lipton, Z.C. (2019). Learning to Deceive with Attention-Based Explanations. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.432"},{"key":"ref_24","unstructured":"Vig, J. (2019). Visualizing attention in transformer-based language representation models. arXiv."},{"key":"ref_25","unstructured":"Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, \u0141. (2018). Universal Transformers. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Abnar, S., and Zuidema, W. (2020). Quantifying attention flow in transformers. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.385"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Chefer, H., Gur, S., and Wolf, L. (2021, January 19\u201325). Transformer interpretability beyond attention visualization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00084"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis. (IJCV)"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Niu, Y., Ding, M., Zhang, Y., Ohtani, K., and Takeda, K. (2022, January 5\u20139). Auditory and visual warning information generation of the risk object in driving scenes based on weakly supervised learning. Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany.","DOI":"10.1109\/IV51971.2022.9827382"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015, January 7\u201312). Learning Deep Features for Discriminative Localization. Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2016.319"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_32","unstructured":"Ramaswamy, H.G. (2020, January 1\u20135). Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chattopadhay, A., Sarkar, A., Howlader, P., and Balasubramanian, V.N. (2018, January 12\u201315). Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA.","DOI":"10.1109\/WACV.2018.00097"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Muhammad, M.B., and Yeasin, M. (2020, January 19\u201324). Eigen-cam: Class activation map using principal components. Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK.","DOI":"10.1109\/IJCNN48605.2020.9206626"},{"key":"ref_35","unstructured":"Draelos, R.L., and Carin, L. (2020). HiResCAM: Faithful location representation in visual attention for explainable 3d medical image classification. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"5875","DOI":"10.1109\/TIP.2021.3089943","article-title":"Layercam: Exploring hierarchical class activation maps for localization","volume":"30","author":"Jiang","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_37","unstructured":"Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., and Li, B. (2020). Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., and Hu, X. (2020, January 14\u201319). Score-CAM: Score-weighted visual explanations for convolutional neural networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA.","DOI":"10.1109\/CVPRW50498.2020.00020"},{"key":"ref_39","unstructured":"Chang, C.H., Creager, E., Goldenberg, A., and Duvenaud, D. (2018). Explaining Image Classifiers by Counterfactual Generation. arXiv."},{"key":"ref_40","unstructured":"Dabkowski, P., and Gal, Y. (2017). Real Time Image Saliency for Black Box Classifiers. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Fong, R., and Vedaldi, A. (2017, January 22\u201329). Interpretable Explanations of Black Boxes by Meaningful Perturbation. Proceedings of the International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.371"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Ribeiro, M.T., Singh, S., and Guestrin, C. (2016, January 12\u201317). \u201cWhy Should I Trust You?\u201d: Explaining the Predictions of Any Classifier. Proceedings of the North American Chapter of the Association for Computational Linguistics, San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-3020"},{"key":"ref_43","unstructured":"Orhan, A.E. (2017). Skip Connections as Effective Symmetry-Breaking. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Bach, S., Binder, A., Montavon, G., Klauschen, F., M\u00fcller, K.R., and Samek, W. (2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE, 10.","DOI":"10.1371\/journal.pone.0130140"},{"key":"ref_45","unstructured":"Hooker, S., Erhan, D., Kindermans, P.J., and Kim, B. (2019). A benchmark for interpretability methods in deep neural networks. Adv. Neural Inf. Process. Syst., 32."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019, January 15\u201320). Generalized intersection over union: A metric and a loss for bounding box regression. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00075"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"888","DOI":"10.1109\/34.868688","article-title":"Normalized cuts and image segmentation","volume":"22","author":"Shi","year":"2000","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_48","unstructured":"Wilkinson, J.H., and Moler, C.B. (2003). Encyclopedia of Computer Science, Association for Computing Machinery."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1016\/j.akcej.2017.10.007","article-title":"Lower bounds for the energy of graphs","volume":"15","author":"Jahanbani","year":"2018","journal-title":"AKCE Int. J. Graphs Comb."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"430","DOI":"10.1137\/0611030","article-title":"Partitioning sparse matrices with eigenvectors of graphs","volume":"11","author":"Pothen","year":"1990","journal-title":"SIAM J. Matrix Anal. Appl."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/9\/2695\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:33:04Z","timestamp":1760106784000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/9\/2695"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,24]]},"references-count":50,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2024,5]]}},"alternative-id":["s24092695"],"URL":"https:\/\/doi.org\/10.3390\/s24092695","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,24]]}}}