{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T16:55:40Z","timestamp":1775667340973,"version":"3.50.1"},"reference-count":118,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,7,23]],"date-time":"2025-07-23T00:00:00Z","timestamp":1753228800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>The rapid advancement of artificial intelligence (AI) and machine learning has revolutionised how systems process information, make decisions, and adapt to dynamic environments. AI-driven approaches have significantly enhanced efficiency and problem-solving capabilities across various domains, from automated decision-making to knowledge representation and predictive modelling. These developments have led to the emergence of increasingly sophisticated models capable of learning patterns, reasoning over complex data structures, and generalising across tasks. As AI systems become more deeply integrated into networked infrastructures and the Internet of Things (IoT), their ability to process and interpret data in real-time is essential for optimising intelligent communication networks, distributed decision making, and autonomous IoT systems. However, despite these achievements, the internal mechanisms that drive LLMs\u2019 reasoning and generalisation capabilities remain largely unexplored. This lack of transparency, compounded by challenges such as hallucinations, adversarial perturbations, and misaligned human expectations, raises concerns about their safe and beneficial deployment. Understanding the underlying principles governing AI models is crucial for their integration into intelligent network systems, automated decision-making processes, and secure digital infrastructures. This paper provides a comprehensive analysis of explainability approaches aimed at uncovering the fundamental mechanisms of LLMs. We investigate the strategic components contributing to their generalisation abilities, focusing on methods to quantify acquired knowledge and assess its representation within model parameters. Specifically, we examine mechanistic interpretability, probing techniques, and representation engineering as tools to decipher how knowledge is structured, encoded, and retrieved in AI systems. Furthermore, by adopting a mechanistic perspective, we analyse emergent phenomena within training dynamics, particularly memorisation and generalisation, which also play a crucial role in broader AI-driven systems, including adaptive network intelligence, edge computing, and real-time decision-making architectures. Understanding these principles is crucial for bridging the gap between black-box AI models and practical, explainable AI applications, thereby ensuring trust, robustness, and efficiency in language-based and general AI systems.<\/jats:p>","DOI":"10.3390\/bdcc9080193","type":"journal-article","created":{"date-parts":[[2025,7,23]],"date-time":"2025-07-23T10:49:17Z","timestamp":1753267757000},"page":"193","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Survey on the Role of Mechanistic Interpretability in Generative AI"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8488-4146","authenticated-orcid":false,"given":"Leonardo","family":"Ranaldi","sequence":"first","affiliation":[{"name":"School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK"},{"name":"Human-Centric ART, University of Rome Tor Vergata, Viale del Politecnico, 1, 00133 Rome, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Marques, N., Silva, R.R., and Bernardino, J. (2024). Using ChatGPT in Software Requirements Engineering: A Comprehensive Review. Future Internet, 16.","DOI":"10.3390\/fi16060180"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Ranaldi, L., and Pucci, G. (2023). Knowing Knowledge: Epistemological Study of Knowledge in Transformers. Appl. Sci., 13.","DOI":"10.3390\/app13020677"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Peng, J., and Zhong, K. (2024). Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology. Future Internet, 16.","DOI":"10.20944\/preprints202409.2415.v1"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Ranaldi, L., Fallucchi, F., and Zanzotto, F.M. (2022). Dis-Cover AI Minds to Preserve Human Knowledge. Future Internet, 14.","DOI":"10.3390\/fi14010010"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Gifu, D., and Silviu-Vasile, C. (2025). Artificial Intelligence vs. Human: Decoding Text Authenticity with Transformers. Future Internet, 17.","DOI":"10.3390\/fi17010038"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Li, J., and Maiti, A. (2025). Applying Large Language Model Analysis and Backend Web Services in Regulatory Technologies for Continuous Compliance Checks. Future Internet, 17.","DOI":"10.3390\/fi17030100"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Petrillo, L., Martinelli, F., Santone, A., and Mercaldo, F. (2025). Explainable Security Requirements Classification Through Transformer Models. Future Internet, 17.","DOI":"10.3390\/fi17010015"},{"key":"ref_8","unstructured":"Korhonen, A., Traum, D., and M\u00e0rquez, L. (August, January 28). BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_9","unstructured":"Goldberg, Y., Kozareva, Z., and Zhang, Y. (2022, January 7\u201311). Can language models learn from explanations in context?. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Aggrawal, S., and Magana, A.J. (2024). Teamwork Conflict Management Training and Conflict Resolution Practice via Large Language Models. Future Internet, 16.","DOI":"10.3390\/fi16050177"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Babaey, V., and Ravindran, A. (2025). GenSQLi: A Generative Artificial Intelligence Framework for Automatically Securing Web Application Firewalls Against Structured Query Language Injection Attacks. Future Internet, 17.","DOI":"10.3390\/fi17010008"},{"key":"ref_12","unstructured":"Savary, A., and Zhang, Y. (2020). Interpretability and Analysis in Neural NLP. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, Association for Computational Linguistics."},{"key":"ref_13","unstructured":"Golechha, S., and Dao, J. (2024). Position Paper: Toward New Frameworks for Studying Model Representations. arXiv."},{"key":"ref_14","unstructured":"Jermyn, A.S., Schiefer, N., and Hubinger, E. (2022). Engineering monosemanticity in toy models. arXiv."},{"key":"ref_15","unstructured":"Bricken,  T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., and Askell, A. (2025, January 01). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. Available online: https:\/\/transformer-circuits.pub\/2023\/monosemantic-features\/index.html."},{"key":"ref_16","unstructured":"Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., and Chen, C. (2024, April 28). Toy Models of Superposition. Transformer Circuits Thread. Available online: https:\/\/transformer-circuits.pub\/2022\/toy_model\/index.html."},{"key":"ref_17","unstructured":"Scherlis, A., Sachan, K., Jermyn, A.S., Benton, J., and Shlegeris, B. (2022). Polysemanticity and capacity in neural networks. arXiv."},{"key":"ref_18","unstructured":"Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. (2023). Finding Neurons in a Haystack: Case Studies with Sparse Probing. arXiv."},{"key":"ref_19","unstructured":"Lecomte, V., Thaman, K., Schaeffer, R., Bashkansky, N., Chow, T., and Koyejo, S. (2024, January 11). What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes. Proceedings of the ICLR 2024 Workshop on Representational Alignment, Vienna, Austria."},{"key":"ref_20","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020). Language models are few-shot learners. arXiv."},{"key":"ref_21","unstructured":"Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., and Chen, A. (2025, January 01). In-Context Learning and Induction Heads. Transformer Circuits Thread. Available online: https:\/\/transformer-circuits.pub\/2022\/in-context-learning-and-induction-heads\/index.html."},{"key":"ref_22","unstructured":"Sakarvadia, M., Khan, A., Ajith, A., Grzenda, D., Hudson, N., Bauer, A., Chard, K., and Foster, I. (2023). Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism. arXiv."},{"key":"ref_23","unstructured":"Edelman, B.L., Edelman, E., Goel, S., Malach, E., and Tsilivis, N. (2024). The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains. arXiv."},{"key":"ref_24","unstructured":"Chan, L., Garriga-Alonso, A., Goldwosky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. (2025, January 01). Causal Scrubbing, A Method for Rigorously Testing Interpretability Hypotheses. AI Alignment Forum. Available online: https:\/\/www.alignmentforum.org\/posts\/JvZhhzycHu2Yd57RN\/causal-scrubbing-a-method-for-rigorously-testing."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Neo, C., Cohen, S.B., and Barez, F. (2024). Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions. arXiv.","DOI":"10.18653\/v1\/2024.emnlp-main.930"},{"key":"ref_26","unstructured":"Conmy, A., Mavor-Parker, A.N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. arXiv."},{"key":"ref_27","unstructured":"Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., and Conerly, T. (2025, January 01). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread. Available online: https:\/\/transformer-circuits.pub\/2021\/framework\/index.html."},{"key":"ref_28","unstructured":"Yu, Z., and Ananiadou, S. (2024). Locating Factual Knowledge in Large Language Models: Exploring the Residual Stream and analysing Subvalues in Vocabulary Space. arXiv."},{"key":"ref_29","unstructured":"Todd, E., Li, M.L., Sharma, A.S., Mueller, A., Wallace, B.C., and Bau, D. (2024). Function Vectors in Large Language Models. arXiv."},{"key":"ref_30","unstructured":"Shai, A.S., Marzen, S.E., Teixeira, L., Oldenziel, A.G., and Riechers, P.M. (2024). Transformers represent belief state geometry in their residual stream. arXiv."},{"key":"ref_31","first-page":"17359","article-title":"Locating and Editing Factual Associations in GPT","volume":"35","author":"Meng","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_32","unstructured":"Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., and Dombrowski, A.K. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv."},{"key":"ref_33","unstructured":"Liu, W., Wang, X., Wu, M., Li, T., Lv, C., Ling, Z., Zhu, J., Zhang, C., Zheng, X., and Huang, X. (2023). Aligning large language models with human preferences through representation engineering. arXiv."},{"key":"ref_34","unstructured":"Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J.K., and Mihalcea, R. (2024). A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. arXiv."},{"key":"ref_35","unstructured":"Hazineh, D.S., Zhang, Z., and Chiu, J. (2023). Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT. arXiv."},{"key":"ref_36","unstructured":"Marks, S., and Tegmark, M. (2023). The geometry of truth: Emergent linear structure in large language model representations of true\/false datasets. arXiv."},{"key":"ref_37","unstructured":"Burns, C., Ye, H., Klein, D., and Steinhardt, J. (2024). Discovering Latent Knowledge in Language Models Without Supervision. arXiv."},{"key":"ref_38","unstructured":"Li, K., Hopkins, A.K., Bau, D., Vi\u00e9gas, F., Pfister, H., and Wattenberg, M. (2022). Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv."},{"key":"ref_39","unstructured":"Gurnee, W., and Tegmark, M. (2023). Language models represent space and time. arXiv."},{"key":"ref_40","unstructured":"Ju, T., Sun, W., Du, W., Yuan, X., Ren, Z., and Liu, G. (2024). How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study. arXiv."},{"key":"ref_41","unstructured":"Wu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C.D., and Potts, C. (2024). ReFT: Representation Finetuning for Language Models. arXiv."},{"key":"ref_42","unstructured":"Jin, M., Yu, Q., Huang, J., Zeng, Q., Wang, Z., Hua, W., Zhao, H., Mei, K., Meng, Y., and Ding, K. (2024). Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?. arXiv."},{"key":"ref_43","unstructured":"Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. arXiv."},{"key":"ref_44","unstructured":"Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Murty, S., Sharma, P., Andreas, J., and Manning, C.D. (2023). Grokking of Hierarchical Structure in Vanilla Transformers. arXiv.","DOI":"10.18653\/v1\/2023.acl-short.38"},{"key":"ref_46","unstructured":"Huang, Y., Hu, S., Han, X., Liu, Z., and Sun, M. (2024). Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition. arXiv."},{"key":"ref_47","unstructured":"Liu, Z., Michaud, E.J., and Tegmark, M. (2022). Omnigrok: Grokking beyond algorithmic data. arXiv."},{"key":"ref_48","unstructured":"Thilak, V., Littwin,  E., Zhai, S., Saremi, O., Paiss, R., and Susskind, J.M. (2025, January 01). The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods. Transactions on Machine Learning Research. Available online: https:\/\/machinelearning.apple.com\/research\/slingshot-effect."},{"key":"ref_49","unstructured":"Furuta, H., Minegishi, G., Iwasawa, Y., and Matsuo, Y. (2024). Interpreting Grokked Transformers in Complex Modular Arithmetic. arXiv."},{"key":"ref_50","unstructured":"Chen, S., Sheen, H., Wang, T., and Yang, Z. (2024). Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality. arXiv."},{"key":"ref_51","first-page":"34651","article-title":"Towards understanding grokking: An effective theory of representation learning","volume":"35","author":"Liu","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_52","unstructured":"Zhu, X., Fu, Y., Zhou, B., and Lin, Z. (2024). Critical data size of language models from a grokking perspective. arXiv."},{"key":"ref_53","unstructured":"Wang, B., Yue, X., Su, Y., and Sun, H. (2024). Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization. arXiv."},{"key":"ref_54","unstructured":"Rajendran, G., Buchholz, S., Aragam, B., Sch\u00f6lkopf, B., and Ravikumar, P. (2024). Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models. arXiv."},{"key":"ref_55","unstructured":"Tamkin, A., Askell, A., Lovitt, L., Durmus, E., Joseph, N., Kravec, S., Nguyen, K., Kaplan, J., and Ganguli, D. (2023). Evaluating and mitigating discrimination in language model decisions. arXiv."},{"key":"ref_56","unstructured":"Doshi, D., Das, A., He, T., and Gromov, A. (2023). To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets. arXiv."},{"key":"ref_57","unstructured":"Kumar, T., Bordelon, B., Gershman, S.J., and Pehlevan, C. (2023). Grokking as the Transition from Lazy to Rich Training Dynamics. arXiv."},{"key":"ref_58","unstructured":"Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. (2023). Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Stolfo, A., Belinkov, Y., and Sachan, M. (2023, January 6\u201310). A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.435"},{"key":"ref_60","unstructured":"Bouamor, H., Pino, J., and Bali, K. (2023, January 6\u201310). Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore."},{"key":"ref_61","unstructured":"Prakash, N., Shaham, T.R., Haklay, T., Belinkov, Y., and Bau, D. (2024). Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. arXiv."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Cohen, R., Biran, E., Yoran, O., Globerson, A., and Geva, M. (2023). Evaluating the ripple effects of knowledge editing in language models. arXiv.","DOI":"10.1162\/tacl_a_00644"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Xu, D., Zhang, Z., Zhu, Z., Lin, Z., Liu, Q., Wu, X., Xu, T., Zhao, X., Zheng, Y., and Chen, E. (2024). Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models. arXiv.","DOI":"10.1145\/3627673.3679673"},{"key":"ref_64","unstructured":"Stoehr, N., Gordon, M., Zhang, C., and Lewis, O. (2024). Localizing Paragraph Memorization in Language Models. arXiv."},{"key":"ref_65","unstructured":"Sharma, A.S., Atkinson, D., and Bau, D. (2024). Locating and Editing Factual Associations in Mamba. arXiv."},{"key":"ref_66","unstructured":"Yang, Y., Duan, H., Abbasi, A., Lalor, J.P., and Tam, K.Y. (2023). Bias A-head? analysing Bias in Transformer-Based Language Model Attention Heads. arXiv."},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Jin, Z., Cao, P., Yuan, H., Chen, Y., Xu, J., Li, H., Jiang, X., Liu, K., and Zhao, J. (2024). Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.70"},{"key":"ref_68","unstructured":"Wong, K.F., Knight, K., and Wu, H. (2020, January 4\u20137). A Survey of the State of Explainable AI for Natural Language Processing. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China."},{"key":"ref_69","unstructured":"Saeed, W., and Omlin, C. (2021). Explainable AI (XAI): A Systematic Meta-Survey of Current Challenges and Future Opportunities. arXiv."},{"key":"ref_70","unstructured":"Luo, H., and Specia, L. (2024). From Understanding to Utilization: A Survey on Explainability for Large Language Models. arXiv."},{"key":"ref_71","unstructured":"Ferrando, J., Sarti, G., Bisazza, A., and Costa-juss\u00e0, M.R. (2024). A Primer on the Inner Workings of Transformer-based Language Models. arXiv."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Papageorgiou, E., Chronis, C., Varlamis, I., and Himeur, Y. (2024). A Survey on the Use of Large Language Models (LLMs) in Fake News. Future Internet, 16.","DOI":"10.3390\/fi16080298"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Hang, C.N., Yu, P.D., Morabito, R., and Tan, C.W. (2024). Large Language Models Meet Next-Generation Networking Technologies: A Review. Future Internet, 16.","DOI":"10.3390\/fi16100365"},{"key":"ref_74","unstructured":"Krueger, D.S. (2024, April 28). Mechanistic Interpretability as Reverse Engineering (Follow-Up to \u201cCars and Elephants\u201d) \u2014 AI Alignment Forum \u2014 alignmentforum.org. Available online: https:\/\/www.alignmentforum.org\/posts\/kjRGMdRxXb9c5bWq5\/mechanistic-interpretability-as-reverse-engineering-follow."},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. (2024, April 28). Zoom In: An Introduction to Circuits. Distill. Available online: https:\/\/distill.pub\/2020\/circuits\/zoom-in.","DOI":"10.23915\/distill.00024.001"},{"key":"ref_76","unstructured":"Goldberg, Y., Kozareva, Z., and Zhang, Y. (2022, January 7\u201311). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates."},{"key":"ref_77","unstructured":"Hanna, M., Liu, O., and Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. arXiv."},{"key":"ref_78","unstructured":"Bereska, L., and Gavves, E. (2024). Mechanistic Interpretability for AI Safety\u2014A Review. arXiv."},{"key":"ref_79","unstructured":"Friedman, D., Wettig, A., and Chen, D. (2023). Learning Transformer Programs. arXiv."},{"key":"ref_80","unstructured":"Zimmermann, R.S., Klein, T., and Brendel, W. (2024). Scale Alone Does not Improve Mechanistic Interpretability in Vision Models. arXiv."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Ribeiro, M.T., Singh, S., and Guestrin, C. (2016). \u201cWhy Should I Trust You?\u201d: Explaining the Predictions of Any Classifier. arXiv.","DOI":"10.18653\/v1\/N16-3020"},{"key":"ref_82","first-page":"4768","article-title":"A unified approach to interpreting model predictions","volume":"30","author":"Lundberg","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_83","unstructured":"Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv."},{"key":"ref_84","unstructured":"Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., and Clark, A. (2022). Training Compute-Optimal Large Language Models. arXiv."},{"key":"ref_85","unstructured":"Das, A., and Rad, P. (2020). Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. arXiv."},{"key":"ref_86","doi-asserted-by":"crossref","first-page":"nwad267","DOI":"10.1093\/nsr\/nwad267","article-title":"Large language models and brain-inspired general intelligence","volume":"10","author":"Xu","year":"2023","journal-title":"Natl. Sci. Rev."},{"key":"ref_87","unstructured":"Sharkey, L., Braun, D. (2024, January 23). Interim Research Report Taking Features out of Superposition with Sparse Autoencoders. Available online: https:\/\/www.lesswrong.com\/posts\/z6QQJbtpkEAX3Aojj\/interim-research-report-taking-features-out-of-superposition."},{"key":"ref_88","doi-asserted-by":"crossref","unstructured":"McDougall, C., Conmy, A., Rushing, C., McGrath, T., and Nanda, N. (2023). Copy suppression: Comprehensively understanding an attention head. arXiv.","DOI":"10.18653\/v1\/2024.blackboxnlp-1.22"},{"key":"ref_89","doi-asserted-by":"crossref","unstructured":"Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M., and Olah, C. (2024, April 28). Curve Detectors. Distill. Available online: https:\/\/distill.pub\/2020\/circuits\/curve-detectors.","DOI":"10.23915\/distill.00024.003"},{"key":"ref_90","doi-asserted-by":"crossref","unstructured":"Schubert, L., Voss, C., Cammarata, N., Goh, G., and Olah, C. (2024, April 28). High-Low Frequency Detectors. Distill. Available online: https:\/\/distill.pub\/2020\/circuits\/frequency-edges.","DOI":"10.23915\/distill.00024.005"},{"key":"ref_91","doi-asserted-by":"crossref","unstructured":"Olah, C., Cammarata, N., Voss, C., Schubert, L., and Goh, G. (2024, April 28). Naturally Occurring Equivariance in Neural Networks. Distill. Available online: https:\/\/distill.pub\/2020\/circuits\/equivariance.","DOI":"10.23915\/distill.00024.004"},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Nanda, N., Lee, A., and Wattenberg, M. (2023). Emergent linear representations in world models of self-supervised sequence models. arXiv.","DOI":"10.18653\/v1\/2023.blackboxnlp-1.2"},{"key":"ref_93","unstructured":"Li, M., Davies, X., and Nadeau, M. (2024). Circuit Breaking: Removing Model behaviours with Targeted Ablation. arXiv."},{"key":"ref_94","first-page":"41451","article-title":"Inference-time intervention: Eliciting truthful answers from a language model","volume":"36","author":"Li","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_95","doi-asserted-by":"crossref","unstructured":"Azaria, A., and Mitchell, T. (2023). The internal state of an llm knows when its lying. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.68"},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"CH-Wang, S., Van Durme, B., Eisner, J., and Kedzie, C. (2023). Do Androids Know They\u2019re Only Dreaming of Electric Sheep?. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.260"},{"key":"ref_97","unstructured":"Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., and Metzler, D. (2022). Emergent Abilities of Large Language Models. arXiv."},{"key":"ref_98","unstructured":"Thilak, V., Littwin, E., Zhai, S., Saremi, O., Paiss, R., and Susskind, J.M. (2022, January 2). The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon. Proceedings of the Has it Trained Yet? NeurIPS 2022 Workshop, New Orleans, LA, USA."},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Bhaskar, A., Friedman, D., and Chen, D. (2024). The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.774"},{"key":"ref_100","unstructured":"Bushnaq, L., Mendel, J., Heimersheim, S., Braun, D., Goldowsky-Dill, N., H\u00e4nni, K., Wu, C., and Hobbhahn, M. (2024). Using Degeneracy in the Loss Landscape for Mechanistic Interpretability. arXiv."},{"key":"ref_101","unstructured":"Golechha, S. (2024). Progress Measures for Grokking on Real-world Datasets. arXiv."},{"key":"ref_102","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_103","unstructured":"Lyu, K., Jin, J., Li, Z., Du, S.S., Lee, J.D., and Hu, W. (2023). Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. arXiv."},{"key":"ref_104","unstructured":"Mohamadi, M.A., Li, Z., Wu, L., and Sutherland, D. (2023, January 16). Grokking modular arithmetic can be explained by margin maximization. Proceedings of the NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, New Orleans, LA, USA."},{"key":"ref_105","unstructured":"Merrill, W., Tsilivis, N., and Shukla, A. (2023). A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks. arXiv."},{"key":"ref_106","doi-asserted-by":"crossref","first-page":"124003","DOI":"10.1088\/1742-5468\/ac3a74","article-title":"Deep double descent: Where bigger models and more data hurt","volume":"2021","author":"Nakkiran","year":"2021","journal-title":"J. Stat. Mech. Theory Exp."},{"key":"ref_107","unstructured":"Davies, X., Langosco, L., and Krueger, D. (2023). Unifying Grokking and Double Descent. arXiv."},{"key":"ref_108","unstructured":"Chen, W., Song, J., Ren, P., Subramanian, S., Morozov, D., and Mahoney, M.W. (2024). Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning. arXiv."},{"key":"ref_109","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv."},{"key":"ref_110","doi-asserted-by":"crossref","unstructured":"Wu, M., Liu, W., Wang, X., Li, T., Lv, C., Ling, Z., Zhu, J., Zhang, C., Zheng, X., and Huang, X. (2024). Advancing Parameter Efficiency in Fine-tuning via Representation Editing. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.726"},{"key":"ref_111","unstructured":"Vlachos, A., and Augenstein, I. (2023, January 2\u20136). Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia."},{"key":"ref_112","unstructured":"Jain, S., Kirk, R., Lubana, E.S., Dick, R.P., Tanaka, H., Grefenstette, E., Rockt\u00e4schel, T., and Krueger, D.S. (2023). Mechanistically analysing the effects of fine-tuning on procedurally defined tasks. arXiv."},{"key":"ref_113","unstructured":"Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. (2023). Activation addition: Steering language models without optimization. arXiv."},{"key":"ref_114","unstructured":"Geiger, A., Wu, Z., Potts, C., Icard, T., and Goodman, N. (2024, January 1\u20133). Finding alignments between interpretable causal variables and distributed neural representations. Proceedings of the Causal Learning and Reasoning, PMLR, Los Angeles, CA, USA."},{"key":"ref_115","unstructured":"Cao, N.D., Aziz, W., and Titov, I. (2021). Editing Factual Knowledge in Language Models. arXiv."},{"key":"ref_116","unstructured":"Hernandez, E., Li, B.Z., and Andreas, J. (2023). Inspecting and Editing Knowledge Representations in Language Models. arXiv."},{"key":"ref_117","unstructured":"Zhang, F., and Nanda, N. (2024). Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. arXiv."},{"key":"ref_118","unstructured":"Campbell, J., Ren, R., and Guo, P. (2023). Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching. arXiv."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/8\/193\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:14:30Z","timestamp":1760033670000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/8\/193"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,23]]},"references-count":118,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["bdcc9080193"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9080193","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,23]]}}}