{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:20:59Z","timestamp":1760059259726,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2025,5,31]],"date-time":"2025-05-31T00:00:00Z","timestamp":1748649600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Santa Clara University"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Large Language models have shown a remarkable ability to \u201cconverse\u201d with humans in a natural language across myriad topics. Despite the proliferation of these models, a deep understanding of how they work under the hood remains elusive. The core of these Generative AI models is composed of layers of neural networks that employ the Transformer architecture. This architecture learns from large amounts of training data and creates new content in response to user input. In this study, we analyze the internals of the Transformer using Information Theory. To quantify the amount of information passing through a layer, we view it as an information transmission channel and compute the capacity of the channel. The highlight of our study is that, using Information-Theoretical tools, we develop techniques to visualize on an Information plane how the Transformer encodes the relationship between words in sentences while these words are projected into a high-dimensional vector space. We use Information Geometry to analyze the high-dimensional vectors in the Transformer layer and infer relationships between words based on the length of the geodesic connecting these vector distributions on a Riemannian manifold. Our tools reveal more information about these relationships than attention scores. In this study, we also show how Information-Theoretic analysis can help in troubleshooting learning problems in the Transformer layers.<\/jats:p>","DOI":"10.3390\/e27060589","type":"journal-article","created":{"date-parts":[[2025,6,2]],"date-time":"2025-06-02T11:40:32Z","timestamp":1748864432000},"page":"589","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Information-Theoretical Analysis of a Transformer-Based Generative AI Model"],"prefix":"10.3390","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5820-3714","authenticated-orcid":false,"given":"Manas","family":"Deb","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Santa Clara University, Santa Clara, CA 95053, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3517-9779","authenticated-orcid":false,"given":"Tokunbo","family":"Ogunfunmi","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Santa Clara University, Santa Clara, CA 95053, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,31]]},"reference":[{"key":"ref_1","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the NIPS 2017: The 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_2","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate, ICLR."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1109\/JSAIT.2020.2991561","article-title":"The Information Bottleneck problem and its applications in machine learning","volume":"1","author":"Goldfeld","year":"2020","journal-title":"IEEE J. Sel. Areas Inf. Theory"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., and Cox, D.D. (2018). On the Information Bottleneck of Deep Learning, ICLR.","DOI":"10.1088\/1742-5468\/ab3985"},{"key":"ref_5","unstructured":"Gabrie, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborova, L. (2018, January 2\u20138). Entropy and mutual information in models of deep neural networks. Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada."},{"key":"ref_6","unstructured":"Koeman, M., and Heskes, T. (2014, January 3\u20136). Mutual Information Estimation with Random Forests. Proceedings of the 21st International Conference, ICONIP 2014, Kuching, Malaysia."},{"key":"ref_7","unstructured":"Carrara, N., and Ernst, J. (July, January 30). On the estimation of Mutual Information. Proceedings of the MaxEnt 39th workshop on Bayesian and Maximum Entropy Methods in Science and Engineering, Garching, Germany."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2654","DOI":"10.1109\/TCOMM.2023.3255251","article-title":"Benchmarking Neural Capacity Estimation: Viability and Reliability","volume":"71","author":"Mirkarimi","year":"2023","journal-title":"IEEE Trans. Commun."},{"key":"ref_9","first-page":"205","article-title":"Review Papers: Recent Developments in Nonparametric Density Estimation","volume":"86","author":"Izenman","year":"1991","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"066138","DOI":"10.1103\/PhysRevE.69.066138","article-title":"Estimating Mutual Information","volume":"69","author":"Kraskov","year":"2004","journal-title":"Phys. Rev. E"},{"key":"ref_11","unstructured":"Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. (2018, January 10\u201315). Mutual Information Neural Estimation. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1002\/cpa.3160360204","article-title":"Asymptotic evaluation of certain markov process","volume":"36","author":"Donsker","year":"1983","journal-title":"Commun. Pure Appl. Math."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Cover, T.M., and Thomas, J.A. (2005). Elements of Information Theory, John Wiley and Sons.","DOI":"10.1002\/047174882X"},{"key":"ref_14","first-page":"401","article-title":"On a measure of divergence between two statistical populations defined by their probability distributions","volume":"7","author":"Bhattacharya","year":"1946","journal-title":"Indian J. Stat."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"10259","DOI":"10.1007\/s00521-021-05789-y","article-title":"Leveraging the Bhattacharyya coefficient for uncertainty quantification in deep neural networks","volume":"33","author":"Molle","year":"2021","journal-title":"Neural Comput. Appl."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Villani, C. (2009). The Wasserstein distances. Optimal Transport, Springer.","DOI":"10.1007\/978-3-540-71050-9"},{"key":"ref_17","unstructured":"Helsinki, U. (2025, February 28). o. Helsinki-NLP\/opus_books. HuggingFace. Available online: https:\/\/huggingface.co\/datasets\/Helsinki-NLP\/opus_books\/viewer\/en-fr."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Deb, M., and Ogunfunmi, T. (2023, January 17\u201320). Information Channels of Deep Neural Networks. Proceedings of the IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP), Rome, Italy.","DOI":"10.1109\/MLSP55844.2023.10285953"},{"key":"ref_19","unstructured":"Ng, A.Y., Jordan, M.I., and Weiss, Y. (2001, January 3\u20138). On spectral clustering analysis and algorithm. Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"6210","DOI":"10.1109\/TIT.2023.3285928","article-title":"A mutual information inequality and some applications","volume":"69","author":"Lau","year":"2022","journal-title":"IEEE Int. Symp. Inf. Theory (ISIT)"},{"key":"ref_21","unstructured":"Janssen, J., Guan, V., and Robeva, E. (April, January 25). Ultra-marginal Feature Importance: Learning from Data with Causal Guarantees. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), Valencia, Spain."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Yeung, R.W. (2002). A First Course in Information Theory, Springer Science & Business Media.","DOI":"10.1007\/978-1-4419-8608-5"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"5522","DOI":"10.1109\/TIT.2020.2982642","article-title":"Proving and Disproving Information Inequalities: Theory and Scalable Algorithms","volume":"66","author":"Ho","year":"2020","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Amari, S.-i., and Nagaoka, H. (2007). Methods of Information Geometry, American Mathematical Society.","DOI":"10.1090\/mmono\/191"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1007\/s41884-024-00143-2","article-title":"On Closed-Form Expressions for the Fisher-Rao Distance","volume":"7","author":"Miyamoto","year":"2024","journal-title":"Inf. Geom."},{"key":"ref_26","unstructured":"Kullback, S. (1997). Information Thoery and Statistics, Dover Publications."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Calin, O., and Udriste, C. (2014). Geometric Modelling in Probability and Statistics, Springer.","DOI":"10.1007\/978-3-319-07779-6"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"879","DOI":"10.1007\/s10463-016-0562-0","article-title":"The uniqueness of the Fisher metric as information metric","volume":"69","author":"Le","year":"2017","journal-title":"Ann. Inst. Stat. Math."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Tu, L.W. (2017). Differential Geometry, Springer.","DOI":"10.1007\/978-3-319-55084-8"},{"key":"ref_30","unstructured":"Charlot, N. (2025, May 05). Information Geometry. Github. Available online: https:\/\/github.com\/Noeloikeau\/information_geometry."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/6\/589\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:44:45Z","timestamp":1760031885000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/6\/589"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,31]]},"references-count":30,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,6]]}},"alternative-id":["e27060589"],"URL":"https:\/\/doi.org\/10.3390\/e27060589","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2025,5,31]]}}}