{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,18]],"date-time":"2025-10-18T21:01:28Z","timestamp":1760821288792,"version":"build-2065373602"},"reference-count":47,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/S000356\/1"],"award-info":[{"award-number":["EP\/S000356\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>In solving challenging pattern recognition problems, deep neural networks have shown excellent performance by forming powerful mappings between inputs and targets, learning representations (features) and making subsequent predictions. A recent tool to help understand how representations are formed is based on observing the dynamics of learning on an information plane using mutual information, linking the input to the representation (I(X;T)) and the representation to the target (I(T;Y)). In this paper, we use an information theoretical approach to understand how Cascade Learning (CL), a method to train deep neural networks layer-by-layer, learns representations, as CL has shown comparable results while saving computation and memory costs. We observe that performance is not linked to information\u2013compression, which differs from observation on End-to-End (E2E) learning. Additionally, CL can inherit information about targets, and gradually specialise extracted features layer-by-layer. We evaluate this effect by proposing an information transition ratio, I(T;Y)\/I(X;T), and show that it can serve as a useful heuristic in setting the depth of a neural network that achieves satisfactory accuracy of classification.<\/jats:p>","DOI":"10.3390\/e23101360","type":"journal-article","created":{"date-parts":[[2021,10,19]],"date-time":"2021-10-19T22:52:47Z","timestamp":1634683967000},"page":"1360","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Information Bottleneck Theory Based Exploration of Cascade Learning"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7948-4081","authenticated-orcid":false,"given":"Xin","family":"Du","sequence":"first","affiliation":[{"name":"School of Electronics and Computer Science, University of Southampton, Southampton SO17 3AS, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6775-127X","authenticated-orcid":false,"given":"Katayoun","family":"Farrahi","sequence":"additional","affiliation":[{"name":"School of Electronics and Computer Science, University of Southampton, Southampton SO17 3AS, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7021-140X","authenticated-orcid":false,"given":"Mahesan","family":"Niranjan","sequence":"additional","affiliation":[{"name":"School of Electronics and Computer Science, University of Southampton, Southampton SO17 3AS, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,10,18]]},"reference":[{"key":"ref_1","unstructured":"Cover, T.M., and Thomas, J.A. (2012). Elements of Information Theory, John Wiley & Sons."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A mathematical theory of communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst. Tech. J."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1999","DOI":"10.1016\/S0031-3203(98)00181-2","article-title":"Extracting decision trees from trained neural networks","volume":"32","author":"Krishnan","year":"1999","journal-title":"Pattern Recognit."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1145\/507338.507355","article-title":"Data mining: Practical machine learning tools and techniques with java implementations","volume":"31","author":"Witten","year":"2002","journal-title":"SIGMOD Rec."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/0031-3203(76)90025-X","article-title":"An application of rate-distortion theory to pattern recognition and classification","volume":"8","author":"Pearl","year":"1976","journal-title":"Pattern Recognit."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_7","unstructured":"Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., and Graepel, T. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1145\/3446776","article-title":"Understanding deep learning requires rethinking generalization","volume":"64","author":"Zhang","year":"2021","journal-title":"Commun. ACM"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1265","DOI":"10.1162\/neco.1995.7.6.1265","article-title":"On the practical applicability of VC dimension bounds","volume":"7","author":"Holden","year":"1995","journal-title":"Neural Comput."},{"key":"ref_10","unstructured":"Shwartz-Ziv, R., and Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv."},{"key":"ref_11","unstructured":"Tishby, N., Pereira, F.C., and Bialek, W. (1999, January 22\u201324). The information bottleneck method. Proceedings of the Annual Allerton Conference on Communications, Control and Computing, Allerton, IL, USA."},{"key":"ref_12","unstructured":"Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., and Cox, D.D. (May, January 30). On the information bottleneck theory of deep learning. Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada."},{"key":"ref_13","unstructured":"Amjad, R.A., and Geiger, B.C. (2018). How (not) to train Your neural network using the information bottleneck principle. arXiv."},{"key":"ref_14","first-page":"1","article-title":"On information plane analyses of neural network classifiers\u2014A review","volume":"1","author":"Geiger","year":"2021","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Geiger, B.C., and Kubin, G. (2020). Information bottleneck: Theory and applications in deep learning. Entropy, 22.","DOI":"10.3390\/e22121408"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TNNLS.2018.2805098","article-title":"Deep cascade learning","volume":"29","author":"Marquez","year":"2018","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_17","unstructured":"Fahlman, S.E., and Lebiere, C. (1990, January 26\u201329). The cascade-correlation learning architecture. Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), Denver, CO, USA."},{"key":"ref_18","unstructured":"Belilovsky, E., Eickenberg, M., and Oyallon, E. (2019, January 9\u201315). Greedy layerwise learning can scale to ImageNet. Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_19","unstructured":"Trinh, L.Q. (2019). Greedy Layerwise Training of Convolutional Neural Networks. [Ph.D. Thesis, Massachusetts Institute of Technology]."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Du, X., Farrahi, K., and Niranjan, M. (2019, January 9\u201313). Transfer learning across human activities using a cascade neural network architecture. Proceedings of the 23rd International Symposium on Wearable Computers (ISWC), London, UK.","DOI":"10.1145\/3341163.3347730"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2006, January 4\u20137). Greedy layer-wise training of deep networks. Proceedings of the 19th International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, USA.","DOI":"10.7551\/mitpress\/7503.003.0024"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1016\/0893-6080(95)00096-8","article-title":"Training MLPs layer by layer using an objective function For internal representations","volume":"9","author":"Denoeux","year":"1996","journal-title":"Neural Netw."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"T Nguyen, T., and Choi, J. (2019). Markov information bottleneck to improve information flow in stochastic neural networks. Entropy, 21.","DOI":"10.3390\/e21100976"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1716","DOI":"10.1109\/IJCNN.1999.832634","article-title":"Training MLPs layer-by-layer with the information potential","volume":"Volume 3","author":"Xu","year":"1999","journal-title":"Proceedings of the International Joint Conference on Neural Networks (IJCNN)"},{"key":"ref_25","unstructured":"Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. (2017, January 4\u20139). SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA."},{"key":"ref_26","unstructured":"Dheeru, D., and Karra Taniskidou, E. (2021, October 14). UCI Machine Learning Repository. Available online: http:\/\/archive.ics.uci.edu\/ml."},{"key":"ref_27","unstructured":"Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-Ortiz, J.L. (2013, January 24\u201326). A public domain dataset for human activity recognition using smartphones. Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium."},{"key":"ref_28","unstructured":"Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. [Master\u2019s Thesis, University of Tront]."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Deng, J., Wei, D., Socher, R., Li, L.J., Li, K., and Li, F.F. (2009). Imagenet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kolchinsky, A., and Tracey, B.D. (2017). Estimating mixture entropy with pairwise distances. Entropy, 19.","DOI":"10.3390\/e19070361"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Noshad, M., Zeng, Y., and Hero, A.O. (2019, January 12\u201317). Scalable mutual information estimation using dependence graphs. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683351"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Silverman, B.W. (2018). Density Estimation for Statistics and Data Analysis, Routledge.","DOI":"10.1201\/9781315140919"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"3766","DOI":"10.1109\/TIT.2005.856954","article-title":"Bayesian bin distribution inference and mutual information","volume":"51","author":"Endres","year":"2005","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1109\/TCOM.1967.1089532","article-title":"The divergence and Bhattacharyya distance measures in signal selection","volume":"15","author":"Kailath","year":"1967","journal-title":"IEEE Trans. Commun. Technol."},{"key":"ref_35","unstructured":"Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. (2018, January 10\u201315). Mutual information neural estimation. Proceedings of the International Conference on Machine Learning (ICML), Stockholm Sweden."},{"key":"ref_36","unstructured":"Wickstr\u00f8m, K., L\u00f8kse, S., Kampffmeyer, M., Yu, S., Principe, J., and Jenssen, R. (2020). Information plane analysis of deep neural networks via matrix\u2013based Renyi\u2019s entropy and tensor kernels. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Yu, S., Alesiani, F., Yu, X., Jenssen, R., and Principe, J. (2021, January 2\u20139). Measuring dependence with matrix-based entropy functional. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual.","DOI":"10.1609\/aaai.v35i12.17288"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Balda, E.R., Behboodi, A., and Mathar, R. (2018, January 17\u201319). An information theoretic view on learning of artificial neural networks. Proceedings of the 12th International Conference on Signal Processing and Communication Systems (ICSPCS), Cairns, QLD, Australia.","DOI":"10.1109\/ICSPCS.2018.8631758"},{"key":"ref_39","unstructured":"Balda, E.R., Behboodi, A., and Mathar, R. (2021, October 14). On the Trajectory of Stochastic Gradient Descent in the Information Plane. Available online: https:\/\/openreview.net\/forum?id=SkMON20ctX."},{"key":"ref_40","unstructured":"Chelombiev, I., Houghton, C., and O\u2019Donnell, C. (May, January 30). Adaptive estimators show information compression in deep neural networks. Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada."},{"key":"ref_41","unstructured":"Schiemer, M., and Ye, J. (2021, October 14). Revisiting the Information Plane. Available online: https:\/\/openreview.net\/forum?id=Hyljn1SFwr."},{"key":"ref_42","unstructured":"Wang, Y., Ni, Z., Song, S., Yang, L., and Huang, G. (2021, January 28\u201329). Revisiting locally supervised learning: An alternative to end-to-end training. Proceedings of the International Conference on Learning Representations (ICLR), Lisbon, Portugal."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"2225","DOI":"10.1109\/TPAMI.2019.2909031","article-title":"Learning representations for neural network-based classification using the information bottleneck principle","volume":"42","author":"Amjad","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Cheng, H., Lian, D., Gao, S., and Geng, Y. (2019). Utilizing information bottleneck to evaluate the capability of deep neural networks for image classification. Entropy, 21.","DOI":"10.3390\/e21050456"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1162\/neco_a_01250","article-title":"On kernel method\u2013based connectionist models and supervised deep learning without backpropagation","volume":"32","author":"Duan","year":"2020","journal-title":"Neural Comput."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Ma, W.D.K., Lewis, J., and Kleijn, W.B. (2020, January 7\u201312). The HSIC bottleneck: Deep learning without back-propagation. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i04.5950"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"J\u00f3nsson, H., Cherubini, G., and Eleftheriou, E. (2020). Convergence behavior of DNNs with mutual-information-based regularization. Entropy, 22.","DOI":"10.3390\/e22070727"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/10\/1360\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:17:31Z","timestamp":1760167051000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/10\/1360"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":47,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2021,10]]}},"alternative-id":["e23101360"],"URL":"https:\/\/doi.org\/10.3390\/e23101360","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2021,10,18]]}}}