{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:46:14Z","timestamp":1760240774112,"version":"build-2065373602"},"reference-count":36,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2019,9,23]],"date-time":"2019-09-23T00:00:00Z","timestamp":1569196800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Casey and Family Foundation, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science","award":["N.A."],"award-info":[{"award-number":["N.A."]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective     I ( X ; Z ) \u2212 \u03b2 I ( Y ; Z )     employs a Lagrange multiplier    \u03b2    to tune this trade-off. However, in practice, not only is    \u03b2    chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between    \u03b2   , learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if    \u03b2    is improperly chosen, learning cannot happen\u2014the trivial representation     P ( Z | X ) = P ( Z )     becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as    \u03b2    is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good    \u03b2   . We further show that IB-learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum    \u03b2    for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST and CIFAR10.<\/jats:p>","DOI":"10.3390\/e21100924","type":"journal-article","created":{"date-parts":[[2019,9,23]],"date-time":"2019-09-23T11:02:00Z","timestamp":1569236520000},"page":"924","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Learnability for the Information Bottleneck"],"prefix":"10.3390","volume":"21","author":[{"given":"Tailin","family":"Wu","sequence":"first","affiliation":[{"name":"Department of Physics, MIT, 77 Massachusetts Ave, Cambridge, MA 02139, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3886-5619","authenticated-orcid":false,"given":"Ian","family":"Fischer","sequence":"additional","affiliation":[{"name":"Google Research, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA"}]},{"given":"Isaac L.","family":"Chuang","sequence":"additional","affiliation":[{"name":"Department of Physics, MIT, 77 Massachusetts Ave, Cambridge, MA 02139, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7670-7190","authenticated-orcid":false,"given":"Max","family":"Tegmark","sequence":"additional","affiliation":[{"name":"Department of Physics, MIT, 77 Massachusetts Ave, Cambridge, MA 02139, USA"}]}],"member":"1968","published-online":{"date-parts":[[2019,9,23]]},"reference":[{"key":"ref_1","unstructured":"Tishby, N., Pereira, F.C., and Bialek, W. (2000). The information bottleneck method. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A Mathematical Theory of Communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst. Tech. J."},{"key":"ref_3","first-page":"165","article-title":"Information bottleneck for Gaussian variables","volume":"6","author":"Chechik","year":"2005","journal-title":"J. Mach. Learn. Res."},{"key":"ref_4","unstructured":"Rey, M., and Roth, V. (2012). Meta-Gaussian information bottleneck. Advances in Neural Information Processing Systems, lNIPS."},{"key":"ref_5","unstructured":"Alemi, A.A., Fischer, I., Dillon, J.V., and Murphy, K. (2016). Deep variational information bottleneck. arXiv."},{"key":"ref_6","unstructured":"Chalk, M., Marre, O., and Tkacik, G. (2016). Relevant sparse codes with variational information bottleneck. Advances in Neural Information Processing Systems, NIPS."},{"key":"ref_7","unstructured":"Fischer, I. (2019, September 20). The Conditional Entropy Bottleneck. Available online: https:\/\/openreview.net\/forum?id=rkVOXhAqY7."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1611","DOI":"10.1162\/NECO_a_00961","article-title":"The deterministic information bottleneck","volume":"29","author":"Strouse","year":"2017","journal-title":"Neural Comput."},{"key":"ref_9","unstructured":"Kolchinsky, A., Tracey, B.D., and Van Kuyk, S. (2019, January 30). Caveats for information bottleneck in deterministic scenarios. Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA."},{"key":"ref_10","unstructured":"Strouse, D., and Schwab, D.J. (2017). The information bottleneck and geometric clustering. arXiv."},{"key":"ref_11","first-page":"1947","article-title":"Emergence of invariance and disentanglement in deep representations","volume":"19","author":"Achille","year":"2018","journal-title":"J. Mach. Learn. Res."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Achille, A., and Soatto, S. (2018). Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell.","DOI":"10.1109\/TPAMI.2017.2784440"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_14","unstructured":"Krizhevsky, A., and Hinton, G. (2009). Learning Multiple Layers of Features from Tiny Images, University of Toronto. Technical Report."},{"key":"ref_15","unstructured":"Achille, A., Mbeng, G., and Soatto, S. (2018). The Dynamics of Differential Learning I: Information-Dynamics and Task Reachability. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Anantharam, V., Gohari, A., Kamath, S., and Nair, C. (2013). On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv.","DOI":"10.1109\/ISIT.2014.6875389"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Polyanskiy, Y., and Wu, Y. (2017). Strong data-processing inequalities for channels and Bayesian networks. Convexity and Concentration, Springer.","DOI":"10.1007\/978-1-4939-7005-6_7"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Kim, H., Gao, W., Kannan, S., Oh, S., and Viswanath, P. (2017). Discovering potential correlations via hypercontractivity. Advances in Neural Information Processing Systems, NIPS.","DOI":"10.3390\/e19110586"},{"key":"ref_19","unstructured":"Lin, H.W., and Tegmark, M. (2016). Criticality in formal languages and statistical physics. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1017\/S0305004100013517","article-title":"A connection between correlation and contingency","volume":"Volume 31","author":"Hirschfeld","year":"1935","journal-title":"Mathematical Proceedings of the Cambridge Philosophical Society"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"364","DOI":"10.1002\/zamm.19410210604","article-title":"Das statistische Problem der Korrelation als Variations-und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung","volume":"21","author":"Gebelein","year":"1941","journal-title":"ZAMM-J. Appl. Math. Mech. F\u00fcr Angew. Math. Und Mech."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1007\/BF00116829","article-title":"Learning from noisy examples","volume":"2","author":"Angluin","year":"1988","journal-title":"Mach. Learn."},{"key":"ref_23","unstructured":"Natarajan, N., Dhillon, I.S., Ravikumar, P.K., and Tewari, A. (2013). Learning with noisy labels. Advances in Neural Information Processing Systems, NIPS."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1109\/TPAMI.2015.2456899","article-title":"Classification with noisy labels by importance reweighting","volume":"38","author":"Liu","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","unstructured":"Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. (2015, January 7\u201312). Learning from massive noisy labeled data for image classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_26","unstructured":"Northcutt, C.G., Wu, T., and Chuang, I.L. (2017). Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv."},{"key":"ref_27","unstructured":"Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., and Garnett, R. (2016). Conditional Image Generation with PixelCNN Decoders. Advances in Neural Information Processing Systems 29, Curran Associates, Inc."},{"key":"ref_28","unstructured":"Salimans, T., Karpathy, A., Chen, X., and Kingma, D.P. (2017, January 24\u201326). PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"066138","DOI":"10.1103\/PhysRevE.69.066138","article-title":"Estimating mutual information","volume":"69","author":"Kraskov","year":"2004","journal-title":"Phys. Rev. E"},{"key":"ref_30","unstructured":"Gelfand, I.M., and Silverman, R.A. (2000). Calculus of Variations, Courier Corporation."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1109\/18.669153","article-title":"The efficiency of investment information","volume":"44","author":"Erkip","year":"1998","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"441","DOI":"10.1007\/BF02024507","article-title":"On measures of dependence","volume":"10","year":"1959","journal-title":"Acta Math. Hung."},{"key":"ref_33","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zagoruyko, S., and Komodakis, N. (2016). Wide Residual Networks. arXiv.","DOI":"10.5244\/C.30.87"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q.V. (2018). Autoaugment: Learning augmentation policies from data. arXiv.","DOI":"10.1109\/CVPR.2019.00020"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/10\/924\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:23:23Z","timestamp":1760189003000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/10\/924"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,9,23]]},"references-count":36,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2019,10]]}},"alternative-id":["e21100924"],"URL":"https:\/\/doi.org\/10.3390\/e21100924","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2019,9,23]]}}}