{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T00:44:46Z","timestamp":1767919486890,"version":"3.49.0"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"11","license":[{"start":{"date-parts":[[2020,7,24]],"date-time":"2020-07-24T00:00:00Z","timestamp":1595548800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,7,24]],"date-time":"2020-07-24T00:00:00Z","timestamp":1595548800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Max Planck Institute for Informatics"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Knowl Inf Syst"],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can we efficiently discover the exact or<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\alpha $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mi>\u03b1<\/mml:mi><\/mml:math><\/jats:alternatives><\/jats:inline-formula>-approximate top-<jats:italic>k<\/jats:italic>dependencies? We address the first question by adopting information-theoretic notions. Specifically, we consider the mutual information score, for which we propose a reliable estimator that enables robust optimization in high-dimensional data. To address the second question, we then systematically explore the algorithmic implications of using this measure for optimization. We show the problem is NP-hard and justify worst-case exponential-time as well as heuristic search methods. We propose two bounding functions for the estimator, which we use as pruning criteria in branch-and-bound search to efficiently mine dependencies with approximation guarantees. Empirical evaluation shows that the derived estimator has desirable statistical properties, the bounding functions lead to effective exact and greedy search algorithms, and when combined, qualitative experiments show the framework indeed discovers highly informative dependencies.<\/jats:p>","DOI":"10.1007\/s10115-020-01494-9","type":"journal-article","created":{"date-parts":[[2020,7,24]],"date-time":"2020-07-24T08:04:08Z","timestamp":1595577848000},"page":"4223-4253","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Discovering dependencies with reliable mutual information"],"prefix":"10.1007","volume":"62","author":[{"given":"Panagiotis","family":"Mandros","sequence":"first","affiliation":[]},{"given":"Mario","family":"Boley","sequence":"additional","affiliation":[]},{"given":"Jilles","family":"Vreeken","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,7,24]]},"reference":[{"key":"1494_CR1","doi-asserted-by":"publisher","first-page":"013031","DOI":"10.1088\/1367-2630\/aa57c2","volume":"19","author":"BR Goldsmith","year":"2017","unstructured":"Goldsmith BR, Boley M, Vreeken J, Scheffler M, Ghiringhelli LM (2017) Uncovering structure-property relationships of materials by subgroup discovery. New J Phys 19:013031","journal-title":"New J Phys"},{"issue":"10","key":"1494_CR2","doi-asserted-by":"publisher","first-page":"105503","DOI":"10.1103\/PhysRevLett.114.105503","volume":"114","author":"LM Ghiringhelli","year":"2015","unstructured":"Ghiringhelli LM, Vybiral J, Levchenko SV, Draxl C, Scheffler M (2015) Big data of materials science: Critical role of the descriptor. Phys Rev Lett 114(10):105503","journal-title":"Phys Rev Lett"},{"key":"1494_CR3","doi-asserted-by":"publisher","first-page":"1518","DOI":"10.1126\/science.1205438","volume":"334","author":"DN Reshef","year":"2011","unstructured":"Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334:1518\u20131524","journal-title":"Science"},{"issue":"10","key":"1494_CR4","doi-asserted-by":"publisher","first-page":"1082","DOI":"10.14778\/2794367.2794377","volume":"8","author":"T Papenbrock","year":"2015","unstructured":"Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph J-P, Sch\u00f6nberg M, Zwiener J, Naumann F (2015) Functional dependency discovery: An experimental evaluation of seven algorithms. Proc VLDB Endowm 8(10):1082\u20131093","journal-title":"Proc VLDB Endowm"},{"key":"1494_CR5","first-page":"1157","volume":"3","author":"I Guyon","year":"2003","unstructured":"Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157\u20131182","journal-title":"J Mach Learn Res"},{"key":"1494_CR6","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1016\/j.ijar.2006.06.008","volume":"45","author":"JM Pe\u00f1a","year":"2007","unstructured":"Pe\u00f1a JM, Nilsson R, Bj\u00f6rkegren J, Tegn\u00e9r J (2007) Towards scalable and data efficient learning of Markov boundaries. Int J Approx Reason 45:211\u2013232","journal-title":"Int J Approx Reason"},{"key":"1494_CR7","first-page":"27","volume":"13","author":"G Brown","year":"2012","unstructured":"Brown G, Pocock A, Zhao M-J, Luj\u00e1n M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27\u201366","journal-title":"J Mach Learn Res"},{"key":"1494_CR8","unstructured":"Tsamardinos I, Aliferis C, Statnikov A, Statnikov E (2003) Algorithms for large scale markov blanket discovery. In: Proceedings of the 16th international FLAIRS conference, St, AAAI Press, pp 376\u2013380"},{"key":"1494_CR9","unstructured":"Cavallo R, Pittarelli M (1987) The theory of probabilistic databases. In: Proceedings of the 13th international conference on very large data bases (VLDB), Brighton, UK, pp 71\u201381"},{"issue":"6","key":"1494_CR10","doi-asserted-by":"publisher","first-page":"483","DOI":"10.1016\/j.is.2003.10.006","volume":"29","author":"C Giannella","year":"2004","unstructured":"Giannella C, Robertson EL (2004) On approximation measures for functional dependencies. Inf Syst 29(6):483\u2013507","journal-title":"Inf Syst"},{"issue":"1","key":"1494_CR11","doi-asserted-by":"publisher","first-page":"116","DOI":"10.1214\/12-STS405","volume":"28","author":"M Reimherr","year":"2013","unstructured":"Reimherr M, Nicolae DL (2013) On quantifying dependence: a framework for developing interpretable measures. Stat Sci 28(1):116\u2013130","journal-title":"Stat Sci"},{"key":"1494_CR12","doi-asserted-by":"crossref","unstructured":"Romano S, Vinh NX, Bailey J, Verspoor K (2016) A framework to adjust dependency measure estimates for chance. In: Proceedings of the SIAM international conference on data mining (SDM), Miami, FL, SIAM","DOI":"10.1137\/1.9781611974348.48"},{"key":"1494_CR13","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1002\/rsa.10019","volume":"19","author":"A Antos","year":"2001","unstructured":"Antos A, Kontoyiannis I (2001) Convergence properties of functional estimates for discrete distributions. Random Struct Algorithm 19:163\u2013193","journal-title":"Random Struct Algorithm"},{"issue":"3","key":"1494_CR14","doi-asserted-by":"publisher","first-page":"285","DOI":"10.1016\/S0167-2789(98)00269-3","volume":"125","author":"MS Roulston","year":"1999","unstructured":"Roulston MS (1999) Estimating the errors on measured entropy and mutual information. Physica D 125(3):285\u2013294","journal-title":"Physica D"},{"key":"1494_CR15","unstructured":"Lancaster H (1969) The chi-squared distribution. Wiley series in probability and mathematical statistics. Probability and mathematical statistics, Wiley"},{"key":"1494_CR16","doi-asserted-by":"crossref","unstructured":"Mandros P, Boley M, Vreeken J (2017) Discovering reliable approximate functional dependencies. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, KDD\u201917, ACM","DOI":"10.1145\/3097983.3098062"},{"key":"1494_CR17","doi-asserted-by":"crossref","unstructured":"Mandros P, Boley M, Vreeken J (2018) Discovering reliable dependencies from data: hardness and improved algorithms. In: 2018 IEEE international conference on data mining (ICDM), pp 317\u2013326, IEEE","DOI":"10.1109\/ICDM.2018.00047"},{"key":"1494_CR18","doi-asserted-by":"crossref","unstructured":"Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In: Proceedings of the 26th international conference on machine learning, ACM, pp 1073\u20131080","DOI":"10.1145\/1553374.1553511"},{"key":"1494_CR19","unstructured":"Romano S, Bailey J, Vinh NX, Verspoor K (2014) Standardized mutual information for clustering comparisons: one step further in adjustment for chance. In: Proceedings of the 31st international conference on machine learning (ICML), Beijing, China, pp 1143\u20131151"},{"key":"1494_CR20","first-page":"2837","volume":"11","author":"NX Vinh","year":"2010","unstructured":"Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837\u20132854","journal-title":"J Mach Learn Res"},{"key":"1494_CR21","doi-asserted-by":"publisher","DOI":"10.1002\/0471200611","volume-title":"Elements of information theory","author":"TM Cover","year":"1991","unstructured":"Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York, NY"},{"key":"1494_CR22","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24488-9","volume-title":"Combinatorial optimization: theory and algorithms","author":"B Korte","year":"2012","unstructured":"Korte B, Vygen J (2012) Combinatorial optimization: theory and algorithms, 5th edn. Springer Publishing Company, Berlin","edition":"5"},{"key":"1494_CR23","volume-title":"Algorithms and data structures: the basic toolbox","author":"K Mehlhorn","year":"2008","unstructured":"Mehlhorn K, Sanders P (2008) Algorithms and data structures: the basic toolbox. Springer Science & Business Media, Berlin"},{"key":"1494_CR24","doi-asserted-by":"publisher","first-page":"431","DOI":"10.1613\/jair.227","volume":"3","author":"GI Webb","year":"1995","unstructured":"Webb GI (1995) Opus: an efficient admissible algorithm for unordered search. J Artif Intell Res 3:431\u2013465","journal-title":"J Artif Intell Res"},{"issue":"4","key":"1494_CR25","doi-asserted-by":"publisher","first-page":"1133","DOI":"10.1137\/090779346","volume":"40","author":"U Feige","year":"2011","unstructured":"Feige U, Mirrokni VS, Vondrak J (2011) Maximizing non-monotone submodular functions. SIAM J Comput 40(4):1133\u20131153","journal-title":"SIAM J Comput"},{"key":"1494_CR26","unstructured":"Das A, Kempe D (2011) Submodular meets spectral: greedy algorithms for subset selection, sparse approximation and dictionary selection. In: Proceedings of the 28th international conference on international conference on machine learning, ICML\u201911, pp 1057\u20131064"},{"key":"1494_CR27","unstructured":"Bian AA, Buhmann JM, Krause A, Tschiatschek S (2017) Guarantees for greedy maximization of non-submodular functions with applications. In: International conference on machine learning (ICML)"},{"key":"1494_CR28","doi-asserted-by":"crossref","unstructured":"Vinh NX, Chan J, Bailey J (2014) Reconsidering mutual information based feature selection: a statistical significance view. In: Proceedings of the 28th AAAI conference on artificial intelligence","DOI":"10.1609\/aaai.v28i1.8953"},{"issue":"2\u20133","key":"1494_CR29","first-page":"255","volume":"17","author":"J Alcal\u00e0-Fdez","year":"2011","unstructured":"Alcal\u00e0-Fdez J, Fern\u00e0ndez A, Luengo J, Derrac J, Garc\u00eca S (2011) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multip Val Log Soft Comput 17(2\u20133):255\u2013287","journal-title":"Multip Val Log Soft Comput"},{"key":"1494_CR30","unstructured":"Matheus CJ, Rendell LA (1989) Constructive induction on decision trees. In: Proceedings of the 11th international joint conference on artificial intelligence (IJCAI), Detroit, MI"},{"issue":"3","key":"1494_CR31","doi-asserted-by":"publisher","first-page":"891","DOI":"10.1103\/PhysRev.182.891","volume":"182","author":"JA Van Vechten","year":"1969","unstructured":"Van Vechten JA (1969) Quantum dielectric theory of electronegativity in covalent systems. i. electronic dielectric constant. Phys Rev 182(3):891","journal-title":"Phys Rev"},{"key":"1494_CR32","unstructured":"Horel T, Singer Y (2016) Maximization of approximately submodular functions. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc, pp 3045\u20133053"},{"key":"1494_CR33","unstructured":"Du D-Z, Graham RL, Pardalos PM, Wan P-J, Wu W, Zhao W (2008) Analysis of greedy approximations with nonsubmodular potential functions. In: Proceedings of the nineteenth annual ACM-SIAM symposium on discrete algorithms, SODA \u201908, pp 167\u2013175"},{"key":"1494_CR34","unstructured":"Zhou Y, Spanos CJ (2016) Causal meets submodular: subset selection with directed information. In: Proceedings of the neural information processing systems conference, pp 2649\u20132657"},{"issue":"6","key":"1494_CR35","doi-asserted-by":"publisher","first-page":"1220","DOI":"10.1109\/TIT.2004.828057","volume":"50","author":"M Rao","year":"2004","unstructured":"Rao M, Chen Y, Vemuri BC, Wang F (2004) Cumulative residual entropy: a new measure of information. IEEE Trans Inf Technol 50(6):1220\u20131228","journal-title":"IEEE Trans Inf Technol"}],"container-title":["Knowledge and Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10115-020-01494-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10115-020-01494-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10115-020-01494-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,4]],"date-time":"2022-11-04T07:48:08Z","timestamp":1667548088000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10115-020-01494-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,24]]},"references-count":35,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["1494"],"URL":"https:\/\/doi.org\/10.1007\/s10115-020-01494-9","relation":{},"ISSN":["0219-1377","0219-3116"],"issn-type":[{"value":"0219-1377","type":"print"},{"value":"0219-3116","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,7,24]]},"assertion":[{"value":"11 March 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 June 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 July 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 July 2020","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}