{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,6]],"date-time":"2025-11-06T01:03:39Z","timestamp":1762391019395,"version":"build-2065373602"},"reference-count":50,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2021,9,23]],"date-time":"2021-09-23T00:00:00Z","timestamp":1632355200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.<\/jats:p>","DOI":"10.3390\/info12100392","type":"journal-article","created":{"date-parts":[[2021,9,23]],"date-time":"2021-09-23T09:59:15Z","timestamp":1632391155000},"page":"392","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Understanding Collections of Related Datasets Using Dependent MMD Coresets"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0572-0045","authenticated-orcid":false,"given":"Sinead A.","family":"Williamson","sequence":"first","affiliation":[{"name":"Department of Statistics and Data Science, University of Texas at Austin, Austin, TX 78712, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3562-6540","authenticated-orcid":false,"given":"Jette","family":"Henderson","sequence":"additional","affiliation":[{"name":"CognitiveScale, Austin, TX 78759, USA"}]}],"member":"1968","published-online":{"date-parts":[[2021,9,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"12592","DOI":"10.1073\/pnas.1919012117","article-title":"Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis","volume":"117","author":"Larrazabal","year":"2020","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_2","unstructured":"Chen, I.Y., Johansson, F.D., and Sontag, D. (2018, January 3\u20138). Why is my classifier discriminatory?. Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_3","unstructured":"Buolamwini, J., and Gebru, T. (2018, January 23\u201324). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA."},{"key":"ref_4","unstructured":"Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., and Sculley, D. (2017). No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"595","DOI":"10.1080\/13506285.2014.890989","article-title":"Are summary statistics enough? Evidence for the importance of shape in guiding visual search","volume":"22","author":"Alexander","year":"2014","journal-title":"Vis. Cogn."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"14666","DOI":"10.1038\/s41598-018-32991-1","article-title":"The role of scene summary statistics in object recognition","volume":"8","author":"Lauer","year":"2018","journal-title":"Sci. Rep."},{"key":"ref_7","unstructured":"Kaufmann, L., and Rousseeuw, P. (1987). Clustering by means of medoids. Statistical Data Analysis Based on the L1-Norm and Related Methods, Springer."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2403","DOI":"10.1214\/11-AOAS495","article-title":"Prototype selection for interpretable classification","volume":"5","author":"Bien","year":"2011","journal-title":"Ann. Appl. Stat."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Mak, S., and Joseph, V.R. (2017). Projected support points: A new method for high-dimensional data reduction. arXiv.","DOI":"10.1214\/17-AOS1629"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2562","DOI":"10.1214\/17-AOS1629","article-title":"Support points","volume":"46","author":"Mak","year":"2018","journal-title":"Ann. Stat."},{"key":"ref_11","unstructured":"Kim, B., Khanna, R., and Koyejo, O.O. (2016, January 5\u201310). Examples are not enough, learn to criticize! Criticism for interpretability. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1023\/A:1007626913721","article-title":"Reduction techniques for instance-based learning algorithms","volume":"38","author":"Wilson","year":"2000","journal-title":"Mach. Learn."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gurumoorthy, K.S., Dhurandhar, A., Cecchi, G., and Aggarwal, C. (2019, January 8\u201311). Efficient data representation by selecting prototypes with importance weights. Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China.","DOI":"10.1109\/ICDM.2019.00036"},{"key":"ref_14","unstructured":"Chen, Y., Welling, M., and Smola, A. (2010, January 8\u201311). Super-samples from kernel herding. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence Uncertainty in Artificial Intelligence, Catalina Island, CA, USA."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"867","DOI":"10.1007\/s00454-019-00134-6","article-title":"Near-optimal coresets of kernel density estimates","volume":"63","author":"Phillips","year":"2020","journal-title":"Discret. Comput. Geom."},{"key":"ref_16","unstructured":"Karnin, Z., and Liberty, E. (2019, January 25\u201328). Discrepancy, coresets, and sketches in machine learning. Proceedings of the 32nd Conference on Learning Theory Conference on Learning Theory, Phoenix, AZ, USA."},{"key":"ref_17","unstructured":"Tai, W.M. (2021). Optimal Coreset for Gaussian Kernel Density Estimation. arXiv."},{"key":"ref_18","first-page":"723","article-title":"A kernel two-sample test","volume":"13","author":"Gretton","year":"2012","journal-title":"J. Mach. Learn. Res."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Pratt, K.B., and Tschapek, G. (2003, January 24\u201327). Visualizing concept drift. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA.","DOI":"10.1145\/956750.956849"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Hohman, F., Wongsuphasawat, K., Kery, M.B., and Patel, K. (2020, January 25\u201330). Understanding and visualizing data iteration in machine learning. Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA.","DOI":"10.1145\/3313831.3376177"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"606","DOI":"10.1145\/1008731.1008736","article-title":"Approximating extent measures of points","volume":"51","author":"Agarwal","year":"2004","journal-title":"J. ACM"},{"key":"ref_22","first-page":"18","article-title":"Wasserstein coresets for Lipschitz costs","volume":"1050","author":"Claici","year":"2018","journal-title":"Stat"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"429","DOI":"10.2307\/1428011","article-title":"Integral probability metrics and their generating classes of functions","volume":"29","year":"1997","journal-title":"Adv. Appl. Probab."},{"key":"ref_24","unstructured":"Bach, F., Lacoste-Julien, S., and Obozinski, G. (July, January 26). On the equivalence between herding and conditional gradient algorithms. Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK."},{"key":"ref_25","unstructured":"Lacoste-Julien, S., Lindsten, F., and Bach, F. (2015, January 9\u201312). Sequential kernel herding: Frank-Wolfe optimization for particle filtering. Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Phillips, J.M. (2013, January 6\u20138). \u03b5-samples for kernels. Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA.","DOI":"10.1137\/1.9781611973105.116"},{"key":"ref_27","unstructured":"Lopez-Paz, D., Muandet, K., Sch\u00f6lkopf, B., and Tolstikhin, I. (2015, January 7\u20139). Towards a learning theory of cause-effect inference. Proceedings of the 32nd International Conference on Machine Learning, Lille, France."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Feldman, D. (2020). Introduction to core-sets: An updated survey. arXiv.","DOI":"10.1002\/widm.1335"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"270","DOI":"10.3758\/s13414-013-0605-z","article-title":"Detecting meaning in RSVP at 13 ms per picture","volume":"76","author":"Potter","year":"2014","journal-title":"Atten. Percept. Psychophys."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zheng, Y., Ou, Y., Lex, A., and Phillips, J.M. (2017, January 1). Visualization of big spatial data using coresets for kernel density estimates. Proceedings of the IEEE Visualization in Data Science (VDS), Phoenix, AZ, USA.","DOI":"10.1109\/VDS.2017.8573446"},{"key":"ref_31","unstructured":"Kim, B., Rudin, C., and Shah, J.A. (2014, January 8\u201313). The Bayesian case model: A generative approach for case-based reasoning and prototype classification. Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"39","DOI":"10.3233\/AIC-1994-7104","article-title":"Case-based reasoning: Foundational issues, methodological variations, and system approaches","volume":"7","author":"Aamodt","year":"1994","journal-title":"AI Commun."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Murdock, J.W., Aha, D.W., and Breslow, L.A. (2003). Assessing elaborated hypotheses: An interpretive case-based reasoning approach. Case-Based Reasoning Research and Development, Proceedings of the 5th International Conference on Case-Based Reasoning, Trondheim, Norway, 23\u201326 June 2003, Springer.","DOI":"10.1007\/3-540-45006-8_27"},{"key":"ref_34","first-page":"50","article-title":"Dependent nonparametric processes","volume":"Volume 1","author":"MacEachern","year":"1999","journal-title":"ASA Proceedings of the Section on Bayesian Statistical Science"},{"key":"ref_35","unstructured":"Quintana, F.A., Mueller, P., Jara, A., and MacEachern, S.N. (2020). The dependent Dirichlet process and related models. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1198\/016214504000000205","article-title":"An ANOVA model for dependent random measures","volume":"99","author":"Rosner","year":"2004","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Dubey, A., Hefny, A., Williamson, S., and Xing, E.P. (2013, January 2\u20134). A nonparametric mixture model for topic modeling over time. Proceedings of the 13th SIAM International Conference on Data Mining, Austin, TX, USA.","DOI":"10.1137\/1.9781611972832.59"},{"key":"ref_38","unstructured":"Garreau, D., Jitkrittum, W., and Kanagawa, M. (2017). Large sample analysis of the median heuristic. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Kiela, D., and Bottou, L. (2014, January 25\u201329). Learning image embeddings using convolutional neural networks for improved multi-modal semantics. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1005"},{"key":"ref_40","unstructured":"Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. (2020, January 13\u201318). Generative pretraining from pixels. Proceedings of the International Conference on Machine Learning, Online."},{"key":"ref_41","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 13\u201318). A simple framework for contrastive learning of visual representations. Proceedings of the 37th International Conference on Machine Learning, Online."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_43","unstructured":"Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. (2018, January 10\u201315). Synthesizing robust adversarial examples. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Steed, R., and Caliskan, A. (2021, January 3\u201310). Image representations learned with unsupervised pre-training contain human-like biases. Proceedings of the 4th Conference on Fairness, Accountability, and Transparency, Online.","DOI":"10.1145\/3442188.3445932"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Ginosar, S., Rakelly, K., Sachs, S., Yin, B., and Efros, A.A. (2015, January 7\u201313). A century of portraits: A visual historical record of American high school yearbooks. Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile.","DOI":"10.1109\/ICCVW.2015.87"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Marcel, S., and Rodriguez, Y. (2010, January 25\u201329). Torchvision the machine-vision package of torch. Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy.","DOI":"10.1145\/1873951.1874254"},{"key":"ref_47","unstructured":"Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daum\u00e9, H., and Crawford, K. (2018, January 13\u201315). Datasheets for datasets. Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning, Stockholm, Sweden."},{"key":"ref_48","unstructured":"Chmielinski, K.S., Newman, S., Taylor, M., Joseph, J., Thomas, K., Yurkofsky, J., and Qiu, Y.C. (2020, January 11). The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. Proceedings of the NeurIPS 2020 Workshop on Dataset Curation and Security, Online."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"550","DOI":"10.1109\/34.291440","article-title":"A database for handwritten text recognition research","volume":"16","author":"Hull","year":"1994","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_50","first-page":"2825","article-title":"Scikit-learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/10\/392\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:03:49Z","timestamp":1760166229000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/10\/392"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,23]]},"references-count":50,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2021,10]]}},"alternative-id":["info12100392"],"URL":"https:\/\/doi.org\/10.3390\/info12100392","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2021,9,23]]}}}