{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T07:44:32Z","timestamp":1776152672705,"version":"3.50.1"},"reference-count":53,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,7,30]],"date-time":"2022-07-30T00:00:00Z","timestamp":1659139200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,7,30]],"date-time":"2022-07-30T00:00:00Z","timestamp":1659139200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Sci Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (\u03b5,\u00a0\u03b4\ufeff)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality\/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased.<\/jats:p>","DOI":"10.1038\/s41597-022-01561-6","type":"journal-article","created":{"date-parts":[[2022,7,30]],"date-time":"2022-07-30T09:14:38Z","timestamp":1659172478000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Utility-driven assessment of anonymized data via clustering"],"prefix":"10.1038","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1317-0629","authenticated-orcid":false,"given":"Maria Eug\u00e9nia","family":"Ferr\u00e3o","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3072-0186","authenticated-orcid":false,"given":"Paula","family":"Prata","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6054-7188","authenticated-orcid":false,"given":"Paulo","family":"Fazendeiro","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,7,30]]},"reference":[{"key":"1561_CR1","unstructured":"European Commission. General Data Protection Regulation, Art. 12\u201323 (2016)."},{"key":"1561_CR2","doi-asserted-by":"publisher","first-page":"89","DOI":"10.2478\/jos-2020-0005","volume":"36","author":"H Goldstein","year":"2020","unstructured":"Goldstein, H. & Shlomo, N. A probabilistic procedure for anonymisation, for assessing the risk of re-identification and for the analysis of perturbed data sets. J. Off. Stat. 36, 89\u2013115 (2020).","journal-title":"J. Off. Stat."},{"key":"1561_CR3","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1111\/rssa.12315","volume":"181","author":"DJ Hand","year":"2018","unstructured":"Hand, D. J. Statistical challenges of administrative and transaction data. J. R. Stat. Soc. Ser. A Stat. Soc. 181, 555\u2013605 (2018).","journal-title":"J. R. Stat. Soc. Ser. A Stat. Soc."},{"key":"1561_CR4","unstructured":"Commission, E. General Data Protection Regulation, Art.24. (2016)."},{"key":"1561_CR5","doi-asserted-by":"crossref","unstructured":"Willenborg, L. & de Waal, T. Elements of statistical disclosure control in practice. (Springer-Verlag, 2001).","DOI":"10.1007\/978-1-4613-0121-9"},{"key":"1561_CR6","doi-asserted-by":"publisher","first-page":"1277","DOI":"10.1002\/spe.2812","volume":"50","author":"F Prasser","year":"2020","unstructured":"Prasser, F., Eicher, J., Spengler, H., Bild, R. & Kuhn, K. A. Flexible data anonymization using ARX\u2014Current status and challenges ahead. Softw. Pract. Exp. 50, 1277\u20131304 (2020).","journal-title":"Softw. Pract. Exp."},{"key":"1561_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/inventions6030045","volume":"6","author":"P Churi","year":"2021","unstructured":"Churi, P., Pawar, A. & Moreno-Guerrero, A. J. A comprehensive survey on data utility and privacy: Taking indian healthcare system as a potential case study. Inventions 6, 1\u201330 (2021).","journal-title":"Inventions"},{"key":"1561_CR8","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1142\/S0218488502001648","volume":"10","author":"L Sweeney","year":"2002","unstructured":"Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowledge-Based Syst. 10, 557\u2013570 (2002).","journal-title":"Int. J. Uncertainty, Fuzziness Knowledge-Based Syst."},{"key":"1561_CR9","first-page":"1","volume":"4052 LNCS","author":"C Dwork","year":"2006","unstructured":"Dwork, C. Differential Privacy. in. Lecture Notes in Computer Science 4052 LNCS, 1\u201312 (2006).","journal-title":"Lecture Notes in Computer Science"},{"key":"1561_CR10","doi-asserted-by":"publisher","first-page":"2751","DOI":"10.1109\/ACCESS.2016.2577036","volume":"4","author":"S Yu","year":"2016","unstructured":"Yu, S. Big privacy: Challenges and opportunities of privacy study in the age of big data. IEEE Access 4, 2751\u20132763 (2016).","journal-title":"IEEE Access"},{"key":"1561_CR11","unstructured":"Sweeney, L., Loewenfeldt, M. V. & Perry, M. Saying it\u2019s anonymous doesn\u2019t make it so: Re-identifications of \u201canonymized\u201d law school data. Technol. Sci. (2018)."},{"key":"1561_CR12","doi-asserted-by":"publisher","first-page":"10562","DOI":"10.1109\/ACCESS.2017.2706947","volume":"5","author":"R Mendes","year":"2017","unstructured":"Mendes, R. & Vilela, J. P. Privacy-Preserving Data Mining: Methods, Metrics, and Applications. IEEE Access 5, 10562\u201310582 (2017).","journal-title":"IEEE Access"},{"key":"1561_CR13","unstructured":"Sousa S, Guetl C, K. R. Privacy in open search: A review of challenges and solutions. in OSSYM 2021: Third Open Search Symposium (OSF: The Open Search Foundation, 2021)."},{"key":"1561_CR14","doi-asserted-by":"crossref","unstructured":"Dwork, C. Differential privacy: A survey of results. in Theory and Applications of Models of Computation 4978 LNCS, 1\u201319 (Springer Berlin Heidelberg, 2008).","DOI":"10.1007\/978-3-540-79228-4_1"},{"key":"1561_CR15","doi-asserted-by":"crossref","unstructured":"Soria-Comas, J., Domingo-Ferrer, J., Sanchez, D. & Martinez, S. t-closeness through microaggregation: Strict privacy with enhanced utility preservation. in IEEE Transactions on Knowledge and Data Engineering 27, 3098\u20133110 (IEEE, 2015).","DOI":"10.1109\/TKDE.2015.2435777"},{"key":"1561_CR16","first-page":"312","volume":"228","author":"F Prasser","year":"2016","unstructured":"Prasser, F., Bild, R. & Kuhn, K. A. A Generic method for assessing the quality of De-Identified health data. Stud. Health Technol. Inform. 228, 312\u2013316 (2016).","journal-title":"Stud. Health Technol. Inform."},{"key":"1561_CR17","doi-asserted-by":"crossref","unstructured":"Baird, C. Risk and needs assessments. Encyclopedia of Social Measurement 1007 (2005).","DOI":"10.1016\/B0-12-369398-5\/00075-X"},{"key":"1561_CR18","unstructured":"Vossensteyn, J. J. et al. Dropout and completion in higher education in Europe: main report. European Commission Education and Culture (2015)."},{"key":"1561_CR19","unstructured":"Breslow, N. E. & Day, N. E. Statistical methods in cancer research. Volume 2 - The design and analysis of cohort studies. (IARC Scientific Publications, 1987)."},{"key":"1561_CR20","unstructured":"OECD. Glossary of Statistical Terms. Available at: https:\/\/stats.oecd.org\/glossary\/ (2005)."},{"key":"1561_CR21","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1145\/1217299.1217302","volume":"1","author":"A Machanavajjhala","year":"2007","unstructured":"Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. L -diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 3 (2007).","journal-title":"ACM Trans. Knowl. Discov. Data"},{"key":"1561_CR22","doi-asserted-by":"publisher","unstructured":"Li, N., Li, T. & Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. in 2007 IEEE 23rd International Conference on Data Engineering 106\u2013115 https:\/\/doi.org\/10.1109\/ICDE.2007.367856 (IEEE, 2007).","DOI":"10.1109\/ICDE.2007.367856"},{"key":"1561_CR23","doi-asserted-by":"publisher","first-page":"433","DOI":"10.14301\/llcs.v9i4.478","volume":"9","author":"D Avraam","year":"2018","unstructured":"Avraam, D., Boyd, A., Goldstein, H. & Burton, P. A software package for the application of probabilistic anonymisation to sensitive individual-level data: A proof of principle with an example from the ALSPAC birth cohort study. Longit. Life Course Stud. 9, 433\u2013446 (2018).","journal-title":"Longit. Life Course Stud."},{"key":"1561_CR24","doi-asserted-by":"publisher","unstructured":"Jagannathan, G., Pillaipakkamnatt, K. & Wright, R. N. A practical differentially private random decision tree classifier. ICDM Work. 2009 - IEEE Int. Conf. Data Min. 114\u2013121, https:\/\/doi.org\/10.1109\/ICDMW.2009.93 (2009).","DOI":"10.1109\/ICDMW.2009.93"},{"key":"1561_CR25","doi-asserted-by":"crossref","unstructured":"Jain, P., Gyanchandani, M. & Khare, N. Differential privacy: its technological prescriptive using big data. J. Big Data 5 (2018).","DOI":"10.1186\/s40537-018-0124-9"},{"key":"1561_CR26","doi-asserted-by":"publisher","unstructured":"Li, N., Qardaji, W. & Su, D. On sampling, anonymization, and differential privacy or, k -anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security - ASIACCS \u201912 32, https:\/\/doi.org\/10.1145\/2414456.2414474 (ACM Press, 2012).","DOI":"10.1145\/2414456.2414474"},{"key":"1561_CR27","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1515\/popets-2018-0004","volume":"2018","author":"R Bild","year":"2018","unstructured":"Bild, R., Kuhn, K. A. & Prasser, F. SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees. Proc. Priv. Enhancing Technol. 2018, 67\u201387 (2018).","journal-title":"Proc. Priv. Enhancing Technol."},{"key":"1561_CR28","first-page":"35","volume":"6","author":"FK Dankar","year":"2013","unstructured":"Dankar, F. K. & El Emam, K. Practicing differential privacy in health care: A review. Trans. Data Priv. 6, 35\u201367 (2013).","journal-title":"Trans. Data Priv."},{"key":"1561_CR29","unstructured":"Kasiviswanathan, S. P. & Smith, A. A. Note on differential privacy: Defining resistance to arbitrary side information. arXiv:0803.3946 (2008)."},{"key":"1561_CR30","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1080\/01621459.1990.10475304","volume":"85","author":"JG Bethlehem","year":"1990","unstructured":"Bethlehem, J. G., Keller, W. J. & Pannekoek, J. Disclosure control of microdata. J. Am. Stat. Assoc. 85, 38\u201345 (1990).","journal-title":"J. Am. Stat. Assoc."},{"key":"1561_CR31","unstructured":"Chen, B.-C., Ramakrishnan, R. & LeFevre, K. Privacy skyline: Privacy with multidimensional adversarial knowledge. Proc. 33rd Int. Conf. Very Large Databases (2007)."},{"key":"1561_CR32","doi-asserted-by":"crossref","unstructured":"El Emam, K. Guide to the De-Identification of Personal Health Information. (CRC Press, 2013).","DOI":"10.1201\/b14764"},{"key":"1561_CR33","unstructured":"Kniola, L. Calculating the risk of re-identification of patient-level data using quantitative approach. PhUSE Annu. Conf. 1\u20139 (2016)."},{"key":"1561_CR34","unstructured":"El Emam, K. & Arbuckle, L. Anonymizing Health Data. (O\u00b4REILLY, 2014)."},{"key":"1561_CR35","unstructured":"Kniola, L. Plausible adversaries in re-identification risk assessment. PhUSE Annu. Conf. (2017)."},{"key":"1561_CR36","first-page":"461","volume":"9","author":"DB Rubin","year":"1993","unstructured":"Rubin, D. B. Statistical disclosure limitation. J. Off. Stat. 9, 461\u2013468 (1993).","journal-title":"J. Off. Stat."},{"key":"1561_CR37","unstructured":"Ji, Z., Lipton, Z. C. & Elkan, C. Differential Privacy and Machine Learning: a Survey and Review. 1\u201330 (2014)."},{"key":"1561_CR38","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/s20247030","volume":"20","author":"T Wang","year":"2020","unstructured":"Wang, T., Zhang, X., Feng, J. & Yang, X. A comprehensive survey on local differential privacy toward data statistics and analysis. Sensors (Switzerland) 20, 1\u201348 (2020).","journal-title":"Sensors (Switzerland)"},{"key":"1561_CR39","doi-asserted-by":"publisher","first-page":"158","DOI":"10.1016\/j.future.2018.07.038","volume":"90","author":"C Piao","year":"2019","unstructured":"Piao, C., Shi, Y., Yan, J., Zhang, C. & Liu, L. Privacy-preserving governmental data publishing: A fog-computing-based differential privacy approach. Futur. Gener. Comput. Syst. 90, 158\u2013174 (2019).","journal-title":"Futur. Gener. Comput. Syst."},{"key":"1561_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3168389","volume":"51","author":"I Wagner","year":"2018","unstructured":"Wagner, I. & Eckhoff, D. Technical privacy metrics: A systematic survey. ACM Comput. Surv. 51, 1\u201345 (2018).","journal-title":"ACM Comput. Surv."},{"key":"1561_CR41","doi-asserted-by":"crossref","unstructured":"Yin, X., Zhu, Y. & Hu, J. A Comprehensive Survey of Privacy-preserving Federated Learning: A Taxonomy, Review, and Future Directions. ACM Comput. Surv. 54 (2021).","DOI":"10.1145\/3460427"},{"key":"1561_CR42","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1109\/TKDE.2008.129","volume":"21","author":"A Gionis","year":"2009","unstructured":"Gionis, A. & Tassa, T. k-Anonymization with Minimal Loss of Information. IEEE Trans. Knowl. Data Eng. 21, 206\u2013219 (2009).","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"1561_CR43","unstructured":"Rastogi, V., Suciu, D. & Hong, S. The Boundary Between Privacy and Utility in Data Anonymization. eprint arXiv:cs\/0612103 531\u2013542 (2006)."},{"key":"1561_CR44","doi-asserted-by":"crossref","unstructured":"Fazendeiro, P. & Oliveira, J. V. Fuzzy clustering as a data-driven development environment for information granules. in Handbook of Granular Computing 153\u2013169 (Wiley, 2008).","DOI":"10.1002\/9780470724163.ch7"},{"key":"1561_CR45","first-page":"768","volume":"21","author":"E Forgy","year":"1965","unstructured":"Forgy, E. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21, 768\u2013780 (1965).","journal-title":"Biometrics"},{"key":"1561_CR46","doi-asserted-by":"crossref","unstructured":"Fazendeiro, P. & de Oliveira, J. V. Observer-biased analysis of gene expression profiles. Big Data Anal. Bioinforma. Heal. IGI Glob. 117\u2013137 (2015).","DOI":"10.4018\/978-1-4666-6611-5.ch006"},{"key":"1561_CR47","unstructured":"Tan, P. N., Steinbach, M. & Kumar, K. Introduction to Data Mining. (Addison-Wesley, 2005)."},{"key":"1561_CR48","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1002\/sam.10080","volume":"3","author":"L Vendramin","year":"2010","unstructured":"Vendramin, L., Campello, R. J. G. B. & Hruschka, E. R. Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. ASA Data Sci. J. 3, 209\u2013235 (2010).","journal-title":"Stat. Anal. Data Min. ASA Data Sci. J."},{"key":"1561_CR49","doi-asserted-by":"publisher","first-page":"957","DOI":"10.1007\/s10639-017-9645-7","volume":"23","author":"S Bharara","year":"2018","unstructured":"Bharara, S., Sabitha, S. & Bansal, A. Application of learning analytics using clustering data mining for students\u2019 disposition analysis. Educ. Inf. Technol. 23, 957\u2013984 (2018).","journal-title":"Educ. Inf. Technol."},{"key":"1561_CR50","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-021-93244-2","volume":"11","author":"B Lyu","year":"2021","unstructured":"Lyu, B., Wu, W. & Hu, Z. A novel bidirectional clustering algorithm based on local density. Sci. Rep. 11, 14214 (2021).","journal-title":"Sci. Rep."},{"key":"1561_CR51","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1016\/j.patcog.2012.07.021","volume":"46","author":"O Arbelaitz","year":"2013","unstructured":"Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P\u00e9rez, J. M. & Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 46, 243\u2013256 (2013).","journal-title":"Pattern Recognit."},{"key":"1561_CR52","doi-asserted-by":"publisher","DOI":"10.17605\/OSF.IO\/9VGEH","author":"ME Ferr\u00e3o","year":"2022","unstructured":"Ferr\u00e3o, ME., Prata, P. & Fazendeiro, P. Anonymized higher education data for \u201cUtility-driven assessment of anonymized data via clustering\u201d. Open Science Framework, https:\/\/doi.org\/10.17605\/OSF.IO\/9VGEH (2022)."},{"key":"1561_CR53","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V. & Thirion, B. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825\u20132830 (2011).","journal-title":"J. Mach. Learn. Res."}],"container-title":["Scientific Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41597-022-01561-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41597-022-01561-6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41597-022-01561-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,24]],"date-time":"2022-11-24T19:56:23Z","timestamp":1669319783000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41597-022-01561-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,30]]},"references-count":53,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["1561"],"URL":"https:\/\/doi.org\/10.1038\/s41597-022-01561-6","relation":{"references":[{"id-type":"doi","id":"10.17605\/OSF.IO\/9VGEH","asserted-by":"subject"}]},"ISSN":["2052-4463"],"issn-type":[{"value":"2052-4463","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,7,30]]},"assertion":[{"value":"1 April 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 July 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 July 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"456"}}