{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T09:07:19Z","timestamp":1773133639400,"version":"3.50.1"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T00:00:00Z","timestamp":1626134400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T00:00:00Z","timestamp":1626134400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100007364","name":"Fondazione CRT","doi-asserted-by":"publisher","award":["2019-0450"],"award-info":[{"award-number":["2019-0450"]}],"id":[{"id":"10.13039\/100007364","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Min Knowl Disc"],"published-print":{"date-parts":[[2021,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Most privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.<\/jats:p>","DOI":"10.1007\/s10618-021-00778-0","type":"journal-article","created":{"date-parts":[[2021,7,13]],"date-time":"2021-07-13T07:02:49Z","timestamp":1626159769000},"page":"2050-2088","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Differentially Private Distance Learning in Categorical Data"],"prefix":"10.1007","volume":"35","author":[{"given":"Elena","family":"Battaglia","sequence":"first","affiliation":[]},{"given":"Simone","family":"Celano","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5145-3438","authenticated-orcid":false,"given":"Ruggero G.","family":"Pensa","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,7,13]]},"reference":[{"key":"778_CR1","doi-asserted-by":"crossref","unstructured":"Alamuri M, Raju SB, Negi A (2014) A survey of distance\/similarity measures for categorical data. In: Proceedings of IJCNN 2014. IEEE, pp 1907\u20131914","DOI":"10.1109\/IJCNN.2014.6889941"},{"key":"778_CR2","doi-asserted-by":"crossref","unstructured":"Anandan B, Clifton C (2018) Differentially private feature selection for data mining. In: Proceedings of ACM IWSPA@CODASPY 2018, pp 43\u201353","DOI":"10.1145\/3180445.3180452"},{"key":"778_CR3","doi-asserted-by":"crossref","unstructured":"Aum\u00fcller M, Bourgeat A, Schmurr J (2020) Differentially private sketches for Jaccard similarity estimation. In: Proceedings of SISAP 2020. Springer, Berlin, pp 18\u201332","DOI":"10.1007\/978-3-030-60936-8_2"},{"issue":"1","key":"778_CR4","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","volume":"57","author":"Y Benjamini","year":"1995","unstructured":"Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B (Methodol) 57(1):289\u2013300","journal-title":"J Roy Stat Soc Ser B (Methodol)"},{"key":"778_CR5","doi-asserted-by":"crossref","unstructured":"Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of SIAM SDM 2008, pp 243\u2013254","DOI":"10.1137\/1.9781611972788.22"},{"key":"778_CR6","first-page":"1069","volume":"12","author":"K Chaudhuri","year":"2011","unstructured":"Chaudhuri K, Monteleoni C, Sarwate AD (2011) Differentially private empirical risk minimization. J Mach Learn Res 12:1069\u20131109","journal-title":"J Mach Learn Res"},{"key":"778_CR7","doi-asserted-by":"crossref","DOI":"10.1002\/0471200611","volume-title":"Elements of information theory","author":"TM Cover","year":"2001","unstructured":"Cover TM, Thomas JA (2001) Elements of information theory. Wiley, New York"},{"key":"778_CR8","doi-asserted-by":"publisher","first-page":"282","DOI":"10.1107\/S0567739478000522","volume":"34","author":"GM Crippen","year":"1978","unstructured":"Crippen GM, Havel TF (1978) Stable calculation of coordinates for distance information. Acta Crystallogr A 34:282\u2013284","journal-title":"Acta Crystallogr A"},{"key":"778_CR9","first-page":"1","volume":"7","author":"J Demsar","year":"2006","unstructured":"Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1\u201330","journal-title":"J Mach Learn Res"},{"key":"778_CR10","doi-asserted-by":"crossref","unstructured":"Domingo-Ferrer J, S\u00e1nchez D, Blanco-Justicia A (2021) The limits of differential privacy (and its misuse in data release and machine learning). Commun ACM (To appear)","DOI":"10.1145\/3433638"},{"issue":"3\u20134","key":"778_CR11","first-page":"211","volume":"9","author":"C Dwork","year":"2014","unstructured":"Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9(3\u20134):211\u2013407","journal-title":"Found Trends Theor Comput Sci"},{"key":"778_CR12","doi-asserted-by":"crossref","unstructured":"Dwork C, Kohli N, Mulligan D (2019) Differential privacy in practice: expose your epsilons! J Priv Confid 9(2)","DOI":"10.29012\/jpc.689"},{"key":"778_CR13","doi-asserted-by":"crossref","unstructured":"Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of ACM SIGKDD 2010. ACM, pp 493\u2013502","DOI":"10.1145\/1835804.1835868"},{"key":"778_CR14","doi-asserted-by":"crossref","unstructured":"Gao C, Huang C, Lin D, Jin D, Li Y (2020) DPLCF: differentially private local collaborative filtering. In: Proceedings of ACM SIGIR 2020. ACM, pp 961\u2013970","DOI":"10.1145\/3397271.3401053"},{"issue":"5","key":"778_CR15","doi-asserted-by":"publisher","first-page":"1544","DOI":"10.1007\/s10618-017-0532-z","volume":"31","author":"ME Gursoy","year":"2017","unstructured":"Gursoy ME, Inan A, Nergiz ME, Saygin Y (2017) Differentially private nearest neighbor classification. Data Min Knowl Discov 31(5):1544\u20131575","journal-title":"Data Min Knowl Discov"},{"key":"778_CR16","doi-asserted-by":"crossref","unstructured":"Hsu J, Gaboardi M, Haeberlen A, Khanna S, Narayan A, Pierce BC, Roth A (2014) Differential privacy: An economic method for choosing epsilon. In: Proceedings of IEEE CSF 2014. IEEE Computer Society, pp 398\u2013410","DOI":"10.1109\/CSF.2014.35"},{"issue":"1","key":"778_CR17","doi-asserted-by":"publisher","first-page":"193","DOI":"10.1007\/BF01908075","volume":"2","author":"L Hubert","year":"1985","unstructured":"Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193\u2013218","journal-title":"J Classif"},{"key":"778_CR18","doi-asserted-by":"publisher","first-page":"113","DOI":"10.1016\/j.neucom.2016.01.089","volume":"196","author":"D Ienco","year":"2016","unstructured":"Ienco D, Pensa RG (2016) Positive and unlabeled learning in categorical data. Neurocomputing 196:113\u2013124","journal-title":"Neurocomputing"},{"key":"778_CR19","doi-asserted-by":"crossref","unstructured":"Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 6(1):1:1-1:25","DOI":"10.1145\/2133360.2133361"},{"issue":"5","key":"778_CR20","doi-asserted-by":"publisher","first-page":"1017","DOI":"10.1109\/TNNLS.2016.2526063","volume":"28","author":"D Ienco","year":"2017","unstructured":"Ienco D, Pensa RG, Meo R (2017) A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans Neural Netw Learn Syst 28(5):1017\u20131029","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"issue":"1\u20132","key":"778_CR21","doi-asserted-by":"publisher","first-page":"287","DOI":"10.1016\/S0004-3702(98)00046-0","volume":"104","author":"S Kasif","year":"1998","unstructured":"Kasif S, Salzberg S, Waltz DL, Rachlin J, Aha DW (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104(1\u20132):287\u2013311","journal-title":"Artif Intell"},{"key":"778_CR22","doi-asserted-by":"publisher","first-page":"1107","DOI":"10.1016\/j.neucom.2015.10.038","volume":"174","author":"Y Li","year":"2016","unstructured":"Li Y, Yang J, Ji W (2016) Local learning-based feature weighting with privacy preservation. Neurocomputing 174:1107\u20131115","journal-title":"Neurocomputing"},{"key":"778_CR23","doi-asserted-by":"crossref","unstructured":"McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: Proceedings of IEEE FOCS 2007. IEEE Computer Society, pp 94\u2013103","DOI":"10.1109\/FOCS.2007.4389483"},{"issue":"301","key":"778_CR24","doi-asserted-by":"publisher","first-page":"216","DOI":"10.1080\/01621459.1963.10500843","volume":"58","author":"EB Page","year":"1963","unstructured":"Page EB (1963) Ordered hypotheses for multiple treatments: a significance test for linear ranks. J Am Stat Assoc 58(301):216\u2013230","journal-title":"J Am Stat Assoc"},{"issue":"8","key":"778_CR25","doi-asserted-by":"publisher","first-page":"1226","DOI":"10.1109\/TPAMI.2005.159","volume":"27","author":"H Peng","year":"2005","unstructured":"Peng H, Long F, Ding CHQ (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226\u20131238","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"778_CR26","unstructured":"Shi J, Malik J (1997) Normalized cuts and image segmentation. In: Proceedings of IEEE CVPR 1997. IEEE Computer Society, pp 731\u2013737"},{"key":"778_CR27","doi-asserted-by":"crossref","unstructured":"Stanojevic R, Nabeel M, Yu T (2017) Distributed cardinality estimation of set operations with differential privacy. In: Proceedings of IEEE PAC 2017. IEEE, pp 37\u201348","DOI":"10.1109\/PAC.2017.43"},{"key":"778_CR28","doi-asserted-by":"crossref","unstructured":"Su D, Cao J, Li N, Bertino E, Lyu M, Jin H (2017) Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans Priv Secur 20(4):16:1\u201316:33","DOI":"10.1145\/3133201"},{"key":"778_CR29","unstructured":"Velickovic P, Cucurull G, Casanova A, Romero A, Li\u00f2 P, Bengio Y (2018) Graph attention networks. In: Proceedings of ICLR 2018, OpenReview.net"},{"issue":"12","key":"778_CR30","doi-asserted-by":"publisher","first-page":"3081","DOI":"10.1109\/TIFS.2017.2737966","volume":"12","author":"C Xu","year":"2017","unstructured":"Xu C, Ren J, Zhang Y, Qin Z, Ren K (2017) Dppro: differentially private high-dimensional data release via random projection. IEEE Trans Inf Forensics Secur 12(12):3081\u20133093","journal-title":"IEEE Trans Inf Forensics Secur"},{"key":"778_CR31","doi-asserted-by":"crossref","unstructured":"Yamaguchi Y, Faloutsos C, Kitagawa H (2016) CAMLP: confidence-aware modulated label propagation. In: Proceedings of SIAM SDM 2016. SIAM, pp 513\u2013521","DOI":"10.1137\/1.9781611974348.58"},{"key":"778_CR32","doi-asserted-by":"crossref","unstructured":"Yang J, Li Y (2014) Differentially private feature selection. In: Proceedings of IJCNN 2014. IEEE, pp 4182\u20134189","DOI":"10.1109\/IJCNN.2014.6889613"},{"key":"778_CR33","unstructured":"Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. Proceedings of ICML 2003, pp 856\u2013863"},{"key":"778_CR34","doi-asserted-by":"crossref","unstructured":"Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: multiple transitive distance learning and embedding. In: Proceedings of SIAM SDM 2015. SIAM, pp 46\u201354","DOI":"10.1137\/1.9781611974010.6"},{"key":"778_CR35","doi-asserted-by":"crossref","unstructured":"Zhang Y, Cheung Y (2020) An ordinal data clustering algorithm with automated distance learning. In: Proceedings of AAAI 2020. AAAI Press, pp 6869\u20136876","DOI":"10.1609\/aaai.v34i04.6168"}],"container-title":["Data Mining and Knowledge Discovery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-021-00778-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10618-021-00778-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-021-00778-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T00:43:15Z","timestamp":1725410595000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10618-021-00778-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,13]]},"references-count":35,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2021,9]]}},"alternative-id":["778"],"URL":"https:\/\/doi.org\/10.1007\/s10618-021-00778-0","relation":{},"ISSN":["1384-5810","1573-756X"],"issn-type":[{"value":"1384-5810","type":"print"},{"value":"1573-756X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,13]]},"assertion":[{"value":"20 November 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 June 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 July 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interest\/Competing interests."}},{"value":"Source code and scripts used in our experiments are available at.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Code availability"}}]}}