{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T18:07:34Z","timestamp":1770919654934,"version":"3.50.1"},"reference-count":26,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2021,10,13]],"date-time":"2021-10-13T00:00:00Z","timestamp":1634083200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["DS"],"published-print":{"date-parts":[[2021,10,13]]},"abstract":"<jats:p>The General Data Protection Regulation (GDPR) grants all natural persons the right to access their personal data if this is being processed by data controllers. The data controllers are obliged to share the data in an electronic format and often provide the data in a so called Data Download Package (DDP). These DDPs contain all data collected by public and private entities during the course of a citizens\u2019 digital life and form a treasure trove for social scientists. However, the data can be deeply private. To protect the privacy of research participants while using their DDPs for scientific research, we developed a de-identification algorithm that is able to handle typical characteristics of DDPs. These include regularly changing file structures, visual and textual content, differing file formats, differing file structures and private information like usernames. We investigate the performance of the algorithm and illustrate how the algorithm can be tailored towards specific DDP structures.<\/jats:p>","DOI":"10.3233\/ds-210035","type":"journal-article","created":{"date-parts":[[2021,9,7]],"date-time":"2021-09-07T15:09:12Z","timestamp":1631027352000},"page":"101-120","source":"Crossref","is-referenced-by-count":12,"title":["Automatic de-identification of data download packages"],"prefix":"10.1177","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3536-0474","authenticated-orcid":false,"given":"Laura","family":"Boeschoten","sequence":"first","affiliation":[{"name":"Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands. E-mail:\u00a0l.boeschoten@uu.nl"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4411-8495","authenticated-orcid":false,"given":"Roos","family":"Voorvaart","sequence":"additional","affiliation":[{"name":"Research and Data Management Services, Utrecht University, Utrecht, The Netherlands. E-mail:\u00a0r.voorvaart@uu.nl"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3229-3015","authenticated-orcid":false,"given":"Ruben","family":"Van Den Goorbergh","sequence":"additional","affiliation":[{"name":"Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands. E-mail:\u00a0r.vandengoorbergh@uu.nl"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6326-6680","authenticated-orcid":false,"given":"Casper","family":"Kaandorp","sequence":"additional","affiliation":[{"name":"Research and Data Management Services, Utrecht University, Utrecht, The Netherlands. E-mail:\u00a0c.s.kaandorp@uu.nl"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5301-1713","authenticated-orcid":false,"given":"Martine","family":"De Vos","sequence":"additional","affiliation":[{"name":"Research and Data Management Services, Utrecht University, Utrecht, The Netherlands. E-mail:\u00a0m.g.devos@uu.nl"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"key":"10.3233\/DS-210035_ref2","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242598"},{"key":"10.3233\/DS-210035_ref3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3343038","article-title":"A survey on privacy in social media","volume":"1","author":"Beigi","year":"2020","journal-title":"ACM\/IMS Transactions on Data Science"},{"key":"10.3233\/DS-210035_ref4","doi-asserted-by":"crossref","unstructured":"G.\u00a0Beigi, K.\u00a0Shu, R.\u00a0Guo, S.\u00a0Wang and H.\u00a0Liu, Privacy preserving text representation learning, in: Proceedings of the 30th ACM Conference on Hypertext and Social Media, 2019, pp.\u00a0275\u2013276, https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3342220.3344925.","DOI":"10.1145\/3342220.3344925"},{"issue":"1","key":"10.3233\/DS-210035_ref5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-019-56847-4","article-title":"The effect of social media on well-being differs from adolescent to adolescent","volume":"10","author":"Beyens","year":"2020","journal-title":"Scientific Reports"},{"key":"10.3233\/DS-210035_ref8","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W15-1204"},{"issue":"5","key":"10.3233\/DS-210035_ref10","doi-asserted-by":"publisher","first-page":"627","DOI":"10.1197\/jamia.M2716","article-title":"Protecting privacy using k-anonymity","volume":"15","author":"El Emam","year":"2008","journal-title":"Journal of the American Medical Informatics Association"},{"key":"10.3233\/DS-210035_ref11","doi-asserted-by":"publisher","DOI":"10.34740\/KAGGLE\/DSV\/845275"},{"key":"10.3233\/DS-210035_ref12","unstructured":"G.D.P.\u00a0Regulation, Regulation (EU) 2016\/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95\/46, Official Journal of the European Union (OJ) 59(1\u201388) (2016), 294. https:\/\/eur-lex.europa.eu\/legal-content\/EN\/TXT\/PDF\/?uri=OJ:L:2016:119:FULL&from=EN."},{"key":"10.3233\/DS-210035_ref13","unstructured":"P.M.\u00a0Heider, J.S.\u00a0Obeid and S.M.\u00a0Meystre, A comparative analysis of speed and accuracy for three off-the-shelf de-identification tools, AMIA Summits on Translational Science Proceedings 2020 (2020), 241. PMCID: PMC7233098."},{"key":"10.3233\/DS-210035_ref14","doi-asserted-by":"publisher","DOI":"10.1002\/9781118348239"},{"issue":"6018","key":"10.3233\/DS-210035_ref15","doi-asserted-by":"publisher","first-page":"719","DOI":"10.1126\/science.1197872","article-title":"Ensuring the data-rich future of the social sciences","volume":"331","author":"King","year":"2011","journal-title":"Science"},{"key":"10.3233\/DS-210035_ref17","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1097\/MLR.0b013e3182585355","article-title":"Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies","volume":"50","author":"Kushida","year":"2012","journal-title":"Medical Care"},{"key":"10.3233\/DS-210035_ref19","doi-asserted-by":"crossref","unstructured":"H.\u00a0Mao, X.\u00a0Shuai and A.\u00a0Kapadia, Loose tweets: An analysis of privacy leaks on Twitter, in: Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society, 2011, pp.\u00a01\u201312, https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2046556.2046558.","DOI":"10.1145\/2046556.2046558"},{"issue":"4","key":"10.3233\/DS-210035_ref21","doi-asserted-by":"publisher","first-page":"727","DOI":"10.1016\/j.tele.2017.08.002","article-title":"DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text","volume":"35","author":"Menger","year":"2018","journal-title":"Telematics and Informatics"},{"key":"10.3233\/DS-210035_ref23","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2008.33"},{"key":"10.3233\/DS-210035_ref24","doi-asserted-by":"publisher","first-page":"575","DOI":"10.1146\/annurev.med.57.121304.131257","article-title":"The health insurance portability and accountability act of 1996 (HIPAA) privacy rule: Implications for clinical research","volume":"57","author":"Nosowsky","year":"2006","journal-title":"Annu. Rev. Med."},{"key":"10.3233\/DS-210035_ref26","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"the Journal of Machine Learning Research"},{"key":"10.3233\/DS-210035_ref27","unstructured":"F.\u00a0Prasser, F.\u00a0Kohlmayer, R.\u00a0Lautenschl\u00e4ger and K.A.\u00a0Kuhn, Arx-a comprehensive tool for anonymizing biomedical data, in: AMIA Annual Symposium Proceedings, Vol.\u00a02014, American Medical Informatics Association, 2014, p.\u00a0984, https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4419984\/."},{"key":"10.3233\/DS-210035_ref28","first-page":"131","article-title":"De-identification for privacy protection in multimedia content: A survey","volume":"47","author":"Ribaric","year":"2016","journal-title":"Signal Processing: Image Communication"},{"issue":"5","key":"10.3233\/DS-210035_ref30","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1142\/S0218488502001648","article-title":"k-anonymity: A model for protecting privacy","volume":"10","author":"Sweeney","year":"2002","journal-title":"International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems"},{"key":"10.3233\/DS-210035_ref31","doi-asserted-by":"publisher","DOI":"10.18637\/jss.v067.i04"},{"key":"10.3233\/DS-210035_ref32","first-page":"3","article-title":"Comparing rule-based, feature-based and deep neural methods for de-identification of Dutch medical records","volume":"2551","author":"Trienes","year":"2020","journal-title":"CEUR Workshop Proceedings"},{"issue":"5","key":"10.3233\/DS-210035_ref33","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1197\/jamia.M2444","article-title":"Evaluating the state-of-the-art in automatic de-identification","volume":"14","author":"Uzuner","year":"2007","journal-title":"Journal of the American Medical Informatics Association"},{"key":"10.3233\/DS-210035_ref34","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.4088798"},{"issue":"10","key":"10.3233\/DS-210035_ref36","doi-asserted-by":"publisher","first-page":"1499","DOI":"10.1109\/LSP.2016.2603342","article-title":"Joint face detection and alignment using multi-task cascaded convolutional networks","volume":"23","author":"Zhang","year":"2016","journal-title":"IEEE Signal Processing Letters"},{"key":"10.3233\/DS-210035_ref37","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.283"}],"container-title":["Data Science"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/DS-210035","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,26]],"date-time":"2024-11-26T13:24:33Z","timestamp":1732627473000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.medra.org\/servlet\/aliasResolver?alias=iospress&doi=10.3233\/DS-210035"}},"subtitle":[],"editor":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5747-9927","authenticated-orcid":false,"given":"Thomas","family":"Maillart","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2021,10,13]]},"references-count":26,"journal-issue":{"issue":"2"},"URL":"https:\/\/doi.org\/10.3233\/ds-210035","relation":{},"ISSN":["2451-8492","2451-8484"],"issn-type":[{"value":"2451-8492","type":"electronic"},{"value":"2451-8484","type":"print"}],"subject":[],"published":{"date-parts":[[2021,10,13]]}}}