{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T18:44:40Z","timestamp":1770749080357,"version":"3.50.0"},"reference-count":11,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:p>\n            Surrogate keys are now extensively utilized by database designers to implement keys in SQL tables. They are straightforward, easy to understand, enable efficient access, and are often considered a sufficient guarantee of data integrity despite lacking any real-world semantic meaning. In spite of all their benefits, one might wonder whether surrogate keys can negatively impact data quality. IT developers who rely exclusively on surrogate keys when designing database schemas may be tempted to not encode natural keys, as they are perceived as complex to manage at the application level. In such settings, surrogate keys allow the presence of so-called\n            <jats:italic toggle=\"yes\">artificial unicity<\/jats:italic>\n            , a complex form of redundancy that can be propagated through foreign keys, and other underlying data-quality issues. In the presence of artificial unicity, most data cleaning techniques, especially unsupervised, are likely to fail, making data preparation and analytics very challenging.\n          <\/jats:p>\n          <jats:p>\n            For relational databases implemented with surrogate keys but no natural keys, we developed RED2Hunt (RElational Databases REDundancy Hunting), a human-in-the-loop framework for identifying hidden redundancy and, if problems occur, clean the database. The framework was implemented on top of PostgreSQL within an eponym web-based platform to guide the expert through its application. In this paper, we present a demonstration of the RED2Hunt tool through three interactive scenarios on a polluted instance of the publicly available\n            <jats:italic toggle=\"yes\">Perfect Pet<\/jats:italic>\n            database. During the demonstration, the visitor can take on one of two roles in the Perfect Pet database: a domain expert or a data scientist. As a domain expert, she will interact with RED2Hunt, for example to elicit natural keys, from simple yet very intuitive visualizations of tables' attributes. As a data scientist, she will explore two simple scenarios\u2014executing SQL queries or applying learning models\u2014on both the initial and cleaned databases to grasp the benefits of the approach.\n          <\/jats:p>","DOI":"10.14778\/3750601.3750651","type":"journal-article","created":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:38:05Z","timestamp":1758029885000},"page":"5279-5282","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Can Surrogate Keys Negatively Impact Data Quality?"],"prefix":"10.14778","volume":"18","author":[{"given":"Mathilde","family":"Marcy","sequence":"first","affiliation":[{"name":"INSA Lyon, CNRS, UCBL, LIRIS, Villeurbanne, France"}]},{"given":"Jean-Marc","family":"Petit","sequence":"additional","affiliation":[{"name":"INSA Lyon, CNRS, UCBL, LIRIS, Villeurbanne, France"}]},{"given":"Vasile-Marian","family":"Scuturici","sequence":"additional","affiliation":[{"name":"INSA Lyon, CNRS, UCBL, LIRIS, Villeurbanne, France"}]},{"given":"Jocelyn","family":"Bonjour","sequence":"additional","affiliation":[{"name":"INSA Lyon, CNRS, CETHIL, Villeurbanne, France"}]},{"given":"Camille","family":"Fertel","sequence":"additional","affiliation":[{"name":"CEMAFROID, Fresnes, France"}]},{"given":"Gerald","family":"Cavalier","sequence":"additional","affiliation":[{"name":"CEMAFROID, Fresnes, France"}]}],"member":"320","published-online":{"date-parts":[[2025,9,16]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31164-2"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","first-page":"850611","DOI":"10.3389\/fdata.2022.850611","article-title":"A Survey of Data Quality Measurement and Monitoring Tools","volume":"5","author":"Ehrlinger Lisa","year":"2022","unstructured":"Lisa Ehrlinger and Wolfram W\u00f6\u00df. 2022. A Survey of Data Quality Measurement and Monitoring Tools. Frontiers in Big Data 5 (2022), 850611.","journal-title":"Frontiers in Big Data"},{"key":"e_1_2_1_3_1","volume-title":"Lubna Alzamil and Arash Termehchy","author":"Ga Young Lee Bakhtiyar Doskenov","year":"2021","unstructured":"Bakhtiyar Doskenov Ga Young Lee, Lubna Alzamil and Arash Termehchy. 2021. A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance. arXiv preprint arXiv:2109.07127 (2021)."},{"key":"e_1_2_1_4_1","volume-title":"Accessed","author":"Gregg Forest","year":"2025","unstructured":"Forest Gregg and Derek Eder. 2025. Dedupe. https:\/\/github.com\/dedupeio\/dedupe. open-source software, GitHub repository. Accessed: June 25, 2025."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/0304-3975(95)00028-U"},{"key":"e_1_2_1_6_1","volume-title":"Moritz Feuerpfeil and Hazar Harmouch","author":"Andrea Nathansen Nele Noack Nina Ihde","year":"2022","unstructured":"Nina Ihde Andrea Nathansen Nele Noack Hendrik Patzlaff Felix Naumann Lukas Budach, Moritz Feuerpfeil and Hazar Harmouch. 2022. The Effects of Data Quality on Machine Learning Performance. arXiv preprint arXiv:2207.14529 (2022)."},{"key":"e_1_2_1_7_1","volume-title":"Accessed","author":"Petit Mathilde Marcy Jean-Marc","year":"2025","unstructured":"Jean-Marc Petit Mathilde Marcy and Vasile-Marian Scuturici. 2025. Perfect Pet: Synthetic Database Generation with Artificial Unicity. https:\/\/github.com\/mathildemarcy\/perfect_pet. open-source software, GitHub repository. Accessed: July 14, 2025."},{"key":"e_1_2_1_8_1","volume-title":"Jean-Marc Petit and Gerald Cavalier","author":"Scuturici Jocelyn Bonjour Camille Vasile-Marian","year":"2025","unstructured":"Vasile-Marian Scuturici Jocelyn Bonjour Camille Fertel Mathilde Marcy, Jean-Marc Petit and Gerald Cavalier. 2025. RED2Hunt: an Actionable Framework for Cleaning Operational Databases with Surrogate Keys. arXiv preprint arXiv:2503.20593 (2025)."},{"key":"e_1_2_1_9_1","first-page":"1237","article-title":"Efficient Discovery of Functional Dependencies and Armstrong Relations","volume":"28","author":"Milo Tova","year":"1999","unstructured":"Tova Milo and Dan Suciu. 1999. Efficient Discovery of Functional Dependencies and Armstrong Relations. SIAM J. Comput. 28, 4 (1999), 1237\u20131257.","journal-title":"SIAM J. Comput."},{"key":"e_1_2_1_10_1","volume-title":"Accessed","author":"Redman Thomas C.","year":"2017","unstructured":"Thomas C. Redman. 2017. Seizing Opportunity in Data Quality. https:\/\/sloanreview.mit.edu\/article\/seizing-opportunity-in-data-quality\/. (2017). Accessed: July 15, 2025."},{"key":"e_1_2_1_11_1","unstructured":"World Economic Forum. 2023. Data Unleashed: Empowering Small and Medium Enterprises (SMEs) for Innovation and Success. https:\/\/www3.weforum.org\/docs\/WEF_Data_Unleashed_Empowering_Small_and_Medium_Enterprises_(SMEs)_for_Innovation_and_Success_2023.pdf. Accessed: July 15 2025."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3750601.3750651","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:44:06Z","timestamp":1758030246000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3750601.3750651"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8]]},"references-count":11,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["10.14778\/3750601.3750651"],"URL":"https:\/\/doi.org\/10.14778\/3750601.3750651","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,8]]},"assertion":[{"value":"2025-09-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}