{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T05:41:38Z","timestamp":1776404498367,"version":"3.51.2"},"reference-count":15,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p>Thanks to the numerous machine learning tools available to us nowadays, it is easier than ever to derive a model from a dataset in the frame of a supervised learning problem. However, when this model behaves poorly compared with an expected performance, the underlying question of the existence of such a model is often underlooked and one might just be tempted to try different parameters or just choose another model architecture. This is why the quality of the learning examples should be considered as early as possible as it acts as a go\/no go signal for the following potentially costly learning process. With ADESIT, we provide a way to evaluate the ability of a dataset to perform well for a given supervised learning problem through statistics and visual exploration. Notably, we base our work on recent studies proposing the use of functional dependencies and specifically counterexample analysis to provide dataset cleanliness statistics but also a theoretical upper bound on the prediction accuracy directly linked to the problem settings (measurement uncertainty, expected generalization...). In brief, ADESIT is intended to be part of an iterative data refinement process right after data selection and right before the machine learning process itself. With further analysis for a given problem, the user can characterize, clean and export dynamically selected subsets, allowing to better understand what regions of the data could be refined and where the data precision must be improved by using, for example, new or more precise sensors.<\/jats:p>","DOI":"10.14778\/3476311.3476318","type":"journal-article","created":{"date-parts":[[2021,10,28]],"date-time":"2021-10-28T22:48:43Z","timestamp":1635461323000},"page":"2679-2682","source":"Crossref","is-referenced-by-count":4,"title":["ADESIT"],"prefix":"10.14778","volume":"14","author":[{"given":"Pierre","family":"Faure-Giovagnoli","sequence":"first","affiliation":[{"name":"UCBL, Villeurbanne, France and Compagnie Nationale du Rh\u00f4ne, Lyon, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marie Le","family":"Guilly","sequence":"additional","affiliation":[{"name":"UCBL, Villeurbanne, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jean-Marc","family":"Petit","sequence":"additional","affiliation":[{"name":"UCBL, Villeurbanne, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vasile-Marian","family":"Scuturici","sequence":"additional","affiliation":[{"name":"UCBL, Villeurbanne, France"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,28]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2015.2472010"},{"key":"e_1_2_1_2_1","volume-title":"End-to-end entity resolution for big data: A survey. arXiv preprint arXiv:1905.06397","author":"Christophides Vassilis","year":"2019","unstructured":"Vassilis Christophides , Vasilis Efthymiou , Themis Palpanas , George Papadakis , and Kostas Stefanidis . 2019. End-to-end entity resolution for big data: A survey. arXiv preprint arXiv:1905.06397 ( 2019 ). Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2019. End-to-end entity resolution for big data: A survey. arXiv preprint arXiv:1905.06397 (2019)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/1757788.1757831"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376916.1376940"},{"key":"e_1_2_1_6_1","volume-title":"Introduction to statistical pattern recognition","author":"Fukunaga Keinosuke","unstructured":"Keinosuke Fukunaga . 2013. Introduction to statistical pattern recognition . Elsevier . Keinosuke Fukunaga. 2013. Introduction to statistical pattern recognition. Elsevier."},{"key":"e_1_2_1_7_1","volume-title":"Computers and intractability","author":"Garey Michael R","unstructured":"Michael R Garey and David S Johnson . 1979. Computers and intractability . Vol. 174 . freeman San Francisco . Michael R Garey and David S Johnson. 1979. Computers and intractability. Vol. 174. freeman San Francisco."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976229.1"},{"key":"e_1_2_1_9_1","volume-title":"TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2","author":"Huhtala Yka","year":"1999","unstructured":"Yka Huhtala , Juha K\u00e4rkk\u00e4inen , Pasi Porkka , and Hannu Toivonen . 1999 . TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111. Yka Huhtala, Juha K\u00e4rkk\u00e4inen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/210562.210505"},{"key":"e_1_2_1_11_1","unstructured":"Lars Kolb Andreas Thor and Erhard Rahm. 2010. Parallel Sorted Neighborhood Blocking with MapReduce. arXiv:1010.3053 [cs.DC]  Lars Kolb Andreas Thor and Erhard Rahm. 2010. Parallel Sorted Neighborhood Blocking with MapReduce. arXiv:1010.3053 [cs.DC]"},{"key":"e_1_2_1_12_1","first-page":"132","article-title":"Evaluating Classification Feasibility Using Functional","volume":"44","author":"Guilly Marie Le","year":"2020","unstructured":"Marie Le Guilly , Jean-Marc Petit , and Vasile-Marian Scuturici . 2020 . Evaluating Classification Feasibility Using Functional Dependencies. Trans. Large Scale Data Knowl. Centered Syst. 44 (2020), 132 -- 159 . Marie Le Guilly, Jean-Marc Petit, and Vasile-Marian Scuturici. 2020. Evaluating Classification Feasibility Using Functional Dependencies. Trans. Large Scale Data Knowl. Centered Syst. 44(2020), 132--159.","journal-title":"Dependencies. Trans. Large Scale Data Knowl. Centered Syst."},{"key":"e_1_2_1_13_1","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm Erhard","year":"2000","unstructured":"Erhard Rahm and Hong Hai Do . 2000 . Data cleaning: Problems and current approaches . IEEE Data Eng. Bull. 23 , 4 (2000), 3 -- 13 . Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3070647"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342626"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3476311.3476318","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:27:14Z","timestamp":1672226834000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3476311.3476318"}},"subtitle":["visualize the limits of your data in a machine learning process"],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":15,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.14778\/3476311.3476318"],"URL":"https:\/\/doi.org\/10.14778\/3476311.3476318","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,7]]}}}