{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,8,1]],"date-time":"2026-08-01T07:37:06Z","timestamp":1785569826654,"version":"3.56.0"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2018,8]]},"abstract":"<jats:p>Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.<\/jats:p>","DOI":"10.14778\/3229863.3229867","type":"journal-article","created":{"date-parts":[[2018,9,10]],"date-time":"2018-09-10T12:12:28Z","timestamp":1536581548000},"page":"1781-1794","source":"Crossref","is-referenced-by-count":195,"title":["Automating large-scale data quality verification"],"prefix":"10.14778","volume":"11","author":[{"given":"Sebastian","family":"Schelter","sequence":"first","affiliation":[{"name":"Amazon Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dustin","family":"Lange","sequence":"additional","affiliation":[{"name":"Amazon Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Philipp","family":"Schmidt","sequence":"additional","affiliation":[{"name":"Amazon Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Meltem","family":"Celikel","sequence":"additional","affiliation":[{"name":"Amazon Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Felix","family":"Biessmann","sequence":"additional","affiliation":[{"name":"Amazon Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andreas","family":"Grafberger","sequence":"additional","affiliation":[{"name":"University of Augsburg and Amazon Research"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2018,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0389-y"},{"key":"e_1_2_1_2_1","volume-title":"Machine Learning Systems workshop at ICML","author":"Andrews P.","year":"2016"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1541880.1541883"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098021"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"P. Bohannon W. Fan F. Geerts X. Jia and A. Kementsietsidis. Conditional functional dependencies for data cleaning. ICDE 746--755 2007.  P. Bohannon W. Fan F. Geerts X. Jia and A. Kementsietsidis. Conditional functional dependencies for data cleaning. ICDE 746--755 2007.","DOI":"10.1109\/ICDE.2007.367920"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137775"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"E. Breck S. Cai E. Nielsen M. Salib and D. Sculley. The ml test score: A rubric for ml production readiness and technical debt reduction. Big Data 1123--1132 2017.  E. Breck S. Cai E. Nielsen M. Salib and D. Sculley. The ml test score: A rubric for ml production readiness and technical debt reduction. Big Data 1123--1132 2017.","DOI":"10.1109\/BigData.2017.8258038"},{"key":"e_1_2_1_9_1","volume-title":"SysML","author":"Breck E.","year":"2018"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2912574"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564719"},{"key":"e_1_2_1_13_1","unstructured":"V. Flunkert D. Salinas and J. Gasthaus. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. CoRR abs\/1704.04110 2017.  V. Flunkert D. Salinas and J. Gasthaus. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. CoRR abs\/1704.04110 2017."},{"key":"e_1_2_1_14_1","unstructured":"H. Galhardas D. Florescu D. Shasha E. Simon and C.-A. Saita. Declarative data cleaning: Language model and algorithms. VLDB 371--380 2001.   H. Galhardas D. Florescu D. Shasha E. Simon and C.-A. Saita. Declarative data cleaning: Language model and algorithms. VLDB 371--380 2001."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/376284.375670"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903730"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2009.36"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3186728.3164145"},{"key":"e_1_2_1_19_1","volume-title":"United Nations Economic Commission for Europe (UNECE)","author":"Hellerstein J. M.","year":"2008"},{"key":"e_1_2_1_20_1","volume-title":"CIDR","author":"Hellerstein J. M.","year":"2017"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2452376.2452456"},{"key":"e_1_2_1_22_1","volume-title":"OTexts","author":"Hyndman R. J.","year":"2014"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000045"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2899409"},{"key":"e_1_2_1_25_1","unstructured":"S. Krishnan M. J. Franklin K. Goldberg and E. Wu. Boostclean: Automated error detection and repair for machine learning. CoRR abs\/1711.01299 2017.  S. Krishnan M. J. Franklin K. Goldberg and E. Wu. Boostclean: Automated error detection and repair for machine learning. CoRR abs\/1711.01299 2017."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2935694.2935698"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/2946645.2946679"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3076246.3076252"},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","unstructured":"H. Miao A. Li L. S. Davis and A. Deshpande. Towards unified data and lifecycle management for deep learning. ICDE 571--582 2017.  H. Miao A. Li L. S. Davis and A. Deshpande. Towards unified data and lifecycle management for deep learning. ICDE 571--582 2017.","DOI":"10.1109\/ICDE.2017.112"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/1791545"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915203"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137789"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3054782"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_36_1","unstructured":"J. D. Rennie L. Shih J. Teevan and D. R. Karger. Tackling the poor assumptions of naive bayes text classifiers. ICML 616--623 2003.   J. D. Rennie L. Shih J. Teevan and D. R. Karger. Tackling the poor assumptions of naive bayes text classifiers. ICML 616--623 2003."},{"key":"e_1_2_1_37_1","volume-title":"Interpretable ML Symposium at NIPS","author":"Rukat T.","year":"2017"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3186549.3186559"},{"key":"e_1_2_1_39_1","first-page":"1","article-title":"Data quality under a computer science perspective","volume":"2","author":"Scannapieco M.","year":"2002","journal-title":"Archivi & Computer"},{"key":"e_1_2_1_40_1","volume-title":"Automatically Tracking Metadata and Provenance of Machine Learning Experiments. Machine Learning Systems workshop at NIPS","author":"Schelter S.","year":"2017"},{"key":"e_1_2_1_41_1","volume-title":"SysML","author":"Schelter S.","year":"2018"},{"key":"e_1_2_1_42_1","unstructured":"D. Sculley G. Holt D. Golovin E. Davydov T. Phillips D. Ebner V. Chaudhary M. Young J. Crespo and D. Dennison. Hidden Technical Debt in Machine Learning Systems. NIPS 2503--2511 2015.   D. Sculley G. Holt D. Golovin E. Davydov T. Phillips D. Ebner V. Chaudhary M. Young J. Crespo and D. Dennison. Hidden Technical Debt in Machine Learning Systems. NIPS 2503--2511 2015."},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","unstructured":"E. R. Sparks S. Venkataraman T. Kaftan M. J. Franklin and B. Recht. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE 535--546 2017.  E. R. Sparks S. Venkataraman T. Kaftan M. J. Franklin and B. Recht. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE 535--546 2017.","DOI":"10.1109\/ICDE.2017.109"},{"key":"e_1_2_1_44_1","doi-asserted-by":"crossref","unstructured":"C. Sun A. Shrivastava S. Singh and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. ICCV 843--852 2017.  C. Sun A. Shrivastava S. Singh and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. ICCV 843--852 2017.","DOI":"10.1109\/ICCV.2017.97"},{"key":"e_1_2_1_45_1","volume-title":"Automated Sanity Checking for ML Data Sets. Machine Learning Systems Workshop at NIPS","author":"Terry M.","year":"2017"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3076246.3076248"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2641190.2641198"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939502.2939516"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_2_1_50_1","first-page":"95","article-title":"Spark: Cluster computing with working sets","author":"Zaharia M.","year":"2010","journal-title":"HotCloud"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3229863.3229867","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:11:59Z","timestamp":1672222319000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3229863.3229867"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,8]]},"references-count":50,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2018,8]]}},"alternative-id":["10.14778\/3229863.3229867"],"URL":"https:\/\/doi.org\/10.14778\/3229863.3229867","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2018,8]]}}}