{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T01:44:37Z","timestamp":1768787077866,"version":"3.49.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2020,12,17]],"date-time":"2020-12-17T00:00:00Z","timestamp":1608163200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGMOD Rec."],"published-print":{"date-parts":[[2020,12,17]]},"abstract":"<jats:p>Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day.<\/jats:p>\n          <jats:p>To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.<\/jats:p>","DOI":"10.1145\/3444831.3444835","type":"journal-article","created":{"date-parts":[[2020,12,17]],"date-time":"2020-12-17T23:52:01Z","timestamp":1608249121000},"page":"18-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":42,"title":["Data Preparation"],"prefix":"10.1145","volume":"49","author":[{"given":"Mazhar","family":"Hameed","sequence":"first","affiliation":[{"name":"University of Potsdam, Potsdam, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Felix","family":"Naumann","sequence":"additional","affiliation":[{"name":"University of Potsdam, Potsdam, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,12,17]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Trifacta end user data preparation. https:\/\/www.trifacta.com\/wp-content\/ uploads\/2018\/02\/ End-User-Data-Preparation-Market-Study-2018. pdf. Accessed: 2019-09--19.  Trifacta end user data preparation. https:\/\/www.trifacta.com\/wp-content\/ uploads\/2018\/02\/ End-User-Data-Preparation-Market-Study-2018. pdf. Accessed: 2019-09--19."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3027063.3053359"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824112"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.330169"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544914"},{"key":"e_1_2_1_8_1","first-page":"305","volume-title":"Proceedings of the International Conference on Extending Database Technology (EDBT)","author":"Ehrlich Jens","year":"2016"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2015.05.197"},{"key":"e_1_2_1_10_1","first-page":"473","volume-title":"Proceedings of the International Conference on Extending Database Technology (EDBT)","author":"Furche Tim","year":"2016"},{"key":"e_1_2_1_11_1","volume-title":"The costs of poor data quality. Journal of Industrial Engineering and Management (JIEM), 4(2):168--193","author":"Haug Anders","year":"2011"},{"issue":"2","key":"e_1_2_1_12_1","first-page":"23","article-title":"Self-service data preparation: Research to practice","volume":"41","author":"Hellerstein Joseph M","year":"2018","journal-title":"IEEE Data Engineering Bulletin"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3064034"},{"key":"e_1_2_1_14_1","article-title":"Interactive data exploration with smart drill-down (extended version)","author":"Joglekar Manas","year":"2017","journal-title":"IEEE Transactions on Knowledge and Data Engineering (TKDE), (1):1--1"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1177\/1473871611415994"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1978942.1979444"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/543613.543644"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/3297753.3297757"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/DAS.2014.9"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2590989.2590995"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824086"},{"key":"e_1_2_1_23_1","volume-title":"Forbes","author":"Press Gil","year":"2016"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/645927.672045"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/3165161"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007287"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025126"},{"issue":"2","key":"e_1_2_1_28_1","first-page":"3","article-title":"The current status and the way forward","volume":"41","author":"Stonebraker Michael","year":"2018","journal-title":"IEEE Data Engineering Bulletin"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR)","author":"Terrizzano Ignacio G","year":"2015"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3193569"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.4135\/9781412986069"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"Shichao Zhang Chengqi Zhang and Qiang Yang. Data preparation for data mining. Applied artificial intelligence 17(5--6):375--381 Shichao Zhang Chengqi Zhang and Qiang Yang. Data preparation for data mining. Applied artificial intelligence 17(5--6):375--381","DOI":"10.1080\/713827180"}],"container-title":["ACM SIGMOD Record"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3444831.3444835","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3444831.3444835","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:13Z","timestamp":1750195693000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3444831.3444835"}},"subtitle":["A Survey of Commercial Tools"],"short-title":[],"issued":{"date-parts":[[2020,12,17]]},"references-count":32,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,12,17]]}},"alternative-id":["10.1145\/3444831.3444835"],"URL":"https:\/\/doi.org\/10.1145\/3444831.3444835","relation":{},"ISSN":["0163-5808"],"issn-type":[{"value":"0163-5808","type":"print"}],"subject":[],"published":{"date-parts":[[2020,12,17]]},"assertion":[{"value":"2020-12-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}