{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,1,1]],"date-time":"2025-01-01T05:07:44Z","timestamp":1735708064185,"version":"3.32.0"},"reference-count":20,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not an interesting piece of a project --- it transforms raw data into a format that enables further innovative work. Because such scripts are never intended to be interesting, are project-specific, and are written in general-purpose languages, they can be tedious to understand and difficult to verify. As a result, data preparation scripts can easily become a breeding ground for poor engineering and statistical practices. Ideally, data preparation scripts are \"admirably boring\" --- they should serve the project, but otherwise be as simple and as standard as possible. We propose a bottom-up script standardization framework that takes a user's data preparation script and transforms it into a simpler, more standardized version of itself. Our framework takes the user's script not as an unchangeable definition of correctness, but as a sketch of the user's intent. We embedded this framework in a system called LucidScript.<\/jats:p>","DOI":"10.14778\/3685800.3685864","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4317-4320","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["LucidScript: Bottom-Up Standardization for Data Preparation"],"prefix":"10.14778","volume":"17","author":[{"given":"Eugenie","family":"Lai","sequence":"first","affiliation":[{"name":"MIT"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuze","family":"Lou","sequence":"additional","affiliation":[{"name":"University of Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Brit","family":"Youngmann","sequence":"additional","affiliation":[{"name":"Technion"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Cafarella","sequence":"additional","affiliation":[{"name":"MIT"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"GitHub. https:\/\/github.com\/ey-l\/bottom-up-script-standardization."},{"key":"e_1_2_1_2_1","unstructured":"GitHub Copilot your ai pair programmer. https:\/\/github.com\/features\/copilot."},{"key":"e_1_2_1_3_1","unstructured":"Sourcery automatically improve python code quality. https:\/\/sourcery.ai\/."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3360594"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313602"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-34770-3_10"},{"key":"e_1_2_1_7_1","first-page":"84","volume-title":"Variability in the analysis of a single neuroimaging dataset by many teams","author":"Botvinik-Nezer R.","year":"2020","unstructured":"R. Botvinik-Nezer, F. Holzmeister, C. Camerer, et al. Variability in the analysis of a single neuroimaging dataset by many teams. pages 84--88, 2020."},{"key":"e_1_2_1_8_1","volume-title":"Evaluating large language models trained on code","author":"Chen M.","year":"2021","unstructured":"M. Chen et al. Evaluating large language models trained on code, 2021."},{"key":"e_1_2_1_9_1","volume-title":"Auto-sklearn 2.0: Hands-free automl via meta-learning","author":"Feurer M.","year":"2020","unstructured":"M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter. Auto-sklearn 2.0: Hands-free automl via meta-learning. 2020."},{"key":"e_1_2_1_10_1","volume-title":"CIDR","author":"Grafberger S.","year":"2021","unstructured":"S. Grafberger, J. Stoyanovich, and S. Schelter. Lightweight inspection of data preprocessing in native machine learning pipelines. In CIDR, 2021."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403261"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407831"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254556.2254659"},{"key":"e_1_2_1_14_1","volume-title":"Gpt-4 technical report","author":"AI.","year":"2023","unstructured":"OpenAI. Gpt-4 technical report, 2023."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1502650.1502692"},{"volume-title":"https:\/\/streamlit.io\/","year":"2024","key":"e_1_2_1_16_1","unstructured":"streamlit. Streamlit. https:\/\/streamlit.io\/, 2024. Accessed: April 12, 2024."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3062341.3062365"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389738"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476303"},{"key":"e_1_2_1_20_1","first-page":"783","volume-title":"ICSE '19'","author":"Zhang J.","unstructured":"J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu. A novel neural source code representation based on abstract syntax tree. In ICSE '19', pages 783--794."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685864","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:27:19Z","timestamp":1735622839000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685864"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":20,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685864"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685864","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}