{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:01:18Z","timestamp":1775638878770,"version":"3.50.1"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:p>Query processing over data lakes is a challenging task, often requiring extensive data pre-processing activities such as data cleaning, transformation, and loading. However, the advent of Large Language Models (LLMs) has illuminated a new pathway to address these complexities by offering a unified approach to understanding the diverse datasets submerged in data lakes. In this paper, we introduce QueryArtisan, a novel LLM-powered analytic tool specifically designed for data lakes. QueryArtisan transcends traditional ETL (Extract, Transform, Load) processes by generating just-intime code for dataset-specific queries. It eliminates the need for an intermediary schema, enabling users to query the data lake directly using natural language. To achieve this, we have developed a suite of heterogeneous operators capable of processing data across various modalities. Additionally, QueryArtisan incorporates a cost model-based query optimization technique, significantly enhancing its code generation capabilities for efficient query resolution. Our extensive experimental evaluations, conducted with real-life datasets, demonstrate that QueryArtisan markedly outperforms existing solutions in terms of effectiveness, efficiency and usability.<\/jats:p>","DOI":"10.14778\/3705829.3705832","type":"journal-article","created":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:21:06Z","timestamp":1740784866000},"page":"108-116","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes"],"prefix":"10.14778","volume":"18","author":[{"given":"Xiu","family":"Tang","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenhao","family":"Liu","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sai","family":"Wu","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chang","family":"Yao","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gongsheng","family":"Yuan","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shanshan","family":"Ying","sequence":"additional","affiliation":[{"name":"ApeCloud"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gang","family":"Chen","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"DataXFormer: A robust transformation discovery system","author":"Abedjan Ziawasch","unstructured":"Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2016. DataXFormer: A robust transformation discovery system. In ICDE. IEEE Computer Society, 1134--1145."},{"key":"e_1_2_1_2_1","volume-title":"Unsupervised Matching of Data and Text","author":"Ahmadi Naser","unstructured":"Naser Ahmadi, Hansjorg Sand, and Paolo Papotti. 2022. Unsupervised Matching of Data and Text. In ICDE. IEEE, 1058--1070."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2021.101846"},{"key":"e_1_2_1_4_1","volume-title":"Dataset Discovery in Data Lakes","author":"Bogatu Alex","unstructured":"Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE. IEEE, 709--720."},{"key":"e_1_2_1_5_1","volume-title":"Entity Matching on Unstructured Data: An Active Learning Approach","author":"Brunner Ursin","unstructured":"Ursin Brunner and Kurt Stockinger. 2019. Entity Matching on Unstructured Data: An Active Learning Approach. In SDS. IEEE, 97--102."},{"key":"e_1_2_1_6_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, et al.","author":"Deng Dong","year":"2017","unstructured":"Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, et al. 2017. The Data Civilizer System. In CIDR."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3542700.3542709"},{"key":"e_1_2_1_8_1","volume-title":"Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach","author":"Dong Yuyang","unstructured":"Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In ICDE. IEEE, 456--467."},{"key":"e_1_2_1_9_1","volume-title":"Cross Modal Data Discovery over Structured and Unstructured Data Lakes. arXiv preprint arXiv:2306.00932","author":"Eltabakh Mohamed Y","year":"2023","unstructured":"Mohamed Y Eltabakh, Mayuresh Kunjir, Ahmed Elmagarmid, and Mohammad Shahmeer Ahmad. 2023. Cross Modal Data Discovery over Structured and Unstructured Data Lakes. arXiv preprint arXiv:2306.00932 (2023)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611533"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/3587136.3587146"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"e_1_2_1_13_1","volume-title":"Aurum: A Data Discovery System","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In ICDE. IEEE Computer Society, 1001--1012."},{"key":"e_1_2_1_14_1","first-page":"1","article-title":"Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD","volume":"7","author":"Fernandez Raul Castro","year":"2019","unstructured":"Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD. ACM, 7:1--7:8.","journal-title":"ACM"},{"key":"e_1_2_1_15_1","volume-title":"Abdulhakim Ali Qahtan, et al","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, et al. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In ICDE. IEEE Computer Society, 989--1000."},{"key":"e_1_2_1_16_1","volume-title":"Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment","author":"Fernandez Raul Castro","year":"2019","unstructured":"Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In ICDE. IEEE, 1190--1201."},{"key":"e_1_2_1_17_1","unstructured":"Dawei Gao Haibin Wang Yaliang Li Xiuyu Sun et al. 2023. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363 (2023)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2899389"},{"key":"e_1_2_1_19_1","volume-title":"Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.","author":"Halevy Alon Y.","year":"2016","unstructured":"Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's Datasets. In SIGMOD. ACM, 795--806."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009","author":"Halim Felix","year":"2009","unstructured":"Felix Halim, Panagiotis Karras, and Roland H. C. Yap. 2009. Fast and effective histogram construction. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2--6, 2009, David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy Lin (Eds.). ACM, 1167--1176."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574274"},{"key":"e_1_2_1_22_1","unstructured":"Jey Han Lau and Timothy Baldwin. 2016. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Rep4NLP@ACL."},{"key":"e_1_2_1_23_1","volume-title":"Hasan Abed Al Kader Hammoud, et al","author":"Li Guohao","year":"2023","unstructured":"Guohao Li, Hasan Abed Al Kader Hammoud, et al. 2023. Camel: Communicative agents for\" mind\" exploration of large scale language model society. arXiv preprint arXiv:2303.17760 (2023)."},{"key":"e_1_2_1_24_1","unstructured":"Jinyang Li Binyuan Hui Ge Qu Binhua Li et al. 2023. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111 (2023)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Yujia Li David Choi Junyoung Chung Nate Kushman et al. 2022. Competition-level code generation with alphacode. Science 378 6624 (2022) 1092--1097.","DOI":"10.1126\/science.abq1158"},{"key":"e_1_2_1_26_1","volume-title":"Zheng Zhang, Minlie Huang, and Tat-Seng Chua.","author":"Liao Lizi","year":"2021","unstructured":"Lizi Liao, Le Hong Long, Zheng Zhang, Minlie Huang, and Tat-Seng Chua. 2021. MMConv: An Environment for Multimodal Conversational Search across Multiple Domains. In SIGIR. ACM, 675--684."},{"key":"e_1_2_1_27_1","volume-title":"TAPEX: Table Pretraining via Learning a Neural SQL Executor. In ICLR.","author":"Liu Qian","year":"2022","unstructured":"Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, et al. 2022. TAPEX: Table Pretraining via Learning a Neural SQL Executor. In ICLR."},{"key":"e_1_2_1_28_1","unstructured":"Meta. [n.d.]. Meta Llama 3. https:\/\/huggingface.co\/meta-llama\/Meta-Llama-3-70B-Instruct."},{"key":"e_1_2_1_29_1","unstructured":"Yev Meyer Marjan Emadi Dhruv Nathawani et al. 2024. Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts."},{"key":"e_1_2_1_30_1","unstructured":"[n. d.]. [n.d.]. Bird leaderboard. https:\/\/bird-bench.github.io\/."},{"key":"e_1_2_1_31_1","unstructured":"[n. d.]. [n.d.]. Ebay website. https:\/\/www.ebay.com."},{"key":"e_1_2_1_32_1","unstructured":"[n. d.]. [n.d.]. Spider leaderboard. https:\/\/yale-lily.github.io\/spider."},{"key":"e_1_2_1_33_1","unstructured":"[n. d.]. [n.d.]. WikiSQL leaderboard. https:\/\/github.com\/salesforce\/WikiSQL."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_36_1","volume-title":"International Conference on Machine Learning. PMLR, 26106--26128","author":"Ni Ansong","year":"2023","unstructured":"Ansong Ni, Srini Iyer, Dragomir Radev, et al. 2023. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning. PMLR, 26106--26128."},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Natasha F. Noy. 2020. When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web. In SIGMOD. ACM 801.","DOI":"10.1145\/3318464.3393815"},{"key":"e_1_2_1_38_1","unstructured":"OpenAI. [n.d.]. OpenAI API. https:\/\/api.openai.com\/."},{"key":"e_1_2_1_39_1","volume-title":"Jose Luis Beltran, and Ravigopal Vennelakanti","author":"Peng Marnith","year":"2020","unstructured":"Marnith Peng, Jose Luis Beltran, and Ravigopal Vennelakanti. 2020. Entity Matching from Unstructured and Dissimilar Data Collections: Semantic and Content Distribution Approach. In IMMS. ACM, 29--33."},{"key":"e_1_2_1_40_1","volume-title":"Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761","author":"Schick Timo","year":"2023","unstructured":"Timo Schick, Jane Dwivedi-Yu, Roberto Dess\u00ec, Roberta Raileanu, Lomeli, et al. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023)."},{"key":"e_1_2_1_41_1","unstructured":"Noah Shinn Beck Labash et al. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366 (2023)."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3494124.3494149"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551841"},{"key":"e_1_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Bailin Wang Richard Shin Xiaodong Liu et al. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In ACL. ACL 7567--7578.","DOI":"10.18653\/v1\/2020.acl-main.677"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539305"},{"key":"e_1_2_1_46_1","unstructured":"Jason Wei Xuezhi Wang Dale Schuurmans et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS."},{"key":"e_1_2_1_47_1","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35 (2022), 24824--24837.","journal-title":"NeurIPS"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247494"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-naacl.141"},{"key":"e_1_2_1_50_1","volume-title":"React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629","author":"Yao Shunyu","year":"2022","unstructured":"Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)."},{"key":"e_1_2_1_51_1","volume-title":"Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887","author":"Yu Tao","year":"2018","unstructured":"Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"crossref","unstructured":"Qin Yuan Ye Yuan Zhenyu Wen et al. 2023. An effective framework for enhancing query answering in a heterogeneous data lake. In SIGIR. 770--780.","DOI":"10.1145\/3539618.3591637"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.222"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402707.3402726"},{"key":"e_1_2_1_55_1","volume-title":"Parameter Curation and Data Generation for Benchmarking Multi-model Queries. In VLDB (CEUR Workshop Proceedings)","volume":"2175","author":"Zhang Chao","year":"2018","unstructured":"Chao Zhang. 2018. Parameter Curation and Data Generation for Benchmarking Multi-model Queries. In VLDB (CEUR Workshop Proceedings), Vol. 2175."},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Hanchong Zhang Ruisheng Cao Lu Chen Hongshen Xu and Kai Yu. 2023. ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought. In EMNLP.","DOI":"10.18653\/v1\/2023.findings-emnlp.227"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920944"},{"key":"e_1_2_1_58_1","volume-title":"Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103","author":"Zhong Victor","year":"2017","unstructured":"Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017)."},{"key":"e_1_2_1_59_1","volume-title":"Miller","author":"Zhu Erkang","year":"2019","unstructured":"Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Ren\u00e9e J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD. ACM, 847--864."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3705829.3705832","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:25:38Z","timestamp":1740785138000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3705829.3705832"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10]]},"references-count":59,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["10.14778\/3705829.3705832"],"URL":"https:\/\/doi.org\/10.14778\/3705829.3705832","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,10]]},"assertion":[{"value":"2025-02-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}