{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T03:18:53Z","timestamp":1758079133012,"version":"3.44.0"},"reference-count":9,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:p>Querying and analyzing data in data lakes requires substantial manual intervention, including numerous data preprocessing steps, and often demands complex domain expertise. However, the advent of Large Language Models (LLMs) has introduced a promising solution to these challenges by providing a unified framework for interpreting the heterogeneous datasets within data lakes. In this paper, we demonstrate QueryArtisan, a novel LLM-powered analytical system tailored for data lakes. It enables users to issue complex queries in natural language without the need for domain-specific expertise. The system automatically executes user-submitted queries and performs data processing and analysis based on the query results. QueryArtisan extends beyond traditional ETL (Extract, Transform, Load) processes by generating just-in-time code customized for dataset-specific tasks. A suite of heterogeneous operators is developed to process data across various modalities. In addition, a cost-based query optimization mechanism is integrated to improve the efficiency of the generated code. Furthermore, QueryArtisan can dynamically instantiate multiple agents in response to user-defined analytical requirements to perform further in-depth analysis of the retrieved data.<\/jats:p>","DOI":"10.14778\/3750601.3750647","type":"journal-article","created":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:38:05Z","timestamp":1758029885000},"page":"5263-5266","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["A Demonstration of QueryArtisan: Real-Time Data Lake Analysis via Dynamically Generated Data Manipulation Code"],"prefix":"10.14778","volume":"18","author":[{"given":"Wenhao","family":"Liu","sequence":"first","affiliation":[{"name":"Zhejiang University"}]},{"given":"Xiu","family":"Tang","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Sai","family":"Wu","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Chang","family":"Yao","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Gongsheng","family":"Yuan","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Gang","family":"Chen","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]}],"member":"320","published-online":{"date-parts":[[2025,9,16]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"1","article-title":"Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD","volume":"7","author":"Fernandez Raul Castro","year":"2019","unstructured":"Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD. ACM, 7:1\u20137:8.","journal-title":"ACM"},{"key":"e_1_2_1_2_1","volume-title":"Yap","author":"Halim Felix","year":"2009","unstructured":"Felix Halim, Panagiotis Karras, and Roland H. C. Yap. 2009. Fast and effective histogram construction. In CIKM. ACM, 1167\u20131176."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Natasha F. Noy. 2020. When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web. In SIGMOD. ACM 801.","DOI":"10.1145\/3318464.3393815"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.14778\/3705829.3705832"},{"key":"e_1_2_1_6_1","volume-title":"Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911\u20133921.","author":"Yu Tao","year":"2018","unstructured":"Tao Yu, Rui Zhang, Kai Yang, and et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911\u20133921."},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Qin Yuan Ye Yuan Zhenyu Wen He Wang and Shiyuan Tang. 2023. An effective framework for enhancing query answering in a heterogeneous data lake. In SIGIR. 770\u2013780.","DOI":"10.1145\/3539618.3591637"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920944"},{"key":"e_1_2_1_9_1","volume-title":"Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103","author":"Zhong Victor","year":"2017","unstructured":"Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3750601.3750647","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:38:57Z","timestamp":1758029937000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3750601.3750647"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8]]},"references-count":9,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["10.14778\/3750601.3750647"],"URL":"https:\/\/doi.org\/10.14778\/3750601.3750647","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,8]]},"assertion":[{"value":"2025-09-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}