{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T18:06:54Z","timestamp":1757614014770,"version":"3.44.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p>Most recently, researchers have started building large language models (LLMs) powered data systems that allow users to analyze unstructured text documents like working with a database because LLMs are very effective in extracting attributes from documents. In such systems, LLM-based extraction operations constitute the performance bottleneck of query execution due to the high monetary cost and slow LLM inference. Existing systems typically borrow the query optimization principles popular in relational databases to produce query execution plans, which unfortunately are ineffective in minimizing LLM cost. To fill this gap, we propose QUEST, which features a bunch of novel optimization strategies for unstructured document analysis. First, we introduce an index-based strategy to minimize the cost of each extraction operation. With this index, QUEST quickly retrieves the text segments relevant to the target attributes and only feeds them to LLMs. Furthermore, we design an evidence-augmented retrieval strategy to reduce the possibility of missing relevant segments. Moreover, we develop an instance-optimized query execution strategy: because the attribute extraction cost could vary significantly document by document, QUEST produces different plans for different documents. For each document, QUEST produces a plan to minimize the frequency of attribute extraction. The innovations include LLM cost-aware operator ordering strategies and an optimized join execution approach that transforms joins into filters. Extensive experiments on 3 real-world datasets demonstrate the superiority of QUEST, achieving 30%-6\u00d7 cost savings while improving the F1 score by 10% -27% compared with state-of-the-art baselines.<\/jats:p>","DOI":"10.14778\/3749646.3749713","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:55:06Z","timestamp":1757008506000},"page":"4560-4573","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["QUEST: Query Optimization in Unstructured Document Analysis"],"prefix":"10.14778","volume":"18","author":[{"given":"Zhaoze","family":"Sun","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Chengliang","family":"Chai","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Qiyan","family":"Deng","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Kaisen","family":"Jin","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Xinyu","family":"Guo","sequence":"additional","affiliation":[{"name":"University of Arizona, United States"}]},{"given":"Han","family":"Han","sequence":"additional","affiliation":[{"name":"University of Arizona, United States"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona, United States"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. https:\/\/anonymous.4open.science\/r\/QUEST\/Full_version.pdf"},{"key":"e_1_2_1_2_1","unstructured":"2019. https:\/\/solutionsreview.com\/data-management\/80-percent-of-your-data-will-be-unstructured-in-five-years\/"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626294"},{"key":"e_1_2_1_4_1","volume-title":"PromptNER: Prompting For Named Entity Recognition. (May","author":"Ashok Dhananjay","year":"2023","unstructured":"Dhananjay Ashok and ZacharyC. Lipton. 2023. PromptNER: Prompting For Named Entity Recognition. (May 2023)."},{"key":"e_1_2_1_5_1","volume-title":"Large language models as annotators: Enhancing generalization of nlp models at minimal cost. arXiv preprint arXiv:2306.15766","author":"Bansal Parikshit","year":"2023","unstructured":"Parikshit Bansal and Amit Sharma. 2023. Large language models as annotators: Enhancing generalization of nlp models at minimal cost. arXiv preprint arXiv:2306.15766 (2023)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3532682"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/275487.275492"},{"key":"e_1_2_1_8_1","unstructured":"Zui Chen Zihui Gu Lei Cao Ju Fan Sam Madden and Nan Tang. [n.d.]. Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. ([n. d.])."},{"key":"e_1_2_1_9_1","volume-title":"UQE: A Query Engine for Unstructured Databases. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=t7SGOv5W5z","author":"Dai Hanjun","year":"2024","unstructured":"Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans. 2024. UQE: A Query Engine for Unstructured Databases. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=t7SGOv5W5z"},{"key":"e_1_2_1_10_1","volume-title":"UQE: A Query Engine for Unstructured Databases. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=t7SGOv5W5z","author":"Dai Hanjun","year":"2024","unstructured":"Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans. 2024. UQE: A Query Engine for Unstructured Databases. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=t7SGOv5W5z"},{"key":"e_1_2_1_11_1","volume-title":"Data imputation using large language model to accelerate recommendation system. arXiv preprint arXiv:2407.10078","author":"Ding Zhicheng","year":"2024","unstructured":"Zhicheng Ding, Jiahao Tian, Zhenkai Wang, Jinman Zhao, and Siyang Li. 2024. Data imputation using large language model to accelerate recommendation system. arXiv preprint arXiv:2407.10078 (2024)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00284"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671470"},{"key":"e_1_2_1_14_1","volume-title":"LEXA: Towards Automatic Legal Citation Classification. In AI 2010: Advances in Artificial Intelligence (Lecture Notes in Computer Science)","author":"Galgani Filippo","year":"2010","unstructured":"Filippo Galgani and Achim Hoffmann. 2010. LEXA: Towards Automatic Legal Citation Classification. In AI 2010: Advances in Artificial Intelligence (Lecture Notes in Computer Science), Jiuyong Li (Ed.), Vol. 6464. Springer Berlin Heidelberg, 445 \u2013454."},{"key":"e_1_2_1_15_1","unstructured":"Dawei Gao Haibin Wang Yaliang Li Xiuyu Sun Yichen Qian Bolin Ding and Jingren Zhou. [n.d.]. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. ([n. d.])."},{"key":"e_1_2_1_16_1","unstructured":"Yingqi Gao Yifu Liu Xiaoxia Li Xiaorong Shi Yin Zhu Yiming Wang Shiqi Li Wei Li Yuntao Hong Zhiling Luo et al. 2024. XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL. arXiv preprint arXiv:2411.08599 (2024)."},{"key":"e_1_2_1_17_1","volume-title":"Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997","author":"Gao Yunfan","year":"2023","unstructured":"Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)."},{"key":"e_1_2_1_18_1","volume-title":"Optimized product quantization","author":"Ge Tiezheng","year":"2013","unstructured":"Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 36, 4 (2013), 744\u2013755."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2010020"},{"key":"e_1_2_1_20_1","volume-title":"CLAIM Your Data: Enhancing Imputation Accuracy with Contextual Large Language Models. arXiv preprint arXiv:2405.17712","author":"Hayat Ahatsham","year":"2024","unstructured":"Ahatsham Hayat and Mohammad Rashedul Hasan. 2024. CLAIM Your Data: Enhancing Imputation Accuracy with Contextual Large Language Models. arXiv preprint arXiv:2405.17712 (2024)."},{"key":"e_1_2_1_21_1","volume-title":"Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543","author":"He Pengcheng","year":"2021","unstructured":"Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)."},{"key":"e_1_2_1_22_1","volume-title":"DeBERTa: Decoding-enhanced BERT with Disentangled Attention","author":"He Pengcheng","year":"2020","unstructured":"Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. Cornell University - arXiv, Cornell University - arXiv (Jun 2020)."},{"key":"e_1_2_1_23_1","volume-title":"DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=XPZIaotutsD","author":"He Pengcheng","year":"2021","unstructured":"Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=XPZIaotutsD"},{"key":"e_1_2_1_24_1","volume-title":"Instruct and extract: Instruction tuning for on-demand information extraction. arXiv preprint arXiv:2310.16040","author":"Jiao Yizhu","year":"2023","unstructured":"Yizhu Jiao, Ming Zhong, Sha Li, Ruining Zhao, Siru Ouyang, Heng Ji, and Jiawei Han. 2023. Instruct and extract: Instruction tuning for on-demand information extraction. arXiv preprint arXiv:2310.16040 (2023)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654989"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/3495724.3496517"},{"key":"e_1_2_1_27_1","unstructured":"Huayang Li Yixuan Su Deng Cai Yan Wang and Lemao Liu. [n.d.]. A Survey on Retrieval-Augmented Text Generation. ([n. d.])."},{"key":"e_1_2_1_28_1","volume-title":"Dongmei Zhang, and Surajit Chaudhuri.","author":"Li Peng","year":"2023","unstructured":"Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2023. Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023)."},{"key":"e_1_2_1_29_1","volume-title":"Towards Accurate and Efficient Document Analytics with Large Language Models. arXiv preprint arXiv:2405.04674","author":"Lin Yiming","year":"2024","unstructured":"Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G Parameswaran, and Eugene Wu. 2024. Towards Accurate and Efficient Document Analytics with Large Language Models. arXiv preprint arXiv:2405.04674 (2024)."},{"key":"e_1_2_1_30_1","volume-title":"Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. In Proceedings of the Conference on Innovative Database Research (CIDR)","author":"Liu Chunwei","year":"2025","unstructured":"Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. [n.d.]. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. In Proceedings of the Conference on Innovative Database Research (CIDR) (2025)."},{"key":"e_1_2_1_31_1","volume-title":"A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv preprint arXiv:2408.05109","author":"Liu Xinyu","year":"2024","unstructured":"Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. 2024. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv preprint arXiv:2408.05109 (2024)."},{"key":"e_1_2_1_32_1","volume-title":"Magneto: Combining Small and Large Language Models for Schema Matching. arXiv:2412.08194 [cs.DB] https:\/\/arxiv.org\/abs\/2412.08194","author":"Liu Yurong","year":"2024","unstructured":"Yurong Liu, Eduardo Pena, Aecio Santos, Eden Wu, and Juliana Freire. 2024. Magneto: Combining Small and Large Language Models for Schema Matching. arXiv:2412.08194 [cs.DB] https:\/\/arxiv.org\/abs\/2412.08194"},{"key":"e_1_2_1_33_1","volume-title":"Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs","author":"Malkov Yu A","year":"2018","unstructured":"Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824\u2013836."},{"key":"e_1_2_1_34_1","unstructured":"Marcel Parciak Brecht Vandevoort Frank Neven Liesbet M. Peeters and Stijn Vansummeren. 2024. Schema Matching with Large Language Models: an Experimental Study. arXiv:2407.11852 [cs.DB] https:\/\/arxiv.org\/abs\/2407.11852"},{"key":"e_1_2_1_35_1","volume-title":"LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data. arXiv preprint arXiv:2407.11418","author":"Patel Liana","year":"2024","unstructured":"Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data. arXiv preprint arXiv:2407.11418 (2024)."},{"key":"e_1_2_1_36_1","volume-title":"CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors. (May","author":"PengLi","year":"2023","unstructured":"PengLi and TianxiangSun ect all. 2023. CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors. (May 2023)."},{"key":"e_1_2_1_37_1","volume-title":"Stable: Table generation framework for encoder-decoder models. arXiv preprint arXiv:2206.04045","author":"Pietruszka Micha\u0142","year":"2022","unstructured":"Micha\u0142 Pietruszka, Micha\u0142 Turski, \u0141ukasz Borchmann, Tomasz Dwojak, Gabriela Pa\u0142ka, Karolina Szyndler, Dawid Jurkiewicz, and \u0141ukasz Garncarek. 2022. Stable: Table generation framework for encoder-decoder models. arXiv preprint arXiv:2206.04045 (2022)."},{"volume-title":"Data structures and algorithms","author":"Preiss Bruno R","key":"e_1_2_1_38_1","unstructured":"Bruno R Preiss. 1999. Data structures and algorithms. John Wiley & Sons, Inc."},{"key":"e_1_2_1_39_1","volume-title":"GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction. (Oct","author":"Sainz Oscar","year":"2023","unstructured":"Oscar Sainz, Iker Garcia-Ferrero, Rodrigo Agerri, OierLopezde Lacalle, German Rigau, and Eneko Agirre. 2023. GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction. (Oct 2023)."},{"key":"e_1_2_1_40_1","volume-title":"RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. (Jan","author":"Sarthi Parth","year":"2024","unstructured":"Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and ChristopherD. Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. (Jan 2024)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3447689.3447706"},{"key":"e_1_2_1_42_1","unstructured":"Matthias Urban and Carsten Binnig. [n.d.]. Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables. ([n. d.])."},{"key":"e_1_2_1_43_1","volume-title":"CAESURA: Language Models as Multi-Modal Query Planners. arXiv preprint arXiv:2308.03424","author":"Urban Matthias","year":"2023","unstructured":"Matthias Urban and Carsten Binnig. 2023. CAESURA: Language Models as Multi-Modal Query Planners. arXiv preprint arXiv:2308.03424 (2023)."},{"key":"e_1_2_1_44_1","volume-title":"Text Embeddings by Weakly-Supervised Contrastive Pre-training","author":"Wang Liang","year":"2022","unstructured":"Liang Wang, Nan Yang, Xiaolong Huang, Jiao Binxing, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. Cornell University - arXiv, Cornell University - arXiv (Dec 2022)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.180"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.896"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.497"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3749646.3749713","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T03:34:41Z","timestamp":1757043281000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3749646.3749713"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7]]},"references-count":47,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.14778\/3749646.3749713"],"URL":"https:\/\/doi.org\/10.14778\/3749646.3749713","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,7]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}