{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T03:16:37Z","timestamp":1758078997768,"version":"3.44.0"},"reference-count":4,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:p>Recent studies have developed LLM-powered data systems that enable database-like analysis of unstructured text documents. While LLMs excel at attribute extraction from documents, their high computational costs and latency make extraction operations the primary performance bottleneck. Existing systems typically adopt traditional relational database query optimization strategies, which prove ineffective in minimizing LLM-related expenses. To fill this gap, we propose DocDB, a prototype system that features a bunch of novel optimization strategies designated to unstructured document analysis. First, we employ a two-level index to reduce LLM extraction costs by selectively retrieving and processing only text segments relevant to target attributes. Second, DocDB employs adaptive execution, generating document-specific plans to minimize LLM extraction frequency based on varying per-document attribute extraction costs. With a real-life scenario, we demonstrate that DocDB allows users to analyze unstructured documents accurately and affordably using SQL-like queries. The corresponding video is available at https:\/\/youtu.be\/8yDIKOBHIOg.<\/jats:p>","DOI":"10.14778\/3750601.3750678","type":"journal-article","created":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:38:05Z","timestamp":1758029885000},"page":"5387-5390","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["DocDB: A Database for Unstructured Document Analysis"],"prefix":"10.14778","volume":"18","author":[{"given":"Zequn","family":"Li","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuanhao","family":"Zhong","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengliang","family":"Chai","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhaoze","family":"Sun","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuhao","family":"Deng","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona, Tucson, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,16]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. https:\/\/anonymous.4open.science\/r\/QUEST\/Full_version.pdf"},{"key":"e_1_2_1_2_1","volume-title":"et al","author":"Liu Chunwei","year":"2025","unstructured":"Chunwei Liu, Matthew Russo, and Michael J. et al. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. CIDR (2025)."},{"key":"e_1_2_1_3_1","volume-title":"LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data. CoRR","author":"Patel Liana","year":"2024","unstructured":"Liana Patel and Siddharth Jha et al. 2024. LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data. CoRR (2024)."},{"key":"e_1_2_1_4_1","volume-title":"Text-tuple-table: Towards information integration in text-to-table generation via global tuple extraction. EMNLP","author":"Zheye Deng","year":"2024","unstructured":"Deng Zheye and Chan Chunkit et al. 2024. Text-tuple-table: Towards information integration in text-to-table generation via global tuple extraction. EMNLP (2024)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3750601.3750678","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:42:50Z","timestamp":1758030170000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3750601.3750678"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8]]},"references-count":4,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["10.14778\/3750601.3750678"],"URL":"https:\/\/doi.org\/10.14778\/3750601.3750678","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,8]]},"assertion":[{"value":"2025-09-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}