{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T15:20:52Z","timestamp":1774452052797,"version":"3.50.1"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p>To fulfill the potential great value of unstructured documents, it is critical to extract structural data (e.g., attributes) from them, which can benefit various applications such as analytical SQL queries and decision-making. Multiple strategies, such as pre-trained language models (PLMs), can be employed for this task. However, these methods often struggle to achieve high-quality results, particularly when dealing with attribute extraction that requires intricate reasoning or semantic comprehension. Recently, large language models (LLMs) have proven to be effective in extracting attributes but incur substantial costs caused by token consumption, making them impractical for large-scale document set.<\/jats:p>\n          <jats:p>To best trade off quality and cost, we present Doctopus, a system designed for accurate attribute extraction from unstructured documents with a user-specified cost constraint. Overall, Doctopus combines LLMs with non-LLM strategies to achieve a good tradeoff. First, the system employs an index-based approach to efficiently identify and process only relevant text chunks, thereby reducing the LLM cost. Afterwards, it further estimates the quality of multiple strategies for each attribute. Finally, based on the cost and estimated quality, Doctopus dynamically selects the optimal strategies through budget-aware optimization. We have built a comprehensive benchmark including 4 document sets with various characteristics and manually labeled ground truth using 1000 human hours. Extensive experiments on the benchmark show that compared with state-of-the-art baselines, Doctopus can improve the quality by 11% given the same cost constraint.<\/jats:p>","DOI":"10.14778\/3749646.3749647","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:55:06Z","timestamp":1757008506000},"page":"3695-3707","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Doctopus: Budget-Aware Structural Table Extraction from Unstructured Documents"],"prefix":"10.14778","volume":"18","author":[{"given":"Chengliang","family":"Chai","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Jiajun","family":"Li","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Yuhao","family":"Deng","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Yuanhao","family":"Zhong","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, China"}]},{"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona, United States"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n. d.]. https:\/\/github.com\/mutong184\/Doctopus\/blob\/main\/Doctopus_tech_report_.pdf"},{"key":"e_1_2_1_2_1","unstructured":"[n. d.]. https:\/\/www.wikiart.org\/"},{"key":"e_1_2_1_3_1","unstructured":"[n. d.]. https:\/\/finance.yahoo.com\/"},{"key":"e_1_2_1_4_1","unstructured":"[n. d.]. https:\/\/en.wikipedia.org\/wiki\/Lists_of_NBA_players"},{"key":"e_1_2_1_5_1","unstructured":"[n. d.]. https:\/\/python.langchain.com\/docs\/how_to\/semantic-chunker\/"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626294"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2023.EMNLP-INDUSTRY.55"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.05176"},{"key":"e_1_2_1_9_1","volume-title":"13th Conference on Innovative Data Systems Research, CIDR 2023","author":"Chen Zui","year":"2023","unstructured":"Zui Chen, Zihui Gu, Lei Cao, Ju Fan, Samuel Madden, and Nan Tang. 2023. Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. In 13th Conference on Innovative Data Systems Research, CIDR 2023, Amsterdam, The Netherlands, January 8\u201311, 2023. www.cidrdb.org. https:\/\/www.cidrdb.org\/cidr2023\/papers\/p51-chen.pdf"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671470"},{"key":"e_1_2_1_11_1","volume-title":"LEXA: Towards Automatic Legal Citation Classification. In AI 2010: Advances in Artificial Intelligence (Lecture Notes in Computer Science","volume":"454","author":"Galgani Filippo","year":"2010","unstructured":"Filippo Galgani and Achim Hoffmann. 2010. LEXA: Towards Automatic Legal Citation Classification. In AI 2010: Advances in Artificial Intelligence (Lecture Notes in Computer Science, Vol. 6464), Jiuyong Li (Ed.). Springer Berlin Heidelberg, 445\u2013454."},{"key":"e_1_2_1_12_1","volume-title":"Optimized product quantization","author":"Ge Tiezheng","year":"2013","unstructured":"Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 36, 4 (2013), 744\u2013755."},{"key":"e_1_2_1_13_1","volume-title":"The Eleventh International Conference on Learning Representations, ICLR 2023","author":"He Pengcheng","year":"2023","unstructured":"Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\u20135, 2023. OpenReview.net. https:\/\/openreview.net\/forum?id=sE7-XhLxHA"},{"key":"e_1_2_1_14_1","article-title":"Atlas: Few-shot Learning with Retrieval Augmented Language Models","volume":"24","author":"Izacard Gautier","year":"2023","unstructured":"Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot Learning with Retrieval Augmented Language Models. J. Mach. Learn. Res. 24 (2023), 251:1\u2013251:43. https:\/\/jmlr.org\/papers\/v24\/23-0037.html","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2023.ACL-LONG.792"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2023.EMNLP-MAIN.620"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654989"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2020.EMNLP-MAIN.306"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544825"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2405.14696"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2023.EMNLP-MAIN.322"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018","author":"Niklaus Christina","year":"2018","unstructured":"Christina Niklaus, Matthias Cetto, Andr\u00e9 Freitas, and Siegfried Handschuh. 2018. A Survey on Open Information Extraction. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20\u201326, 2018, Emily M. Bender, Leon Derczynski, and Pierre Isabelle (Eds.). Association for Computational Linguistics, 3866\u20133878. https:\/\/aclanthology.org\/C18-1326\/"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2407.11418"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.eacl-long.151"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018","author":"Saha Swarnadeep","year":"2018","unstructured":"Swarnadeep Saha and Mausam. 2018. Open Information Extraction from Conjunctive Sentences. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20\u201326, 2018, Emily M. Bender, Leon Derczynski, and Pierre Isabelle (Eds.). Association for Computational Linguistics, 2288\u20132299. https:\/\/aclanthology.org\/C18-1194\/"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/P17-2050"},{"key":"e_1_2_1_28_1","volume-title":"a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs\/1910.01108","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs\/1910.01108 (2019). arXiv:1910.01108 http:\/\/arxiv.org\/abs\/1910.01108"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.14778\/3447689.3447706"},{"key":"e_1_2_1_30_1","volume-title":"Text Embeddings by Weakly-Supervised Contrastive Pre-training","author":"Wang Liang","year":"2022","unstructured":"Liang Wang, Nan Yang, Xiaolong Huang, Jiao Binxing, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. Cornell University - arXiv, Cornell University - arXiv (Dec 2022)."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2022.ACL-LONG.180"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2401.15884"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2021.EMNLP-MAIN.764"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2310.03094"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219839"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3749646.3749647","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T03:22:47Z","timestamp":1757042567000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3749646.3749647"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7]]},"references-count":35,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.14778\/3749646.3749647"],"URL":"https:\/\/doi.org\/10.14778\/3749646.3749647","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,7]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}