{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T07:12:39Z","timestamp":1779174759884,"version":"3.51.4"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>Over the past five decades, the relational database model has proven to be a scaleable and adaptable model for querying a variety of structured data, with use cases in analytics, transactions, graphs, streaming and more. However, most of the world's data is unstructured. Thus, despite their success, the reality is that the vast majority of the world's data has remained beyond the reach of relational systems.<\/jats:p>\n          <jats:p>The rise of deep learning and generative AI offers an opportunity to change this. These models provide a stunning capability to extract semantic understanding from almost any type of document, including text, images, and video, which can extend the reach of databases to all the world's data. In this paper we explore how these new technologies will transform the way we build database management software, creating new that systems that can ingest, store, process, and query all data. Building such systems presents many opportunities and challenges. In this paper we focus on three: scalability, correctness, and reliability, and argue that the declarative programming paradigm that has served relational systems so well offers a path forward in the new world of AI data systems as well. To illustrate this, we describe several examples of such declarative AI systems we have built in document and video processing, and provide a set of research challenges and opportunities to guide research in this exciting area going forward.<\/jats:p>\n          <jats:p>\n            <jats:italic>And lovely apparitions, -dim at first<\/jats:italic>\n            ,\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Then radiant, as the mind arising bright<\/jats:italic>\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>From the embrace of beauty (whence the forms<\/jats:italic>\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Of which these are the phantoms) casts on them<\/jats:italic>\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>The gathered rays which are reality-<\/jats:italic>\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Shall visit us the progeny immortal<\/jats:italic>\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Of Painting, Sculpture, and rapt Poesy<\/jats:italic>\n            ,\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>And arts, though unimagined, yet to be<\/jats:italic>\n            ;\n          <\/jats:p>\n          <jats:p>Prometheus Unbound, Percy Bysshe Shelley<\/jats:p>","DOI":"10.14778\/3685800.3685916","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4546-4554","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Databases Unbound: Querying All of the World's Bytes with AI"],"prefix":"10.14778","volume":"17","author":[{"given":"Samuel","family":"Madden","sequence":"first","affiliation":[{"name":"MIT CSAIL"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Cafarella","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Franklin","sequence":"additional","affiliation":[{"name":"Univeristy of Chicago"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tim","family":"Kraska","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00132"},{"key":"e_1_2_1_2_1","volume-title":"Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433","author":"Arora Simran","year":"2023","unstructured":"Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher R\u00e9. 2023. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433 (2023)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389692"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517835"},{"key":"e_1_2_1_5_1","volume-title":"Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749","author":"Chen Zui","year":"2023","unstructured":"Zui Chen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. 2023. Seed: Simple, efficient, and effective data management via large language models. arXiv preprint arXiv:2310.00749 (2023)."},{"key":"e_1_2_1_6_1","volume-title":"Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans.","author":"Dai Hanjun","year":"2024","unstructured":"Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans. 2024. UQE: A Query Engine for Unstructured Databases. arXiv:2407.09522 [cs.DB] https:\/\/arxiv.org\/abs\/2407.09522"},{"key":"e_1_2_1_7_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs\/1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs\/1810.04805 (2018). arXiv:1810.04805 http:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3587136.3587146"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402755.3402777"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611527"},{"key":"e_1_2_1_11_1","volume-title":"Girshick","author":"He Kaiming","year":"2017","unstructured":"Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross B. Girshick. 2017. Mask R-CNN. CoRR abs\/1703.06870 (2017). arXiv:1703.06870 http:\/\/arxiv.org\/abs\/1703.06870"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137664"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517897"},{"key":"e_1_2_1_14_1","volume-title":"Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714","author":"Khattab Omar","year":"2023","unstructured":"Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714 (2023)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2406.14424"},{"key":"e_1_2_1_16_1","first-page":"1","article-title":"Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization","volume":"18","author":"Li Lisha","year":"2018","unstructured":"Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research 18 (2018), 1--52.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_17_1","unstructured":"Xiang Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics 4582--4597."},{"key":"e_1_2_1_18_1","volume-title":"Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano.","author":"Liu Chunwei","year":"2024","unstructured":"Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano. 2024. A Declarative System for Optimizing AI Workloads. arXiv:2405.14696 [cs.CL] https:\/\/arxiv.org\/abs\/2405.14696"},{"key":"e_1_2_1_19_1","volume-title":"Miller","author":"Marcus Adam","year":"2011","unstructured":"Adam Marcus, Eugene Wu, Samuel Madden, and Robert C. Miller. 2011. Crowd-sourced Databases: Query Processing with People. In Fifth Biennial Conference on Innovative Data Systems Research, CIDR 2011, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings. www.cidrdb.org, 211--214. http:\/\/cidrdb.org\/cidr2011\/Papers\/CIDR11_Paper29.pdf"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/2367502.2367555"},{"key":"e_1_2_1_21_1","volume-title":"LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data. arXiv:2407.11418 [cs.DB] https:\/\/arxiv.org\/abs\/2407.11418","author":"Patel Liana","year":"2024","unstructured":"Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data. arXiv:2407.11418 [cs.DB] https:\/\/arxiv.org\/abs\/2407.11418"},{"key":"e_1_2_1_22_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs\/2103.00020 (2021). arXiv:2103.00020 https:\/\/arxiv.org\/abs\/2103.00020"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the MLSys: Workshop on Systems for ML at NeurIPS.","author":"Shang Zeyuan","year":"2019","unstructured":"Zeyuan Shang, Emanuel Zgraggen, and Tim Kraska. 2019. Alpine Meadow: A System for Interactive AutoML. In Proceedings of the MLSys: Workshop on Systems for ML at NeurIPS."},{"key":"e_1_2_1_24_1","volume-title":"CAESURA: Language Models as Multi-Modal Query Planners. arXiv preprint arXiv:2308.03424","author":"Urban Matthias","year":"2023","unstructured":"Matthias Urban and Carsten Binnig. 2023. CAESURA: Language Models as Multi-Modal Query Planners. arXiv preprint arXiv:2308.03424 (2023)."},{"key":"e_1_2_1_25_1","unstructured":"vaas [n. d.]. https:\/\/vaas.csail.mit.edu\/docs\/introduction.html."},{"key":"e_1_2_1_26_1","volume-title":"CoRR abs\/1706.03762","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs\/1706.03762 (2017). arXiv:1706.03762 http:\/\/arxiv.org\/abs\/1706.03762"},{"key":"e_1_2_1_27_1","volume-title":"Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi.","author":"Zaharia Matei","year":"2024","unstructured":"Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. 2024. The Shift from Models to Compound AI Systems. https:\/\/bair.berkeley.edu\/blog\/2024\/02\/18\/compound-ai-systems\/."},{"key":"e_1_2_1_28_1","volume-title":"Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al.","author":"Zheng Lianmin","year":"2023","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2023. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104 (2023)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319901"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685916","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:26:51Z","timestamp":1735622811000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685916"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":29,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685916"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685916","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}