{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:01:34Z","timestamp":1775638894273,"version":"3.50.1"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,10]]},"abstract":"<jats:p>\n            A long standing goal in the data management community is developing systems that input documents and output queryable tables without user effort. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using the in-context learning abilities of large language models (LLMs). We propose and evaluate Evaporate, a prototype system powered by LLMs. We identify two strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction. Our insight is to generate many candidate functions and ensemble their extractions using weak supervision. Evaporate-Code+ outperforms the state-of-the art systems using a\n            <jats:italic>sublinear<\/jats:italic>\n            pass over the documents with the LLM. This equates to a 110X reduction in the number of documents the LLM needs to process across our 16 real-world evaluation settings.\n          <\/jats:p>","DOI":"10.14778\/3626292.3626294","type":"journal-article","created":{"date-parts":[[2023,12,11]],"date-time":"2023-12-11T23:24:55Z","timestamp":1702337095000},"page":"92-105","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":47,"title":["Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"],"prefix":"10.14778","volume":"17","author":[{"given":"Simran","family":"Arora","sequence":"first","affiliation":[{"name":"Stanford University"}]},{"given":"Brandon","family":"Yang","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Sabri","family":"Eyuboglu","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Avanika","family":"Narayan","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Andrew","family":"Hojel","sequence":"additional","affiliation":[{"name":"Stanford University"}]},{"given":"Immanuel","family":"Trummer","sequence":"additional","affiliation":[{"name":"Cornell University"}]},{"given":"Christopher","family":"R\u00e9","sequence":"additional","affiliation":[{"name":"Stanford University"}]}],"member":"320","published-online":{"date-parts":[[2023,10]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"April 2023. Wikipedia Statistics. https:\/\/en.wikipedia.org\/wiki\/Special:Statistics  April 2023. Wikipedia Statistics. https:\/\/en.wikipedia.org\/wiki\/Special:Statistics"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.130"},{"key":"e_1_2_1_3_1","volume-title":"Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL)","author":"Arora Simran","year":"2023","unstructured":"Simran Arora , Patrick Lewis , Angela Fan , Jacob Kahn , and Christopher R\u00e9. 2023. Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) ( 2023 ). Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, and Christopher R\u00e9. 2023. Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) (2023)."},{"key":"e_1_2_1_4_1","volume-title":"International Conference on Learning Representations (ICLR)","author":"Arora Simran","year":"2023","unstructured":"Simran Arora , Avanika Narayan , Mayee F. Chen , Laurel Orr , Neel Guha , Kush Bhatia , Ines Chami , Frederic Sala , and Christopher R\u00e9 . 2023 . Ask Me Anything: A simple strategy for prompting language models . International Conference on Learning Representations (ICLR) (2023). Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher R\u00e9. 2023. Ask Me Anything: A simple strategy for prompting language models. International Conference on Learning Representations (ICLR) (2023)."},{"key":"e_1_2_1_5_1","unstructured":"Simran Arora Brandon Yang Sabri Eyuboglu Avanika Narayan Andrew Hojel Immanuel Trummer and Christopher R\u00e9. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. (2023). https:\/\/www.dropbox.com\/scl\/fi\/3gt3ixdbvp986ptyz5j4t\/VLDB_Revision.pdf?rlkey=mxi2kqp7rqx0frm9s7bpttwcq&dl=0  Simran Arora Brandon Yang Sabri Eyuboglu Avanika Narayan Andrew Hojel Immanuel Trummer and Christopher R\u00e9. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. (2023). https:\/\/www.dropbox.com\/scl\/fi\/3gt3ixdbvp986ptyz5j4t\/VLDB_Revision.pdf?rlkey=mxi2kqp7rqx0frm9s7bpttwcq&dl=0"},{"key":"e_1_2_1_6_1","unstructured":"Amanda Askell Yushi Bai Anna Chen Dawn Drain Deep Ganguli T. J. Henighan Andy Jones and Nicholas Joseph et al. 2021. A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861v3 (2021).  Amanda Askell Yushi Bai Anna Chen Dawn Drain Deep Ganguli T. J. Henighan Andy Jones and Nicholas Joseph et al. 2021. A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861v3 (2021)."},{"key":"e_1_2_1_7_1","volume-title":"Open information extraction from the web. IJCAI","author":"Banko Michele","year":"2007","unstructured":"Michele Banko , Michael J. Cafarella , Stephen Soderland , Matthew G Broadhead , and Oren Etzioni . 2007. Open information extraction from the web. IJCAI ( 2007 ). Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew G Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. IJCAI (2007)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1056\/NEJMsa2206117"},{"key":"e_1_2_1_9_1","unstructured":"Benedikt Boecking Willie Neiswanger Eric Xing and Artur Dubrawski. 2021. Interactive weak supervision: Learning useful heuristics for data labeling.  Benedikt Boecking Willie Neiswanger Eric Xing and Artur Dubrawski. 2021. Interactive weak supervision: Learning useful heuristics for data labeling."},{"key":"e_1_2_1_10_1","volume-title":"E.","author":"Rideout J. R.","year":"2019","unstructured":"Rideout J. R. Dillon M. R. Bokulich N. A. Abnet C. C. Al-Ghalith G. A. Alexander H. Alm E. J. Arumugam M. et al. Bolyen , E. 2019 . Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. In Nature biotechnology. Rideout J. R. Dillon M. R. Bokulich N. A. Abnet C. C. Al-Ghalith G. A. Alexander H. Alm E. J. Arumugam M. et al. Bolyen, E. 2019. Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. In Nature biotechnology."},{"key":"e_1_2_1_11_1","unstructured":"Rishi Bommasani Drew A. Hudson E. Adeli Russ Altman Simran Arora S. von Arx Michael S. Bernstein Jeanette Bohg A. Bosselut Emma Brunskill and etal 2021. On the opportunities and risks of foundation models. arXiv:2108.07258 (2021).  Rishi Bommasani Drew A. Hudson E. Adeli Russ Altman Simran Arora S. von Arx Michael S. Bernstein Jeanette Bohg A. Bosselut Emma Brunskill and et al. 2021. On the opportunities and risks of foundation models. arXiv:2108.07258 (2021)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"S. Brin. 1998. Extracting patterns and relations from the WorldWide Web. In WebDB.  S. Brin. 1998. Extracting patterns and relations from the WorldWide Web. In WebDB.","DOI":"10.1007\/10704656_11"},{"key":"e_1_2_1_13_1","volume-title":"Extraction and integration of partially overlapping web sources. PVLDB","author":"Bronzi Mirko","year":"2013","unstructured":"Mirko Bronzi , Valter Crescenzi , Paolo Merialdo , and Paolo Papotti . 2013. Extraction and integration of partially overlapping web sources. PVLDB ( 2013 ). Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB (2013)."},{"key":"e_1_2_1_14_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_2_1_15_1","volume-title":"Structured Querying of Web Text. In Conference on Innovative Data Systems Research (CIDR).","author":"Cafarella Michael J.","year":"2007","unstructured":"Michael J. Cafarella , Christopher Re , Dan Suciu , Oren Etzioni , and Michele Banko . 2007 . Structured Querying of Web Text. In Conference on Innovative Data Systems Research (CIDR). Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, and Michele Banko. 2007. Structured Querying of Web Text. In Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_1_16_1","unstructured":"Michael J Cafarella Dan Suciu and Oren Etzioni. 2007. Navigating Extracted Data with Schema Discovery.. In WebDB. 1--6.  Michael J Cafarella Dan Suciu and Oren Etzioni. 2007. Navigating Extracted Data with Schema Discovery.. In WebDB. 1--6."},{"key":"e_1_2_1_17_1","volume-title":"Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. CIDR","author":"Chen Zui","year":"2023","unstructured":"Zui Chen , Zihui Gu , Lei Cao , Ju Fan , Sam Madden , and Nan Tang . 2023 . Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. CIDR (2023). Zui Chen, Zihui Gu, Lei Cao, Ju Fan, Sam Madden, and Nan Tang. 2023. Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. CIDR (2023)."},{"key":"e_1_2_1_18_1","unstructured":"Eric Chu Akanksha Baid Ting Chen AnHai Doan and Jeffrey Naughton. 2007. A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data. In VLDB.  Eric Chu Akanksha Baid Ting Chen AnHai Doan and Jeffrey Naughton. 2007. A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data. In VLDB."},{"key":"e_1_2_1_19_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Clark Kevin","unstructured":"Kevin Clark , Minh-Thang Luong , Quoc V. Le , and Christopher D. Manning . 2020. ELECTRA: pre-training text encoders as discriminators rather than generators . In International Conference on Learning Representations (ICLR). Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_20_1","volume-title":"Information extraction and integration: An overview. IJCAI","author":"Cohen W.","year":"2004","unstructured":"W. Cohen . 2004. Information extraction and integration: An overview. IJCAI ( 2004 ). W. Cohen. 2004. Information extraction and integration: An overview. IJCAI (2004)."},{"key":"e_1_2_1_21_1","unstructured":"Lei Cui Furu Wei and Ming Zhou. 2022. Neural Open Information Extraction. (2022).  Lei Cui Furu Wei and Ming Zhou. 2022. Neural Open Information Extraction. (2022)."},{"key":"e_1_2_1_22_1","unstructured":"Xiang Deng Prashant Shiralkar Colin Lockard Binxuan Huang and Huan Sun. 2022. DOM-LM: Learning Generalizable Representations for HTML Documents. (2022).  Xiang Deng Prashant Shiralkar Colin Lockard Binxuan Huang and Huan Sun. 2022. DOM-LM: Learning Generalizable Representations for HTML Documents. (2022)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1409360.1409378"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988687"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Oren Etzioni Michael Cafarella Doug Downey Ana-Maria Popescu Tal Shaked Stephen Soderland Daniel S. Weld and Alexander Yates. 2004. Unsupervised named-entity extraction from the Web: An experimental study. In AAAI.  Oren Etzioni Michael Cafarella Doug Downey Ana-Maria Popescu Tal Shaked Stephen Soderland Daniel S. Weld and Alexander Yates. 2004. Unsupervised named-entity extraction from the Web: An experimental study. In AAAI.","DOI":"10.1016\/j.artint.2005.03.001"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"J. H. Faghmous and V Kumar. 2014. A big data guide to understanding climate change: The case for theory-guided data science. In Big data.  J. H. Faghmous and V Kumar. 2014. A big data guide to understanding climate change: The case for theory-guided data science. In Big data.","DOI":"10.1089\/big.2014.0026"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research)","volume":"119","author":"Fu Daniel","year":"2020","unstructured":"Daniel Fu , Mayee Chen , Frederic Sala , Sarah Hooper , Kayvon Fatahalian , and Christopher Re . 2020 . Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods . In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research) , Vol. 119 . PMLR, 3280--3291. Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, and Christopher Re. 2020. Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 119. PMLR, 3280--3291."},{"key":"e_1_2_1_28_1","unstructured":"Leo Gao. 2021. On the Sizes of OpenAI API Models. https:\/\/blog.eleuther.ai\/gpt3-model-sizes\/  Leo Gao. 2021. On the Sizes of OpenAI API Models. https:\/\/blog.eleuther.ai\/gpt3-model-sizes\/"},{"key":"e_1_2_1_29_1","volume-title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling.","author":"Gao Leo","year":"2021","unstructured":"Leo Gao , Stella Biderman , Sid Black , Laurence Golding , Travis Hoppe , Charles Foster , Jason Phang , Horace He , Anish Thite , Noa Nabeshima , Shawn Presser , and Connor Leahy . 2021 . The Pile: An 800GB Dataset of Diverse Text for Language Modeling. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The Pile: An 800GB Dataset of Diverse Text for Language Modeling."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335409"},{"key":"e_1_2_1_31_1","volume-title":"DL '00: Proceedings of the fifth ACM conference on Digital libraries.","author":"Luis Gravano Eugene Agichtein","year":"2000","unstructured":"Eugene Agichtein Luis Gravano . 2000 . Snowball: Extracting Relations from Large Plain-Text Collections . In DL '00: Proceedings of the fifth ACM conference on Digital libraries. Eugene Agichtein Luis Gravano. 2000. Snowball: Extracting Relations from Large Plain-Text Collections. In DL '00: Proceedings of the fifth ACM conference on Digital libraries."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2010020"},{"key":"e_1_2_1_33_1","volume-title":"DeBERTa: Decoding-Enhanced BERT With Disentangled Attention. In International Conference on Learning Representations.","author":"He Pengcheng","year":"2021","unstructured":"Pengcheng He , Xiaodong Liu , Jianfeng Gao , and Weizhu Chen . 2021 . DeBERTa: Decoding-Enhanced BERT With Disentangled Attention. In International Conference on Learning Representations. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-Enhanced BERT With Disentangled Attention. In International Conference on Learning Representations."},{"key":"e_1_2_1_34_1","unstructured":"Nathan Heller. 2017. What the Enron E-mails Say About Us. https:\/\/www.newyorker.com\/magazine\/2017\/07\/24\/what-the-enron-e-mails-say-about-us  Nathan Heller. 2017. What the Enron E-mails Say About Us. https:\/\/www.newyorker.com\/magazine\/2017\/07\/24\/what-the-enron-e-mails-say-about-us"},{"key":"e_1_2_1_35_1","unstructured":"Nick Huss. 2023. How Many Websites Are There in the World?  Nick Huss. 2023. How Many Websites Are There in the World?"},{"key":"e_1_2_1_36_1","volume-title":"David Hall, Percy Liang, Christopher Potts, and Matei Zaharia.","author":"Khattab Omar","year":"2022","unstructured":"Omar Khattab , Keshav Santhanam , Xiang Lisa Li , David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022 . Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022). Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022)."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 1st Conference on Email and Anti-Spam (CEAS).","author":"Klimt B.","unstructured":"B. Klimt and Y. Yang . 2004. Introducing the enron corpus . In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS). B. Klimt and Y. Yang. 2004. Introducing the enron corpus. In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS)."},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Jan Koco\u0144 Igor Cichecki Oliwier Kaszyca Mateusz Kochanek Dominika Szyd\u0142o Joanna Baran Julita Bielaniewicz Marcin Gruza Arkadiusz Janz Kamil Kanclerz etal 2023. ChatGPT: Jack of all trades master of none. arXiv preprint arXiv:2302.10724 (2023).  Jan Koco\u0144 Igor Cichecki Oliwier Kaszyca Mateusz Kochanek Dominika Szyd\u0142o Joanna Baran Julita Bielaniewicz Marcin Gruza Arkadiusz Janz Kamil Kanclerz et al. 2023. ChatGPT: Jack of all trades master of none. arXiv preprint arXiv:2302.10724 (2023).","DOI":"10.2139\/ssrn.4372889"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.306"},{"key":"e_1_2_1_40_1","volume-title":"Daniel Fried, Sida Wang, and Tao Yu.","author":"Lai Yuhang","year":"2022","unstructured":"Yuhang Lai , Chengxi Li , Yiming Wang , Tianyi Zhang , Ruiqi Zhong , Luke Zettlemoyer , Scott Wen tau Yih , Daniel Fried, Sida Wang, and Tao Yu. 2022 . DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ArXiv abs\/2211.11501 (2022). Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2022. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ArXiv abs\/2211.11501 (2022)."},{"key":"e_1_2_1_41_1","volume-title":"Holistic Evaluation of Language Models. ArXiv abs\/2211.09110","author":"Liang Percy","year":"2022","unstructured":"Percy Liang , Rishi Bommasani , Tony Lee , Dimitris Tsipras , Dilara Soylu , and more. 2022. Holistic Evaluation of Language Models. ArXiv abs\/2211.09110 ( 2022 ). Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, and more. 2022. Holistic Evaluation of Language Models. ArXiv abs\/2211.09110 (2022)."},{"key":"e_1_2_1_42_1","unstructured":"Opher Lieber Or Sharir Barak Lenz and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. (2021).  Opher Lieber Or Sharir Barak Lenz and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. (2021)."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of NAACL-HLT","author":"Lockard Colin","year":"2019","unstructured":"Colin Lockard , Prashant Shiralkar , and Xin Luna Dong . 2019 . OpenCeres: When Open Information Extraction Meets the Semi-Structured Web . Proceedings of NAACL-HLT (2019). Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. OpenCeres: When Open Information Extraction Meets the Semi-Structured Web. Proceedings of NAACL-HLT (2019)."},{"key":"e_1_2_1_44_1","volume-title":"Xin Luna Dong, and Hannaneh Hajishirzi","author":"Lockard Colin","year":"2020","unstructured":"Colin Lockard , Prashant Shiralkar , Xin Luna Dong, and Hannaneh Hajishirzi . 2020 . ZeroShotCeres: Zero- Shot Relation Extraction from Semi-Structured Web-pages. ACL ( 2020). Colin Lockard, Prashant Shiralkar, Xin Luna Dong, and Hannaneh Hajishirzi. 2020. ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Web-pages. ACL (2020)."},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.","author":"Schmitz Michael","year":"2012","unstructured":"Mausam, Michael Schmitz , Stephen Soderland , Robert Bart , and Oren Etzioni . 2012 . Open Language Learning for Information Extraction . Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Mausam, Michael Schmitz, Stephen Soderland, Robert Bart, and Oren Etzioni. 2012. Open Language Learning for Information Extraction. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the VLDB Endowment","author":"Nargesian Fatemeh","year":"2019","unstructured":"Fatemeh Nargesian , Erkang Zhu , Rene\u00e9 J. Miller , Ken Q. Pu , and Patricia C. Arocena . 2019. Data Lake Management: Challenges and Opportunities . Proceedings of the VLDB Endowment ( 2019 ). Fatemeh Nargesian, Erkang Zhu, Rene\u00e9 J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proceedings of the VLDB Endowment (2019)."},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics.","author":"Niklaus Christina","year":"2018","unstructured":"Christina Niklaus , Matthias Cetto , Andr\u00e9 Freitas , and Siegfried Handschuh . 2018 . A Survey on Open Information Extraction . In Proceedings of the 27th International Conference on Computational Linguistics. Christina Niklaus, Matthias Cetto, Andr\u00e9 Freitas, and Siegfried Handschuh. 2018. A Survey on Open Information Extraction. In Proceedings of the 27th International Conference on Computational Linguistics."},{"key":"e_1_2_1_49_1","unstructured":"OpenAI. March 2023. OpenAI API. https:\/\/openai.com\/api\/  OpenAI. March 2023. OpenAI API. https:\/\/openai.com\/api\/"},{"key":"e_1_2_1_50_1","unstructured":"Laurel Orr. 2022. Manifest. https:\/\/github.com\/HazyResearch\/manifest.  Laurel Orr. 2022. Manifest. https:\/\/github.com\/HazyResearch\/manifest."},{"key":"e_1_2_1_51_1","unstructured":"F. Chen A. Doan P. DeRose W. Shen and R. Ramakrishnan. 2007. Building structured web community portals: A top-down compositional and incremental approach. VLDB (2007).  F. Chen A. Doan P. DeRose W. Shen and R. Ramakrishnan. 2007. Building structured web community portals: A top-down compositional and incremental approach. VLDB (2007)."},{"key":"e_1_2_1_52_1","volume-title":"100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . 2016. SQuAD : 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250 ( 2016 ). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250 (2016)."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157797"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"C. Romero and S. Ventura. 2013. Data mining in education. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.  C. Romero and S. Ventura. 2013. Data mining in education. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.","DOI":"10.1002\/widm.1075"},{"key":"e_1_2_1_55_1","volume-title":"Parameswaran","author":"Shankar Shreya","year":"2022","unstructured":"Shreya Shankar , Rolando Garcia , Joseph M. Hellerstein , and Aditya G . Parameswaran . 2022 . Operationalizing Machine Learning: An Interview Study . arXiv:2209.09125 (2022). Shreya Shankar, Rolando Garcia, Joseph M. Hellerstein, and Aditya G. Parameswaran. 2022. Operationalizing Machine Learning: An Interview Study. arXiv:2209.09125 (2022)."},{"key":"e_1_2_1_56_1","unstructured":"Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E Gonzalez etal 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023).  Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E Gonzalez et al. 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023)."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/2809974.2809991"},{"key":"e_1_2_1_58_1","volume-title":"Bach","author":"Smith Ryan","year":"2022","unstructured":"Ryan Smith , Jason A. Fries , Braden Hancock , and Stephen H . Bach . 2022 . Language Models in the Loop : Incorporating Prompting into Weak Supervision . arXiv:2205.02318v1 (2022). Ryan Smith, Jason A. Fries, Braden Hancock, and Stephen H. Bach. 2022. Language Models in the Loop: Incorporating Prompting into Weak Supervision. arXiv:2205.02318v1 (2022)."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551841"},{"key":"e_1_2_1_60_1","unstructured":"S. Raghavan S. Vaithyanathan T.S. Jayram R. Krishnamurthy and H. Zhu. 2006. Avatar information extraction system. IEEE Data Eng. Bull (2006).  S. Raghavan S. Vaithyanathan T.S. Jayram R. Krishnamurthy and H. Zhu. 2006. Avatar information extraction system. IEEE Data Eng. Bull (2006)."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/3291264.3291268"},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the 36th International Conference on Machine Learning (ICML).","author":"Varma Paroma","year":"2019","unstructured":"Paroma Varma , Frederic Sala , Ann He , Alexander Ratner , and Christopher Re . 2019 . Learning Dependency Structures for Weak Supervision Models . Proceedings of the 36th International Conference on Machine Learning (ICML). Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, and Christopher Re. 2019. Learning Dependency Structures for Weak Supervision Models. Proceedings of the 36th International Conference on Machine Learning (ICML)."},{"key":"e_1_2_1_63_1","volume-title":"Ed H. Cho, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou.","author":"Wang Xuezhi","year":"2022","unstructured":"Xuezhi Wang , Jason Wei , Dale Schuurmans , Quoc Le Le , Ed H. Cho, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022 . Self-Consistency Improves Chain of Thought Reasoning in Language Models . arXiv:2203.11171v2. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le Le, Ed H. Cho, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171v2."},{"key":"e_1_2_1_64_1","volume-title":"How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine 27 (04","author":"Wu Eric","year":"2021","unstructured":"Eric Wu , Kevin Wu , Roxana Daneshjou , David Ouyang , Daniel Ho , and James Zou . 2021. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine 27 (04 2021 ), 1--3. Eric Wu, Kevin Wu, Roxana Daneshjou, David Ouyang, Daniel Ho, and James Zou. 2021. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine 27 (04 2021), 1--3."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517582"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/SPC.2013.6735131"},{"key":"e_1_2_1_67_1","volume-title":"A Survey on Neural Open Information Extraction: Current Status and Future Directions. IJCAI22","author":"Zhou Shaowen","year":"2022","unstructured":"Shaowen Zhou , Bowen Yu , Aixin Sun , Cheng Long , Jingyang Li , Haiyang Yu , Jian Sun , and Yongbin Li. 2022. A Survey on Neural Open Information Extraction: Current Status and Future Directions. IJCAI22 ( 2022 ). Shaowen Zhou, Bowen Yu, Aixin Sun, Cheng Long, Jingyang Li, Haiyang Yu, Jian Sun, and Yongbin Li. 2022. A Survey on Neural Open Information Extraction: Current Status and Future Directions. IJCAI22 (2022)."},{"key":"e_1_2_1_68_1","volume-title":"Nissen","author":"Zuckerman Diana M.","year":"2011","unstructured":"Diana M. Zuckerman , Paul Brown , and Steven E . Nissen . 2011 . Medical Device Recalls and the FDA Approval Process. Archives of Internal Medicine 171, 11 (06 2011), 1006--1011. Diana M. Zuckerman, Paul Brown, and Steven E. Nissen. 2011. Medical Device Recalls and the FDA Approval Process. Archives of Internal Medicine 171, 11 (06 2011), 1006--1011."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3626292.3626294","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,8]],"date-time":"2024-01-08T23:11:30Z","timestamp":1704755490000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3626292.3626294"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10]]},"references-count":68,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,10]]}},"alternative-id":["10.14778\/3626292.3626294"],"URL":"https:\/\/doi.org\/10.14778\/3626292.3626294","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,10]]},"assertion":[{"value":"2023-10-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}