{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T22:23:12Z","timestamp":1780438992973,"version":"3.54.1"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"8","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,4]]},"abstract":"<jats:p>Discovering tables from poorly maintained data lakes is a significant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there's a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates effectiveness, efficiency, and scalability of table join &amp; union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries - 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates state-of-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.<\/jats:p>","DOI":"10.14778\/3659437.3659448","type":"journal-article","created":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T16:22:27Z","timestamp":1717172547000},"page":"1925-1938","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes"],"prefix":"10.14778","volume":"17","author":[{"given":"Yuhao","family":"Deng","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chengliang","family":"Chai","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona\/MIT"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qin","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Siyuan","family":"Chen","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yanrui","family":"Yu","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhaoze","family":"Sun","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Junyi","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jiajun","family":"Li","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ziqi","family":"Cao","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kaisen","family":"Jin","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuqing","family":"Jiang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuanfang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuping","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nan","family":"Tang","sequence":"additional","affiliation":[{"name":"HKUST (GZ)"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,5,31]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. OpenData. https:\/\/open.canada.ca\/."},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. WebTable. https:\/\/webdatacommons.org\/webtables\/."},{"key":"e_1_2_1_3_1","volume-title":"Dataset Discovery in Data Lakes. In 36th IEEE International Conference on Data Engineering, ICDE 2020","author":"Bogatu Alex","year":"2020","unstructured":"Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020. IEEE, 709--720."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687750"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389772"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589302"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523223"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3397230.3397235"},{"key":"e_1_2_1_9_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang.","author":"Deng Dong","year":"2017","unstructured":"Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2017\/papers\/p44-deng-cidr17.pdf"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3648160.3648161"},{"key":"e_1_2_1_11_1","volume-title":"IDE: A System for Iterative Mislabel Detection. In Companion of the 2024 International Conference on Management of Data, SIGMOD\/PODS 2024","author":"Deng Yuhao","year":"2024","unstructured":"Yuhao Deng, Qiyan Deng, Chengliang Chai, Lei Cao, Nan Tang, Ju Fan, Jiayi Wang, Ye Yuan, and Guoren Wang. 2024. IDE: A System for Iterative Mislabel Detection. In Companion of the 2024 International Conference on Management of Data, SIGMOD\/PODS 2024, Santiago, Chile, June 9--15, 2024. ACM."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/N19-1423"},{"key":"e_1_2_1_13_1","volume-title":"Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021","author":"Dong Yuyang","year":"2021","unstructured":"Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021. IEEE, 456--467."},{"key":"e_1_2_1_14_1","volume-title":"DeepJoin: Joinable Table Discovery with Pre-trained Language Models. CoRR abs\/2212.07588","author":"Dong Yuyang","year":"2022","unstructured":"Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2022. DeepJoin: Joinable Table Discovery with Pre-trained Language Models. CoRR abs\/2212.07588 (2022)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3529337.3529353"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3555041.3589409"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574274"},{"key":"e_1_2_1_18_1","volume-title":"Aurum: A Data Discovery System. In 34th IEEE International Conference on Data Engineering, ICDE 2018","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018. IEEE Computer Society, 1001--1012."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903730"},{"key":"e_1_2_1_20_1","first-page":"5","article-title":"Managing Google's data lake: an overview of the Goods system","volume":"39","author":"Halevy Alon Y.","year":"2016","unstructured":"Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google's data lake: an overview of the Goods system. IEEE Data Eng. Bull. 39, 3 (2016), 5--14. http:\/\/sites.computer.org\/debull\/A16sept\/p5.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2021.NAACL-MAIN.270"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2010.57"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/E17-2068"},{"key":"e_1_2_1_24_1","volume-title":"SANTOS: Relationship-based Semantic Table Union Search. CoRR abs\/2209.13589","author":"Khatiwada Aamod","year":"2022","unstructured":"Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Ren\u00e9e J. Miller, and Mirek Riedewald. 2022. SANTOS: Relationship-based Semantic Table Union Search. CoRR abs\/2209.13589 (2022)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00047"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882952"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3467861.3467872"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00317"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2889473"},{"key":"e_1_2_1_30_1","first-page":"59","article-title":"Making Open Data Transparent: Data Discovery on Open Data","volume":"41","author":"Miller Ren\u00e9e J.","year":"2018","unstructured":"Ren\u00e9e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull. 41, 2 (2018), 59--70. http:\/\/sites.computer.org\/debull\/A18june\/p59.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2308.03883"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/2336664.2336665"},{"key":"e_1_2_1_35_1","volume-title":"a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs\/1910.01108","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs\/1910.01108 (2019). arXiv:1910.01108 http:\/\/arxiv.org\/abs\/1910.01108"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012","author":"Sarma Anish Das","year":"2012","unstructured":"Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Y. Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, K. Sel\u00e7uk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman (Eds.). ACM, 817--828."},{"key":"e_1_2_1_37_1","volume-title":"MPNet: Masked and Permuted Pre-training for Language Understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020","author":"Song Kaitao","year":"2020","unstructured":"Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet: Masked and Permuted Pre-training for Language Understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/c3a690be93aa602ee2dc0ccab5b7b67e-Abstract.html"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","unstructured":"Hugo Touvron Louis Martin and Kevin Stone et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs\/2307.09288 (2023). arXiv:2307.09288 10.48550\/ARXIV.2307.09288","DOI":"10.48550\/ARXIV.2307.09288"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561267"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319899"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213848"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2020.ACL-MAIN.745"},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18","volume":"119","author":"Yoon Jinsung","year":"2020","unstructured":"Jinsung Yoon, Sercan \u00d6mer Arik, and Tomas Pfister. 2020. Data Valuation using Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research), Vol. 119. PMLR, 10842--10851. http:\/\/proceedings.mlr.press\/v119\/yoon20a.html"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407793"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019","author":"Zhu Erkang","year":"2019","unstructured":"Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Ren\u00e9e J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 847--864."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3659437.3659448","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T16:23:38Z","timestamp":1717172618000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3659437.3659448"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4]]},"references-count":46,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["10.14778\/3659437.3659448"],"URL":"https:\/\/doi.org\/10.14778\/3659437.3659448","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,4]]},"assertion":[{"value":"2024-05-31","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}