{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,11]],"date-time":"2026-01-11T05:16:09Z","timestamp":1768108569168,"version":"3.49.0"},"reference-count":17,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>Searching tables from poorly maintained data lakes has long been recognized as a formidable challenge in the realm of data management. There are three pivotal tasks: keyword-based, joinable and unionable table search, which form the backbone of tasks that aim to make sense of diverse datasets, such as machine learning. In this demo, we propose LakeCompass, an end-to-end prototype system that maintains abundant tabular data, supports all above search tasks with high efficacy, and well serves downstream ML modeling. To be specific, LakeCompass manages numerous real tables over which diverse types of indexes are built to support efficient search based on different user requirements. Particularly, LakeCompass could automatically integrate these discovered tables to improve the downstream model performance in an iterative approach. Finally, we provide both Python APIs and Web interface to facilitate flexible user interaction.<\/jats:p>","DOI":"10.14778\/3685800.3685880","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4381-4384","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes"],"prefix":"10.14778","volume":"17","author":[{"given":"Chengliang","family":"Chai","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Yuhao","family":"Deng","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Yutong","family":"Zhan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Ziqi","family":"Cao","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Yuanfang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona\/MIT"}]},{"given":"Yuping","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Zhiwei","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Guoren","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology"}]},{"given":"Nan","family":"Tang","sequence":"additional","affiliation":[{"name":"HKUST (GZ)"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476346"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523223"},{"key":"e_1_2_1_3_1","volume-title":"Proc. VLDB Endow.","author":"Chepurko Nadiia","year":"2020","unstructured":"Nadiia Chepurko and Ryan Marcus et al. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow. (2020)."},{"key":"e_1_2_1_4_1","volume-title":"LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes. In VLDB","author":"Deng Yuhao","year":"2024","unstructured":"Yuhao Deng, Chengliang Chai, and Lei Cao et al. 2024. LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes. In VLDB 2024."},{"key":"e_1_2_1_5_1","volume-title":"COCOA: COrrelation COefficient-Aware Data Augmentation. In EDBT","author":"Esmailoghli Mahdi","year":"2021","unstructured":"Mahdi Esmailoghli, Jorge-Arnulfo Quian\u00e9-Ruiz, and Ziawasch Abedjan. 2021. COCOA: COrrelation COefficient-Aware Data Augmentation. In EDBT 2021."},{"key":"e_1_2_1_6_1","volume-title":"Aurum: A Data Discovery System. In ICDE.","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez et al. 2018. Aurum: A Data Discovery System. In ICDE."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352095"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Grace Fan and Jin Wang et al. 2023. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. PVLDB (2023).","DOI":"10.14778\/3587136.3587146"},{"key":"e_1_2_1_9_1","volume-title":"Metam: Goal-Oriented Data Discovery. In ICDE","author":"Galhotra Sainyam","year":"2023","unstructured":"Sainyam Galhotra, Yue Gong, and Raul Castro Fernandez. 2023. Metam: Goal-Oriented Data Discovery. In ICDE 2023."},{"key":"e_1_2_1_10_1","volume-title":"TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT","author":"Iida Hiroshi","year":"2021","unstructured":"Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT 2021."},{"key":"e_1_2_1_11_1","volume-title":"SANTOS: Relationship-based Semantic Table Union Search. CoRR abs\/2209.13589","author":"Khatiwada Aamod","year":"2022","unstructured":"Aamod Khatiwada, Grace Fan, and Roee Shraga et al. 2022. SANTOS: Relationship-based Semantic Table Union Search. CoRR abs\/2209.13589 (2022)."},{"key":"e_1_2_1_12_1","volume-title":"Miller","author":"Khatiwada Aamod","year":"2023","unstructured":"Aamod Khatiwada, Roee Shraga, and Ren\u00e9e J. Miller. 2023. DIALITE: Discover, Align and Integrate Open Data Tables. In SIGMOD\/PODS 2023."},{"key":"e_1_2_1_13_1","volume-title":"Feature Augmentation with Reinforcement Learning. In ICDE","author":"Liu Jiabin","year":"2022","unstructured":"Jiabin Liu, Chengliang Chai, and Yuyu Luo et al. 2022. Feature Augmentation with Reinforcement Learning. In ICDE 2022."},{"key":"e_1_2_1_14_1","first-page":"59","article-title":"Making Open Data Transparent: Data Discovery on Open Data","volume":"41","author":"Miller Ren\u00e9e J.","year":"2018","unstructured":"Ren\u00e9e J. Miller and Fatemeh Nargesian et al. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull. 41, 2 (2018), 59--70.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_15_1","volume-title":"SIGMOD","author":"Yakout Mohamed","year":"2012","unstructured":"Mohamed Yakout and Kris Ganjam et al. 2012. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD 2012."},{"key":"e_1_2_1_16_1","volume-title":"JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD","author":"Zhu Erkang","year":"2019","unstructured":"Erkang Zhu and Dong Deng et al. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD 2019."},{"key":"e_1_2_1_17_1","first-page":"12","volume-title":"Proc. VLDB Endow. 9","author":"Zhu Erkang","year":"2016","unstructured":"Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Ren\u00e9e J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow. 9, 12 (2016)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685880","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:24:08Z","timestamp":1735622648000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685880"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":17,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685880"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685880","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}