{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T16:32:09Z","timestamp":1781713929894,"version":"3.54.5"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"7","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,3]]},"abstract":"<jats:p>Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).<\/jats:p>","DOI":"10.14778\/3587136.3587146","type":"journal-article","created":{"date-parts":[[2023,5,8]],"date-time":"2023-05-08T23:11:35Z","timestamp":1683587495000},"page":"1726-1739","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":66,"title":["Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning"],"prefix":"10.14778","volume":"16","author":[{"given":"Grace","family":"Fan","sequence":"first","affiliation":[{"name":"Northeastern University, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jin","family":"Wang","sequence":"additional","affiliation":[{"name":"Megagon Labs, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuliang","family":"Li","sequence":"additional","affiliation":[{"name":"Megagon Labs, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Megagon Labs, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ren\u00e9e J.","family":"Miller","sequence":"additional","affiliation":[{"name":"Northeastern University, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,5,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536336.2536343"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Alex Bogatu Alvaro A. A. Fernandes Norman W. Paton and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE. 709--720. Alex Bogatu Alvaro A. A. Fernandes Norman W. Paton and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE. 709--720.","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_1_3_1","volume-title":"Noy","author":"Brickley Dan","year":"2019","unstructured":"Dan Brickley , Matthew Burgess , and Natasha F . Noy . 2019 . Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW. 1365--1375. Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW. 1365--1375."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687750"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453916"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Riccardo Cappuzzo Paolo Papotti and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD David Maier Rachel Pottinger AnHai Doan Wang-Chiew Tan Abdussalam Alawini and Hung Q. Ngo (Eds.). 1335--1349. Riccardo Cappuzzo Paolo Papotti and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD David Maier Rachel Pottinger AnHai Doan Wang-Chiew Tan Abdussalam Alawini and Hung Q. Ngo (Eds.). 1335--1349.","DOI":"10.1145\/3318464.3389742"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476346"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388. Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388.","DOI":"10.1145\/509907.509965"},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD. ACM 785--794. Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD. ACM 785--794.","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_2_1_10_1","first-page":"1597","article-title":"A Simple Framework for Contrastive Learning of Visual Representations","volume":"119","author":"Chen Ting","year":"2020","unstructured":"Ting Chen , Simon Kornblith , Mohammad Norouzi , and Geoffrey E. Hinton . 2020 . A Simple Framework for Contrastive Learning of Visual Representations . In ICML , Vol. 119. 1597 -- 1607 . Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML, Vol. 119. 1597--1607.","journal-title":"ICML"},{"key":"e_1_2_1_11_1","unstructured":"Tianji Cong James Gale Jason Frantz H. V. Jagadish and \u00c7agatay Demiralp. 2023. WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses. In CIDR. Tianji Cong James Gale Jason Frantz H. V. Jagadish and \u00c7agatay Demiralp. 2023. WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses. In CIDR."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430921"},{"key":"e_1_2_1_13_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186."},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Yuyang Dong Kunihiro Takeoka Chuan Xiao and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In ICDE. 456--467. Yuyang Dong Kunihiro Takeoka Chuan Xiao and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In ICDE. 456--467.","DOI":"10.1109\/ICDE51399.2021.00046"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2210.01922"},{"key":"e_1_2_1_16_1","volume-title":"CLAMS: Bringing Quality to Data Lakes. In SIGMOD. 2089--2092.","author":"Farid Mina H.","year":"2016","unstructured":"Mina H. Farid , Alexandra Roatis , Ihab F. Ilyas , Hella-Franziska Hoffmann , and Xu Chu . 2016 . CLAMS: Bringing Quality to Data Lakes. In SIGMOD. 2089--2092. Mina H. Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In SIGMOD. 2089--2092."},{"key":"e_1_2_1_17_1","volume-title":"Aurum: A Data Discovery System. In ICDE. 1001--1012.","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez , Ziawasch Abedjan , Famien Koko , Gina Yuan , Samuel Madden , and Michael Stonebraker . 2018 . Aurum: A Data Discovery System. In ICDE. 1001--1012. Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In ICDE. 1001--1012."},{"key":"e_1_2_1_18_1","volume-title":"Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang.","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez , Essam Mansour , Abdulhakim Ali Qahtan , Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018 . Seeping Semantics : Linking Datasets Using Word Embeddings for Data Discovery. In ICDE. 989--1000. Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In ICDE. 989--1000."},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In CIKM. 3381--3384. Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In CIKM. 3381--3384.","DOI":"10.1145\/3340531.3417426"},{"key":"e_1_2_1_20_1","unstructured":"Aristides Gionis Piotr Indyk and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In VLDB. Morgan Kaufmann 518--529. Aristides Gionis Piotr Indyk and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In VLDB. Morgan Kaufmann 518--529."},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Hazar Harmouch Thorsten Papenbrock and Felix Naumann. 2021. Relational Header Discovery using Similarity Search in a Table Corpus. In ICDE. 444--455. Hazar Harmouch Thorsten Papenbrock and Felix Naumann. 2021. Relational Header Discovery using Similarity Search in a Table Corpus. In ICDE. 444--455.","DOI":"10.1109\/ICDE51399.2021.00045"},{"key":"e_1_2_1_22_1","volume-title":"Michiel A. Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, \u00c7agatay Demiralp, and C\u00e9sar A. Hidalgo.","author":"Hulsebos Madelon","year":"2019","unstructured":"Madelon Hulsebos , Kevin Zeng Hu , Michiel A. Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, \u00c7agatay Demiralp, and C\u00e9sar A. Hidalgo. 2019 . Sherlock : A Deep Learning Approach to Semantic Data Type Detection. In KDD. 1500--1508. Madelon Hulsebos, Kevin Zeng Hu, Michiel A. Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, \u00c7agatay Demiralp, and C\u00e9sar A. Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In KDD. 1500--1508."},{"key":"e_1_2_1_23_1","volume-title":"TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT. 3446--3456.","author":"Iida Hiroshi","year":"2021","unstructured":"Hiroshi Iida , Dung Thai , Varun Manjunatha , and Mohit Iyyer . 2021 . TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT. 3446--3456. Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT. 3446--3456."},{"key":"e_1_2_1_24_1","volume-title":"SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD.","author":"Khatiwada Aamod","year":"2023","unstructured":"Aamod Khatiwada , Grace Fan , Roee Shraga , Zixuan Chen , Wolfgang Gatterbauer , Ren\u00e9e J. Miller , and Mirek Riedewald . 2023 . SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD. Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Ren\u00e9e J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574274"},{"key":"e_1_2_1_26_1","volume-title":"Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE. 468--479.","author":"Koutras Christos","year":"2021","unstructured":"Christos Koutras , George Siachamis , Andra Ionescu , Kyriakos Psarakis , Jerry Brons , Marios Fragkoulis , Christoph Lofi , Angela Bonifati , and Asterios Katsifodimos . 2021 . Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE. 468--479. Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE. 468--479."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137657"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Oliver Lehmberg Dominique Ritze Robert Meusel and Christian Bizer. 2016. A Large Public Corpus of Web Tables containing Time and Context Metadata. In WWW (Companion Volume). ACM 75--76. Oliver Lehmberg Dominique Ritze Robert Meusel and Christian Bizer. 2016. A Large Public Corpus of Web Tables containing Time and Context Metadata. In WWW (Companion Volume). ACM 75--76.","DOI":"10.1145\/2872518.2889386"},{"key":"e_1_2_1_29_1","volume-title":"Wolfgang Gatterbauer, Ren\u00e9e J. Miller, and Mirek Riedewald.","author":"Leventidis Aristotelis","year":"2021","unstructured":"Aristotelis Leventidis , Laura Di Rocco , Wolfgang Gatterbauer, Ren\u00e9e J. Miller, and Mirek Riedewald. 2021 . DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT. 13--24. Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Ren\u00e9e J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT. 13--24."},{"key":"e_1_2_1_30_1","unstructured":"Chen Li Jiaheng Lu and Yiming Lu. 2008. Efficient Merging and Filtering Algorithms for Approximate String Searches. In ICDE. 257--266. Chen Li Jiaheng Lu and Yiming Lu. 2008. Efficient Merging and Filtering Algorithms for Approximate String Searches. In ICDE. 257--266."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_32_1","article-title":"Deep Entity Matching","volume":"13","author":"Li Yuliang","year":"2021","unstructured":"Yuliang Li , Jinfeng Li , Yoshihiko Suhara , Jin Wang , Wataru Hirota , and Wang-Chiew Tan . 2021 . Deep Entity Matching : Challenges and Opportunities. ACM J. Data Inf. Qual. 13 , 1 (2021), 1:1--1:17. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, and Wang-Chiew Tan. 2021. Deep Entity Matching: Challenges and Opportunities. ACM J. Data Inf. Qual. 13, 1 (2021), 1:1--1:17.","journal-title":"Challenges and Opportunities. ACM J. Data Inf. Qual."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1921005"},{"key":"e_1_2_1_34_1","unstructured":"Xiao Ling Alon Y. Halevy Fei Wu and Cong Yu. 2013. Synthesizing Union Tables from the Web. In IJCAI. 2677--2683. Xiao Ling Alon Y. Halevy Fei Wu and Cong Yu. 2013. Synthesizing Union Tables from the Web. In IJCAI. 2677--2683."},{"key":"e_1_2_1_35_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs\/1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs\/1907.11692 ( 2019 ). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs\/1907.11692 (2019)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2889473"},{"key":"e_1_2_1_37_1","volume-title":"Introduction to information retrieval","author":"Manning Christopher D.","unstructured":"Christopher D. Manning , Prabhakar Raghavan , and Hinrich Sch\u00fctze . 2008. Introduction to information retrieval . Cambridge University Press . Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch\u00fctze. 2008. Introduction to information retrieval. Cambridge University Press."},{"key":"e_1_2_1_38_1","volume-title":"ISWC","volume":"1690","author":"Mazumdar Suvodeep","year":"2016","unstructured":"Suvodeep Mazumdar and Ziqi Zhang . 2016 . Visualizing Semantic Table Annotations with TableMiner+ . In ISWC , Vol. 1690 . Suvodeep Mazumdar and Ziqi Zhang. 2016. Visualizing Semantic Table Annotations with TableMiner+. In ISWC, Vol. 1690."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3240491"},{"key":"e_1_2_1_40_1","first-page":"59","article-title":"Making Open Data Transparent: Data Discovery on Open Data","volume":"41","author":"Miller Ren\u00e9e J.","year":"2018","unstructured":"Ren\u00e9e J. Miller , Fatemeh Nargesian , Erkang Zhu , Christina Christodoulakis , Ken Q. Pu , and Periklis Andritsos . 2018 . Making Open Data Transparent: Data Discovery on Open Data . IEEE Data Eng. Bull. 41 , 2 (2018), 59 -- 70 . Ren\u00e9e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull. 41, 2 (2018), 59--70.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3384345.3384346"},{"key":"e_1_2_1_44_1","volume-title":"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks","author":"Reimers Nils","unstructured":"Nils Reimers and Iryna Gurevych . 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks . In EMNLP. Association for Computational Linguistics , 3980--3990. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP. Association for Computational Linguistics, 3980--3990."},{"key":"e_1_2_1_45_1","doi-asserted-by":"crossref","unstructured":"A\u00e9cio S. R. Santos Aline Bessa Christopher Musco and Juliana Freire. 2022. A Sketch-based Index for Correlated Dataset Search. In ICDE. 2928--2941. A\u00e9cio S. R. Santos Aline Bessa Christopher Musco and Juliana Freire. 2022. A Sketch-based Index for Correlated Dataset Search. In ICDE. 2928--2941.","DOI":"10.1109\/ICDE53745.2022.00264"},{"key":"e_1_2_1_46_1","doi-asserted-by":"crossref","unstructured":"Anish Das Sarma Lujun Fang Nitin Gupta Alon Y. Halevy Hongrae Lee Fei Wu Reynold Xin and Cong Yu. 2012. Finding related tables. In SIGMOD. 817--828. Anish Das Sarma Lujun Fang Nitin Gupta Alon Y. Halevy Hongrae Lee Fei Wu Reynold Xin and Cong Yu. 2012. Finding related tables. In SIGMOD. 817--828.","DOI":"10.1145\/2213836.2213962"},{"key":"e_1_2_1_47_1","doi-asserted-by":"crossref","unstructured":"Yoshihiko Suhara Jinfeng Li Yuliang Li Dan Zhang \u00c7agatay Demiralp Chen Chen and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Language Models. In SIGMOD. 1493--1503. Yoshihiko Suhara Jinfeng Li Yuliang Li Dan Zhang \u00c7agatay Demiralp Chen Chen and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Language Models. In SIGMOD. 1493--1503.","DOI":"10.1145\/3514221.3517906"},{"key":"e_1_2_1_48_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998--6008. Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. 5998--6008."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/2002938.2002939"},{"key":"e_1_2_1_50_1","volume-title":"Xin Luna Dong, and Meng Jiang","author":"Wang Daheng","year":"2021","unstructured":"Daheng Wang , Prashant Shiralkar , Colin Lockard , Binxuan Huang , Xin Luna Dong, and Meng Jiang . 2021 . TCN : Table Convolutional Network for Web Table Interpretation. In WWW. 4020--4032. Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, and Meng Jiang. 2021. TCN: Table Convolutional Network for Web Table Interpretation. In WWW. 4020--4032."},{"key":"e_1_2_1_51_1","doi-asserted-by":"crossref","unstructured":"Jin Wang Chunbin Lin and Carlo Zaniolo. 2019. MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering. In ICDE. 386--397. Jin Wang Chunbin Lin and Carlo Zaniolo. 2019. MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering. In ICDE. 386--397.","DOI":"10.1109\/ICDE.2019.00042"},{"key":"e_1_2_1_52_1","volume-title":"Transformers: State-of-the-Art Natural Language Processing. In EMNLP. 38--45.","author":"Wolf Thomas","year":"2020","unstructured":"Thomas Wolf , Lysandre Debut , Victor Sanh , and 2020 . Transformers: State-of-the-Art Natural Language Processing. In EMNLP. 38--45. Thomas Wolf, Lysandre Debut, Victor Sanh, and et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP. 38--45."},{"key":"e_1_2_1_53_1","doi-asserted-by":"crossref","unstructured":"Jiacheng Wu Yong Zhang Jin Wang Chunbin Lin Yingjia Fu and Chunxiao Xing. 2019. Scalable Metric Similarity Join Using MapReduce. In ICDE. 1662--1665. Jiacheng Wu Yong Zhang Jin Wang Chunbin Lin Yingjia Fu and Chunxiao Xing. 2019. Scalable Metric Similarity Join Using MapReduce. In ICDE. 1662--1665.","DOI":"10.1109\/ICDE.2019.00167"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Mohamed Yakout Kris Ganjam Kaushik Chakrabarti and Surajit Chaudhuri. 2012. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. ACM 97--108. Mohamed Yakout Kris Ganjam Kaushik Chakrabarti and Surajit Chaudhuri. 2012. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. ACM 97--108.","DOI":"10.1145\/2213836.2213848"},{"key":"e_1_2_1_55_1","unstructured":"Pengcheng Yin Graham Neubig Wen-tau Yih and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL. 8413--8426. Pengcheng Yin Graham Neubig Wen-tau Yih and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL. 8413--8426."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407793"},{"key":"e_1_2_1_57_1","volume-title":"Ives","author":"Zhang Yi","year":"2020","unstructured":"Yi Zhang and Zachary G . Ives . 2020 . Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD. 1951--1966. Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD. 1951--1966."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.3233\/SW-160242"},{"key":"e_1_2_1_59_1","volume-title":"Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In SIGMOD. 1504--1517.","author":"Zhao Zixuan","year":"2022","unstructured":"Zixuan Zhao and Raul Castro Fernandez . 2022 . Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In SIGMOD. 1504--1517. Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In SIGMOD. 1504--1517."},{"key":"e_1_2_1_60_1","volume-title":"Miller","author":"Zhu Erkang","year":"2019","unstructured":"Erkang Zhu , Dong Deng , Fatemeh Nargesian , and Ren\u00e9e J . Miller . 2019 . JOSIE : Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD. 847--864. Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Ren\u00e9e J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD. 847--864."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3587136.3587146","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,20]],"date-time":"2024-10-20T00:52:06Z","timestamp":1729385526000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3587136.3587146"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3]]},"references-count":61,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2023,3]]}},"alternative-id":["10.14778\/3587136.3587146"],"URL":"https:\/\/doi.org\/10.14778\/3587136.3587146","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,3]]},"assertion":[{"value":"2023-05-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}