{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T08:56:01Z","timestamp":1775638561181,"version":"3.50.1"},"reference-count":80,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:p>We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980's, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, the first proposal for scalable integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables.<\/jats:p>","DOI":"10.14778\/3574245.3574274","type":"journal-article","created":{"date-parts":[[2023,2,21]],"date-time":"2023-02-21T23:14:12Z","timestamp":1677021252000},"page":"932-945","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":38,"title":["Integrating Data Lake Tables"],"prefix":"10.14778","volume":"16","author":[{"given":"Aamod","family":"Khatiwada","sequence":"first","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}]},{"given":"Roee","family":"Shraga","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}]},{"given":"Wolfgang","family":"Gatterbauer","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}]},{"given":"Ren\u00e9e J.","family":"Miller","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, Massachusetts, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,2,21]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proofs from the Book","author":"Aigner Martin","year":"1999","unstructured":"Martin Aigner and G\u00fcnter M Ziegler . 1999. Proofs from the Book . Berlin . Germany 1 ( 1999 ). Martin Aigner and G\u00fcnter M Ziegler. 1999. Proofs from the Book. Berlin. Germany 1 (1999)."},{"key":"e_1_2_1_2_1","unstructured":"ALITE. 2022. https:\/\/github.com\/northeastern-datalab\/alite  ALITE. 2022. https:\/\/github.com\/northeastern-datalab\/alite"},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","first-page":"484","DOI":"10.3844\/jcssp.2015.484.489","article-title":"Integrating correlation clustering and agglomerative hierarchical clustering for holistic schema matching","volume":"11","author":"Alshaikhdeeb Basel","year":"2015","unstructured":"Basel Alshaikhdeeb and Kamsuriah Ahmad . 2015 . Integrating correlation clustering and agglomerative hierarchical clustering for holistic schema matching . Journal of Computer Science 11 , 3 (2015), 484 . Basel Alshaikhdeeb and Kamsuriah Ahmad. 2015. Integrating correlation clustering and agglomerative hierarchical clustering for holistic schema matching. Journal of Computer Science 11, 3 (2015), 484.","journal-title":"Journal of Computer Science"},{"key":"e_1_2_1_4_1","unstructured":"Hugging Face BERT base model (uncased). 2022. https:\/\/huggingface.co\/bert-base-uncased  Hugging Face BERT base model (uncased). 2022. https:\/\/huggingface.co\/bert-base-uncased"},{"key":"e_1_2_1_5_1","article-title":"EBK-means: A clustering technique based on elbow method and k-means in WSN","volume":"105","author":"Bholowalia Purnima","year":"2014","unstructured":"Purnima Bholowalia and Arvind Kumar . 2014 . EBK-means: A clustering technique based on elbow method and k-means in WSN . International Journal of Computer Applications 105 , 9 (2014). Purnima Bholowalia and Arvind Kumar. 2014. EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications 105, 9 (2014).","journal-title":"International Journal of Computer Applications"},{"key":"e_1_2_1_6_1","first-page":"18","article-title":"Eliminating NULLs with Subsumption and Complementation","volume":"34","author":"Bleiholder Jens","year":"2011","unstructured":"Jens Bleiholder , Melanie Herschel , and Felix Naumann . 2011 . Eliminating NULLs with Subsumption and Complementation . IEEE Data Eng. Bull. 34 , 3 (2011), 18 -- 25 . http:\/\/sites.computer.org\/debull\/A11sept\/DataFusion1.pdf Jens Bleiholder, Melanie Herschel, and Felix Naumann. 2011. Eliminating NULLs with Subsumption and Complementation. IEEE Data Eng. Bull. 34, 3 (2011), 18--25. http:\/\/sites.computer.org\/debull\/A11sept\/DataFusion1.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1456650.1456651"},{"key":"e_1_2_1_8_1","volume-title":"EDBT 2010, 13th International Conference on Extending Database Technology, Proceedings (ACM International Conference Proceeding Series)","volume":"426","author":"Bleiholder Jens","year":"2010","unstructured":"Jens Bleiholder , Sascha Szott , Melanie Herschel , Frank Kaufer , and Felix Naumann . 2010 . Subsumption and complementation as data fusion operators . In EDBT 2010, 13th International Conference on Extending Database Technology, Proceedings (ACM International Conference Proceeding Series) , Vol. 426 . ACM, 513--524. 10.1145\/1739041.1739103 Jens Bleiholder, Sascha Szott, Melanie Herschel, Frank Kaufer, and Felix Naumann. 2010. Subsumption and complementation as data fusion operators. In EDBT 2010, 13th International Conference on Extending Database Technology, Proceedings (ACM International Conference Proceeding Series), Vol. 426. ACM, 513--524. 10.1145\/1739041.1739103"},{"key":"e_1_2_1_9_1","volume-title":"Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010","author":"Bleiholder Jens","year":"2010","unstructured":"Jens Bleiholder , Sascha Szott , Melanie Herschel , and Felix Naumann . 2010 . Complement union for data integration . In Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010 , March 1-6, 2010. IEEE Computer Society, 183--186. 10.1109\/ICDEW. 2010.5452760 Jens Bleiholder, Sascha Szott, Melanie Herschel, and Felix Naumann. 2010. Complement union for data integration. In Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010. IEEE Computer Society, 183--186. 10.1109\/ICDEW.2010.5452760"},{"key":"e_1_2_1_10_1","volume-title":"Dataset Discovery in Data Lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 709--720","author":"Bogatu Alex","year":"2020","unstructured":"Alex Bogatu , Alvaro A. A. Fernandes , Norman W. Paton , and Nikolaos Konstantinou . 2020 . Dataset Discovery in Data Lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 709--720 . 10.1109\/ICDE48307.2020.00067 Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 709--720. 10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_1_11_1","volume-title":"The World Wide Web Conference (WWW '19)","author":"Brickley Dan","year":"2019","unstructured":"Dan Brickley , Matthew Burgess , and Natasha Noy . 2019 . Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem . In The World Wide Web Conference (WWW '19) . ACM, 1365--1375. 10.1145\/3308558.3313685 Dan Brickley, Matthew Burgess, and Natasha Noy. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In The World Wide Web Conference (WWW '19). ACM, 1365--1375. 10.1145\/3308558.3313685"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687750"},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1080\/03610927408827101","article-title":"A dendrite method for cluster analysis","volume":"3","author":"Cali\u0144ski Tadeusz","year":"1974","unstructured":"Tadeusz Cali\u0144ski and Jerzy Harabasz . 1974 . A dendrite method for cluster analysis . Communications in Statistics-theory and Methods 3 , 1 (1974), 1 -- 27 . Tadeusz Cali\u0144ski and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1--27.","journal-title":"Communications in Statistics-theory and Methods"},{"key":"e_1_2_1_14_1","volume-title":"Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD Conference 2020","author":"Cappuzzo Riccardo","year":"2020","unstructured":"Riccardo Cappuzzo , Paolo Papotti , and Saravanan Thirumuruganathan . 2020 . Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD Conference 2020 . ACM, 1335--1349. 10.1145\/3318464.3389742 Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD Conference 2020. ACM, 1335--1349. 10.1145\/3318464.3389742"},{"key":"e_1_2_1_15_1","first-page":"10","article-title":"BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration","volume":"41","author":"Chen Chen","year":"2018","unstructured":"Chen Chen , Behzad Golshan , Alon Y Halevy , Wang-Chiew Tan , and AnHai Doan . 2018 . BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration . IEEE Data Eng. Bull. 41 , 2 (2018), 10 -- 22 . Chen Chen, Behzad Golshan, Alon Y Halevy, Wang-Chiew Tan, and AnHai Doan. 2018. BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration. IEEE Data Eng. Bull. 41, 2 (2018), 10--22.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/320107.320109"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06)","author":"Cohen Sara","year":"2006","unstructured":"Sara Cohen , Itzhak Fadida , Yaron Kanza , Benny Kimelfeld , and Yehoshua Sagiv . 2006 . Full Disjunctions: Polynomial-Delay Iterators in Action . In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06) . VLDB Endowment, 739--750. Sara Cohen, Itzhak Fadida, Yaron Kanza, Benny Kimelfeld, and Yehoshua Sagiv. 2006. Full Disjunctions: Polynomial-Delay Iterators in Action. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06). VLDB Endowment, 739--750."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2006.10.015"},{"key":"e_1_2_1_19_1","volume-title":"Finding Related Tables. In SIGMOD Conference","author":"Sarma Anish Das","year":"2012","unstructured":"Anish Das Sarma , Lujun Fang , Nitin Gupta , Alon Halevy , Hongrae Lee , Fei Wu , Reynold Xin , and Cong Yu . 2012 . Finding Related Tables. In SIGMOD Conference 2012. ACM, 817--828. 10.1145\/2213836.2213962 Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In SIGMOD Conference 2012. ACM, 817--828. 10.1145\/2213836.2213962"},{"key":"e_1_2_1_20_1","unstructured":"Canada Open Data. 2020. https:\/\/open.canada.ca\/en\/open-data  Canada Open Data. 2020. https:\/\/open.canada.ca\/en\/open-data"},{"key":"e_1_2_1_21_1","unstructured":"UK Open Data. 2020. https:\/\/data.gov.uk\/  UK Open Data. 2020. https:\/\/data.gov.uk\/"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.1979.4766909"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3430915.3442430"},{"key":"e_1_2_1_24_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs\/1810.04805","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs\/1810.04805 (2019). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs\/1810.04805 (2019)."},{"key":"e_1_2_1_25_1","first-page":"50060","volume-title":"VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. 610--621","author":"Do Hong-Hai","year":"2002","unstructured":"Hong-Hai Do and Erhard Rahm . 2002 . COMA---a system for flexible combination of schema matching approaches . In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. 610--621 . 10.1016\/B978-155860869-6\/ 50060 - 50063 Hong-Hai Do and Erhard Rahm. 2002. COMA---a system for flexible combination of schema matching approaches. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. 610--621. 10.1016\/B978-155860869-6\/50060-3"},{"key":"e_1_2_1_26_1","volume-title":"Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021","author":"Dong Yuyang","year":"2021","unstructured":"Yuyang Dong , Kunihiro Takeoka , Chuan Xiao , and Masafumi Oyamada . 2021 . Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021 . IEEE, 456--467. 10.1109\/ICDE51399.2021.00046 Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021. IEEE, 456--467. 10.1109\/ICDE51399.2021.00046"},{"key":"e_1_2_1_27_1","unstructured":"Hugging Face. 2022. https:\/\/huggingface.co  Hugging Face. 2022. https:\/\/huggingface.co"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2210.01922"},{"key":"e_1_2_1_29_1","volume-title":"SIGMOD Conference 2016","author":"Farid Mina H.","year":"2016","unstructured":"Mina H. Farid , Alexandra Roatis , Ihab F. Ilyas , Hella-Franziska Hoffmann , and Xu Chu . 2016 . CLAMS: Bringing Quality to Data Lakes . In SIGMOD Conference 2016 . ACM, 2089--2092. 10.1145\/2882903.2899391 Mina H. Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In SIGMOD Conference 2016. ACM, 2089--2092. 10.1145\/2882903.2899391"},{"key":"e_1_2_1_30_1","unstructured":"fastText. 2022. https:\/\/fasttext.cc\/docs\/en\/english-vectors.html  fastText. 2022. https:\/\/fasttext.cc\/docs\/en\/english-vectors.html"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2019.2962124"},{"key":"e_1_2_1_32_1","volume-title":"SIGMOD Conference 1994","author":"Galindo-Legaria C\u00e9sar A.","year":"1994","unstructured":"C\u00e9sar A. Galindo-Legaria . 1994 . Outerjoins as Disjunctions . In SIGMOD Conference 1994 . ACM, 348--358. 10.1145\/191839.191908 C\u00e9sar A. Galindo-Legaria. 1994. Outerjoins as Disjunctions. In SIGMOD Conference 1994. ACM, 348--358. 10.1145\/191839.191908"},{"key":"e_1_2_1_33_1","unstructured":"Gensim. 2022. https:\/\/radimrehurek.com\/gensim  Gensim. 2022. https:\/\/radimrehurek.com\/gensim"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1016308404627"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 429--438","author":"He Bin","year":"2005","unstructured":"Bin He and Kevin Chen-Chuan Chang . 2005 . Making holistic schema matching robust: an ensemble approach . In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 429--438 . 10.1145\/1081870.1081920 Bin He and Kevin Chen-Chuan Chang. 2005. Making holistic schema matching robust: an ensemble approach. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 429--438. 10.1145\/1081870.1081920"},{"key":"e_1_2_1_36_1","volume-title":"Government's open data","author":"The","year":"2020","unstructured":"The home of the U.S. Government's open data . 2020 . https:\/\/data.gov\/ The home of the U.S. Government's open data. 2020. https:\/\/data.gov\/"},{"key":"e_1_2_1_37_1","unstructured":"IMDB. 2022. https:\/\/datasets.imdbws.com\/  IMDB. 2022. https:\/\/datasets.imdbws.com\/"},{"key":"e_1_2_1_38_1","volume-title":"Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759","author":"Joulin Armand","year":"2016","unstructured":"Armand Joulin , Edouard Grave , Piotr Bojanowski , and Tomas Mikolov . 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 ( 2016 ). Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)."},{"key":"e_1_2_1_39_1","volume-title":"On Schema Matching with Opaque Column Names and Data Values. In SIGMOD Conference","author":"Kang Jaewoo","year":"2003","unstructured":"Jaewoo Kang and Jeffrey F. Naughton . 2003 . On Schema Matching with Opaque Column Names and Data Values. In SIGMOD Conference 2003 . ACM, 205--216. 10.1145\/872757.872783 Jaewoo Kang and Jeffrey F. Naughton. 2003. On Schema Matching with Opaque Column Names and Data Values. In SIGMOD Conference 2003. ACM, 205--216. 10.1145\/872757.872783"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '03)","author":"Kanza Yaron","year":"2003","unstructured":"Yaron Kanza and Yehoshua Sagiv . 2003 . Computing Full Disjunctions . In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '03) . ACM, 78--89. 10.1145\/773153.773162 Yaron Kanza and Yehoshua Sagiv. 2003. Computing Full Disjunctions. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '03). ACM, 78--89. 10.1145\/773153.773162"},{"key":"e_1_2_1_41_1","volume-title":"SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD Conference","author":"Khatiwada Aamod","year":"2023","unstructured":"Aamod Khatiwada , Grace Fan , Roee Shraga , Zixuan Chen , Wolfgang Gatterbauer , Ren\u00e9e J Miller , and Mirek Riedewald . 2023 . SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD Conference 2023. ACM. Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Ren\u00e9e J Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search. In SIGMOD Conference 2023. ACM."},{"key":"e_1_2_1_42_1","volume-title":"Miller","author":"Khatiwada Aamod","year":"2022","unstructured":"Aamod Khatiwada , Gatterbauer Wolfgang , Roee Shraga , and Ren\u00e9e J . Miller . 2022 . Technical Report on Integrating Data Lake Tables . https:\/\/github.com\/northeastern-datalab\/alite\/blob\/main\/alite-technical-report.pdf Aamod Khatiwada, Gatterbauer Wolfgang, Roee Shraga, and Ren\u00e9e J. Miller. 2022. Technical Report on Integrating Data Lake Tables. https:\/\/github.com\/northeastern-datalab\/alite\/blob\/main\/alite-technical-report.pdf"},{"key":"e_1_2_1_43_1","first-page":"90","article-title":"Review on determining number of Cluster in K-Means Clustering","volume":"1","author":"Kodinariya Trupti M","year":"2013","unstructured":"Trupti M Kodinariya and Prashant R Makwana . 2013 . Review on determining number of Cluster in K-Means Clustering . International Journal 1 , 6 (2013), 90 -- 95 . Trupti M Kodinariya and Prashant R Makwana. 2013. Review on determining number of Cluster in K-Means Clustering. International Journal 1, 6 (2013), 90--95.","journal-title":"International Journal"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920904"},{"key":"e_1_2_1_45_1","volume-title":"Valentine: Evaluating Matching Techniques for Dataset Discovery. In 37th IEEE International Conference on Data Engineering, ICDE 2021","author":"Koutras Christos","year":"2021","unstructured":"Christos Koutras , George Siachamis , Andra Ionescu , Kyriakos Psarakis , Jerry Brons , Marios Fragkoulis , Christoph Lofi , Angela Bonifati , and Asterios Katsifodimos . 2021 . Valentine: Evaluating Matching Techniques for Dataset Discovery. In 37th IEEE International Conference on Data Engineering, ICDE 2021 . IEEE, 468--479. 10.1109\/ICDE51399.2021.00047 Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In 37th IEEE International Conference on Data Engineering, ICDE 2021. IEEE, 468--479. 10.1109\/ICDE51399.2021.00047"},{"key":"e_1_2_1_46_1","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1145\/984378.984379","article-title":"Generalized joins","volume":"8","author":"Lacroix Michel","year":"1976","unstructured":"Michel Lacroix and Alain Pirotte . 1976 . Generalized joins . ACM Sigmod Record 8 , 3 (1976), 14 -- 15 . Michel Lacroix and Alain Pirotte. 1976. Generalized joins. ACM Sigmod Record 8, 3 (1976), 14--15.","journal-title":"ACM Sigmod Record"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btm563"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137657"},{"key":"e_1_2_1_49_1","volume-title":"DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT 2021","author":"Leventidis Aristotelis","year":"2021","unstructured":"Aristotelis Leventidis , Laura Di Rocco , Wolfgang Gatterbauer , Ren\u00e9e J. Miller , and Mirek Riedewald . 2021 . DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT 2021 . OpenProceedings.org, 13--24. 10.5441\/002\/edbt. 2021.03 Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Ren\u00e9e J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT 2021. OpenProceedings.org, 13--24. 10.5441\/002\/edbt.2021.03"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1921005"},{"key":"e_1_2_1_51_1","volume-title":"Generic schema matching with cupid. In vldb","author":"Madhavan Jayant","unstructured":"Jayant Madhavan , Philip A Bernstein , and Erhard Rahm . 2001. Generic schema matching with cupid. In vldb , Vol. 1 . Citeseer , 49--58. Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic schema matching with cupid. In vldb, Vol. 1. Citeseer, 49--58."},{"key":"e_1_2_1_52_1","volume-title":"The theory of relational databases","author":"Maier David","unstructured":"David Maier . 1983. The theory of relational databases . Vol. 11 . Computer science press Rockville . David Maier. 1983. The theory of relational databases. Vol. 11. Computer science press Rockville."},{"key":"e_1_2_1_53_1","volume-title":"Proceedings of the 18th International Conference on Data Engineering, 2002","author":"Melnik Sergey","year":"2002","unstructured":"Sergey Melnik , Hector Garcia-Molina , and Erhard Rahm . 2002 . Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching . In Proceedings of the 18th International Conference on Data Engineering, 2002 . IEEE Computer Society, 117--128. 10.1109\/ICDE. 2002.994702 Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. 2002. Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching. In Proceedings of the 18th International Conference on Data Engineering, 2002. IEEE Computer Society, 117--128. 10.1109\/ICDE.2002.994702"},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018","author":"Mikolov Tom\u00e1s","year":"2018","unstructured":"Tom\u00e1s Mikolov , Edouard Grave , Piotr Bojanowski , Christian Puhrsch , and Armand Joulin . 2018 . Advances in Pre-Training Distributed Word Representations . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018 . European Language Resources Association (ELRA). http:\/\/www.lrec-conf.org\/proceedings\/lrec 2018\/summaries\/721.html Tom\u00e1s Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018. European Language Resources Association (ELRA). http:\/\/www.lrec-conf.org\/proceedings\/lrec2018\/summaries\/721.html"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3240491"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476364"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.bdr.2019.07.002"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_61_1","volume-title":"4th International Conference, ADVIS 2006, Proceedings (Lecture Notes in Computer Science)","volume":"4243","author":"Pei Jin","year":"1890","unstructured":"Jin Pei , Jun Hong , and David A. Bell . 2006. A Novel Clustering-Based Approach to Schema Matching. In Advances in Information Systems , 4th International Conference, ADVIS 2006, Proceedings (Lecture Notes in Computer Science) , Vol. 4243 . Springer, 60--69. 10.1007\/1 1890 393_7 Jin Pei, Jun Hong, and David A. Bell. 2006. A Novel Clustering-Based Approach to Schema Matching. In Advances in Information Systems, 4th International Conference, ADVIS 2006, Proceedings (Lecture Notes in Computer Science), Vol. 4243. Springer, 60--69. 10.1007\/11890393_7"},{"key":"e_1_2_1_62_1","unstructured":"py_entitymatching. 2016. https:\/\/github.com\/anhaidgroup\/py_entitymatching  py_entitymatching. 2016. https:\/\/github.com\/anhaidgroup\/py_entitymatching"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/s007780100057"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-63962-8_12-1"},{"key":"e_1_2_1_65_1","volume-title":"Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '96)","author":"Rajaraman Anand","unstructured":"Anand Rajaraman and Jeffrey D. Ullman . 1996. Integrating Information by Outerjoins and Full Disjunctions (Extended Abstract) . In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '96) . ACM, 238--248. 10.1145\/237661.237717 Anand Rajaraman and Jeffrey D. Ullman. 1996. Integrating Information by Outerjoins and Full Disjunctions (Extended Abstract). In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '96). ACM, 238--248. 10.1145\/237661.237717"},{"key":"e_1_2_1_66_1","volume-title":"Database management systems (3. ed.)","author":"Ramakrishnan Raghu","unstructured":"Raghu Ramakrishnan and Johannes Gehrke . 2003. Database management systems (3. ed.) . McGraw-Hill . Raghu Ramakrishnan and Johannes Gehrke. 2003. Database management systems (3. ed.). McGraw-Hill."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1016\/0377-0427(87)90125-7"},{"key":"e_1_2_1_68_1","volume-title":"Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271)","author":"Rubner Y.","year":"1998","unstructured":"Y. Rubner , C. Tomasi , and L.J. Guibas . 1998. A metric for distributions with applications to image databases . In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271) . 59--66. 10.1109\/ICCV. 1998 .710701 Y. Rubner, C. Tomasi, and L.J. Guibas. 1998. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271). 59--66. 10.1109\/ICCV.1998.710701"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.14778\/3397230.3397237"},{"key":"e_1_2_1_70_1","volume-title":"10th International Conference on Extending Database Technology, Proceedings (Lecture Notes in Computer Science)","volume":"3896","author":"Su Weifeng","unstructured":"Weifeng Su , Jiying Wang , and Frederick H. Lochovsky . 2006. Holistic Schema Matching for Web Query Interfaces. In Advances in Database Technology - EDBT 2006 , 10th International Conference on Extending Database Technology, Proceedings (Lecture Notes in Computer Science) , Vol. 3896 . Springer, 77--94. 10.1007\/11687238_8 Weifeng Su, Jiying Wang, and Frederick H. Lochovsky. 2006. Holistic Schema Matching for Web Query Interfaces. In Advances in Database Technology - EDBT 2006, 10th International Conference on Extending Database Technology, Proceedings (Lecture Notes in Computer Science), Vol. 3896. Springer, 77--94. 10.1007\/11687238_8"},{"key":"e_1_2_1_71_1","volume-title":"Annotating Columns with Pre-trained Language Models. In SIGMOD Conference","author":"Suhara Yoshihiko","year":"2022","unstructured":"Yoshihiko Suhara , Jinfeng Li , Yuliang Li , Dan Zhang , \u00c7agatay Demiralp , Chen Chen , and Wang-Chiew Tan . 2022 . Annotating Columns with Pre-trained Language Models. In SIGMOD Conference 2022. ACM, 1493--1503. 10.1145\/3514221.3517906 Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, \u00c7agatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Language Models. In SIGMOD Conference 2022. ACM, 1493--1503. 10.1145\/3514221.3517906"},{"key":"e_1_2_1_72_1","unstructured":"TURL. 2020. https:\/\/github.com\/sunlab-osu\/TURL  TURL. 2020. https:\/\/github.com\/sunlab-osu\/TURL"},{"key":"e_1_2_1_73_1","unstructured":"Valentine. 2021. https:\/\/github.com\/delftdata\/valentine  Valentine. 2021. https:\/\/github.com\/delftdata\/valentine"},{"key":"e_1_2_1_74_1","volume-title":"7th International Conference","author":"Yannakakis Mihalis","year":"1981","unstructured":"Mihalis Yannakakis . 1981 . Algorithms for Acyclic Database Schemes. In Very Large Data Bases , 7th International Conference , 1981. IEEE Computer Society, 82--94. Mihalis Yannakakis. 1981. Algorithms for Acyclic Database Schemes. In Very Large Data Bases, 7th International Conference, 1981. IEEE Computer Society, 82--94."},{"key":"e_1_2_1_75_1","volume-title":"12th International Conference on Database Systems for Advanced Applications, DASFAA 2007 (Lecture Notes in Computer Science)","volume":"4443","author":"Zhan Jiang","year":"2007","unstructured":"Jiang Zhan and Shan Wang . 2007 . ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship. In Advances in Databases: Concepts, Systems and Applications , 12th International Conference on Database Systems for Advanced Applications, DASFAA 2007 (Lecture Notes in Computer Science) , Vol. 4443 . Springer, 67--78. 10.1007\/978-3-540-71703-4_8 Jiang Zhan and Shan Wang. 2007. ITREKS: Keyword Search over Relational Database by Indexing Tuple Relationship. In Advances in Databases: Concepts, Systems and Applications, 12th International Conference on Database Systems for Advanced Applications, DASFAA 2007 (Lecture Notes in Computer Science), Vol. 4443. Springer, 67--78. 10.1007\/978-3-540-71703-4_8"},{"key":"e_1_2_1_76_1","volume-title":"SIGMOD Conference 2011","author":"Zhang Meihui","year":"2011","unstructured":"Meihui Zhang , Marios Hadjieleftheriou , Beng Chin Ooi , Cecilia M. Procopiuc , and Divesh Srivastava . 2011 . Automatic discovery of attributes in relational databases . In SIGMOD Conference 2011 . ACM, 109--120. 10.1145\/1989323.1989336 Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2011. Automatic discovery of attributes in relational databases. In SIGMOD Conference 2011. ACM, 109--120. 10.1145\/1989323.1989336"},{"key":"e_1_2_1_77_1","volume-title":"Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD Conference 2020","author":"Zhang Yi","year":"1951","unstructured":"Yi Zhang and Zachary G. Ives . 2020 . Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD Conference 2020 . ACM, 1951 --1966. 10.1145\/3318464.3389726 Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD Conference 2020. ACM, 1951--1966. 10.1145\/3318464.3389726"},{"key":"e_1_2_1_78_1","volume-title":"JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD Conference","author":"Zhu Erkang","year":"2019","unstructured":"Erkang Zhu , Dong Deng , Fatemeh Nargesian , and Ren\u00e9e J. Miller . 2019 . JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD Conference 2019 . ACM, 847--864. 10.1145\/3299869.3300065 Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Ren\u00e9e J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD Conference 2019. ACM, 847--864. 10.1145\/3299869.3300065"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115409"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137788"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3574245.3574274","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,21]],"date-time":"2023-02-21T23:21:15Z","timestamp":1677021675000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3574245.3574274"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12]]},"references-count":80,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["10.14778\/3574245.3574274"],"URL":"https:\/\/doi.org\/10.14778\/3574245.3574274","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2022,12]]},"assertion":[{"value":"2023-02-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}