{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T09:55:06Z","timestamp":1773482106080,"version":"3.50.1"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,12]],"date-time":"2024-03-12T00:00:00Z","timestamp":1710201600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,3,12]]},"abstract":"<jats:p>Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it.<\/jats:p>\n          <jats:p>The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors.<\/jats:p>\n          <jats:p>We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.<\/jats:p>","DOI":"10.1145\/3639303","type":"journal-article","created":{"date-parts":[[2024,3,26]],"date-time":"2024-03-26T18:51:32Z","timestamp":1711479092000},"page":"1-26","source":"Crossref","is-referenced-by-count":3,"title":["Determining the Largest Overlap between Tables"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4856-0838","authenticated-orcid":false,"given":"Luca","family":"Zecchini","sequence":"first","affiliation":[{"name":"University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9517-7707","authenticated-orcid":false,"given":"Tobias","family":"Bleifu\u00df","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute, University of Potsdam, Potsdam, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3466-509X","authenticated-orcid":false,"given":"Giovanni","family":"Simonini","sequence":"additional","affiliation":[{"name":"University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8087-6587","authenticated-orcid":false,"given":"Sonia","family":"Bergamaschi","sequence":"additional","affiliation":[{"name":"University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4483-1389","authenticated-orcid":false,"given":"Felix","family":"Naumann","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute, University of Potsdam, Potsdam, Germany"}]}],"member":"320","published-online":{"date-parts":[[2024,3,26]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","unstructured":"Ziawasch Abedjan Lukasz Golab Felix Naumann and Thorsten Papenbrock. 2018. Data Profiling. https:\/\/doi.org\/10.1007\/978--3-031-01865--7","DOI":"10.1007\/978--3-031-01865--7"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3--642--41181--6_18"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415560"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00544"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.48786\/edbt.2023.36"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3--319--25007--6_25"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","unstructured":"Garrett Birkhoff. 1940. Lattice Theory. https:\/\/doi.org\/10.1090\/coll\/025","DOI":"10.1090\/coll\/025"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.websem.2009.07.002"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/3282495.3282496"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00115"},{"key":"e_1_2_2_11_1","volume-title":"Proceedings of the Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data @ VLDB). 20--26","author":"Bleifu\u00df Tobias","year":"2021","unstructured":"Tobias Bleifu\u00df, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. 2021b. The Secret Life of Wikipedia Tables. In Proceedings of the Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data @ VLDB). 20--26. https:\/\/ceur-ws.org\/Vol-2929\/paper4.pdf"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2015.7113426"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687750"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453916"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389742"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00094"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213962"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45876-X_30"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10844-007-0048-x"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2003.1250899"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115413"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183748"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331333"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430921"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00046"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3357916"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/BDC.2015.30"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3529337.3529353"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816716"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.2307\/2312726"},{"key":"e_1_2_2_33_1","volume-title":"Johnson","author":"Garey Michael R.","year":"1979","unstructured":"Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2500887"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588710"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3310205"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1469--8137.1912.tb05611.x"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588929"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574274"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.18420\/BTW2023--18"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137657"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2872518.2889386"},{"key":"e_1_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Jure Leskovec Anand Rajaraman and Jeffrey David Ullman. 2020. Mining of Massive Datasets. http:\/\/www.mmds.org","DOI":"10.1017\/9781108684163"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/2535568.2448943"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1921005"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306--4379(01)00027--8"},{"key":"e_1_2_2_47_1","unstructured":"Bruce T. Lowerre. 1976. The HARPY Speech Recognition System. Ph.D. Dissertation. Carnegie Mellon University."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380605"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/2752939.2752946"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3-030--49461--2_34"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78646-7_75"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2797115.2797118"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-28730-6_11"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.04217"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1101\/gr.092452.109"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3105959"},{"key":"e_1_2_2_59_1","volume-title":"Proceedings of the VLDB Workshops. https:\/\/ceur-ws.org\/Vol-3462\/TADA3.pdf","author":"Vogel Liane","year":"2023","unstructured":"Liane Vogel and Carsten Binnig. 2023. WikiDBs: A Corpus Of Relational Databases From Wikidata. In Proceedings of the VLDB Workshops. https:\/\/ceur-ws.org\/Vol-3462\/TADA3.pdf"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.745"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-71703-4_8"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389726"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300065"},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639303","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3639303","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T15:12:59Z","timestamp":1755789179000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639303"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,12]]},"references-count":64,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,12]]}},"alternative-id":["10.1145\/3639303"],"URL":"https:\/\/doi.org\/10.1145\/3639303","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,12]]}}}