{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T19:50:39Z","timestamp":1774986639453,"version":"3.50.1"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,6,17]]},"abstract":"<jats:p>Discovering duplicate or high-overlapping tables in table collections is a crucial task for eliminating redundant information, detecting inconsistencies in the evolution of a table across its multiple versions produced over time, and identifying related tables. Candidate duplicate or related tables to support this task can be identified via the estimation of the largest table overlap. Unfortunately, current solutions for finding it present serious scalability issues for heavy workloads: Sloth, the state of-the-art framework for its estimation, requires more than three days of machine time for computing 100k table overlaps.<\/jats:p>\n                  <jats:p>\n                    In this paper, we introduce ARMADILLO, an approach based on graph neural networks that learns table embeddings whose cosine similarity approximates the\n                    <jats:italic toggle=\"yes\">overlap ratio<\/jats:italic>\n                    between tables, i.e., the ratio between the area of their largest table overlap and the area of the smaller table in the pair. We also introduce two new annotated datasets based on GitTables and a Wikipedia table corpus containing 1.32 million table pairs overall labeled with their overlap. Evaluating the performance of ARMADILLO on these datasets, we observed that it is able to calculate overlaps between pairs of tables several times faster than the state-of-the-art method while maintaining a good quality in approximating the exact result.\n                  <\/jats:p>","DOI":"10.1145\/3725365","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:23:29Z","timestamp":1750281809000},"page":"1-25","source":"Crossref","is-referenced-by-count":1,"title":["Table Overlap Estimation through Graph Embeddings"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-0300-6686","authenticated-orcid":false,"given":"Francesco","family":"Pugnaloni","sequence":"first","affiliation":[{"name":"Hasso Plattner Institute, University of Potsdam, Potsdam, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4856-0838","authenticated-orcid":false,"given":"Luca","family":"Zecchini","sequence":"additional","affiliation":[{"name":"University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8119-895X","authenticated-orcid":false,"given":"Matteo","family":"Paganelli","sequence":"additional","affiliation":[{"name":"University of Modena and Reggio Emilia, Modena, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7922-5998","authenticated-orcid":false,"given":"Matteo","family":"Lissandrini","sequence":"additional","affiliation":[{"name":"University of Verona, Verona, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4483-1389","authenticated-orcid":false,"given":"Felix","family":"Naumann","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute, University of Potsdam, Potsdam, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3466-509X","authenticated-orcid":false,"given":"Giovanni","family":"Simonini","sequence":"additional","affiliation":[{"name":"University of Modena and Reggio Emilia, Modena, Italy"}]}],"member":"320","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00544"},{"key":"e_1_2_2_2_1","first-page":"425","article-title":"TabEL: Entity Linking in Web Tables. In ISWC (1), (Lecture Notes in Computer Science, Vol. 9366)","author":"Bhagavatula Chandra Sekhar","year":"2015","unstructured":"Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity Linking in Web Tables. In ISWC (1), (Lecture Notes in Computer Science, Vol. 9366). Springer, 425-441.","journal-title":"Springer"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3282495.3282496"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00115"},{"key":"e_1_2_2_5_1","volume-title":"Proceedings of the Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data @ VLDB), (CEUR Workshop Proceedings, 2929)","author":"Bleifu\u00df Tobias","year":"2021","unstructured":"Tobias Bleifu\u00df, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. 2021b. The Secret Life of Wikipedia Tables. In Proceedings of the Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data @ VLDB), (CEUR Workshop Proceedings, 2929). 20-26. https:\/\/ceur-ws.org\/Vol-2929\/paper4.pdf"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389742"},{"key":"e_1_2_2_9_1","volume-title":"Proceedings of the International Conference on Information and Knowledge Management (CIKM). ACM, 2945-2949","author":"Chen Zhiyu","unstructured":"Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Dawei Yin, and Brian D. Davison. 2021. MGNETS: Multi-Graph Neural Networks for Table Search. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). ACM, 2945-2949."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3542700.3542709"},{"key":"e_1_2_2_11_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1),. Association for Computational Linguistics, 4171-4186."},{"key":"e_1_2_2_12_1","first-page":"2458","article-title":"DeepJoin","volume":"16","author":"Dong Yuyang","year":"2023","unstructured":"Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-trained Language Models. PVLDB, Vol. 16, 10 (2023), 2458-2470.","journal-title":"Joinable Table Discovery with Pre-trained Language Models. PVLDB"},{"key":"e_1_2_2_13_1","volume-title":"Cohen","author":"Eisenschlos Julian Martin","year":"2021","unstructured":"Julian Martin Eisenschlos, Maharshi Gor, Thomas M\u00fcller, and William W. Cohen. 2021. MATE: Multi-view Attention for Table Transformer Efficiency. In EMNLP (1),. Association for Computational Linguistics, 7606-7619."},{"key":"e_1_2_2_14_1","first-page":"1684","article-title":"MATE","volume":"15","author":"Esmailoghli Mahdi","year":"2022","unstructured":"Mahdi Esmailoghli, Jorge-Arnulfo Quian\u00e9-Ruiz, and Ziawasch Abedjan. 2022. MATE: Multi-Attribute Table Extraction. PVLDB, Vol. 15, 8 (2022), 1684-1696.","journal-title":"Multi-Attribute Table Extraction. PVLDB"},{"key":"e_1_2_2_15_1","volume-title":"Finding Support for Tabular LLM Outputs. In VLDB Workshops. VLDB.org.","author":"Fan Grace","unstructured":"Grace Fan, Roee Shraga, and Ren\u00e9e J. Miller. 2024a. Finding Support for Tabular LLM Outputs. In VLDB Workshops. VLDB.org."},{"key":"e_1_2_2_16_1","volume-title":"Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 3532-3545","author":"Fan Grace","unstructured":"Grace Fan, Roee Shraga, and Ren\u00e9e J. Miller. 2024b. Gen-T: Table Reclamation in Data Lakes. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 3532-3545."},{"key":"e_1_2_2_17_1","volume-title":"Table Discovery in Data Lakes: State-of-the-art and Future Directions. In SIGMOD Conference Companion. ACM, 69-75","author":"Fan Grace","unstructured":"Grace Fan, Jin Wang, Yuliang Li, and Ren\u00e9e J. Miller. 2023a. Table Discovery in Data Lakes: State-of-the-art and Future Directions. In SIGMOD Conference Companion. ACM, 69-75."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/3587136.3587146"},{"key":"e_1_2_2_19_1","first-page":"241","article-title":"Completeness and Ambiguity of Schema Cover. In O\u2122 Conferences, (Lecture Notes in Computer Science, Vol. 8185)","author":"Gal Avigdor","year":"2013","unstructured":"Avigdor Gal, Michael Katz, Tomer Sagi, Matthias Weidlich, Karl Aberer, Nguyen Quoc Viet Hung, Zolt\u00e1n Mikl\u00f3s, Eliezer Levy, and Victor Shafran. 2013. Completeness and Ambiguity of Schema Cover. In O\u2122 Conferences, (Lecture Notes in Computer Science, Vol. 8185). Springer, 241-258.","journal-title":"Springer"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2021.3134806"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939754"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3412056"},{"key":"e_1_2_2_23_1","volume-title":"Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017","author":"Hamilton William L.","year":"2017","unstructured":"William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA. 1024-1034. https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html"},{"key":"e_1_2_2_24_1","volume-title":"Thomas M\u00fcller, Francesco Piccinno, and Julian Martin Eisenschlos.","author":"Herzig Jonathan","year":"2020","unstructured":"Jonathan Herzig, Pawel Krzysztof Nowak, Thomas M\u00fcller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. In ACL,. Association for Computational Linguistics, 4320-4333."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588710"},{"key":"e_1_2_2_26_1","volume-title":"TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT","author":"Iida Hiroshi","year":"2021","unstructured":"Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained Representations of Tabular Data. In NAACL-HLT,. Association for Computational Linguistics, 3446-3456."},{"key":"e_1_2_2_27_1","volume-title":"New Phytologist, (Cargse, France.)","unstructured":"Jaccard. 1912. The distribution of the flora of the alpine zone. In New Phytologist, (Cargse, France.), Vol. 11. 37-50."},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582418"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588689"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574274"},{"key":"e_1_2_2_31_1","volume-title":"Kipf and Max Welling","author":"Thomas","year":"2017","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR (Poster). OpenReview.net."},{"key":"e_1_2_2_32_1","first-page":"367","volume-title":"Proceedings of the Conference Datenbanksysteme in Business, Technologie und Web Technik (BTW), (LNI","volume":"331","author":"Koch Maximilian","year":"2023","unstructured":"Maximilian Koch, Mahdi Esmailoghli, S\u00f6ren Auer, and Ziawasch Abedjan. 2023. Duplicate Table Discovery with Xash. In Proceedings of the Conference Datenbanksysteme in Business, Technologie und Web Technik (BTW), (LNI, Vol. P-331, ). Gesellschaft f\u00fcr Informatik e.V., 367-390."},{"key":"e_1_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Jure Leskovec Anand Rajaraman and Jeffrey David Ullman. 2020. Mining of Massive Datasets . http:\/\/www.mmds.org","DOI":"10.1017\/9781108684163"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6330"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/2535568.2448943"},{"key":"e_1_2_2_36_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs\/1907.11692 (2019)."},{"key":"e_1_2_2_37_1","volume-title":"Knoblock","author":"Michelson Matthew","year":"2006","unstructured":"Matthew Michelson and Craig A. Knoblock. 2006. Learning Blocking Schemes for Record Linkage. In AAAI,. AAAI Press, 440-445."},{"key":"e_1_2_2_38_1","unstructured":"Tom\u00e1s Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR (Workshop Poster)."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_2_41_1","volume-title":"A Survey on Oversmoothing in Graph Neural Networks. CoRR","author":"Rusch T. Konstantin","year":"2023","unstructured":"T. Konstantin Rusch, Michael M. Bronstein, and Siddhartha Mishra. 2023. A Survey on Oversmoothing in Graph Neural Networks. CoRR, Vol. abs\/2303.10993 (2023)."},{"key":"e_1_2_2_42_1","volume-title":"Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, 285-296","author":"Saha Barna","unstructured":"Barna Saha, Ioana Stanoi, and Kenneth L. Clarkson. 2010. Schema covering: a step towards enabling reuse in information integration. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, 285-296."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213962"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517906"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/3494124.3494149"},{"key":"e_1_2_2_46_1","first-page":"442","article-title":"StruBERT: Structure-aware BERT for Table Search and Matching. In WWW","author":"Trabelsi Mohamed","year":"2022","unstructured":"Mohamed Trabelsi, Zhiyu Chen, Shuo Zhang, Brian D. Davison, and Jeff Heflin. 2022. StruBERT: Structure-aware BERT for Table Search and Matching. In WWW,. ACM, 442-451.","journal-title":"ACM"},{"key":"e_1_2_2_47_1","unstructured":"Petar Velickovic Guillem Cucurull Arantxa Casanova Adriana Romero Pietro Li\u00f2 and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR (Poster). OpenReview.net."},{"key":"e_1_2_2_48_1","volume-title":"Proceedings of the International Conference on Information Retrieval (SIGIR). ACM, 1472-1482","author":"Wang Fei","unstructured":"Fei Wang, Kexuan Sun, Muhao Chen, Jay Pujara, and Pedro A. Szekely. 2021. Retrieving Complex Tables with Multi-Granular Graph Representation Learning. In Proceedings of the International Conference on Information Retrieval (SIGIR). ACM, 1472-1482."},{"key":"e_1_2_2_49_1","volume-title":"Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR","author":"Wu Yonghui","year":"2016","unstructured":"Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR, Vol. abs\/1609.08144 (2016)."},{"key":"e_1_2_2_50_1","volume-title":"7th International Conference on Learning Representations, ICLR 2019","author":"Xu Keyulu","year":"2019","unstructured":"Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019a. How Powerful are Graph Neural Networks?. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=ryGs6iA5Km"},{"key":"e_1_2_2_51_1","unstructured":"Keyulu Xu Weihua Hu Jure Leskovec and Stefanie Jegelka. 2019b. How Powerful are Graph Neural Networks?. In ICLR . OpenReview.net."},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.745"},{"key":"e_1_2_2_53_1","first-page":"11960","article-title":"Graph Transformer Networks","author":"Yun Seongjun","year":"2019","unstructured":"Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. 2019. Graph Transformer Networks. In NeurIPS. 11960-11970.","journal-title":"NeurIPS."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639303"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331333"},{"key":"e_1_2_2_56_1","volume-title":"Proceedings of the International Conference on Management of Data (SIGMOD). ACM","author":"Zhang Yi","unstructured":"Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, 1951-1966."},{"key":"e_1_2_2_57_1","volume-title":"Proceedings of the International Conference on Management of Data (SIGMOD). ACM, 847-864","author":"Zhu Erkang","unstructured":"Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Ren\u00e9e J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, 847-864."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3725365","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:59:04Z","timestamp":1774983544000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3725365"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,17]]},"references-count":57,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,6,17]]}},"alternative-id":["10.1145\/3725365"],"URL":"https:\/\/doi.org\/10.1145\/3725365","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,17]]}}}