{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T16:49:49Z","timestamp":1773938989919,"version":"3.50.1"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,12]]},"abstract":"<jats:p>Inclusion dependencies (INDs) are widely used in data management tasks. The discovery techniques of INDs have thus received a lot of attention, for discovering INDs valid in data. However, real-world data quality issues may lead to partial violations of INDs. This paper makes the first effort to provide a comprehensive study on the discovery of approximate INDs (AINDs), aiming to identify INDs with error rates below a given threshold. This paper introduces a new definition of AIND based on deletion semantics, in addition to the existing definition based on insertion semantics. A discovery method is developed that can be configured to identify AINDs based on either of these semantics. The method combines partitioning techniques to handle tables that cannot all fit into memory simultaneously, with novel approaches to quantify AIND violations based on partitioned tables. To improve efficiency, the method employs a novel three-layer filtering structure and techniques that can potentially prune invalid candidate AINDs and identify valid AINDs without necessarily processing all tuples. We conduct an extensive experimental evaluation and verify the following: the proposed method significantly outperforms existing methods for AIND discovery based on insertion semantics, the AIND discoveries with insertion and deletion semantics can provide complementary results, and our discovery method can effectively deal with dirty dataset containing various types of errors.<\/jats:p>","DOI":"10.14778\/3717755.3717777","type":"journal-article","created":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T15:51:49Z","timestamp":1747756309000},"page":"1210-1222","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Discovering Approximate Inclusion Dependencies"],"prefix":"10.14778","volume":"18","author":[{"given":"Qingdong","family":"Su","sequence":"first","affiliation":[{"name":"School of Computer Science, Fudan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhikang","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zijing","family":"Tan","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, China and Shanghai Key Laboratory of Data Science"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shuai","family":"Ma","sequence":"additional","affiliation":[{"name":"SKLSDE Lab, Beihang University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,5,20]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"1747","article-title":"Data Profiling","volume":"2017","author":"Abedjan Ziawasch","year":"2017","unstructured":"Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data Profiling: A Tutorial. In SIGMOD 2017. 1747\u20131751.","journal-title":"A Tutorial. In SIGMOD"},{"key":"e_1_2_1_2_1","volume-title":"Data Profiling","author":"Abedjan Ziawasch","unstructured":"Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Morgan & Claypool Publishers."},{"key":"e_1_2_1_3_1","volume-title":"Foundations of Databases","author":"Abiteboul Serge","unstructured":"Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley."},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Jana Bauckmann Ziawasch Abedjan Ulf Leser Heiko M\u00fcller and Felix Naumann. 2012. Discovering conditional inclusion dependencies. In CIKM. 2094\u20132098.","DOI":"10.1145\/2396761.2398580"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Jana Bauckmann Ulf Leser Felix Naumann and Veronique Tietz. 2007. Efficiently Detecting Inclusion Dependencies. In ICDE. 1448\u20131450.","DOI":"10.1109\/ICDE.2007.369032"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2021.11.020"},{"key":"e_1_2_1_8_1","unstructured":"Leon Bornemann Tobias Bleifu\u00df Dmitri V. Kalashnikov Fatemeh Nargesian Felix Naumann and Divesh Srivastava. 2024. Efficient Discovery of Temporal Inclusion Dependencies in Wikipedia Tables. In EDBT. 399\u2013411."},{"key":"e_1_2_1_9_1","unstructured":"Loreto Bravo Wenfei Fan and Shuai Ma. 2007. Extending Dependencies with Conditions. In VLDB. 243\u2013254."},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","unstructured":"Yuyang Dong Kunihiro Takeoka Chuan Xiao and Masafumi Oyamada. 2021. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In ICDE. 456\u2013467.","DOI":"10.1109\/ICDE51399.2021.00046"},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Falco D\u00fcrsch Axel Stebner Fabian Windheuser Maxi Fischer Tim Friedrich Nils Strelow Tobias Bleifu\u00df Hazar Harmouch Lan Jiang Thorsten Papenbrock and Felix Naumann. 2019. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. In CIKM. 219\u2013228.","DOI":"10.1145\/3357384.3357916"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3529337.3529353"},{"key":"e_1_2_1_13_1","volume-title":"Miller","author":"Fan Grace","year":"2023","unstructured":"Grace Fan, Jin Wang, Yuliang Li, and Ren\u00e9e J. Miller. 2023. Table Discovery in Data Lakes: State-of-the-art and Future Directions. In SIGMOD. 69\u201375."},{"key":"e_1_2_1_14_1","volume-title":"Foundations of Data Quality Management","author":"Fan Wenfei","unstructured":"Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers."},{"key":"e_1_2_1_15_1","volume-title":"Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science","author":"Flajolet Philippe","year":"2007","unstructured":"Philippe Flajolet, \u00c9ric Fusy, Olivier Gandouet, and Fr\u00e9d\u00e9ric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science (2007)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Jarek Gryz. 1998. Query Folding with Inclusion Dependencies. In ICDE. 126\u2013133.","DOI":"10.1109\/ICDE.1998.655768"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Laura M. Haas Mauricio A. Hern\u00e1ndez Howard Ho Lucian Popa and Mary Roth. 2005. Clio grows up: from research prototype to industrial tool. In SIGMOD. 805\u2013810.","DOI":"10.1145\/1066157.1066252"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536336.2536345"},{"key":"e_1_2_1_19_1","volume-title":"Ilyas and Xu Chu","author":"Ihab","year":"2019","unstructured":"Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2021.3130227"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Yifeng Jin Zijing Tan Weijun Zeng and Shuai Ma. 2021. Approximate Order Dependency Discovery. In ICDE. 25\u201336.","DOI":"10.1109\/ICDE51399.2021.00010"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588929"},{"key":"e_1_2_1_23_1","unstructured":"Reza Karegar Parke Godfrey Lukasz Golab Mehdi Kargar Divesh Srivastava and Jaroslaw Szlichta. 2021. Efficient Discovery of Approximate Order Dependencies. In EDBT. 427\u2013432."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574274"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"A. Koeller and E.A. Rundensteiner. 2003. Discovery of high-dimensional inclusion dependencies. In ICDE. 683\u2013685.","DOI":"10.1109\/ICDE.2003.1260834"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00676-3"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192968"},{"key":"e_1_2_1_28_1","unstructured":"Sebastian Kruse Thorsten Papenbrock Christian Dullweber Moritz Finke Manuel Hegner Martin Zabel Christian Z\u00f6llner and Felix Naumann. 2017. Fast Approximate Discovery of Inclusion Dependencies. In BTW. 207\u2013226."},{"key":"e_1_2_1_29_1","volume-title":"BTW","author":"Kruse Sebastian","unstructured":"Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann. 2015. Scaling Out the Discovery of Inclusion Dependencies. In BTW, Vol. P-241. 445\u2013454."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.842267"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3401960.3401966"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2013.11.002"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10844-007-0048-x"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Fabien De Marchi and Jean-Marc Petit. 2003. Zigzag: a new algorithm for mining large inclusion dependencies in database. In ICDM. 27\u201334.","DOI":"10.1109\/ICDM.2003.1250899"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/373626.373713"},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Shaabani Nuhad and Christoph Meinel. 2016. Detecting Maximum Inclusion Dependencies without Candidate Generation. In Database and Expert Systems Applications. 118\u2013133.","DOI":"10.1007\/978-3-319-44406-2_10"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824086"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/2752939.2752946"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3368289.3368293"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-023-00788-y"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342638"},{"key":"e_1_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Nuhad Shaabani and Christoph Meinel. 2015. Scalable Inclusion Dependency Discovery. In DASFAA. 425\u2013440.","DOI":"10.1007\/978-3-319-18120-2_25"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3085504.3085506"},{"key":"e_1_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Nuhad Shaabani and Christoph Meinel. 2018. Improving the Efficiency of Inclusion Dependency Detection. In CIKM. 207\u2013216.","DOI":"10.1145\/3269206.3271724"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10619-018-7233-5"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-018-0510-0"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3105959"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2362394.2362395"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/3565816.3565828"},{"key":"e_1_2_1_50_1","volume-title":"Ives","author":"Zhang Yi","year":"2020","unstructured":"Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD. 1951\u20131966."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3717755.3717777","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T16:17:34Z","timestamp":1747757854000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3717755.3717777"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12]]},"references-count":50,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12]]}},"alternative-id":["10.14778\/3717755.3717777"],"URL":"https:\/\/doi.org\/10.14778\/3717755.3717777","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,12]]},"assertion":[{"value":"2025-05-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}