{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T14:07:24Z","timestamp":1760710044435,"version":"3.37.3"},"reference-count":29,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2019,11,9]],"date-time":"2019-11-09T00:00:00Z","timestamp":1573257600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,11,9]],"date-time":"2019-11-09T00:00:00Z","timestamp":1573257600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Sci. Eng."],"published-print":{"date-parts":[[2019,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n<jats:p>This paper revisits set containment join (SCJ) problem, which uses the subset relationship (i.e., <jats:inline-formula><jats:alternatives><jats:tex-math>$$\\subseteq$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>\u2286<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>) as condition to join set-valued attributes of two relations and has many fundamental applications in commercial and scientific fields. Existing in-memory algorithms for SCJ are either signature-based or prefix-tree-based. The former incurs high CPU cost because of the enumeration of signatures, while the latter incurs high space cost because of the storage of prefix trees. This paper proposes a new adaptive parameter-free in-memory algorithm, named as <jats:bold>fre<\/jats:bold>quency-ha<jats:bold>sh<\/jats:bold><jats:bold>join<\/jats:bold> or <jats:inline-formula><jats:alternatives><jats:tex-math>$${\\mathsf {FreshJoin}}$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mi>FreshJoin<\/mml:mi><\/mml:math><\/jats:alternatives><\/jats:inline-formula> in short, to evaluate SCJ efficiently. <jats:inline-formula><jats:alternatives><jats:tex-math>$${\\mathsf {FreshJoin}}$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mi>FreshJoin<\/mml:mi><\/mml:math><\/jats:alternatives><\/jats:inline-formula> builds a flat index on-the-fly to record three kinds of signatures (i.e., two least frequent elements and a hash signature whose length is determined adaptively by the frequencies of elements in the universe set). The index consists of two sparse inverted indices and two arrays which record hash signatures of all sets in each relation. The index is well organized such that <jats:inline-formula><jats:alternatives><jats:tex-math>$${\\mathsf {FreshJoin}}$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mi>FreshJoin<\/mml:mi><\/mml:math><\/jats:alternatives><\/jats:inline-formula> can avoid enumerating hash signatures. The rationality of this design is explained. And, the time and space cost of the proposed algorithm, which provide a rule to choose <jats:inline-formula><jats:alternatives><jats:tex-math>$${\\mathsf {FreshJoin}}$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mi>FreshJoin<\/mml:mi><\/mml:math><\/jats:alternatives><\/jats:inline-formula> from existing algorithms, are analyzed. Experiments on 16 real-life datasets show that <jats:inline-formula><jats:alternatives><jats:tex-math>$${\\mathsf {FreshJoin}}$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mi>FreshJoin<\/mml:mi><\/mml:math><\/jats:alternatives><\/jats:inline-formula> usually reduces more than 50% of space cost while remains as competitive as the state-of-the-art algorithms in running time.<\/jats:p>","DOI":"10.1007\/s41019-019-00107-y","type":"journal-article","created":{"date-parts":[[2019,11,9]],"date-time":"2019-11-09T02:03:03Z","timestamp":1573264983000},"page":"293-308","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join"],"prefix":"10.1007","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3302-3917","authenticated-orcid":false,"given":"Jizhou","family":"Luo","sequence":"first","affiliation":[]},{"given":"Wei","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Shengfei","family":"Shi","sequence":"additional","affiliation":[]},{"given":"Hong","family":"Gao","sequence":"additional","affiliation":[]},{"given":"Jianzhong","family":"Li","sequence":"additional","affiliation":[]},{"given":"Wei","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Shouxu","family":"Jiang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2019,11,9]]},"reference":[{"key":"107_CR1","doi-asserted-by":"crossref","unstructured":"Deng D, Yang C, Shang S, Zhu F, Liu L, Shao L (2019) LCJoin: set containment join via list crosscutting. In: Proceedings of ICDE, pp 362\u2013373","DOI":"10.1109\/ICDE.2019.00040"},{"key":"107_CR2","doi-asserted-by":"crossref","unstructured":"Yang J, Zhang W, Yang S, Zhang Y, Lin X (2017) TT-Join: efficient set containment join. In: Proceedings of ICDE, pp 509\u2013520","DOI":"10.1109\/ICDE.2017.107"},{"key":"107_CR3","doi-asserted-by":"crossref","unstructured":"Kunkel A, Rheinl\u00e4nder A, Schiefer C, Helmer S, Bouros P, Leser U (2016) Piejoin: towards parallel set containment joins. In: Baumann P, Manolescu-Goujot I, Trani L (eds) SSDBM, pp 11\u201322","DOI":"10.1145\/2949689.2949694"},{"key":"107_CR4","doi-asserted-by":"crossref","unstructured":"Luo Y, Fletcher G, Hidders J, De Bra P (2015) Efficient and scalable trie-based algorithms for computing set containment relations. In: Gehrke J, Lehner W, Shim K et al (eds) ICDE, pp 303\u2013314","DOI":"10.1109\/ICDE.2015.7113293"},{"key":"107_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.is.2014.10.004","volume":"49","author":"P Bouros","year":"2015","unstructured":"Bouros P, Mamoulis N, Ge S, Terrovitis M (2015) Set containment join revisited. Knowl Inf Syst 49:1\u201328","journal-title":"Knowl Inf Syst"},{"key":"107_CR6","doi-asserted-by":"publisher","first-page":"761","DOI":"10.1007\/11408079_69","volume-title":"Database Systems for Advanced Applications","author":"Ravindranath Jampani","year":"2005","unstructured":"Jampani R, Pudi V (2005) Using prefix-trees for efficiently computing set joins. In: Zhou L, Ooi B, Meng X (eds) DASFAA, LNCS, vol 3453, pp 761\u2013772"},{"key":"107_CR7","doi-asserted-by":"crossref","unstructured":"Mamoulis N (2003) Efficient processing of joins on set-valued attributes. In: Halevy A, Ives Z, Doan A (eds) SIGMOD , pp 157\u2013168","DOI":"10.1145\/872757.872778"},{"issue":"1","key":"107_CR8","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1145\/762471.762474","volume":"28","author":"S Melnik","year":"2003","unstructured":"Melnik S, Molina H (2003) Adaptive algorithms for set containment joins. ACM Trans Database Syst 28(1):56\u201399","journal-title":"ACM Trans Database Syst"},{"key":"107_CR9","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1007\/3-540-45876-X_28","volume-title":"Advances in Database Technology \u2014 EDBT 2002","author":"Sergey Melnik","year":"2002","unstructured":"Melnik S, Garcia-Molina H (2002) Divide-and-conquer algorithm for computing set containment joins. In: Jensen C, Jerrery K, Pokorny J, et al (eds) EDBT, LNCS, vol 2287, pp 427\u2013444"},{"key":"107_CR10","unstructured":"Ramasamy K, Patel J, Naughton J, Kaushik R (2000) Set containment joins: the good, the bad and the ugly. In: Abbadi A, Brodie M, Chakravarthy S, et al (eds) VLDB, pp 386\u2013395"},{"key":"107_CR11","unstructured":"Helmer S, Moerkotte G (1997) Evaluation of main memory join algorithms for joins with set comparison predicates. In: Jarke M, Carey J, Dittrich R, et al (eds) VLDB, pp 386\u2013395"},{"issue":"12","key":"107_CR12","doi-asserted-by":"publisher","first-page":"1185","DOI":"10.14778\/2994509.2994534","volume":"9","author":"E Zhu","year":"2016","unstructured":"Zhu E, Nargesian F, Pu K, Miller R (2016) Lsh ensemble: internet scale domain search. Proc VLDB Endow 9(12):1185\u20131196","journal-title":"Proc VLDB Endow"},{"issue":"9","key":"107_CR13","doi-asserted-by":"publisher","first-page":"636","DOI":"10.14778\/2947618.2947620","volume":"9","author":"W Mann","year":"2016","unstructured":"Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636\u2013647","journal-title":"Proc VLDB Endow"},{"key":"107_CR14","doi-asserted-by":"crossref","unstructured":"Luo J, Zhang W, Shi S, Gao H, Li J, Zhang T, Zhou Z (2019) FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join. In: Shao J et al (eds) Proceedings of APWeb-WAIM 2019: web and big data. LNCS, vol 11642. Springer, pp 175\u2013190","DOI":"10.1007\/978-3-030-26075-0_14"},{"key":"107_CR15","doi-asserted-by":"crossref","unstructured":"Hmedeh Z, Kourdounakis H, Christophides V, Mouza C, Scholl M, Travers N (2012) Subscription indexes for web syndication systems. In: Rundensteiner E Markl, V, Manolescu I, et al (eds) EDBT, pp 312\u2013323","DOI":"10.1145\/2247596.2247634"},{"key":"107_CR16","doi-asserted-by":"crossref","unstructured":"Terrovitis M, Bouros P, Vassiliadis P, Sellis T, Mamoulis T (2011) Efficient answering of set containment queries for skewed item distributions. In: Ailamaki A, Amer-Yahia S, Patel J et al (eds) EDBT, pp 225\u2013236","DOI":"10.1145\/1951365.1951394"},{"key":"107_CR17","doi-asserted-by":"crossref","unstructured":"Agrawal P, Arasu A, Kaushik R (2010) On indexing error-tolerant set containment. In: Elmagarmid A, Agrawal D (eds) SIGMOD, pp 927\u2013938","DOI":"10.1145\/1807167.1807267"},{"key":"107_CR18","doi-asserted-by":"crossref","unstructured":"Li C, Lu J, Lu Y (2008) Efficient merging and filtering algorithms for approximate string searches. In: Alonso G, Blakeley J, Chen A (eds) ICDE, pp 257\u2013266","DOI":"10.1109\/ICDE.2008.4497434"},{"key":"107_CR19","doi-asserted-by":"crossref","unstructured":"Terrovitis M, Passas S, Vassiliadis P, Sellis T (2006) A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: Yu P, Tsotras V, Fox E, Liu B (eds) CIKM, pp 728\u2013737","DOI":"10.1145\/1183614.1183718"},{"issue":"2","key":"107_CR20","doi-asserted-by":"publisher","first-page":"332","DOI":"10.1145\/176567.176573","volume":"19","author":"T Yan","year":"1994","unstructured":"Yan T, Garc\u00eda-Molina Z (1994) Index structures for selective dissemination of information under the boolean model. ACM Trans Database Syst 19(2):332\u2013364","journal-title":"ACM Trans Database Syst"},{"issue":"4","key":"107_CR21","doi-asserted-by":"publisher","first-page":"360","DOI":"10.14778\/2856318.2856330","volume":"9","author":"D Deng","year":"2015","unstructured":"Deng D, Li G, Wen H, Feng J (2015) An efficient partition based method for exact set similarity joins. Proc VLDB Endow 9(4):360\u2013371","journal-title":"Proc VLDB Endow"},{"key":"107_CR22","unstructured":"Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? an adaptive framework for similarity join and search. In: Candan K, Chen Y, Snodgrass R et al (eds) SIGMOD, pp 85\u201396"},{"key":"107_CR23","unstructured":"Xiao C, Wang W, Lin X, Shang H (2008) Top-k set similarity joins. In: Alonso G, Blakeley J, Chen A (eds) ICDE, pp 916\u2013927"},{"key":"107_CR24","doi-asserted-by":"crossref","unstructured":"Xiao C, Wang W, Lin X, Yu J (2008) Efficient similarity joins for near duplicate detection. In: Huai J, Chen R, Hon H et al (eds) WWW, pp 131\u2013140","DOI":"10.1145\/1367497.1367516"},{"key":"107_CR25","doi-asserted-by":"crossref","unstructured":"Bayardo R, Ma Y, Stikant R (2007) Scaling up all pairs similarity search. In: Williamson C, Zurko M, Patel-Schneider P et al (eds) WWW, pp 131\u2013140","DOI":"10.1145\/1242572.1242591"},{"key":"107_CR26","doi-asserted-by":"crossref","unstructured":"Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Liu L, Reuter A, Whang K et al (eds) ICDE, pp 5\u201316","DOI":"10.1109\/ICDE.2006.9"},{"key":"107_CR27","unstructured":"Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Dayal U, Whang K, Lomet D (eds) VLDB, pp 918\u2013929"},{"key":"107_CR28","unstructured":"Luo J, Zhou X, Zhang Y, Shen H, Li J (2007) Selectivity estimation by batch-query based histogram and parametric method. In: Australia database conference, pp 93\u2013102"},{"issue":"10","key":"107_CR29","doi-asserted-by":"publisher","first-page":"1499","DOI":"10.1631\/FITEE.1601347","volume":"18","author":"J Luo","year":"2017","unstructured":"Luo J, Shi S, Wang H, Li J (2017) FrepJoin: an efficient partition-based algorithm for edit similarity join. Front Inf Technol Electron Eng 18(10):1499\u20131510","journal-title":"Front Inf Technol Electron Eng"}],"container-title":["Data Science and Engineering"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-019-00107-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s41019-019-00107-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-019-00107-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,8]],"date-time":"2020-11-08T00:50:00Z","timestamp":1604796600000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s41019-019-00107-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,9]]},"references-count":29,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12]]}},"alternative-id":["107"],"URL":"https:\/\/doi.org\/10.1007\/s41019-019-00107-y","relation":{},"ISSN":["2364-1185","2364-1541"],"issn-type":[{"type":"print","value":"2364-1185"},{"type":"electronic","value":"2364-1541"}],"subject":[],"published":{"date-parts":[[2019,11,9]]},"assertion":[{"value":"14 August 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 October 2019","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 October 2019","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 November 2019","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}