{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,5,17]],"date-time":"2024-05-17T17:38:50Z","timestamp":1715967530290},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2019,5]]},"abstract":"<jats:p>\n            Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top-\n            <jats:italic>k<\/jats:italic>\n            selection and top-\n            <jats:italic>k<\/jats:italic>\n            join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose\n            <jats:italic>balance-aware signatures<\/jats:italic>\n            where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1--3 orders of magnitude and has good scalability.\n          <\/jats:p>","DOI":"10.14778\/3329772.3329774","type":"journal-article","created":{"date-parts":[[2019,6,24]],"date-time":"2019-06-24T13:43:16Z","timestamp":1561383796000},"page":"961-974","source":"Crossref","is-referenced-by-count":8,"title":["Balance-aware distributed string similarity-based query processing system"],"prefix":"10.14778","volume":"12","author":[{"given":"Ji","family":"Sun","sequence":"first","affiliation":[{"name":"Tsinghua University"}]},{"given":"Zeyuan","family":"Shang","sequence":"additional","affiliation":[{"name":"Tsinghua University and MIT"}]},{"given":"Guoliang","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]},{"given":"Dong","family":"Deng","sequence":"additional","affiliation":[{"name":"Tsinghua University and MIT"}]},{"given":"Zhifeng","family":"Bao","sequence":"additional","affiliation":[{"name":"RMIT University"}]}],"member":"320","published-online":{"date-parts":[[2019,5]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.66"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-019-0088-6"},{"key":"e_1_2_1_3_1","first-page":"918","volume-title":"VLDB","author":"Arasu A.","year":"2006","unstructured":"A. Arasu , V. Ganti , and R. Kaushik . Efficient exact set-similarity joins . In VLDB , pages 918 -- 929 , 2006 . A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242591"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276781"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872796"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035960"},{"key":"e_1_2_1_8_1","first-page":"137","volume-title":"OSDI","author":"Dean J.","year":"2004","unstructured":"J. Dean and S. Ghemawat . Mapreduce: Simplified data processing on large clusters . In OSDI , pages 137 -- 150 , 2004 . J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536271"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2593675"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544886"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816663"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.14778\/2856318.2856330"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183748"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-011-0252-8"},{"key":"e_1_2_1_16_1","first-page":"491","volume-title":"VLDB","author":"Gravano L.","year":"2001","unstructured":"L. Gravano , P. G. Ipeirotis , H. V. Jagadish , N. Koudas , S. Muthukrishnan , and D. Srivastava . Approximate string joins in a database (almost) for free . In VLDB , pages 491 -- 500 , 2001 . L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559891"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276876"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732296.2732299"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.87"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994535"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497434"},{"key":"e_1_2_1_23_1","first-page":"303","volume-title":"VLDB","author":"Li C.","year":"2007","unstructured":"C. Li , B. Wang , and X. Yang . Vgram: Improving performance of approximate queries on string collections using variable-length grams . In VLDB , pages 303 -- 314 , 2007 . C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503009"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487259.2487261"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/2078331.2078340"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2723733"},{"key":"e_1_2_1_28_1","first-page":"193","volume-title":"EDBT","author":"Li H.","year":"2018","unstructured":"H. Li , P. Konda , P. S. G. C., A. Doan , B. Snyder , Y. Park , G. Krishnan , R. Deep , and V. Raghavendra . Matchcatcher: A debugger for blocking in entity matching . In EDBT , pages 193 -- 204 . OpenProceedings.org , 2018 . H. Li, P. Konda, P. S. G. C., A. Doan, B. Snyder, Y. Park, G. Krishnan, R. Deep, and V. Raghavendra. Matchcatcher: A debugger for blocking in entity matching. In EDBT, pages 193--204. OpenProceedings.org, 2018."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-018-0074-4"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/2212351.2212353"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007652"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/2140436.2140440"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2601325"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137810"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-017-0043-3"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807222"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2015.7113311"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767865"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213847"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2535628"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453957"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.111"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367516"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11704-015-5900-5"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-016-0449-y"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989428"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807266"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3329772.3329774","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:35:35Z","timestamp":1672223735000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3329772.3329774"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,5]]},"references-count":47,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2019,5]]}},"alternative-id":["10.14778\/3329772.3329774"],"URL":"https:\/\/doi.org\/10.14778\/3329772.3329774","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2019,5]]}}}