{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,8]],"date-time":"2026-02-08T10:15:20Z","timestamp":1770545720457,"version":"3.49.0"},"reference-count":25,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2014,8]]},"abstract":"<jats:p>Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft.<\/jats:p>","DOI":"10.14778\/2733004.2733020","type":"journal-article","created":{"date-parts":[[2015,5,12]],"date-time":"2015-05-12T15:37:52Z","timestamp":1431445072000},"page":"1484-1495","source":"Crossref","is-referenced-by-count":42,"title":["Advanced join strategies for large-scale distributed computation"],"prefix":"10.14778","volume":"7","author":[{"given":"Nicolas","family":"Bruno","sequence":"first","affiliation":[{"name":"Microsoft Corp."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"YongChul","family":"Kwon","sequence":"additional","affiliation":[{"name":"Microsoft Corp."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ming-Chuan","family":"Wu","sequence":"additional","affiliation":[{"name":"Microsoft Corp."}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2014,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2011.47"},{"key":"e_1_2_1_2_1","unstructured":"Apache. Hadoop. http:\/\/hadoop.apache.org\/.  Apache. Hadoop. http:\/\/hadoop.apache.org\/."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807273"},{"key":"e_1_2_1_4_1","first-page":"10","volume-title":"Proceedings of OSDI Conference","author":"Dean J.","year":"2004","unstructured":"J. Dean and S. Ghemawat . MapReduce: Simplified data processing on large clusters . In Proceedings of OSDI Conference , pages 10 -- 10 , 2004 . J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, pages 10--10, 2004."},{"key":"e_1_2_1_5_1","first-page":"27","volume-title":"Proc. of the 18th VLDB Conf.","author":"DeWitt D.","year":"1992","unstructured":"D. DeWitt , J. Naughton , D. Schneider , and S. S. Seshadri . Practical skew handling in parallel joins . In Proc. of the 18th VLDB Conf. , pages 27 -- 40 , 1992 . D. DeWitt, J. Naughton, D. Schneider, and S. S. Seshadri. Practical skew handling in parallel joins. In Proc. of the 18th VLDB Conf., pages 27--40, 1992."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687568"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/152610.152611"},{"key":"e_1_2_1_8_1","unstructured":"He Yongqiang. handle skewed keys for a join in a separate job. https:\/\/issues.apache.org\/jira\/browse\/HIVE-964.  He Yongqiang. handle skewed keys for a join in a separate job. https:\/\/issues.apache.org\/jira\/browse\/HIVE-964."},{"key":"e_1_2_1_9_1","first-page":"525","volume-title":"Proc. of the 17th VLDB Conf.","author":"Hua K.","year":"1991","unstructured":"K. Hua and C. Lee . Handling data skew in multiprocessor database computers using partition tuning . In Proc. of the 17th VLDB Conf. , pages 525 -- 535 , 1991 . K. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., pages 525--535, 1991."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.476502"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1272996.1273005"},{"key":"e_1_2_1_12_1","first-page":"210","volume-title":"Proc. of the 16th VLDB Conf.","author":"Kitsuregawa M.","year":"1990","unstructured":"M. Kitsuregawa and Y. Ogawa . Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (sdc) . In Proc. of the 16th VLDB Conf. , pages 210 -- 221 , 1990 . M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (sdc). In Proc. of the 16th VLDB Conf., pages 210--221, 1990."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564711"},{"key":"e_1_2_1_14_1","unstructured":"Namit Jain. Skewed Join Optimization. https:\/\/issues.apache.org\/jira\/browse\/HIVE-3086.  Namit Jain. Skewed Join Optimization. https:\/\/issues.apache.org\/jira\/browse\/HIVE-3086."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376726"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2507157.2507195"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/170035.170062"},{"key":"e_1_2_1_18_1","unstructured":"Sriranjan Manjunath. support for skewed outer join. https:\/\/issues.apache.org\/jira\/browse\/PIG-1035.  Sriranjan Manjunath. support for skewed outer join. https:\/\/issues.apache.org\/jira\/browse\/PIG-1035."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2010.5447738"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/319057.319072"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/645476.654473"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687565"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376720"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-012-0280-z"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2010.5447802"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2733004.2733020","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:41:11Z","timestamp":1672220471000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2733004.2733020"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,8]]},"references-count":25,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2014,8]]}},"alternative-id":["10.14778\/2733004.2733020"],"URL":"https:\/\/doi.org\/10.14778\/2733004.2733020","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2014,8]]}}}