{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T12:55:19Z","timestamp":1773924919458,"version":"3.50.1"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2016,11,2]],"date-time":"2016-11-02T00:00:00Z","timestamp":1478044800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2016,12,23]]},"abstract":"<jats:p>\n            The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the\n            <jats:italic>hybrid warehouse<\/jats:italic>\n            . There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support.\n          <\/jats:p>\n          <jats:p>\n            In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new\n            <jats:italic>zigzag join<\/jats:italic>\n            algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.\n          <\/jats:p>","DOI":"10.1145\/2972950","type":"journal-article","created":{"date-parts":[[2016,11,4]],"date-time":"2016-11-04T12:49:04Z","timestamp":1478263744000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Building a Hybrid Warehouse"],"prefix":"10.1145","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6835-8434","authenticated-orcid":false,"given":"Yuanyuan","family":"Tian","sequence":"first","affiliation":[{"name":"IBM Research -- Almaden, San Jose, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fatma","family":"\u00d6zcan","sequence":"additional","affiliation":[{"name":"IBM Research -- Almaden, San Jose, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tao","family":"Zou","sequence":"additional","affiliation":[{"name":"Google, Mountain View, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Romulo","family":"Goncalves","sequence":"additional","affiliation":[{"name":"The Netherlands eScience Center, Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hamid","family":"Pirahesh","sequence":"additional","affiliation":[{"name":"IBM Research -- Almaden, San Jose, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2016,11,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/233269.233327"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2011.47"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-014-0357-y"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733085.2733096"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989447"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.342.0292"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807273"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/1454159.1454166"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 1985 International Conference on Very Large Data Bases (VLDB'85)","author":"David","unstructured":"David J. DeWitt and Robert H. Gerber. 1985. Multiprocessor hash-based join algorithms . In Proceedings of the 1985 International Conference on Very Large Data Bases (VLDB'85) . 151--164. David J. DeWitt and Robert H. Gerber. 1985. Multiprocessor hash-based join algorithms. In Proceedings of the 1985 International Conference on Very Large Data Bases (VLDB'85). 151--164."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463709"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/602259.602261"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 1992 International Conference on Very Large Data Bases (VLDB'92)","author":"DeWitt David J.","unstructured":"David J. DeWitt , Jeffrey F. Naughton , Donovan A. Schneider , and S. Seshadri . 1992. Practical skew handling in parallel joins . In Proceedings of the 1992 International Conference on Very Large Data Bases (VLDB'92) . 27--40. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri. 1992. Practical skew handling in parallel joins. In Proceedings of the 1992 International Conference on Very Large Data Bases (VLDB'92). 27--40."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 2003 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03)","author":"Fagin Ronald","unstructured":"Ronald Fagin , Ravi Kumar , and D. Sivakumar . 2003. Comparing top K lists . In Proceedings of the 2003 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03) . 28--36. Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Comparing top K lists. In Proceedings of the 2003 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 28--36."},{"key":"e_1_2_1_17_1","volume-title":"Dynamic Access: The SQL-H feature for the latest Teradata database leverages data in Hadoop.","author":"Frazier Doug","year":"2013","unstructured":"Doug Frazier . 2013 . Dynamic Access: The SQL-H feature for the latest Teradata database leverages data in Hadoop. Retrieved from http:\/\/www.teradatamagazine.com\/v13n02\/Tech2Tech\/Dynamic-Access. Doug Frazier. 2013. Dynamic Access: The SQL-H feature for the latest Teradata database leverages data in Hadoop. Retrieved from http:\/\/www.teradatamagazine.com\/v13n02\/Tech2Tech\/Dynamic-Access."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/130283.130291"},{"key":"e_1_2_1_19_1","unstructured":"Scott C. Gray Fatma Ozcan Hebert Pereyra Bert van der Linden and Adriana Zubiri. 2015. SQL-on-Hadoop without compromise: How Big SQL 3.0 from IBM represents an important leap forward for speed portability and robust functionality in SQL-on-Hadoop solutions. Retrieved from http:\/\/public.dhe.ibm.com\/common\/ssi\/ecm\/sw\/en\/sww14019usen\/SWW14019USEN.PDF.  Scott C. Gray Fatma Ozcan Hebert Pereyra Bert van der Linden and Adriana Zubiri. 2015. SQL-on-Hadoop without compromise: How Big SQL 3.0 from IBM represents an important leap forward for speed portability and robust functionality in SQL-on-Hadoop solutions. Retrieved from http:\/\/public.dhe.ibm.com\/common\/ssi\/ecm\/sw\/en\/sww14019usen\/SWW14019USEN.PDF."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564751"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 2015 Conference on Innovative Data Systems Research (CIDR'15)","author":"Kornacker Marcel","year":"2015","unstructured":"Marcel Kornacker , Alexander Behm , Victor Bittorf , Taras Bobrovytsky , Casey Ching , Alan Choi , Justin Erickson , Martin Grund , Daniel Hecht , Matthew Jacobs , Ishaan Joshi , Lenni Kuff , Dileep Kumar , Alex Leblang , Nong Li , Ippokratis Pandis , Henry Robinson , David Rorke , Silvius Rus , John Russell , Dimitris Tsirogiannis , Skye Wanderman-Milne , and Michael Yoder . 2015 . Impala: A modern, open-source SQL engine for hadoop . In Proceedings of the 2015 Conference on Innovative Data Systems Research (CIDR'15) . Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A modern, open-source SQL engine for hadoop. In Proceedings of the 2015 Conference on Innovative Data Systems Research (CIDR'15)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/371578.371598"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2401603.2401626"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/221270.221360"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 1986 International Conference on Very Large Data Bases (VLDB'86)","author":"Lothar","unstructured":"Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries . In Proceedings of the 1986 International Conference on Very Large Data Bases (VLDB'86) . 149--159. Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 1986 International Conference on Very Large Data Bases (VLDB'86). 149--159."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/62061.62063"},{"key":"e_1_2_1_27_1","unstructured":"Dan McClary. 2014. Oracle Big Data SQL: One Fast Query All Your Data. Retrieved from https:\/\/blogs.oracle.com\/datawarehousing\/entry\/oracle_big_data_sql_one.  Dan McClary. 2014. Oracle Big Data SQL: One Fast Query All Your Data. Retrieved from https:\/\/blogs.oracle.com\/datawarehousing\/entry\/oracle_big_data_sql_one."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/AINA.2007.80"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/32.52778"},{"key":"e_1_2_1_30_1","unstructured":"Oracle. 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Retrieved from http:\/\/www.oracle.com\/technetwork\/bdc\/hadoop-loader\/connectors-hdfs-wp-1674035.pdf.  Oracle. 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Retrieved from http:\/\/www.oracle.com\/technetwork\/bdc\/hadoop-loader\/connectors-hdfs-wp-1674035.pdf."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989446"},{"key":"e_1_2_1_32_1","volume-title":"Principles of Distributed Database Systems","author":"Tamer \u00d6zsu M.","unstructured":"M. Tamer \u00d6zsu and Patrick Valduriez . 2011. Principles of Distributed Database Systems ( 3 rd ed.). Springer . M. Tamer \u00d6zsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems (3rd ed.). Springer.","edition":"3"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 1995 International Conference on Deductive and Object-Oriented Databases (DOOD'95)","author":"Papakonstantinou Yannis","unstructured":"Yannis Papakonstantinou , Ashish Gupta , Hector Garcia-Molina , and Jeffrey D. Ullman . 1995. A query translation scheme for rapid implementation of wrappers . In Proceedings of the 1995 International Conference on Deductive and Object-Oriented Databases (DOOD'95) . 161--186. Yannis Papakonstantinou, Ashish Gupta, Hector Garcia-Molina, and Jeffrey D. Ullman. 1995. A query translation scheme for rapid implementation of wrappers. In Proceedings of the 1995 International Conference on Deductive and Object-Oriented Databases (DOOD'95). 161--186."},{"key":"e_1_2_1_34_1","volume-title":"Pivotal HD: HAWQ - A True SQL Engine For Hadoop.","year":"2015","unstructured":"Pivotal. 2015 . Pivotal HD: HAWQ - A True SQL Engine For Hadoop. Retrieved from http:\/\/www.gopivotal.com\/sites\/default\/files\/Hawq_WP_042313_FINAL.pdf. Pivotal. 2015. Pivotal HD: HAWQ - A True SQL Engine For Hadoop. Retrieved from http:\/\/www.gopivotal.com\/sites\/default\/files\/Hawq_WP_042313_FINAL.pdf."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610521"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB'96)","author":"Poosala Viswanath","unstructured":"Viswanath Poosala and Yannis E. Ioannidis . 1996. Estimation of query-result distribution and its application in parallel-join load balancing . In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB'96) . 448--459. Viswanath Poosala and Yannis E. Ioannidis. 1996. Estimation of query-result distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB'96). 448--459."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 1999 International Conference on Very Large Data Bases (VLDB'99)","author":"Roth Mary Tork","unstructured":"Mary Tork Roth , Fatma \u00d6zcan , and Laura M. Haas . 1999. Cost models DO matter: Providing cost information for diverse data sources in a federated system . In Proceedings of the 1999 International Conference on Very Large Data Bases (VLDB'99) . 599--610. Mary Tork Roth, Fatma \u00d6zcan, and Laura M. Haas. 1999. Cost models DO matter: Providing cost information for diverse data sources in a federated system. In Proceedings of the 1999 International Conference on Very Large Data Bases (VLDB'99). 599--610."},{"key":"e_1_2_1_38_1","volume-title":"Pegasus: A heterogeneous information management system","author":"Shan Ming-Chien","year":"1995","unstructured":"Ming-Chien Shan , Rafi Ahmed , Jim Davis , Weimin Du , and William Kent . 1995 . Pegasus: A heterogeneous information management system . In Modern Database Systems, Won Kim (Ed.). ACM Press\/Addison-Wesley Publishing , 664--682. Ming-Chien Shan, Rafi Ahmed, Jim Davis, Weimin Du, and William Kent. 1995. Pegasus: A heterogeneous information management system. In Modern Database Systems, Won Kim (Ed.). ACM Press\/Addison-Wesley Publishing, 664--682."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/96602.96604"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544909"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2595637"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 1994 International Conference on Extending Database Technology (EDBT'94)","author":"Swami Arun","unstructured":"Arun Swami and K. Bernhard Schiefer . 1994. On the estimation of join result sizes . In Proceedings of the 1994 International Conference on Extending Database Technology (EDBT'94) . 287--300. Arun Swami and K. Bernhard Schiefer. 1994. On the estimation of join result sizes. In Proceedings of the 1994 International Conference on Extending Database Technology (EDBT'94). 287--300."},{"key":"e_1_2_1_43_1","unstructured":"Teradata. 2013. Teradata Connector for Hadoop. Retrieved from http:\/\/developer.teradata.com\/connectivity\/articles\/teradata-connector-for-hadoop-now-available.  Teradata. 2013. Teradata Connector for Hadoop. Retrieved from http:\/\/developer.teradata.com\/connectivity\/articles\/teradata-connector-for-hadoop-now-available."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687609"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 2015 International Conference on Extending Database Technology (EDBT'15)","author":"Tian Yuanyuan","year":"2015","unstructured":"Yuanyuan Tian , Tao Zou , Fatma Ozcan , Romulo Goncalves , and Hamid Pirahesh . 2015 . Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses . In Proceedings of the 2015 International Conference on Extending Database Technology (EDBT'15) . 373--384. Yuanyuan Tian, Tao Zou, Fatma Ozcan, Romulo Goncalves, and Hamid Pirahesh. 2015. Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses. In Proceedings of the 2015 International Conference on Extending Database Technology (EDBT'15). 373--384."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.729736"},{"key":"e_1_2_1_47_1","volume-title":"Presto: Interacting with petabytes of data at Facebook.","author":"Traverso Martin","year":"2013","unstructured":"Martin Traverso . 2013 . Presto: Interacting with petabytes of data at Facebook. Retrieved from https:\/\/www.facebook.com\/notes\/facebook-engineering\/presto-interacting-with-petabytes-of-data-at-facebook\/10151786197628920. Martin Traverso. 2013. Presto: Interacting with petabytes of data at Facebook. Retrieved from https:\/\/www.facebook.com\/notes\/facebook-engineering\/presto-interacting-with-petabytes-of-data-at-facebook\/10151786197628920."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376720"},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the 2012 USENIX Conference on Networked Systems Design and Implementation (NSDI'12)","author":"Zaharia Matei","year":"2012","unstructured":"Matei Zaharia , Mosharaf Chowdhury , Tathagata Das , Ankur Dave , Justin Ma , Murphy McCauley , Michael J. Franklin , Scott Shenker , and Ion Stoica . 2012 . Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing . In Proceedings of the 2012 USENIX Conference on Networked Systems Design and Implementation (NSDI'12) . 15--28. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 2012 USENIX Conference on Networked Systems Design and Implementation (NSDI'12). 15--28."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-35600-1_13"},{"key":"e_1_2_1_51_1","unstructured":"Wei Zheng. 2015. Hybrid Hybrid Grace Hash Join v1.0. Retrieved from https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/Hybrid+Hybrid+Grace+Hash+Join +v1.0#HybridHybridGraceHashJoin v1.0-BloomFilter.  Wei Zheng. 2015. Hybrid Hybrid Grace Hash Join v1.0. Retrieved from https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/Hybrid+Hybrid+Grace+Hash+Join +v1.0#HybridHybridGraceHashJoin v1.0-BloomFilter."}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2972950","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2972950","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:50:16Z","timestamp":1750218616000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2972950"}},"subtitle":["Efficient Joins between Data Stored in HDFS and Enterprise Warehouse"],"short-title":[],"issued":{"date-parts":[[2016,11,2]]},"references-count":51,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,12,23]]}},"alternative-id":["10.1145\/2972950"],"URL":"https:\/\/doi.org\/10.1145\/2972950","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,11,2]]},"assertion":[{"value":"2015-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-11-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}