{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T13:38:30Z","timestamp":1774532310440,"version":"3.50.1"},"reference-count":19,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>\n            Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The\n            <jats:italic>Map-Reduce<\/jats:italic>\n            scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as\n            <jats:italic>join<\/jats:italic>\n            by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.\n          <\/jats:p>\n          <jats:p>\n            <jats:italic>Pig<\/jats:italic>\n            is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the\n            <jats:italic>Hadoop<\/jats:italic>\n            Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation.\n          <\/jats:p>\n          <jats:p>This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.<\/jats:p>","DOI":"10.14778\/1687553.1687568","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1414-1425","source":"Crossref","is-referenced-by-count":233,"title":["Building a high-level dataflow system on top of Map-Reduce"],"prefix":"10.14778","volume":"2","author":[{"given":"Alan F.","family":"Gates","sequence":"first","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Olga","family":"Natkovich","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Shubham","family":"Chopra","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Pradeep","family":"Kamath","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Shravan M.","family":"Narayanamurthy","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Christopher","family":"Olston","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Benjamin","family":"Reed","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Santhosh","family":"Srinivasan","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]},{"given":"Utkarsh","family":"Srivastava","sequence":"additional","affiliation":[{"name":"Yahoo!, Inc."}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Hadoop: Open-source implementation of MapReduce. http:\/\/hadoop.apache.org.  Hadoop: Open-source implementation of MapReduce. http:\/\/hadoop.apache.org."},{"key":"e_1_2_1_2_1","unstructured":"Pig Mix Benchmark. http:\/\/wiki.apache.org\/pig\/PigMix.  Pig Mix Benchmark. http:\/\/wiki.apache.org\/pig\/PigMix."},{"key":"e_1_2_1_3_1","unstructured":"The Hive Project. http:\/\/hadoop.apache.org\/hive\/.  The Hive Project. http:\/\/hadoop.apache.org\/hive\/."},{"key":"e_1_2_1_4_1","unstructured":"The Pig Project. http:\/\/hadoop.apache.org\/pig.  The Pig Project. http:\/\/hadoop.apache.org\/pig."},{"key":"e_1_2_1_5_1","unstructured":"K. Beyer V. Ercegovac and E. Shekita. Jaql: A JSON query language. http:\/\/www.jaql.org\/.  K. Beyer V. Ercegovac and E. Shekita. Jaql: A JSON query language. http:\/\/www.jaql.org\/."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/1454159.1454166"},{"key":"e_1_2_1_7_1","unstructured":"Cloudera. http:\/\/www.cloudera.com.  Cloudera. http:\/\/www.cloudera.com."},{"key":"e_1_2_1_8_1","volume-title":"Proc. OSDI","author":"Dean J.","year":"2004","unstructured":"J. Dean and S. Ghemawat . MapReduce: Simplified data processing on large clusters . In Proc. OSDI , 2004 . J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI, 2004."},{"key":"e_1_2_1_9_1","volume-title":"Proc. VLDB","author":"DeWitt D. J.","year":"1992","unstructured":"D. J. DeWitt , J. F. Naughton , D. A. Schneider , and S. Seshadri . Practical skew handling in parallel joins . In Proc. VLDB , 1992 . D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proc. VLDB, 1992."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/509252.509292"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/69.273032"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009726021843"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/963770.963774"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559873"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376726"},{"issue":"4","key":"e_1_2_1_16_1","first-page":"227","volume":"13","author":"Pike R.","year":"2005","unstructured":"R. Pike , S. Dorward , R. Griesemer , and S. Quinlan . Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal , 13 ( 4 ): 227 -- 298 , 2005 . R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, 13(4):227--298, 2005.","journal-title":"Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.40"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/604264.604282"},{"key":"e_1_2_1_20_1","volume-title":"Proc. OSDI","author":"Yu Y.","year":"2008","unstructured":"Y. Yu , M. Isard , D. Fetterly , M. Badiu , U. Erlingsson , P. K. Gunda , and J. Currey . DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language . In Proc. OSDI , 2008 . Y. Yu, M. Isard, D. Fetterly, M. Badiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. OSDI, 2008."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687553.1687568","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:57:55Z","timestamp":1672225075000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687553.1687568"}},"subtitle":["the Pig experience"],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":19,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687553.1687568"],"URL":"https:\/\/doi.org\/10.14778\/1687553.1687568","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}