Apache Pig is a scripting language for exploring large datasets that allows users to express data flows in Pig Latin scripts. Pig Latin scripts describe multi-step transformations that Pig executes by translating into MapReduce jobs. This allows users to focus on the logic of their data analysis without needing to write MapReduce programs directly.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
175 views46 pages
Pig Slides
Apache Pig is a scripting language for exploring large datasets that allows users to express data flows in Pig Latin scripts. Pig Latin scripts describe multi-step transformations that Pig executes by translating into MapReduce jobs. This allows users to focus on the logic of their data analysis without needing to write MapReduce programs directly.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46
Apache pig raises the level of abstraction for
processing large datasets.
Pig is a scripting language for exploring large datasets. Why Pig ? To overcome long development cycle of Map-Reduce Jobs
1. Writing the mappers and reducers,
2. Compiling and packaging the code, 3. submitting the job(s), and retrieving the results Pig Components Pig is made up of two pieces: • The language used to express data flows, called Pig Latin. • The execution environment to run Pig Latin programs. 1. local execution in a single JVM 2. distributed execution on a Hadoop cluster. Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output.
The operations describe a data flow, which the Pig
execution environment translates into an executable representation and then runs.
Pig turns the transformations into a series of
MapReduce jobs unaware to the programmer. Installing and Running Pig Pig runs as a client-side application either locally or on a Hadoop cluster. Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your workstation.
http://pig.apache.org/releases.html, and unpack the
tarb in a suitable directory on your workstation: % tar xzf pig-x.y.z.tar.gz Pig Execution Types Local mode % pig -x local grunt> Map Reduce mode Hadoop MR Apache Tez Apache Spark To use MapReduce mode, you first need to check that the version of Pig you downloaded is compatible with the version of Hadoop you are using. Connect to Hadoop cluster Pig conf directory pig.properties fs.defaultFS=hdfs://localhost/ mapreduce.framework.name=yarn yarn.resourcemanager.address=localhost:8032 Running Pig Programs Three ways of executing Pig programs, all of which work in both local and MapReduce mode. Script -- A script file containing pig commands. -e option specify command as string Grunt --- An interactive shell for running pig commands. Embedded --- You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java. Pig example script -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp; grunt> records = LOAD 'input/ncdc/micro- tab/sample.txt' >> AS (year:chararray, temperature:int, quality:int); For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields. grunt> filtered_records = FILTER records BY temperature != 9999 AND >> quality IN (0, 1, 4, 5, 9); grunt> DUMP filtered_records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> grouped_records = GROUP filtered_records BY year; grunt> DUMP grouped_records; (1949,{(1949,78,1),(1949,111,1)}) (1950,{(1950,-11,1),(1950,22,1),(1950,0,1)}) grunt> DESCRIBE grouped_records; grouped_records: {group: chararray,filtered_records: {year: chararray, temperature: int,quality: int}} grunt> max_temp = FOREACH grouped_records GENERATE group, >> MAX(filtered_records.temperature); FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. grunt> DUMP max_temp; (1949,111) (1950,22) Pig vs Databases Pig Latin is a data flow programming language, whereas SQL is a declarative programming language.
a Pig Latin program is a step-by-step set of operations
on an input relation, in which each step is a single transformation.
SQL statements are a set of constraints that, taken
together, define the output. Pig Latin is like working at the level of an RDBMS query planner, which figures out how to turn a declarative statement into a system of steps.
RDBMSs store data in tables, with tightly predefined
schemas. Pig is more relaxed about the data that it processes: you can define a schema at runtime, but it’s optional.
Pig’s nested data structures makes Pig Latin more
customizable than most SQL dialects. Pig does not support random reads Nor does it support random writes to update small portions of data;
In Pig, all writes are bulk streaming writes, just like
with MapReduce.
Pig is able to work with Hive tables using HCatalog
Pig Latin A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation or a command. For example, a GROUP operation is a type of statement: grouped_records = GROUP records BY year; Statements that have to be terminated with a semicolon can be split across multiple lines for readability: records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); Pig Latin has two forms of comments. Double hyphens are used for single-line comments. Everything from the first hyphen to the end of the line is ignored by the Pig Latin interpreter: -- My program DUMP A; -- What's in A? C-style comments are more flexible since they delimit the beginning and end of the comment block with /* and */ markers. They can span lines or be embedded in a single line: /* * Description of my program spanning * multiple lines. */ A = LOAD 'input/pig/join/A'; B = LOAD 'input/pig/join/B'; C = JOIN A BY $0, /* ignored */ B BY $1; DUMP C; Pig Latin has a list of keywords that have a special meaning in the language and cannot be used as identifiers. These include the operators (LOAD, ILLUSTRATE), commands (cat, ls), expressions (matches, FLATTEN), and functions (DIFF, MAX) Pig Latin has mixed rules on case sensitivity. Operators and commands are not case sensitive (to make interactive use more forgiving); however, aliases and function names are case sensitive. When Pig Latin program is executed, each statement is parsed in turn. If there are syntax errors or other (semantic) problems, such as undefined aliases, the interpreter will halt and display an error message. The interpreter builds a logical plan for every relational operation, which forms the core of a Pig Latin program. The logical plan for the statement is added to the logical plan for the program so far, and then the interpreter moves on to the next statement. No data processing takes place while the logical plan of the program is being constructed. The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled into a physical plan and executed. -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp; Pig validates the GROUP and FOREACH...GENERATE statements, and adds them to the logical plan without executing them.
The trigger for Pig to start execution is the DUMP
statement. At that point, the logical plan is compiled into a physical plan and executed. The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode Pig runs in the local JVM and in MapReduce mode Pig runs on a Hadoop cluster. The STORE statement should be used when the size of the output is more than a few lines, as it writes to a file rather than to the console. diagnostic operators—DESCRIBE, EXPLAIN, and ILLUSTRATE—are provided to allow the user to interact with the logical plan for debugging purposes Operator (Shortcut) Description DESCRIBE (\de) Prints a relation’s schema EXPLAIN (\e) Prints the logical and physical plans ILLUSTRATE (\i) Shows a sample execution of the logical plan, using a generated subset of the input Pig Latin also provides three statements—REGISTER, DEFINE, and IMPORT—that make it possible to incorporate macros and user-defined functions into Pig scripts Statement Description
REGISTER Registers a JAR file with the Pig runtime
DEFINE Creates an alias for a macro, UDF,
streaming script, or command specification
IMPORT Imports macros defined in a separate file
into a script Pig provides commands to interact with Hadoop filesystems and MapReduce, as well as a few utility commands. Table 16-4. Pig Latin commands Category Command Description Hadoop filesystem cat Prints the contents of one or more files cd Changes the current directory copyFromLocal Copies a local file or directory to a Hadoop filesystem copyToLocal Copies a file or directory on a Hadoop filesystem to the local filesystem cp Copies a file or directory to another directory fs Accesses Hadoop’s filesystem shell ls Lists files mkdir Creates a new directory mv Moves a file or directory to another directory pwd Prints the path of the current working directory rm Deletes a file or directory rmf Forcibly deletes a file or directory (does not fail if the file or directory does not exist) Hadoop MapReduce kill Kills a MapReduce job Utilities clear Clears the screen in Grunt exec Runs a script in a new Grunt shell in batch mode help Shows the available commands and options history Prints the query statements run in the current Grunt session quit (\q) Exits the interpreter run Runs a script within the existing Grunt shell set Sets Pig options and MapReduce job properties sh Runs a shell command from within Grunt A relation in Pig may have an associated schema, which gives the fields in the relation names and types.
grunt> records = LOAD 'input/ncdc/micro-
tab/sample.txt' >> AS (year:int, temperature:int, quality:int); grunt> DESCRIBE records; records: {year: int,temperature: int,quality: int} Functions in Pig come in four types: Eval function A function that takes one or more expressions and returns another expression. An example of a built-in eval function is MAX, which returns the maximum value of the entries. Filter function A special type of eval function that returns a logical Boolean result. As the name suggests, filter functions are used in the FILTER operator to remove unwanted rows. Load function A function that specifies how to load data into a relation from external storage. Store function A function that specifies how to save the contents of a relation to external storage. Often, load and store functions are implemented by the same type. Macros Macros provide a way to package reusable pieces of Pig Latin code from within Pig Latin itself. For example, we can extract the part of our Pig Latin program that performs grouping on a relation and then finds the maximum value in each group by defining a macro as follows: DEFINE max_by_group(X, group_key, max_field) RETURNS Y { A = GROUP $X by $group_key; $Y = FOREACH A GENERATE group, MAX($X.$max_field); } records = LOAD 'input/ncdc/micro-tab/sample.txt' A S (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9); max_temp = max_by_group(filtered_records, year, temperature); DUMP max_temp macros can be defined in separate files to Pig scripts, in which case they need to be imported into any script that uses them. An import statement looks like this: IMPORT './ch16-pig/src/main/pig/max_temp.macro'; User-Defined Functions Pig’s designers realized that the ability to plug in custom code is crucial for all but the most trivial data processing jobs.
For this reason, they made it easy to define and use
user-defined functions.
you can write UDFs in Java, Python, JavaScript, Ruby,
or Groovy, all of which are run using the Java Scripting API. A FilterFunc UDF to remove records with unsatisfactory temperature quality readings package com.hadoopbook.pig; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.pig.FilterFunc; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.FrontendException; public class IsGoodQuality extends FilterFunc {