0% found this document useful (0 votes)
175 views46 pages

Pig Slides

Apache Pig is a scripting language for exploring large datasets that allows users to express data flows in Pig Latin scripts. Pig Latin scripts describe multi-step transformations that Pig executes by translating into MapReduce jobs. This allows users to focus on the logic of their data analysis without needing to write MapReduce programs directly.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
175 views46 pages

Pig Slides

Apache Pig is a scripting language for exploring large datasets that allows users to express data flows in Pig Latin scripts. Pig Latin scripts describe multi-step transformations that Pig executes by translating into MapReduce jobs. This allows users to focus on the logic of their data analysis without needing to write MapReduce programs directly.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

 Apache pig raises the level of abstraction for

processing large datasets.


 Pig is a scripting language for exploring large datasets.
Why Pig ?
 To overcome long development cycle of
Map-Reduce Jobs

1. Writing the mappers and reducers,


2. Compiling and packaging the code,
3. submitting the job(s), and retrieving the results
Pig Components
 Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs.
1. local execution in a single JVM
2. distributed execution on a Hadoop cluster.
 Pig Latin program is made up of a series of operations,
or transformations, that are applied to the input data
to produce output.

 The operations describe a data flow, which the Pig


execution environment translates into an executable
representation and then runs.

 Pig turns the transformations into a series of


MapReduce jobs unaware to the programmer.
Installing and Running Pig
 Pig runs as a client-side application either locally or
on a Hadoop cluster.
 Pig launches jobs and interacts with HDFS
(or other Hadoop filesystems) from your workstation.

http://pig.apache.org/releases.html, and unpack the


tarb in a suitable directory on your workstation:
% tar xzf pig-x.y.z.tar.gz
Pig Execution Types
 Local mode
% pig -x local
grunt>
 Map Reduce mode
Hadoop MR
Apache Tez
Apache Spark
 To use MapReduce mode, you first need to check that
the version of Pig you downloaded is compatible with
the version of Hadoop you are using.
Connect to Hadoop cluster
 Pig conf directory
pig.properties
fs.defaultFS=hdfs://localhost/
mapreduce.framework.name=yarn
yarn.resourcemanager.address=localhost:8032
Running Pig Programs
Three ways of executing Pig programs, all of which work in
both local and MapReduce mode.
Script -- A script file containing pig commands.
-e option specify command as string
Grunt --- An interactive shell for running pig
commands.
Embedded --- You can run Pig programs from Java using
the PigServer class, much like you can use
JDBC to run SQL programs from Java.
Pig example script
-- max_temp.pig: Finds the maximum temperature by
year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature !=
9999 AND quality IN (0, 1, 4, 5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE
group,
MAX(filtered_records.temperature);
DUMP max_temp;
 grunt> records = LOAD 'input/ncdc/micro-
tab/sample.txt'
 >> AS (year:chararray, temperature:int,
quality:int);
 For simplicity, the program assumes that the input is
tab-delimited text, with each line
 having just year, temperature, and quality fields.
 grunt> filtered_records = FILTER records BY
temperature != 9999 AND
 >> quality IN (0, 1, 4, 5, 9);
 grunt> DUMP filtered_records;
 (1950,0,1)
 (1950,22,1)
 (1950,-11,1)
 (1949,111,1)
 (1949,78,1)
 grunt> grouped_records = GROUP
filtered_records BY year;
 grunt> DUMP grouped_records;
 (1949,{(1949,78,1),(1949,111,1)})
 (1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})
 grunt> DESCRIBE grouped_records;
 grouped_records: {group: chararray,filtered_records:
{year: chararray,
 temperature: int,quality: int}}
 grunt> max_temp = FOREACH grouped_records
GENERATE group,
 >> MAX(filtered_records.temperature);
 FOREACH processes every row to generate a derived
set of rows, using a GENERATE clause to define the
fields in each derived row.
 grunt> DUMP max_temp;
 (1949,111)
 (1950,22)
Pig vs Databases
 Pig Latin is a data flow programming language,
whereas SQL is a declarative programming language.

 a Pig Latin program is a step-by-step set of operations


on an input relation, in which each step is a single
transformation.

 SQL statements are a set of constraints that, taken


together, define the output.
 Pig Latin is like working at the level of an RDBMS
query planner, which figures out how to turn a
declarative statement into a system of steps.

 RDBMSs store data in tables, with tightly predefined


schemas. Pig is more relaxed about the data that it
processes: you can define a schema at runtime, but it’s
optional.

 Pig’s nested data structures makes Pig Latin more


customizable than most SQL dialects.
 Pig does not support random reads Nor does it support
random writes to update small portions of data;

 In Pig, all writes are bulk streaming writes, just like


with MapReduce.

 Pig is able to work with Hive tables using HCatalog


Pig Latin
 A Pig Latin program consists of a collection of
statements.
 A statement can be thought of as an operation or a
command.
 For example, a GROUP operation is a type of
statement:
grouped_records = GROUP records BY year;
 Statements that have to be terminated with a
semicolon can be split across multiple lines for
readability:
 records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
 Pig Latin has two forms of comments. Double hyphens
are used for single-line comments.
 Everything from the first hyphen to the end of the line
is ignored by the Pig Latin interpreter:
 -- My program
DUMP A; -- What's in A?
 C-style comments are more flexible since they delimit the
beginning and end of the
 comment block with /* and */ markers. They can span lines or be
embedded in a single
 line:
 /*
 * Description of my program spanning
 * multiple lines.
 */
 A = LOAD 'input/pig/join/A';
 B = LOAD 'input/pig/join/B';
 C = JOIN A BY $0, /* ignored */ B BY $1;
 DUMP C;
 Pig Latin has a list of keywords that have a special
meaning in the language and cannot be used as
identifiers.
 These include the operators (LOAD, ILLUSTRATE),
commands (cat, ls),
expressions (matches, FLATTEN),
and functions (DIFF, MAX)
 Pig Latin has mixed rules on case sensitivity.
 Operators and commands are not case sensitive (to
make interactive use more forgiving);
 however, aliases and function names are case sensitive.
 When Pig Latin program is executed, each statement is
parsed in turn.
If there are syntax errors or other (semantic) problems,
such as undefined aliases, the interpreter will halt and
display an error message.
 The interpreter builds a logical plan for every relational
operation, which forms the core of a Pig Latin
program.
 The logical plan for the statement is added to the
logical plan for the program so far, and then the
interpreter moves on to the next statement.
 No data processing takes place while the logical plan of
the program is being constructed.
 The trigger for Pig to start execution is the DUMP
statement.
 At that point, the logical plan is compiled into a physical
plan and executed.
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
AND
quality IN (0, 1, 4, 5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE
group,
MAX(filtered_records.temperature);
DUMP max_temp;
 Pig validates the GROUP and FOREACH...GENERATE
statements, and adds them to the logical plan without
executing them.

 The trigger for Pig to start execution is the DUMP


statement. At that point, the logical plan is compiled
into a physical plan and executed.
 The physical plan that Pig prepares is a series of
MapReduce jobs, which in local mode Pig runs in the
local JVM and in MapReduce mode Pig runs on a
Hadoop cluster.
 The STORE statement should be used when the size of
the output is more than a few lines, as it writes to a file
rather than to the console.
 diagnostic operators—DESCRIBE, EXPLAIN, and
ILLUSTRATE—are provided to allow the user to
interact with the logical plan for debugging purposes
Operator (Shortcut) Description
 DESCRIBE (\de) Prints a relation’s schema
 EXPLAIN (\e) Prints the logical and physical
plans
 ILLUSTRATE (\i) Shows a sample execution of the
logical plan, using a generated
subset of the input
Pig Latin also provides three statements—REGISTER,
DEFINE, and IMPORT—that make it possible to incorporate
macros and user-defined functions into Pig scripts
Statement Description

 REGISTER Registers a JAR file with the Pig runtime

 DEFINE Creates an alias for a macro, UDF,


streaming script, or command specification

 IMPORT Imports macros defined in a separate file


into a script
 Pig provides commands to interact with Hadoop
filesystems and MapReduce, as well as a few utility
commands.
 Table 16-4. Pig Latin commands
 Category Command Description
 Hadoop filesystem
cat Prints the contents of one or more files
 cd Changes the current directory
 copyFromLocal Copies a local file or directory to a Hadoop
filesystem
 copyToLocal Copies a file or directory on a Hadoop
filesystem to the local filesystem
 cp Copies a file or directory to another directory
 fs Accesses Hadoop’s filesystem shell
 ls Lists files
 mkdir Creates a new directory
 mv Moves a file or directory to another directory
 pwd Prints the path of the current working
directory
 rm Deletes a file or directory
 rmf Forcibly deletes a file or directory (does not fail if
the file or directory does not exist)
Hadoop MapReduce
 kill Kills a MapReduce job
Utilities
 clear Clears the screen in Grunt
 exec Runs a script in a new Grunt shell in batch mode
 help Shows the available commands and options
 history Prints the query statements run in the current
Grunt session
 quit (\q) Exits the interpreter
 run Runs a script within the existing Grunt shell
 set Sets Pig options and MapReduce job properties
 sh Runs a shell command from within Grunt
 A relation in Pig may have an associated schema,
which gives the fields in the relation names and types.

grunt> records = LOAD 'input/ncdc/micro-


tab/sample.txt'
>> AS (year:int, temperature:int, quality:int);
grunt> DESCRIBE records;
records: {year: int,temperature: int,quality: int}
 Functions in Pig come in four types:
Eval function
 A function that takes one or more expressions and
returns another expression.
 An example of a built-in eval function is MAX, which
returns the maximum value of the entries.
 Filter function
 A special type of eval function that returns a logical
Boolean result. As the name
 suggests, filter functions are used in the FILTER
operator to remove unwanted rows.
 Load function
 A function that specifies how to load data into a
relation from external storage.
 Store function
 A function that specifies how to save the contents of a
relation to external storage.
 Often, load and store functions are implemented by
the same type.
Macros
 Macros provide a way to package reusable pieces of Pig
Latin code from within Pig Latin
 itself. For example, we can extract the part of our Pig
Latin program that performs
 grouping on a relation and then finds the maximum
value in each group by defining a macro as follows:
 DEFINE max_by_group(X, group_key, max_field)
RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group,
MAX($X.$max_field);
}
 records = LOAD 'input/ncdc/micro-tab/sample.txt'
A S (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature
!= 9999 AND quality IN (0, 1, 4, 5, 9);
max_temp = max_by_group(filtered_records, year,
temperature);
DUMP max_temp
 macros can be defined in separate files to Pig scripts, in
which case they need to be imported into any script that uses
them.
An import statement looks like this:
IMPORT './ch16-pig/src/main/pig/max_temp.macro';
User-Defined Functions
 Pig’s designers realized that the ability to plug in
custom code is crucial for all but the most trivial data
processing jobs.

 For this reason, they made it easy to define and use


user-defined functions.

 you can write UDFs in Java, Python, JavaScript, Ruby,


or Groovy, all of which are run using the Java Scripting
API.
 A FilterFunc UDF to remove records with unsatisfactory
temperature
 quality readings
 package com.hadoopbook.pig;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.List;
 import org.apache.pig.FilterFunc;
 import
org.apache.pig.backend.executionengine.ExecException;
 import org.apache.pig.data.DataType;
 import org.apache.pig.data.Tuple;
 import
org.apache.pig.impl.logicalLayer.FrontendException;
 public class IsGoodQuality extends FilterFunc {

You might also like