Introduction To Pig: SESSION 2016-2017
Introduction To Pig: SESSION 2016-2017
Introduction To Pig: SESSION 2016-2017
MODULE 3 (L1)
Presented By
Dept of Computer Engineering & Applications
GLA University India
Agenda
Learning Objectives Learning Outcomes
Introduction to Pig
What is Pig?
Key Features of Pig
Running Pig
Execution Modes of Pig
Relational Operators
Eval Function
Piggy Bank
When to use Pig?
When NOT to use Pig?
Pig versus Hive
Apache Pig
Introduction
Why do we need it?
Use Cases
Philosophy
Pig Latin - A Data Flow Language
What is Pig?
Apache Pig is a platform for data analysis.
It is an alternative to Map Reduce Programming.
Pig was developed as a research project at Yahoo.
Features of Pig
It provides an engine for executing data flows (how your data
should flow). Pig processes data in parallel on the Hadoop
cluster.
It provides a language called “Pig Latin” to express data
flows.
Pig Latin contains operators for many of the traditional data
operations such as join, filter, sort, etc.
It allows users to develop their own functions (User Defined
Functions) for reading, processing, and writing data.
The Anatomy of Pig
The main components of Pig are as follows:
Data flow language (Pig Latin).
Ease of programming. easy to write, understand, and
maintain. (1/20 th the lines of code and 1/16 th the
development time)
Optimization opportunities. user can focus on semantics
rather than efficiency.
Extensibility. Users can create their own functions to do
special-purpose processing.
Interactive shell where you can type Pig Latin statements
(Grunt).
Pig interpreter and execution engine.
Pig on Hadoop
Pig runs on Hadoop.
Pig uses both Hadoop Distributed File System and
MapReduce Programming.
By default, Pig reads input files from HDFS. Pig stores the
intermediate data (data produced by MapReduce jobs) and the
output in HDFS.
However, Pig can also read input from and place output to
other sources.
Pig Philosophy
Pigs Fly
Pigs are
Pigs Eat
Domestic Pig Philosophy
Anything
Animals
Pigs Live
Anywhere
Use Case for Pig
Pig is widely used for ETL (Extract, Transform and
Load)
Pig can extract data from different sources such as
ERP, Accounting, Flat Files, etc.
Pig uses operators to perform transformation on the
data and subsequently loads it into the data
warehouse.
Pig Latin Overview:
Statements
Pig Latin Statements are generally ordered as follows:
1. LOAD statement that reads data from the file system.
2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result.
Here A is relation and NOT a variable
2. Batch Mode
Create Pig Script to run pig in batch mode
Write Pig Latin Statements in a file and save it “.pig” extension
Execution Modes of Pig
You can execute pig in two modes:
1. Local Mode
You need to have your files in local file system
Pig –x local filename
(John,12) (Jack,13)
(Joseph,5) (Smith,8)
(James, 7) (Scott,12)
John [city#Bangalore]
Jack [city#Pune]
James [city#Chennai]
register '/root/pigdemos/piggybank-0.12.0.jar';
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
upper = foreach A generate
org.apache.pig.piggybank.evaluation.string.UPPER(name);
DUMP upper;
USER-DEFINED FUNCTIONS (UDF)
register '/root/pigdemos/myudfs.jar';
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
upper = foreach A generate myudfs.UPPER(name);
DUMP B;
PARAMETER SUBSTITUTION
DUMP wordcount;