Introduction To Pig: SESSION 2016-2017

SESSION 2016-2017
B.TECH (CSE) YEAR: IV SEMESTER: VIII

INTRODUCTION TO PIG
MODULE 3 (L1)
Presented By
Dept of Computer Engineering & Applications
GLA University India
Agenda
Learning Objectives Learning Outcomes
Introduction to Pig
1. To study the key features and a) To have an easy comprehension on

anatomy of Pig. when to use and when NOT to use
Pig.
2. To study the execution modes of
Pig. b) To be able to differentiate between
Pig and Hive.
3. To study the various relational
operators in pig.
Agenda
 What is Pig?
 Key Features of Pig
 The Anatomy of Pig

 Pig on Hadoop
 Pig Philosophy
 Pig Latin Overview
 Pig Latin Statements
 Pig Latin: Identifiers
 Pig Latin: Comments
 Data Types in Pig

 Simple Data Types
 Complex Data Types

Agenda
 Running Pig
 Execution Modes of Pig
 Relational Operators
 Eval Function
 Piggy Bank
 When to use Pig?
 When NOT to use Pig?
 Pig versus Hive
Apache Pig
 Introduction
 Why do we need it?
 Use Cases
 Philosophy
 Pig Latin - A Data Flow Language
What is Pig?
 Apache Pig is a platform for data analysis.
 It is an alternative to Map Reduce Programming.
 Pig was developed as a research project at Yahoo.
Features of Pig
 It provides an engine for executing data flows (how your data
should flow). Pig processes data in parallel on the Hadoop
cluster.
 It provides a language called “Pig Latin” to express data
flows.
 Pig Latin contains operators for many of the traditional data
operations such as join, filter, sort, etc.
 It allows users to develop their own functions (User Defined
Functions) for reading, processing, and writing data.
The Anatomy of Pig
 The main components of Pig are as follows:
 Data flow language (Pig Latin).
 Ease of programming. easy to write, understand, and
maintain. (1/20 th the lines of code and 1/16 th the
development time)
 Optimization opportunities. user can focus on semantics
rather than efficiency.
 Extensibility. Users can create their own functions to do
special-purpose processing.
 Interactive shell where you can type Pig Latin statements
(Grunt).
 Pig interpreter and execution engine.
Pig on Hadoop
 Pig runs on Hadoop.
 Pig uses both Hadoop Distributed File System and
MapReduce Programming.
 By default, Pig reads input files from HDFS. Pig stores the
intermediate data (data produced by MapReduce jobs) and the
output in HDFS.
 However, Pig can also read input from and place output to
other sources.
Pig Philosophy
Pigs Fly
Pigs are
Pigs Eat
Domestic Pig Philosophy
Anything
Animals
Pigs Live
Anywhere
Use Case for Pig
 Pig is widely used for ETL (Extract, Transform and
Load)
 Pig can extract data from different sources such as
ERP, Accounting, Flat Files, etc.
 Pig uses operators to perform transformation on the
data and subsequently loads it into the data
warehouse.
Pig Latin Overview:
Statements
 Pig Latin Statements are generally ordered as follows:
1. LOAD statement that reads data from the file system.
2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result.
Here A is relation and NOT a variable
A = load 'student' (rollno, name, gpa);

A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name);
STORE A INTO ‘myreport’
Pig Latin Overview:
Comments
 In Pig Latin two types of comments are supported:
 Single line comments that begin with “--” (two
hyphens).
 Multiline comments that begin with “/* and end with
*/”.
Pig Latin Overview:
Identifiers
 Valid Identifiers
 Y
 A1
 A1_2014
 Sample
Pig Latin Overview: Operators
Arithmetic Comparison Null Boolean
+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=
Data Types in PIG
 Simple Data Types
Name Description
int Whole numbers
long Large whole numbers
float Decimals
double Very precise decimals
chararray Text strings
bytearray Raw bytes
datetime Datetime
boolean true or false
 Complex Data Types

Name Description
Tuple An ordered set of fields.
Example: (2,3)
Bag A collection of tuples.
Example: {(2,3),(7,5)}
map key, value pair (open # Apache)
Running Pig
 Pig can run in two ways:
1. Interactive Mode
 Invoke grunt Shell
 Type pig to get grunt shell
 A= load ‘/pigdemo/student.tsv’ as ( rollno, name, gpa );
 DUMP A;
2. Batch Mode
 Create Pig Script to run pig in batch mode
 Write Pig Latin Statements in a file and save it “.pig” extension
Execution Modes of Pig
 You can execute pig in two modes:
1. Local Mode
 You need to have your files in local file system
 Pig –x local filename
2. Map Reduce Mode (Default Mode)

 You need to have access to a Hadoop Cluster to read/write file.
 pig filename
Relational Operators
 FILTER
 FOREACH
 GROUP
 DISTINCT
 LIMIT
 ORDERBY
 JOIN
 UNION
 SPLIT
 SAMPLE
FILTER-BY
Find the tuples of those student where the GPA is
greater than 4.0.
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

B = filter A by gpa > 4.0;
DUMP B;
FOREACH
 Display the name of all students in
uppercase.

B = foreach A generate UPPER (name);

DUMP B;
GROUP-BY
 Group tuples of students based on their GPA.

B = GROUP A BY gpa;

DUMP B;

DISTINCT
 To remove duplicate tuples of students.

B = DISTINCT A;

DUMP B;
JOIN-BY
To join two relations namely, “student” and “department”
based on the values contained in the “rollno” column.

B = load '/pigdemo/department.tsv' as (rollno:int, deptno:int,deptname:chararray);

C = JOIN A BY rollno, B BY rollno;

DUMP C;

DUMP B;
SPLIT
 To partition a relation based on the GPAs acquired
by the students.
 GPA = 4.0, place it into relation X.
 GPA is < 4.0, place it into relation Y.

SPLIT A INTO X IF gpa==4.0, Y IF gpa<=4.0;

DUMP X;
Eval Functions
 AVG
 MAX
 COUNT
AVG
 To calculate the average marks for each student.

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as
(studname:chararray,marks:int);

B = GROUP A BY studname;

C = FOREACH B GENERATE A.studname, AVG(A.marks);

DUMP C;
MAX
 To calculate the maximum marks for each
student.
A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as (studname:chararray,
marks:int);


C = FOREACH B GENERATE A.studname, MAX(A.marks);

DUMP C;
COUNT
 To count the number of elements in a bag
A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as (studname:chararray,

marks:int);


C = FOREACH B GENERATE A.studname, COUNT(A);

DUMP C;
COMPLEX DATA TYPES: TUPLE
 TUPLE is an ordered collection of fields.
(John,12) (Jack,13)
(Joseph,5) (Smith,8)
(James, 7) (Scott,12)
A = LOAD '/root/pigdemos/studentdata.tsv' AS ( t1:tuple

(t1a:chararray, t1b:int ), t2:tuple ( t2a:chararray, t2b:int ) ;

B = FOREACH A GENERATE t1.t1a,t1.t1b,t2$0,t2$1;

DUMP B
COMPLEX DATA TYPES: MAP
MAP represents a key/value pair.
John [city#Bangalore]
Jack [city#Pune]
James [city#Chennai]
A = load '/root/pigdemos/studentcity.tsv' Using PigStorage as

(studname:chararray,m:map[chararray]);

B = foreach A generate m#'city' as CityName:chararray;

DUMP B
PIGGY BANK
 Pig user can use Piggy Bank function in Pig Latin

script and they can also share their fuctions in
Piggy Bank.upper function.
 Objective: To use Piggy Bank string UPPER function.
register '/root/pigdemos/piggybank-0.12.0.jar';


upper = foreach A generate
org.apache.pig.piggybank.evaluation.string.UPPER(name);
DUMP upper;
USER-DEFINED FUNCTIONS (UDF)
 Pig allows you to create your own function for

complex analysis.
 Write a java class and convert it into “.jar” to
include this function into code.
register '/root/pigdemos/myudfs.jar';


upper = foreach A generate myudfs.UPPER(name);
DUMP B;
PARAMETER SUBSTITUTION
 Pig allows you to pass parameters at runtime.

 To execute the statement type below command on
grunt:
 Pig–param student=/pigdemo/student.tsv
parameterdemo.pig
A = load ‘$student' as (rollno:int, name:chararray, gpa:float);

DUMP A;
DIAGNOSTIC OPERATOR
 DESCRIBE : It returns the schema of a relation
A = load ‘/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

DESCRIBE A;
WORD COUNT EXAMPLE IN PIG
lines=LOAD ‘/root/pigdemo/lines.txt’ AS (line:chararray);
words=FOREACH lines GENERATE FLATTEN ( TOKENIZE (line)) as word;
grouped=GROUP words by word;
wordcount = FOREACH grouped GENERATE group, COUNT (words);
DUMP wordcount;
 TOKENIZE splits the line into a field for each word.

 FLATTEN will take the collection of records returned by
TOKENIZE and produce a separate record for each one,
calling the single field in the record word.
When to use Pig?
 Pig can be used in the following situations:
1. When data loads are time sensitive.
2. When processing various data sources.
3. When analytical insights are required through

sampling.
When NOT to use Pig?
 Pig should not be used in the following situations:
1. When data is completely unstructured such as
video, text, and audio.
2. When there is a time constraint because Pig is
slower than MapReduce jobs.
PIG at YAHOO
 Yahoo uses PIG for two things:
1. In Pipelines, to fetch log data from its web servers and
to perform cleansing to remove companies interval
views and clicks.
2. In Research, script is used to test a theory. Pig provides
facility to integrate Perl or Python script which can be
executed on a huge dataset.
Pig Vs. Hive
Features Pig Hive
Used By Programmers and Researchers Analyst
Used For Programming Reporting

Language Procedural data flow language SQL Like
Suitable For Semi - Structured Structured

Schema / Types Explicit Implicit
UDF Support YES YES
Join / Order / Sort YES YES
DFS Direct Access YES (Implicit) YES (Explicit)
Web Interface YES NO
Partitions YES No
Shell YES YES
Fill in the blanks
1. Pig is a ___________ language.
2. In Pig, ___________ is used to specify data flow.
3. Pig provides an ___________ to execute data flow.
4. ___________, ___________ are execution modes of Pig.
5. Pig is used in ___________ process.

Answers
1. Scripting
2. Pig Latin
3. Pig Engine
4. Local Mode, MapReduce Mode
5. Grunt

Introduction To Pig: SESSION 2016-2017

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Introduction To Pig: SESSION 2016-2017

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Pig: SESSION 2016-2017

Uploaded by

Copyright:

Available Formats

SESSION 2016-2017

B.TECH (CSE) YEAR: IV SEMESTER: VIII

1. To study the key features and a) To have an easy comprehension on

 The Anatomy of Pig

 Pig Latin: Identifiers

 Pig Latin: Comments

 Data Types in Pig

 Complex Data Types

A = load 'student' (rollno, name, gpa);

 Complex Data Types

2. Map Reduce Mode (Default Mode)

A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

 To calculate the average marks for each student.

A = load '/pigdemo/student.csv' USING PigStorage (‘,’) as (studname:chararray,

 TUPLE is an ordered collection of fields.

A = LOAD '/root/pigdemos/studentdata.tsv' AS ( t1:tuple

MAP represents a key/value pair.

A = load '/root/pigdemos/studentcity.tsv' Using PigStorage as

 Pig user can use Piggy Bank function in Pig Latin

 Pig allows you to create your own function for

 Pig allows you to pass parameters at runtime.

A = load ‘$student' as (rollno:int, name:chararray, gpa:float);

 DESCRIBE : It returns the schema of a relation

A = load ‘/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);

lines=LOAD ‘/root/pigdemo/lines.txt’ AS (line:chararray);

words=FOREACH lines GENERATE FLATTEN ( TOKENIZE (line)) as word;

grouped=GROUP words by word;

wordcount = FOREACH grouped GENERATE group, COUNT (words);

 TOKENIZE splits the line into a field for each word.

1. When data loads are time sensitive.

2. When processing various data sources.

3. When analytical insights are required through

Used For Programming Reporting

Suitable For Semi - Structured Structured

2. In Pig, ___________ is used to specify data flow.

3. Pig provides an ___________ to execute data flow.

4. ___________, ___________ are execution modes of Pig.

5. Pig is used in ___________ process.

You might also like

4. _, _ are execution modes of Pig.