0% found this document useful (0 votes)
217 views61 pages

Apache Pig

Apache Pig is a platform for analyzing large data sets. It uses a scripting language called Pig Latin to express data analysis programs. Pig Latin programs are compiled into sequences of MapReduce jobs which can handle very large datasets in parallel. Popular companies like Yahoo, Twitter and LinkedIn use Pig to process large volumes of data more quickly than traditional approaches.

Uploaded by

AMAL NEJJARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
217 views61 pages

Apache Pig

Apache Pig is a platform for analyzing large data sets. It uses a scripting language called Pig Latin to express data analysis programs. Pig Latin programs are compiled into sequences of MapReduce jobs which can handle very large datasets in parallel. Popular companies like Yahoo, Twitter and LinkedIn use Pig to process large volumes of data more quickly than traditional approaches.

Uploaded by

AMAL NEJJARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

http://pig.apache.

org/

Prof. EZZAHOUT Abderrahmane

Master IPS 2021


Apache Pig is a platform for analyzing
large data sets
Pig consists of a high-level language for
expressing data analysis programs
Pig is coupled with infrastructure for
evaluating these programs.
The salient property of Pig programs is that
their structure is amenable to substantial
parallelization, which in turns enables them
to handle very large data sets.
Pig offers a scripting language called "Pig
Latin"
Its instructions describe trai
each command modifies the data flow which
crosses it on a flow
Pig Latin also makes it possible to build
much more varied and non-linear data
processing
Pig translates Pig Latin programs into
MapReduce jobs and integrates the results
into the flow
Comments are placed between /*...*/ or from - and
the end of the line.
A Pig Latin program is a series of instructions.
All must be terminated with a;
As in SQL, there is no notion of variables, nor
functions / procedures.
The result of each Pig statement is a collection of
tuples. We call it relationship.
We can see it as a database table.
Each Pig instruction takes an input relation and
produces a new output relation.
Faster development
• Fewer lines of code (Writing map reduce like writing SQL
queries)
• Re-use the code (Pig library, Piggy bank)
One test: Find the top 5 words with most high frequency
• 10 lines of Pig Latin V.S 200 lines in Java
• 15 minutes in Pig Latin V.S 4 hours in Java
Pig Latin Pig Latin Jav
Java 300 a
300
250 250
minut

200 200

150 150
100 100
50 50
0 0
70% of production jobs at Yahoo (10ks
per day)
Twitter, LinkedIn, Ebay, AOL,…
Used to
• Process web logs
• Build user behavior models
• Process images
• Build maps of the web
• Do research on large data sets
Requirements
Mandatory
Unix and Windows users need the following:
Hadoop 2.X -
http://hadoop.apache.org/common/releases.html (
You can run Pig with different versions of Hadoop by
setting HADOOP_HOME to point to the directory
where you have installed Hadoop. If you do not set
HADOOP_HOME, by default Pig will run with the
embedded version, currently Hadoop 2.7.3.)
Java 1.7 -
http://java.sun.com/javase/downloads/index.jsp (s et
JAVA_HOME to the root of your Java installation)
Optional
Python 2.7 - https://www.python.org (when using
Streaming Python UDFs)
High-level language (Pig Latin)
• Set of commands
Two Main
Components

• Two execution modes


• Local: reads/write to local file system
• Mapreduce: connects to Hadoop
cluster and reads/writes to HDFS
Lines=LOAD ‗input/hadoop.log‘ AS (line: chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each
row Words = FOREACH Lines GENERATE
-- filter out any words that are just white spaces
FLATTEN(TOKENIZE(line)) AS word;
-- create a group for each word
Groups = GROUP Words BY word;
-- count the entries in each group
Counts = FOREACH Groups GENERATE group,
COUNT(Words);
-- order the records by count
Results = ORDER Words BY Counts DESC;
Top5 = LIMIT Results 5;
STORE Top5 INTO /output/top5words;
Support for
• Grouping
• Joins
• Filtering
• Aggregation
Extensibility
• Support for User Defined Functions (UDF‘s)
Leverages the same massive parallelism
as native MapReduce
Pig Latin statements are the basic constructs you
use to process data using Pig. A Pig Latin
statement is an operator that takes a relation as
input and produces another relation as output.
(This definition applies to all Pig Latin operators
except LOAD and STORE which read data from
and write data to the file system.) Pig Latin
statements may
include expressions and schemas. Pig Latin
statements can span multiple lines and must end
with a semi-colon ( ; ). By default, Pig Latin
statements are processed using multi-query
execution.
Pig Latin statements are generally
organized as follows:
A LOAD statement to read data from the
file system.
A series of "transformation" statements to
process the data.
A DUMP statement to view results or a
STORE statement to save the results.
Note that a DUMP or STORE statement
is required to generate output.
Scalar Types:
• Int, long, float, double, boolean, null, chararray, bytearry;
Complex Types: fields, tuples, bags, relations;
• A Field is a piece of data
• A Tuple is an ordered set of fields
• A Bag is a collection of tuples
• A Relation is a bag

Samples:
• Tuple Row in Database
( 0002576169, Tome, 20, 4.0)
• Bag Table or View in Database
{(0002576169 , Tome, 20, 4.0),
(0002576170, Mike, 20, 3.6),
(0002576171 Lucy, 19, 4.0), ….
}
● Common design patterns as key words
(joins, distinct, counts)
● Data flow analysis
● A script can map to multiple map-reduce jobs
● Avoids Java-level errors (not everyone
can write java code)
● Can be interactive mode
● Issue commands and get results
Loads data from an HDFS file
var = LOA 'employees.txt';
D
salary);
var = LOA 'employees.txt'AS (id, name,
var =DLOAD 'employees.txt' using PigStorage()
AS (id,name, salary);
Each LOAD statement defines a new bag
• Each bag can have multiple elements (atoms)
• Each element can be referenced by name or position
($n)
A bag is immutable
A bag can be aliased and referenced later
STOR
E • Writes output to an HDFS file in a specified
directory
grunt> STORE processed INTO 'processed_txt';
Fails if directory exists
Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field
delimiter
DUMP
• Write output to screen
grunt> DUMP processed;
FOREACH
• Applies expressions to every record in a bag
FILTER
• Filters by expression
GROUP
• Collect records with the same key
ORDER BY
• Sorting
DISTINCT
• Removes duplicates
Use the FOREACH …GENERATE operator
to work with rows of data, call functions, etc.
Basic syntax:
alias2 = FOREACH alias1 GENERATE expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
Use the FILTER operator to restrict tuples or
rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
Use the GROUP…ALL operator to group data
• Use GROUP when only one relation is involved
• Use COGROUP with multiple relations are involved
Basic syntax:
alias2 = GROUP alias1 ALL;
Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Use the ORDER…BY operator to sort
a relation based on one or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Use the DISTINCT operator to remove
duplicate tuples in a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
FLATTEN
• Used to un-nest tuples as well as bags
INNER JOIN
• Used to perform an inner join of two or more relations
based on common field values
OUTER JOIN
• Used to perform left, right or full outer joins
SPLIT
• Used to partition the contents of a relation into two or
more relations
SAMPLE
• Used to select a random data sample with the stated
sample size
Use the JOIN operator to perform an inner,
equi-join join of two or more relations based
on common field values
The JOIN operator always performs an inner
join
Inner joins ignore null keys
• Filter null keys before the join
JOIN and COGROUP operators perform
similar functions
• JOIN creates a flat set of output records
• COGROUP creates a nested set of output records
DUMP Alias1; Join Alias1 by Col1 to
(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1 BY
Col1, Alias2 BY Col1;
(8,3,4)
(4,3,3) Dump Alias3;
(7,2,5) (1,2,3,1,3)
(8,4,3) (4,2,1,4,6)
DUMP Alias2; (4,3,3,4,6)
(2,4) (4,2,1,4,9)
(8,9) (4,3,3,4,9)
(1,3) (8,3,4,8,9)
(2,7) (8,4,3,8,9)
(2,9)
(4,6)
(4,9)
Use the OUTER JOIN operator to perform left,
right, or full outer joins
• Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
• keywords LEFT, RIGHT and FULL will imply left outer,
right outer and full outer joins respectively
Outer joins will only work provided the relations
which need to produce nulls (in the case of non-
matching keys) have schemas
Outer joins will only work for two-way joins
• To perform a multi-way outer join perform multiple two-
way outer join statements
Left Outer Join
• A = LOAD 'a.txt' AS (n:chararray, a:int);
• B = LOAD 'b.txt' AS (n:chararray, m:chararray);
• C = JOIN A by $0 LEFT OUTER, B BY $0;
Full Outer Join
• A = LOAD 'a.txt' AS (n:chararray, a:int);
• B = LOAD 'b.txt' AS (n:chararray, m:chararray);
• C = JOIN A BY $0 FULL OUTER, B BY $0;
Natively written in Java, packaged as a jar
file
• Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER
statement
Optionally, alias it with the DEFINE
statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE can be used to work with UDFs and
also streaming commands
• Useful when dealing with complex input/output
formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;

/* Define UDFs to a more readable format */


DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;
1. Accessing Pig
2. Basic Pig knowledge: (Word Count)
1. Pig Data Types
2. Pig Operations
3. How to run Pig Scripts
3. Advanced Pig features: (Kmeans
Clustering)
1. Embedding Pig within Python
2. User Defined Function
Accessing approaches:
• Batch mode: submit a script directly
• Interactive mode: Grunt, the pig shell
• PigServer Java class, a JDBC like
interface
Execution mode:
• Local mode: pig –x local
• Mapreduce mode: pig –x mapreduce
Loading data
• LOAD loads input data
• Lines=LOAD ‗input/access.log‘ AS (line: chararray);
Projection
• FOREACH … GENERTE … (similar to SELECT)
• takes a set of expressions and applies them to every
record.
Grouping
• GROUP collects together records with the same key
Dump/Store
• DUMP displays results to screen, STORE save results
to file system
Aggregation
• AVG, COUNT, MAX, MIN, SUM
Pig Data Loader
• PigStorage: loads/stores relations using field-
delimited text format
(John,18,4.0F)
students = load 'student.txt' using PigStorage('\t')
(Mary,19,3.8F)
as (studentid: int, name:chararray, age:int,
(Bill,20,3.9F)
gpa:double);
• TextLoader: loads relations from a plain-text
format
• BinStorage:loads/stores relations from or to
binary files
• PigDump: stores relations by writing the
toString() representation of tuples, one per line
Foreach ... Generate
• The Foreach … Generate statement iterates over
the members of a bag
studentid = FOREACH students GENERATE
st ud e n ti d, n a m e;
• Th e r e s u l t o f a Foreach is
another bag
• Elements are named as in the input bag
Fields are referred to by positional
notation or by name (alias).
students = LOAD 'student.txt' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
DUMP A;
(John,18,4.0F
)
(Mary,19,3.8F)
studentname = Foreach students Generate $1 as studentname;
(Bill,20,3.9F)
First Field Second Field Third Field
Data Type chararray int float
Position $0 $1 $2
notation
Name name age Gpa
(variable)
Field value Tom 19 3.9
Groups the data in one or more relations
• The GROUP and COGROUP operators
are identical.
• Both operators work with one or more relations.
• For readability GROUP is used in statements
involving one relation
• COGROUP is used in statements involving two or
more relations. Jointly Group the tuples from A
and B = GROUP A BY age;
B. C = COGROUP A BY name, B BY
name;
DUMP Operator:
• display output results, will always trigger
execution
STORE Operator:
• Pig will parse entire script prior to writing for
efficiency purposes
A = LOAD ‗input/pig/multiquery/A‘;
B = FILTER A by $1 ==
―appleǁ; C = FILTER A by $1
== ―appleǁ; SOTRE B INTO
―output/bǁ STORE C INTO
―output/cǁ
Relations B&C both derived from A
Prior this would create two MapReduce jobs
Compute the number of elements in a
bag
Use the COUNT function to compute the
number of elements in a bag.
COUNT requires a preceding GROUP
ALL statement for global counts and
GROUP BY statement for group counts.
X = FOREACH B GENERATE
COUNT(A);
Sorts a relation based on one or more
fields
In Pig, relations are unordered. If you
order relation A to produce relation X
relations A and X still contain the same
elements.

student = ORDER students BY gpa DESC;


Local mode
• Local host and local file system is used
• Neither Hadoop nor HDFS is required
• Useful for prototyping and debugging
MapReduce mode
• Run on a Hadoop cluster and HDFS
Batch mode - run a script directly
• Pig –x local my_pig_script.pig
• Pig –x mapreduce my_pig_script.pig
Interactive mode use the Pig shell to run script
• Grunt> Lines = LOAD ‗/input/input.txt‘ AS (line:chararray);
• Grunt> Unique = DISTINCT Lines;
• Grunt> DUMP Unique;
1. Get and Setup Hand-on VM from:
http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_gu
ide.html
2. cd pigtutorial/pig-hands-on/
3. tar –xf pig-wordcount.tar
4. cd pig-wordcount

5. Batch mode
6. pig –x local wordcount.pig

7. Iterative mode
8. grunt> Lines=LOAD ‗input.txt‘ AS (line: chararray);
9. grunt>Words = FOREACH Lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
10. grunt>Groups = GROUP Words BY word;
11. grunt>counts = FOREACH Groups GENERATE group,
COUNT(Words);
12. grunt>DUMP counts;
TOKENIZE returns a new bag for each
input;―FLATTENǁ eliminates bag nesting
A:{line1, line2, line3…}
After
Tokenize:{{lineword1,line1word2,…}},{lin
e2word1,line2word2…}}
After
Flatten{line1word1,line1word2,line2word
1…}
A method of cluster analysis which aims to
partition n observations into k clusters in which
each observation belongs to the cluster with the
nearest mean.
Assignment step: Assign each observation to the
cluster with the closest mean

Update step: Calculate the new means to be the


centroid of the observations in the cluster.

Reference: http://en.wikipedia.org/wiki/K-means_clustering
PC = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
students = load 'student.txt' as (name:chararray, age:int,
gpa:double);
centroided = foreach students generate gpa, find_centroid(gpa) as
centroid;
grouped = group centroided by centroid;
result = Foreach grouped Generate group,
AVG(centroided.gpa);
store result into 'output';
""")
while iter_num<MAX_ITERATION:
PCB = PC.bind({'centroids':initial_centroids})
results = PCB.runSingle()
iter = results.result("result").iterator()
centroids = [None] * v
distance_move = 0.0
# get new centroid of this iteration, calculate the moving distance with
last iteration
for i in range(v):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-
centroids[i])
distance_move = distance_move / v;
if distance_move<tolerance:
converged = True
break
……
What is
UDF
• Way to do an operation on a field or fields
• Called from within a pig script
• Currently all done in Java
Why use UDF
• You need to do more than grouping or filtering
• Actually filtering is a UDF
• Maybe more comfortable in Java land than in
SQL/Pig Latin
P = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
Pig does not support flow control statement:
if/else, while loop, for loop, etc.
Pig embedding API can leverage all
language features provided by Python
including control flow:
• Loop and exit criteria
• Similar to the database embedding API
• Easier parameter passing
JavaScript is available as well
The framework is extensible. Any JVM
implementation of a language could be
integrated
1. Get and Setup Hand-on VM from:
http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_guid
e.html
2. cd pigtutorial/pig-hands-on/
3. tar –xf pig-kmeans.tar
4. cd pig-kmeans
5. export PIG_CLASSPATH= /opt/pig/lib/jython-
2.5.0.jar
6. Hadoop dfs –copyFromLocal input.txt ./input.txt
7. pig –x mapreduce kmeans.py
8. pig—x local kmeans.py
2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to
run:
register udf.jar
DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0');
students = load 'student.txt' as (name:chararray, age:int, gpa:double);
centroided = foreach students generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into 'output';

Input(s): Successfully read 10000 records (219190 bytes) from:


"hdfs://iw-ubuntu/user/developer/student.txt"

Output(s): Successfully stored 4 records (134 bytes) in:


"hdfs://iw-ubuntu/user/developer/output―

last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]


Peta
10^15

Tera
10^12
Giga
10^9

Mega
10^6
1. Search Engine System for Summer
School
2. To give an example of how to use
MapReduce technologies to solve big
data challenge.
3. Using Hadoop/HDFS/HBase/Pig
4. Indexed 656K web pages (540MB in
size) selected from Clueweb09 data set.
5. Calculate ranking values for 2 million
web sites.
Apache Lucene

Inverted Indexing
System

PHP script
Web UI HBase Tables
1. inverted index table
Hive/Pig script HBas 2. page rank table
e

Apache Server on Thrift client Thrift server


Salsa Portal

Pig script

Hadoop Cluster
Ranking
on FutureGrid
System
Read file from HDFS The input format (text, tab delimited)
Define run-time schema

raw = LOAD 'excite.log' USING PigStorage ('\t') AS (user, id, time,


query);
clean1 = FILTER raw BY id > 20 AND id < 100;
clean2 = FOREACH clean1 GENERATE
user, time,
org.apache.pig.tutorial.sanitze(query) as
query;
user_groups = GROUP clean2 BY (user, query);
user_query_counts = FOREACH user_groups
GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);

STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage (',');


Store the output in a file Text, Comma delimited
Keywords
• Load, Filter, Foreach Generate, Group By,
Store, Join, Distinct, Order By, …
Aggregations
• Count, Avg, Sum, Max, Min
Schema
• Defines at query-time not when files are loaded
UDFs
Packages for common input/output
formats
Script can take arguments Data are ―ctrl-Aǁ delimited Define types of the columns

A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int);

B = group A by name parallel 10; Specify the need of 10 reduce tasks

C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as


c2;

D = filter C by c0 > 100 and c1 > 100 and c2 > 100;

store D into '$out';

58

You might also like