Apache Pig
Apache Pig
org/
200 200
150 150
100 100
50 50
0 0
70% of production jobs at Yahoo (10ks
per day)
Twitter, LinkedIn, Ebay, AOL,…
Used to
• Process web logs
• Build user behavior models
• Process images
• Build maps of the web
• Do research on large data sets
Requirements
Mandatory
Unix and Windows users need the following:
Hadoop 2.X -
http://hadoop.apache.org/common/releases.html (
You can run Pig with different versions of Hadoop by
setting HADOOP_HOME to point to the directory
where you have installed Hadoop. If you do not set
HADOOP_HOME, by default Pig will run with the
embedded version, currently Hadoop 2.7.3.)
Java 1.7 -
http://java.sun.com/javase/downloads/index.jsp (s et
JAVA_HOME to the root of your Java installation)
Optional
Python 2.7 - https://www.python.org (when using
Streaming Python UDFs)
High-level language (Pig Latin)
• Set of commands
Two Main
Components
Samples:
• Tuple Row in Database
( 0002576169, Tome, 20, 4.0)
• Bag Table or View in Database
{(0002576169 , Tome, 20, 4.0),
(0002576170, Mike, 20, 3.6),
(0002576171 Lucy, 19, 4.0), ….
}
● Common design patterns as key words
(joins, distinct, counts)
● Data flow analysis
● A script can map to multiple map-reduce jobs
● Avoids Java-level errors (not everyone
can write java code)
● Can be interactive mode
● Issue commands and get results
Loads data from an HDFS file
var = LOA 'employees.txt';
D
salary);
var = LOA 'employees.txt'AS (id, name,
var =DLOAD 'employees.txt' using PigStorage()
AS (id,name, salary);
Each LOAD statement defines a new bag
• Each bag can have multiple elements (atoms)
• Each element can be referenced by name or position
($n)
A bag is immutable
A bag can be aliased and referenced later
STOR
E • Writes output to an HDFS file in a specified
directory
grunt> STORE processed INTO 'processed_txt';
Fails if directory exists
Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field
delimiter
DUMP
• Write output to screen
grunt> DUMP processed;
FOREACH
• Applies expressions to every record in a bag
FILTER
• Filters by expression
GROUP
• Collect records with the same key
ORDER BY
• Sorting
DISTINCT
• Removes duplicates
Use the FOREACH …GENERATE operator
to work with rows of data, call functions, etc.
Basic syntax:
alias2 = FOREACH alias1 GENERATE expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)
Use the FILTER operator to restrict tuples or
rows of data
Basic syntax:
alias2 = FILTER alias1 BY expression;
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)
Use the GROUP…ALL operator to group data
• Use GROUP when only one relation is involved
• Use COGROUP with multiple relations are involved
Basic syntax:
alias2 = GROUP alias1 ALL;
Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F)
(Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Use the ORDER…BY operator to sort
a relation based on one or more fields
Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)
Use the DISTINCT operator to remove
duplicate tuples in a relation.
Basic syntax:
alias2 = DISTINCT alias1;
Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)
FLATTEN
• Used to un-nest tuples as well as bags
INNER JOIN
• Used to perform an inner join of two or more relations
based on common field values
OUTER JOIN
• Used to perform left, right or full outer joins
SPLIT
• Used to partition the contents of a relation into two or
more relations
SAMPLE
• Used to select a random data sample with the stated
sample size
Use the JOIN operator to perform an inner,
equi-join join of two or more relations based
on common field values
The JOIN operator always performs an inner
join
Inner joins ignore null keys
• Filter null keys before the join
JOIN and COGROUP operators perform
similar functions
• JOIN creates a flat set of output records
• COGROUP creates a nested set of output records
DUMP Alias1; Join Alias1 by Col1 to
(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1 BY
Col1, Alias2 BY Col1;
(8,3,4)
(4,3,3) Dump Alias3;
(7,2,5) (1,2,3,1,3)
(8,4,3) (4,2,1,4,6)
DUMP Alias2; (4,3,3,4,6)
(2,4) (4,2,1,4,9)
(8,9) (4,3,3,4,9)
(1,3) (8,3,4,8,9)
(2,7) (8,4,3,8,9)
(2,9)
(4,6)
(4,9)
Use the OUTER JOIN operator to perform left,
right, or full outer joins
• Pig Latin syntax closely adheres to the SQL standard
The keyword OUTER is optional
• keywords LEFT, RIGHT and FULL will imply left outer,
right outer and full outer joins respectively
Outer joins will only work provided the relations
which need to produce nulls (in the case of non-
matching keys) have schemas
Outer joins will only work for two-way joins
• To perform a multi-way outer join perform multiple two-
way outer join statements
Left Outer Join
• A = LOAD 'a.txt' AS (n:chararray, a:int);
• B = LOAD 'b.txt' AS (n:chararray, m:chararray);
• C = JOIN A by $0 LEFT OUTER, B BY $0;
Full Outer Join
• A = LOAD 'a.txt' AS (n:chararray, a:int);
• B = LOAD 'b.txt' AS (n:chararray, m:chararray);
• C = JOIN A BY $0 FULL OUTER, B BY $0;
Natively written in Java, packaged as a jar
file
• Other languages include Jython, JavaScript, Ruby,
Groovy, and Python
Register the jar with the REGISTER
statement
Optionally, alias it with the DEFINE
statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
DEFINE can be used to work with UDFs and
also streaming commands
• Useful when dealing with complex input/output
formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;
5. Batch mode
6. pig –x local wordcount.pig
7. Iterative mode
8. grunt> Lines=LOAD ‗input.txt‘ AS (line: chararray);
9. grunt>Words = FOREACH Lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
10. grunt>Groups = GROUP Words BY word;
11. grunt>counts = FOREACH Groups GENERATE group,
COUNT(Words);
12. grunt>DUMP counts;
TOKENIZE returns a new bag for each
input;―FLATTENǁ eliminates bag nesting
A:{line1, line2, line3…}
After
Tokenize:{{lineword1,line1word2,…}},{lin
e2word1,line2word2…}}
After
Flatten{line1word1,line1word2,line2word
1…}
A method of cluster analysis which aims to
partition n observations into k clusters in which
each observation belongs to the cluster with the
nearest mean.
Assignment step: Assign each observation to the
cluster with the closest mean
Reference: http://en.wikipedia.org/wiki/K-means_clustering
PC = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
students = load 'student.txt' as (name:chararray, age:int,
gpa:double);
centroided = foreach students generate gpa, find_centroid(gpa) as
centroid;
grouped = group centroided by centroid;
result = Foreach grouped Generate group,
AVG(centroided.gpa);
store result into 'output';
""")
while iter_num<MAX_ITERATION:
PCB = PC.bind({'centroids':initial_centroids})
results = PCB.runSingle()
iter = results.result("result").iterator()
centroids = [None] * v
distance_move = 0.0
# get new centroid of this iteration, calculate the moving distance with
last iteration
for i in range(v):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-
centroids[i])
distance_move = distance_move / v;
if distance_move<tolerance:
converged = True
break
……
What is
UDF
• Way to do an operation on a field or fields
• Called from within a pig script
• Currently all done in Java
Why use UDF
• You need to do more than grouping or filtering
• Actually filtering is a UDF
• Maybe more comfortable in Java land than in
SQL/Pig Latin
P = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
Pig does not support flow control statement:
if/else, while loop, for loop, etc.
Pig embedding API can leverage all
language features provided by Python
including control flow:
• Loop and exit criteria
• Similar to the database embedding API
• Easier parameter passing
JavaScript is available as well
The framework is extensible. Any JVM
implementation of a language could be
integrated
1. Get and Setup Hand-on VM from:
http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_guid
e.html
2. cd pigtutorial/pig-hands-on/
3. tar –xf pig-kmeans.tar
4. cd pig-kmeans
5. export PIG_CLASSPATH= /opt/pig/lib/jython-
2.5.0.jar
6. Hadoop dfs –copyFromLocal input.txt ./input.txt
7. pig –x mapreduce kmeans.py
8. pig—x local kmeans.py
2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to
run:
register udf.jar
DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0');
students = load 'student.txt' as (name:chararray, age:int, gpa:double);
centroided = foreach students generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into 'output';
Tera
10^12
Giga
10^9
Mega
10^6
1. Search Engine System for Summer
School
2. To give an example of how to use
MapReduce technologies to solve big
data challenge.
3. Using Hadoop/HDFS/HBase/Pig
4. Indexed 656K web pages (540MB in
size) selected from Clueweb09 data set.
5. Calculate ranking values for 2 million
web sites.
Apache Lucene
Inverted Indexing
System
PHP script
Web UI HBase Tables
1. inverted index table
Hive/Pig script HBas 2. page rank table
e
Pig script
Hadoop Cluster
Ranking
on FutureGrid
System
Read file from HDFS The input format (text, tab delimited)
Define run-time schema
A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int);
58