Pig: Building High-Level Dataflows Over Map-Reduce
Pig: Building High-Level Dataflows Over Map-Reduce
• Pig Latin
• Example Generation
• Future Work
Data Processing Renaissance
Map-Reduce
Dryad
Map-Reduce
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Just a group-by-aggregate?
The Map-Reduce Appeal
M M R M
• Pig Latin
• Example Generation
• Future Work
Pig Latin Data Model
The data model of Pig Latin is fully nested, and it allows complex non-atomic
datatypes such as map and tuple.
» Atom: Any single value
Example − ‘raja’ or ‘30’
» Tuple: A record that is formed by an
ordered set of fields
Example − (Raja, 30)
» Bag: An unordered set of tuples/ A collection of tuples (non-unique)
Example − {(Raja, 30), (Mohammad, 45)}
» Map: A map (or data map) is a set of key-value pairs.
Example − [name#Raja, age#30]
» Relation: A relation is a bag of tuples. The relations in Pig Latin are
unordered (there is no guarantee that tuples are processed in any
particular order).
Reading Data in Pig
Relation = LOAD 'Input file path information' USING load_function AS schema
Where,
• Relation − We have to provide the relation name where we want to
load the file content.
• Input file path information − We have to provide the path of the
Hadoop directory where the file is stored.
• load_function − Apache Pig provides a variety of load functions like
BinStorage, JsonLoader, PigStorage, TextLoader. Here, we need to
choose a function from this set. PigStorage is the commonly used
function as it is suited for loading structured text files.
• Schema − We need to define the schema of the data or the passing
files in parenthesis.
Operators in Pig
» Describe Operator: Display the schema of the relation
Syntax: grunt> Describe Relation_name
» Illustrate Operator: get the step-by-step execution of a sequence of
statements in Pig command
Syntax: grunt> illustrate Relation_name;
» Explain: review the logical, physical, and map-reduce execution plans of a
relation
Syntax: grunt> explain Relation_name;
» Filter: Select the required tuples from a relation depending upon a
condition
Syntax: grunt> Relation2 = FILTER Relation1 BY (condition);
» Limit: To get a limited number of tuples from a relation, Pig supports the
limit operator.
Syntax: grunt> Output = LIMIT Relation_name number of tuples required ;
Operators in Pig
» Distinct: To remove duplicate tuples from a relation, the distinct operator is used.
Syntax:- grunt> Relation2 = DISTINCT Relation1;
» Foreach: Generate specified data transformations based on the column data
Syntax: grunt> Relation2 = FOREACH Relation1 GENERATE (required data);
» Group: group the data in one or more relations
Syntax: grunt> Group_data = GROUP Relation_name BY key_column;
Grouping by multiple columns:
Syntax: grunt> Group_data = GROUP Relation_name BY (column1, column2,column3,..)
» Group All: To group a relation by all the columns, the Group All operator is used.
Syntax: grunt> group_all_data = GROUP Relation_name All;
» Cogroup: Cogroup is more suited for multiple relations, whereas the group is more
suitable for single relations
Syntax: grunt> cogroup_data = COGROUP Relation1 by column1, Relation2 by column2;
» Join: combine data from two or more relations: Firstly, we have two declare the
keys, which are nothing but a tuple from each relation. If the keys are matched with
each other, then we consider that two particular tuples are matched and can be
displayed in the output; otherwise, the unmatched records are dropped
Join Operator in Pig
» Self Join: Join a table with itself. We load the same data multiple times with
different alias names
Syntax: grunt> Relation3 = JOIN Relation1 BY key, Relation2 BY key;
» Inner Join: An inner join compares both the tables(say A and B) and returns
rows when there is a match.
Syntax: grunt> Output = JOIN Relation1 BY column1, Relation2 BY column2;
» Left Outer Join: returns all the rows from the left relation, even if doesn’t
match with the right relation
Syntax: grunt> Relation3 = JOIN Relation1 BY id LEFT OUTER, Relation2 BY column;
» Right Outer Join: Returns all the rows from the right relation, even if
doesn’t match with the left relation
Syntax: grunt> Relation3 = JOIN Relation1 BY id RIGHT OUTER, Relation2 BY column;
» Full Outer Join: Returns rows when there is a match in one of the relations.
Syntax: grunt> Relation3 = JOIN Relation1 BY id FULL OUTER, Relation2 BY column;
Other Operator in Pig
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
The step-by-step method of creating a program in Pig is much cleaner and simpler to
use than the single block method of SQL. It is easier to keep track of what your
variables are, and where you are in the process of analyzing your data.
Jasmine Novak
Engineer, Yahoo!
With the various interleaved clauses in SQL, it is difficult to know what is actually
happening sequentially. With Pig, the data nesting and the temporary tables get
abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.
David Ciemiewicz
Search Excellence, Yahoo!
finance
yahoo , email
news
• Pig Latin
• Example Generation
• Future Work
Implementation
SQL user
automatic
rewrite + Pig or
optimize Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce
Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map and Reduce3
Foreach category
reduce phases generate top10(urls)
Optimizations: Using the Combiner
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Filter bots
Group by Group by
state demographic
Reduce1
Filter bots
Split
Group by Group by
state demographic
Demultiplex Reduce1
• Pig Latin
• Example Generation
• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)
JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)
FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)
FOREACH GROUP
user, canonicalize(url) on user
No Output
How to do test runs?
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)
FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 0. Consistency
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 1. Realism
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness
JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user
FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status
• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
–Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Summary
Pig Latin
Sweet spot between map-reduce and SQL
References
» https://www.analyticsvidhya.com/blog/2022/06/getting-
started-with-apache-pig/
» https://github.com/srafay/Hadoop-hands-on
» https://github.com/newTendermint/awesome-bigdata
» https://pig.apache.org/docs/latest/basic.html
» https://www.oreilly.com/library/view/programming-pig-
2nd/9781491937082/ch04.html
» https://data-flair.training/blogs/pig-reading-data-storing-
data/