Pig : Building High-Level
Dataflows over Map-Reduce
Utkarsh Srivastava
Research &
Cloud Computing
Data Processing Renaissance
Internet companies swimming in data
• E.g. TBs/day at Yahoo!
Data analysis is “inner loop” of product innovation
Data analysts are skilled programmers
Data Warehousing …?
Scale Often not scalable enough
Prohibitively expensive at web scale
$$$$
• Up to $200K/TB
• Little control over execution method
SQL • Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
New Systems For Data Analysis
Map-Reduce
Apache Hadoop ...
Dryad
Map-Reduce
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Just a group-by-aggregate?
The Map-Reduce Appeal
Scalable due to simpler design
Scale • Only parallelizable operations
• No transactions
$ Runs on cheap commodity hardware
SQL Procedural Control- a processing “pipe”
Disadvantages
1. Extremely rigid data flow M R
Other flows constantly hacked in
M M R M
Join, Union Split Chains
2. Common operations must be coded by hand
• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
Pros And Cons
Need a high-level, general data flow language
Enter Pig Latin
Need a high-level, general data flow language
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
Example Data Analysis Task
Find the top 10 most visited pages in each category
Visits Url Info
User Url Time Url Category PageRank
Amy cnn.com 8:00 cnn.com News 0.9
Amy bbc.com 10:00 bbc.com News 0.8
Amy flickr.com 10:05 flickr.com Photos 0.7
Fred cnn.com 12:00 espn.com Sports 0.9
Data Flow
Load Visits
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Step-by-step Procedural Control
Target users are entrenched procedural programmers
The step-by-step method of creating a program in Pig is much cleaner and
simpler to use than the single block method of SQL. It is easier to keep track of
what your variables are, and where you are in the process of analyzing your
data.
Jasmine Novak
Engineer, Yahoo!
With the various interleaved clauses in SQL, it is difficult to know what is
actually happening sequentially. With Pig, the data nesting and the temporary
tables get abstracted away. Pig has fewer primitives than SQL does, but it’s
more powerful.
David Ciemiewicz
Search Excellence, Yahoo!
• Automatic query optimization is hard
• Pig Latin does not preclude optimization
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories
Operates generate
directly overtop(visitCounts,10);
files
store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
Schemasgenerate
topUrls = foreach gCategories optional; top(visitCounts,10);
Can be assigned dynamically
store topUrls into ‘/data/topUrls’;
User-Code as a First-Class Citizen
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits functions
User-defined by url; (UDFs)
visitCountscan
= foreach gVisits
be used generate
in every url, count(urlVisits);
construct
• Load, Store
urlInfo =• load ‘/data/urlInfo’
Group, as (url, category, pRank);
Filter, Foreach
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Nested Data Model
• Pig Latin has a fully-nestable data model with:
–Atomic values, tuples, bags (lists), and maps
finance
yahoo , email
news
• More natural to programmers than flat tuples
• Avoids expensive joins
Nested Data Model
Decouples grouping as an independent operation
User Url Time group Visits
Amy cnn.com 8:00 group by url cnn.com
Amy cnn.com 8:00
Amy bbc.com 10:00 Fred cnn.com 12:00
Amy bbc.com 10:05
Fred cnn.com 12:00 Amy bbc.com 10:00
bbc.com
Amy bbc.com 10:05
• Common case: aggregation on these nested sets
• Power users:
I franklysophisticated UDFs,than
like pig much better e.g.,
SQLsequence
in some analysis
respects (group + optional flatten works better for me,
• EfficientI love
Implementation (see paper)
nested data structures).”
Ted Dunning
Chief Scientist, Veoh
19
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
group results revenue
Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings
Kings nba.com 2 Kings side 10
Cross-product of the 2 bags would give natural join
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
Implementation
SQL user
automatic
rewrite + Pig or
optimize Pig is open-source.
or
http://hadoop.apache.org/pig
Hadoop
Map-Reduce
cluster • ~50% of Hadoop jobs at
Yahoo! are Pig
• 1000s of jobs per day
Compilation into Map-Reduce
Map1 Every group or join operation
Load Visits
forms a map-reduce boundary
Group by url
Reduce1
Map2
Foreach url
Load Url Info
generate count
Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map and Reduce3
Foreach category
reduce phases generate top10(urls)
Optimizations: Using the Combiner
Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5
map
k2 v4 k2 v2 reduce
k1 v5 k2 v4
Can pre-process data on the map-side to reduce data shipped
• Algebraic Aggregation Functions
• Distinct processing
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer
group results revenue
Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings
Kings nba.com 2 Kings side 10
• Problem if too many values with same key
• Skew join samples data to find frequent values
• Further splits them among reducers
Optimizations: Fragment-Replicate Join
• Symmetric-hash join repartitions both inputs
• If size(data set 1) >> size(data set 2)
– Just replicate data set 2 to all partitions of data set 1
• Translates to map-only job
– Open data set 2 as “side file”
Optimizations: Merge Join
• Exploit data sets are already sorted.
• Again, a map-only job
– Open other data set as “side file”
Optimizations: Multiple Data Flows
Load Users Map1
Filter bots
Group by Group by
state demographic
Reduce1
Apply udfs Apply udfs
Store into ‘bystate’ Store into ‘bydemo’
Optimizations: Multiple Data Flows
Load Users Map1
Filter bots
Split
Group by Group by
state demographic
Demultiplex Reduce1
Apply udfs Apply udfs
Store into ‘bystate’ Store into ‘bydemo’
Other Optimizations
• Carry data as byte arrays as far as possible
• Using binary comparator for sorting
• “Streaming” data through external executables
Performance
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)
JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)
FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)
JOIN Joining on right
on url attribute?
FOREACH GROUP
user, canonicalize(url) on user
Bug in UDF FOREACH
canonicalize? user, AVG(pagerank)
Everything being
FILTER filtered out?
avgPR> 0.5
No Output ☹
How to do test runs?
• Run with real data
– Too inefficient (TBs of data)
• Create smaller data sets (e.g., by sampling)
– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters
• Biased sampling for joins
– Indexes not always present
Examples to Illustrate Program
LOAD LOAD (www.cnn.com, 0.9)
(www.frogs.com, 0.3)
(user, url) (url, pagerank) (www.snails.com, 0.4)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)
FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)
FOREACH ( Fred, (Fred, www.snails.com, 0.4) )
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Amy, 0.6)
(Fred, www.snails.com)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Value Addition From Examples
• Examples can be used for
– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Good Examples: Consistency
LOAD LOAD
(user, url) (url, pagerank)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 0. Consistency
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user 1. Realism
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness
JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user
FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url
FOREACH GROUP
user, canonicalize(url) on user
FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status
• Available as ILLUSTRATE command in open-source release
of Pig
• Available as Eclipse Plugin (PigPen)
• See SIGMOD09 paper for algorithm and experiments
Related Work
• Sawzall
– Data processing language on top of map-reduce
– Rigid structure of filtering followed by aggregation
• Hive
– SQL-like language on top of Map-Reduce
• DryadLINQ
– SQL-like language on top of Dryad
• Nested data models
– Object-oriented databases
Future / In-Progress Tasks
• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
–Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Credits
Summary
• Big demand for parallel data processing
– Emerging tools that do not look like SQL DBMS
– Programmers like dataflow pipes over static files
• Hence the excitement about Map-Reduce
• But, Map-Reduce is too low-level and rigid
Pig Latin
Sweet spot between map-reduce and SQL