0% found this document useful (0 votes)
4 views59 pages

Pig: Building High-Level Dataflows Over Map-Reduce

Uploaded by

yashitgupta22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views59 pages

Pig: Building High-Level Dataflows Over Map-Reduce

Uploaded by

yashitgupta22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Pig : Building High-Level

Dataflows over Map-Reduce


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Data Processing Renaissance

Internet companies swimming in data


• E.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation


Data analysts are skilled programmers
Data Warehousing …?

Scale Often not scalable enough

Prohibitively expensive at web scale


$$$$
• Up to $200K/TB

• Little control over execution method


SQL • Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
New Systems For Data Analysis

Map-Reduce

Apache Hadoop ...

Dryad
Map-Reduce

Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5

map
k2 v4 k2 v2 reduce
k1 v5 k2 v4

Just a group-by-aggregate?
The Map-Reduce Appeal

Scalable due to simpler design


Scale • Only parallelizable operations
• No transactions

$ Runs on cheap commodity hardware

SQL Procedural Control- a processing “pipe”


Disadvantages

1. Extremely rigid data flow M R


Other flows constantly hacked in

M M R M

Join, Union Split Chains

2. Common operations must be coded by hand


• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
Pros And Cons

Need a high-level, general data flow language


Enter Pig Latin

Need a high-level, general data flow language


Pig?
» A high-level extensible programming language designed to analyze bulk
data sets and to reduce the complexities of coding MapReduce programs.
» Yahoo developed Pig to analyze huge unstructured data sets and minimize
the writing time of Mapper and Reducer functions.
» Apache Pig?
– An abstraction over MapReduce that is used to handle structured, semi-
structured, and unstructured data.
– It is a high-level data flow tool developed to execute queries on large datasets
that are stored in HDFS.
– Pig Latin is the high-level scripting language used by Apache Pig to write data
analysis programs.
– For reading, writing, and processing data, it provides multiple operators that
can easily be used by developers. These pig scripts get internally converted to
Map and Reduce tasks and get executed on data available in HDFS. Generally, a
component of Apache Pig called Pig Engine is responsible for converting the
scripts into MapReduce jobs.
Why do we Need Apache Pig?
» Usually, programmers struggle while performing any MapReduce tasks, as
they are not so good at Java to work with Hadoop. In such cases, Pig works
as a situation booster to all such programmers.
» Here are some reasons that make Pig a must-use platform:
• While using Pig Latin, programmers can skip typing the complex codes
in Java and perform MapReduce tasks with ease.
• The multi-query approach is used by Pig to reduce the length of codes.
For example, Instead of typing 200 lines of code (LoC) in Java,
programmers can use Apache Pig to write just 10 LoC.
• Pig Latin is easy to understand as it is a SQL-like language.
• Pig is also known as an operator-rich language because it offers
multiple built-in operators like joins, filters, ordering, etc.
Features of Apache Pig
» Extensibility: Pig is an extensible language that means, with the help of its
existing operators, users can build their own functions to read, write, and
process data.
» Optimization Opportunities: The tasks encoded in Apache Pig allow the
system to optimize their execution automatically, so the users can focus
only on the semantics of the language rather than efficiency.
» UDFs: UDFs stand for user-defined functions; Pig provides the facility to
create them in other programming languages like Java and embed them in
Pig Scripts.
» Ease of programming: It is difficult for non-programmers to write the
complex java programs for map-reduce, but using Pig Latin, they can easily
perform queries.
» Rich operator set: Pig Latin has multiple operator support like join, sort,
filter, etc.
Difference between Apache Pig
and Map Reduce
• Apache Pig is a user-friendly high-level data-flow language, while
MapReduce is just a low-level paradigm for data processing.
• Apache Pig doesn’t require any compilation process, but MapReduce
operations need a significant compilation process.
• Join task in Pig can be performed much more smoothly and efficiently than
MapReduce.
• The multi-query functionality of Apache Pig enables to write very few lines
of code and makes the operation more efficient, while MapReduce doesn’t
support this feature. In comparison to Pig, MapReduce needs to write 20
times more lines of code to perform the same operation.
• Basic knowledge of SQL is enough for working with Pig, but a deep
understanding of Java concepts is required to work with MapReduce.
Components of Apache Pig
» Parser: Parser is responsible for various types of checks on the
script, like type checks, syntax checks, and other miscellaneous
checks.
» Optimizer: The optimizer basically aims to reduce the quantity of
data in the pipeline when it processes the extracted data. The logical
optimizations are carried out by the optimizer, which includes
activities like transform, split, merge, reorder operators, etc.
» Compiler: Compiler compiles the resultant code into a series of
MapReduce tasks. The Compiler is responsible for the conversion of
Pig Script into MapReduce jobs
» Execution Engine: The MapReduce jobs are transferred for
execution to the Hadoop.
Pig: Execution Modes
» Local Mode: Local mode is designed for development, testing, and debugging purposes. In this
mode, Pig runs on a single machine, using local file systems rather than the Hadoop Distributed File
System (HDFS). Local mode is suitable for small to moderately sized datasets and is ideal for
quickly prototyping Pig Latin scripts and debugging.
» Advantages of Local Mode:
1. Simplicity: Running in local mode is straightforward and doesn't require a Hadoop cluster
setup, making it accessible to beginners.
2. Rapid Development: Developers can quickly iterate and test Pig scripts on smaller datasets,
which accelerates development and debugging.
3. Ease of Debugging: Debugging in local mode is more straightforward as errors and issues can
be more easily traced due to the absence of complex cluster environments.
» Limitations of Local Mode:
1. Not Scalable: Local mode is not suitable for processing large or distributed datasets. It lacks
the scalability and fault tolerance of MapReduce mode.
2. Limited Data Size: It can handle only smaller datasets that fit in the local machine's memory
and storage.
3. Incompatibility: Code tested in local mode may not work as expected when deployed to a
production Hadoop cluster in MapReduce mode due to differences in execution environments.
Pig: Execution Modes
» MapReduce Mode: MapReduce mode is the default and most commonly used execution mode for
Pig. In this mode, Pig scripts are translated into a series of MapReduce jobs that run on a Hadoop
cluster. MapReduce mode is suitable for processing large datasets and offers the scalability and fault
tolerance of the Hadoop ecosystem.
» Advantages of MapReduce Mode:
1. Scalability: MapReduce mode can process massive datasets distributed across a Hadoop cluster,
providing horizontal scalability.
2. Fault Tolerance: It benefits from Hadoop's fault tolerance mechanisms, ensuring that tasks are
rerun if they fail.
3. Optimized for Production: Pig scripts developed in MapReduce mode are more likely to work
seamlessly in a production Hadoop environment.
4. Supports HDFS: MapReduce mode can work with data stored in HDFS, making it ideal for big
data processing.
» Limitations of MapReduce Mode:
1. Complexity: Setting up and maintaining a Hadoop cluster for MapReduce mode can be complex
and resource-intensive.
2. Slower Development: Deploying and testing Pig scripts in MapReduce mode can be slower
compared to local mode, making development cycles longer.
3. Resource Intensive: Running in MapReduce mode requires access to a Hadoop cluster, which may
not be readily available for development or testing.
Architecture and Components of Apache Pig

Firstly, we submit Pig scripts to the execution


environment of Apache Pig which can be
written in Pig Latin using in-built operators.
The Pig scripts undergo various
transformations in multiple stages to
generate the desired output.

To communicate with Pig, a very powerful tool is used


called Grunt Shell. A grunt shell is an interactive shell
that establishes an interaction of the shell with HDFS
and the local file system. We can open any remote
client access software like putty to start the Cloudera
and type Pig to enter in the Grunt Shell. Grunt Shell
allows you to write Pig Latin statements and queries
the structured/unstructured data.
Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Pig Latin Data Model
The data model of Pig Latin is fully nested, and it allows complex non-atomic
datatypes such as map and tuple.
» Atom: Any single value
Example − ‘raja’ or ‘30’
» Tuple: A record that is formed by an
ordered set of fields
Example − (Raja, 30)
» Bag: An unordered set of tuples/ A collection of tuples (non-unique)
Example − {(Raja, 30), (Mohammad, 45)}
» Map: A map (or data map) is a set of key-value pairs.
Example − [name#Raja, age#30]
» Relation: A relation is a bag of tuples. The relations in Pig Latin are
unordered (there is no guarantee that tuples are processed in any
particular order).
Reading Data in Pig
Relation = LOAD 'Input file path information' USING load_function AS schema
Where,
• Relation − We have to provide the relation name where we want to
load the file content.
• Input file path information − We have to provide the path of the
Hadoop directory where the file is stored.
• load_function − Apache Pig provides a variety of load functions like
BinStorage, JsonLoader, PigStorage, TextLoader. Here, we need to
choose a function from this set. PigStorage is the commonly used
function as it is suited for loading structured text files.
• Schema − We need to define the schema of the data or the passing
files in parenthesis.
Operators in Pig
» Describe Operator: Display the schema of the relation
Syntax: grunt> Describe Relation_name
» Illustrate Operator: get the step-by-step execution of a sequence of
statements in Pig command
Syntax: grunt> illustrate Relation_name;
» Explain: review the logical, physical, and map-reduce execution plans of a
relation
Syntax: grunt> explain Relation_name;
» Filter: Select the required tuples from a relation depending upon a
condition
Syntax: grunt> Relation2 = FILTER Relation1 BY (condition);
» Limit: To get a limited number of tuples from a relation, Pig supports the
limit operator.
Syntax: grunt> Output = LIMIT Relation_name number of tuples required ;
Operators in Pig
» Distinct: To remove duplicate tuples from a relation, the distinct operator is used.
Syntax:- grunt> Relation2 = DISTINCT Relation1;
» Foreach: Generate specified data transformations based on the column data
Syntax: grunt> Relation2 = FOREACH Relation1 GENERATE (required data);
» Group: group the data in one or more relations
Syntax: grunt> Group_data = GROUP Relation_name BY key_column;
Grouping by multiple columns:
Syntax: grunt> Group_data = GROUP Relation_name BY (column1, column2,column3,..)
» Group All: To group a relation by all the columns, the Group All operator is used.
Syntax: grunt> group_all_data = GROUP Relation_name All;
» Cogroup: Cogroup is more suited for multiple relations, whereas the group is more
suitable for single relations
Syntax: grunt> cogroup_data = COGROUP Relation1 by column1, Relation2 by column2;
» Join: combine data from two or more relations: Firstly, we have two declare the
keys, which are nothing but a tuple from each relation. If the keys are matched with
each other, then we consider that two particular tuples are matched and can be
displayed in the output; otherwise, the unmatched records are dropped
Join Operator in Pig
» Self Join: Join a table with itself. We load the same data multiple times with
different alias names
Syntax: grunt> Relation3 = JOIN Relation1 BY key, Relation2 BY key;
» Inner Join: An inner join compares both the tables(say A and B) and returns
rows when there is a match.
Syntax: grunt> Output = JOIN Relation1 BY column1, Relation2 BY column2;
» Left Outer Join: returns all the rows from the left relation, even if doesn’t
match with the right relation
Syntax: grunt> Relation3 = JOIN Relation1 BY id LEFT OUTER, Relation2 BY column;
» Right Outer Join: Returns all the rows from the right relation, even if
doesn’t match with the left relation
Syntax: grunt> Relation3 = JOIN Relation1 BY id RIGHT OUTER, Relation2 BY column;
» Full Outer Join: Returns rows when there is a match in one of the relations.
Syntax: grunt> Relation3 = JOIN Relation1 BY id FULL OUTER, Relation2 BY column;
Other Operator in Pig

» Cross: Calculate the cross-product of two or more relations.


Syntax: grunt> Relation3 = CROSS Relation1, Relation2;
» Union: merge the content of two relations
Syntax: grunt> Relation3 = UNION Relation1, Relation2;
» Split: divide a relation into two or more relations
Syntax: grunt> SPLIT Relation1 INTO Relation2 IF
(condition1), Relation2 (condition2);
Example Data Analysis Task
Find the top 10 most visited pages in each category

Visits Url Info


User Url Time Url Category PageRank

Amy cnn.com 8:00 cnn.com News 0.9

Amy bbc.com 10:00 bbc.com News 0.8

Amy flickr.com 10:05 flickr.com Photos 0.7

Fred cnn.com 12:00 espn.com Sports 0.9


Data Flow
Load Visits

Group by url

Foreach url
Load Url Info
generate count

Join on url

Group by category

Foreach category
generate top10 urls
In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);


visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;


topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Step-by-step Procedural Control
Target users are entrenched procedural programmers

The step-by-step method of creating a program in Pig is much cleaner and simpler to
use than the single block method of SQL. It is easier to keep track of what your
variables are, and where you are in the process of analyzing your data.

Jasmine Novak
Engineer, Yahoo!

With the various interleaved clauses in SQL, it is difficult to know what is actually
happening sequentially. With Pig, the data nesting and the temporary tables get
abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

David Ciemiewicz
Search Excellence, Yahoo!

• Automatic query optimization is hard


• Pig Latin does not preclude optimization
Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


Operates
gCategories = group directly
visitCounts by over files
category;
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
Schemasgenerate
topUrls = foreach gCategories optional; top(visitCounts,10);
Can be assigned dynamically
store topUrls into ‘/data/topUrls’;
User-Code as a First-Class Citizen
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits functions
User-defined by url; (UDFs)
visitCountscan
= foreach gVisits
be used generate
in every url, count(urlVisits);
construct
• Load, Store
urlInfo =• load ‘/data/urlInfo’
Group, as (url, category, pRank);
Filter, Foreach

visitCounts = join visitCounts by url, urlInfo by url;


gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Nested Data Model
• Pig Latin has a fully-nestable data model with:
–Atomic values, tuples, bags (lists), and maps

finance
yahoo , email
news

• More natural to programmers than flat tuples


• Avoids expensive joins
Nested Data Model
Decouples grouping as an independent operation
User Url Time group Visits
Amy cnn.com 8:00 group by url cnn.com
Amy cnn.com 8:00

Amy bbc.com 10:00 Fred cnn.com 12:00

Amy bbc.com 10:05


Fred cnn.com 12:00 Amy bbc.com 10:00
bbc.com
Amy bbc.com 10:05
• Common case: aggregation on these nested sets
• Power users: sophisticated UDFs, e.g., sequence analysis
• Efficient Implementation
I frankly like pig much better than SQL in some
Ted Dunning
respects (group + optional flatten works better for me,
Chief Scientist, Veoh
I love nested data structures).”
34
CoGroup
results revenue
query url rank query adSlot amount
Lakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
Kings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10

group results revenue


Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20

Kings nhl.com 1 Kings top 30


Kings
Kings nba.com 2 Kings side 10

Cross-product of the 2 bags would give natural join


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Implementation

SQL user

automatic
rewrite + Pig or
optimize Pig is open-source.
or

http://hadoop.apache.org/pig
Hadoop
Map-Reduce

cluster • ~50% of Hadoop jobs at


Yahoo! are Pig
• 1000s of jobs per day
Compilation into Map-Reduce
Map1 Every group or join operation
Load Visits
forms a map-reduce boundary
Group by url
Reduce1
Map2
Foreach url
Load Url Info
generate count

Join on url
Reduce2
Map3
Other operations Group by category
pipelined into map and Reduce3
Foreach category
reduce phases generate top10(urls)
Optimizations: Using the Combiner

Input k1 v1 k1 v1 Output
records k2 v2 records
map k1 v3 reduce
k1 v3 k1 v5

map
k2 v4 k2 v2 reduce
k1 v5 k2 v4

Can pre-process data on the map-side to reduce data shipped


• Algebraic Aggregation Functions
• Distinct processing
Optimizations: Skew Join
• Default join method is symmetric hash join.
cross product carried out on 1 reducer

group results revenue


Lakers nba.com 1 Lakers top 50
Lakers
Lakers espn.com 2 Lakers side 20

Kings nhl.com 1 Kings top 30


Kings
Kings nba.com 2 Kings side 10

• Problem if too many values with same key


• Skew join samples data to find frequent values
• Further splits them among reducers
Optimizations: Fragment-Replicate Join

• Symmetric-hash join repartitions both inputs

• If size(data set 1) >> size(data set 2)


– Just replicate data set 2 to all partitions of data set 1

• Translates to map-only job


– Open data set 2 as “side file”
Optimizations: Merge Join

• Exploit data sets are already sorted.

• Again, a map-only job


– Open other data set as “side file”
Optimizations: Multiple Data Flows
Load Users Map1

Filter bots

Group by Group by
state demographic
Reduce1

Apply udfs Apply udfs

Store into ‘bystate’ Store into ‘bydemo’


Optimizations: Multiple Data Flows
Load Users Map1

Filter bots

Split

Group by Group by
state demographic

Demultiplex Reduce1

Apply udfs Apply udfs

Store into ‘bystate’ Store into ‘bydemo’


Other Optimizations

• Carry data as byte arrays as far as possible

• Using binary comparator for sorting

• “Streaming” data through external executables


Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work
Example Dataflow Program
LOAD LOAD
(user, url) (url, pagerank)

JOIN
on url
Find users that
FOREACH GROUP tend to visit
user, canonicalize(url) on user
high-pagerank
FOREACH pages
user, AVG(pagerank)

FILTER
avgPR> 0.5
Iterative Process
LOAD LOAD
(user, url) (url, pagerank)

JOIN Joining on right


on url attribute?

FOREACH GROUP
user, canonicalize(url) on user

Bug in UDF FOREACH


canonicalize? user, AVG(pagerank)
Everything being
FILTER filtered out?
avgPR> 0.5

No Output 
How to do test runs?

• Run with real data


– Too inefficient (TBs of data)

• Create smaller data sets (e.g., by sampling)


– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters

• Biased sampling for joins


– Indexes not always present
Examples to Illustrate Program
LOAD LOAD (www.cnn.com, 0.9)
(www.frogs.com, 0.3)
(user, url) (url, pagerank) (www.snails.com, 0.4)

(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN (Amy, www.cnn.com, 0.9)
(Fred, www.snails.com/index.html) (Amy, www.frogs.com, 0.3)
on url
(Fred, www.snails.com, 0.4)

FOREACH GROUP
user, canonicalize(url) on user (Amy, www.cnn.com, 0.9)
( Amy, )
(Amy, www.frogs.com, 0.3)

( Fred, (Fred, www.snails.com, 0.4) )


FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Amy, 0.6)
(Fred, www.snails.com)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Value Addition From Examples

• Examples can be used for


– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Good Examples: Consistency
LOAD LOAD
(user, url) (url, pagerank)

(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user 0. Consistency

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank) output example
(Amy, www.frogs.com) =
(Fred, www.snails.com)
operator applied on input example
FILTER
avgPR> 0.5
Good Examples: Realism
LOAD LOAD
(user, url) (url, pagerank)

(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user 1. Realism

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Good Examples: Completeness
LOAD LOAD
(user, url) (url, pagerank)
2. Completeness

JOIN
on url Demonstrate the salient
properties of each operator,
GROUP e.g., FILTER
FOREACH
user, canonicalize(url) on user

FOREACH
user, AVG(pagerank)
(Amy, 0.6)
(Fred, 0.4)
FILTER
avgPR> 0.5
(Amy, 0.6)
Good Examples: Conciseness
LOAD LOAD
(user, url) (url, pagerank)
3. Conciseness
(Amy, cnn.com)
(Amy, http://www.frogs.com) JOIN
(Fred, www.snails.com/index.html)
on url

FOREACH GROUP
user, canonicalize(url) on user

FOREACH
(Amy, www.cnn.com) user, AVG(pagerank)
(Amy, www.frogs.com)
(Fred, www.snails.com)
FILTER
avgPR> 0.5
Implementation Status

• Available as ILLUSTRATE command in open-source release


of Pig

• Available as Eclipse Plugin (PigPen)

• See SIGMOD09 paper for algorithm and experiments


Future / In-Progress Tasks

• Columnar-storage layer
• Metadata repository
• Profiling and Performance Optimizations
• Tight integration with a scripting language
–Use loops, conditionals, functions of host language
• Memory Management
• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
Summary

• Big demand for parallel data processing


– Emerging tools that do not look like SQL DBMS
– Programmers like dataflow pipes over static files

• Hence the excitement about Map-Reduce

• But, Map-Reduce is too low-level and rigid

Pig Latin
Sweet spot between map-reduce and SQL
References
» https://www.analyticsvidhya.com/blog/2022/06/getting-
started-with-apache-pig/
» https://github.com/srafay/Hadoop-hands-on
» https://github.com/newTendermint/awesome-bigdata
» https://pig.apache.org/docs/latest/basic.html
» https://www.oreilly.com/library/view/programming-pig-
2nd/9781491937082/ch04.html
» https://data-flair.training/blogs/pig-reading-data-storing-
data/

You might also like