Unit 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

PIG

Features of Pig Component

Pig is to process large dataset.


It is a scripting language.
Installation, and configuration settings
It has it’s own shell, i.e., Grunt Shell
Advantages:
1. No need of Java knowledge
2. Simplest way of formation of complex queries on complex structure
3. Dataset can be HiveTable, HDFS text file
4. Queries are formed in the form of scripting statements.
5. Supportive queries: group, co-group, join, cross product, distinct, order, etc.,
6. Allow the inclusion of Linux and HDFS commands with the support of sh and fs commands
respectively.
Slide 2
7. More added features.
Data Types in Pig

1. Atoms or fields (int, long, float, double, chararray, bytearray with type casting feature)
2. Tuple or record (sorted values, similar to rows in RDBMs) ( )
3. Bag (Collection of tuples and fields – unordered sets of tuples in Python, unstructured data) { }
4. Map – Key Value pairs (like Dictionaries in Python eg. [Name#Raja, Age#20] [ ]

Slide 3
Added Features in Pig

1. Pig commands are case sensitive. The programmer can easily


load the data set, filter, group, and do aggregation processes on large data set.
2. Illustrate is the feature to view the intermediate results (eg. group by, order by)
3. Dry Run to view the complete code with parameter substitution (may be with values or user-defined
function statements)
4. Explain is the feature to view the logical plan (check logic with sample data before execution), and
physical plan (to generate MR Job’s DAG flow path)
5. Reflection is to identify the variable names schema which is stored already.

Slide 4
Added Features in Pig (Contd.,)

6. Build is to create the jar file (At grunt> shell, register the jar file, use class loader feature to utilize it).
7. Define is the feature to create new macros i.e., short-cut for set of statements which can be utilized
anywhere in the program.
8. UDFs – sub class of FilterFunc class which is sub class of Eval abstract class which has exec
method
User defined function extends FilterFunc and overrides exec method.
9. Multi query handler – it can create its own logical plan and check the plan with sample data (before
loading the entire dataset)
10. Logical and Physical Planner – To check, use Debug mode, Dryrun mode of executions
(For example, A = LOAD..., B = LOAD...., UNION C, SPLIT D...E, DUMP D... DUMP E will be
executed as a single job, since the relation D and E refers the same dataset C in different records
based of the specified conditions. It is decided by the Pig Logical Planner itself).
Slide 5
Example Pig Commands

HDFS: /pig directory


A.txt  0,1,21,3,4 B.txt  0,5,21,7,8

$ pig 
grunt>

grunt> A = LOAD ‘/pig/A.txt’;


grunt> DUMP A; 0 1 21 3 4

grunt> B = LOAD ‘/pig/B.txt’;


grunt> DUMP B; 0 5 21 7 8

Slide 6
Example Pig Commands

grunt> C = LOAD ‘/pig/A1.txt’ using PigStorage(‘,’) as (a1:int, a2:int, a3:int, a4:int, a5:int);
grunt> DESCRIBE C; C: {a1:int, a2:int, a3:int, a4:int, a5:int}

grunt> D = UNION A, B; g>DUMP D; 0 1 21 3 4


0 5 21 7 8

Grunt>

Slide 7
Example Pig Commands

Slide 8
Example Pig Commands

grunt> ILLUSTRATE max_temp;


-------------------------------------------------------------------------------
| records | year:chararray | temperature:int | quality:int |
------------------------------------------------------------------------------- |
| 1949 | 78 | 1 | | | 1949 | 111 | 1 | | | 1949 | 9999 | 1 |
-------------------------------------------------------------------------------
| filtered_records | year:chararray | temperature:int | quality:int |
-------------------------------------------------------------------------------
| | 1949 | 78 | 1 | | | 1949 | 111 | 1 |
-------------------------------------------------------------------------------

Slide 9
Example Pig Commands

Assume,
g> SPLIT C INTO D IF $0==0, E IF $0==1;
g> DUMP D; g> DUMP E; // Hadoop – AppManager Component
D  0, 1, 2 E  1, 3, 4 // Starts Yarn execution flow – MR Job
0, 5, 2 1, 7, 8

g> F = FILTER C BY $1 > 3; 0,5,2 1,7,8


g> G = GROUP C BY $2;
G> DUMP G;
(2, {(0,1,2), (0,5,2)})
(4, {1,3,4})
(8, (1,7,8})

$0 $1 $2
C (0, 1, 2)
(1, 3, 4) and so on

Slide 10
Pig program for temperature data analysis

records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records

GENERATE group, MAX(filtered_records.temperature);

DUMP max_temp;

Slide 11
Multiquery Execution

Because DUMP is a diagnostic tool, it will always trigger execution. However, the STORE command is
different.

In interactive mode, STORE acts like DUMP and will always trigger execution (this includes the run
command), but in batch mode it will not (this includes the exec command). The reason for this is
efficiency. In batch mode, Pig will parse the whole script to see whether there are any optimizations that
could be made to limit the amount of data to be written to or read from disk.

Slide 12
Functions Functions in Pig come in four types:

Eval function

A function that takes one or more expressions and returns another expression. An example of a built-in
eval function is MAX, which returns the maximum value of the entries in a bag. Some eval functions are
aggregate functions, which means they operate on a bag of data to produce a scalar value; MAX is an
example of an aggregate function. Furthermore, many aggregate functions are algebraic, which means
that the result of the function may be calculated incrementally. In MapReduce terms, algebraic functions
make use of the combiner and are much more efficient to calculate (see “Combiner Functions” on page
34). MAX is an algebraic function, whereas a function to calculate the median of a collection of values is
an example of a function that is not algebraic.

Slide 13
Filter function

A special type of eval function that returns a logical Boolean result. As the name suggests, filter functions
are used in the FILTER operator to remove unwanted rows. They can also be used in other relational
operators that take Boolean conditions, and in general, in expressions using Boolean or conditional
expressions.

An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.

Slide 14
Consider the following simple example:
A = LOAD 'input/pig/multiquery/A’;
B = FILTER A BY $1 == 'banana’;
C = FILTER A BY $1 != 'banana’;
STORE B INTO 'output/b’;
STORE C INTO 'output/c’;

Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single
MapReduce job by reading A once and writing two output files from the job, one for each of B and C.

This feature is called multiquery execution.

In previous versions of Pig that did not have multiquery execution, each STORE statement in a script run
in batch mode triggered execution, resulting in a job for each STORE statement.

It is possible to restore the old behavior by disabling multiquery execution with the -M or -no_multiquery
option to pig.

Slide 15
Pig program for temperature data analysis

Consider the following simple example:


A = LOAD 'input/pig/multiquery/A’;
B = FILTER A BY $1 == 'banana’;
C = FILTER A BY $1 != 'banana’;
STORE B INTO 'output/b’;
STORE C INTO 'output/c’;

Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single
MapReduce job by reading A once and writing two output files from the job, one for each of B and C.

This feature is called multiquery execution.

In previous versions of Pig that did not have multiquery execution, each STORE statement in a script run
in batch mode triggered execution, resulting in a job for each STORE statement.

It is possible to restore the old behavior by disabling multiquery execution with the -M or -no_multiquery
option to pig.

Slide 16

You might also like