Unit 5
Unit 5
Unit 5
1. Atoms or fields (int, long, float, double, chararray, bytearray with type casting feature)
2. Tuple or record (sorted values, similar to rows in RDBMs) ( )
3. Bag (Collection of tuples and fields – unordered sets of tuples in Python, unstructured data) { }
4. Map – Key Value pairs (like Dictionaries in Python eg. [Name#Raja, Age#20] [ ]
Slide 3
Added Features in Pig
Slide 4
Added Features in Pig (Contd.,)
6. Build is to create the jar file (At grunt> shell, register the jar file, use class loader feature to utilize it).
7. Define is the feature to create new macros i.e., short-cut for set of statements which can be utilized
anywhere in the program.
8. UDFs – sub class of FilterFunc class which is sub class of Eval abstract class which has exec
method
User defined function extends FilterFunc and overrides exec method.
9. Multi query handler – it can create its own logical plan and check the plan with sample data (before
loading the entire dataset)
10. Logical and Physical Planner – To check, use Debug mode, Dryrun mode of executions
(For example, A = LOAD..., B = LOAD...., UNION C, SPLIT D...E, DUMP D... DUMP E will be
executed as a single job, since the relation D and E refers the same dataset C in different records
based of the specified conditions. It is decided by the Pig Logical Planner itself).
Slide 5
Example Pig Commands
$ pig
grunt>
Slide 6
Example Pig Commands
grunt> C = LOAD ‘/pig/A1.txt’ using PigStorage(‘,’) as (a1:int, a2:int, a3:int, a4:int, a5:int);
grunt> DESCRIBE C; C: {a1:int, a2:int, a3:int, a4:int, a5:int}
Grunt>
Slide 7
Example Pig Commands
Slide 8
Example Pig Commands
Slide 9
Example Pig Commands
Assume,
g> SPLIT C INTO D IF $0==0, E IF $0==1;
g> DUMP D; g> DUMP E; // Hadoop – AppManager Component
D 0, 1, 2 E 1, 3, 4 // Starts Yarn execution flow – MR Job
0, 5, 2 1, 7, 8
$0 $1 $2
C (0, 1, 2)
(1, 3, 4) and so on
Slide 10
Pig program for temperature data analysis
DUMP max_temp;
Slide 11
Multiquery Execution
Because DUMP is a diagnostic tool, it will always trigger execution. However, the STORE command is
different.
In interactive mode, STORE acts like DUMP and will always trigger execution (this includes the run
command), but in batch mode it will not (this includes the exec command). The reason for this is
efficiency. In batch mode, Pig will parse the whole script to see whether there are any optimizations that
could be made to limit the amount of data to be written to or read from disk.
Slide 12
Functions Functions in Pig come in four types:
Eval function
A function that takes one or more expressions and returns another expression. An example of a built-in
eval function is MAX, which returns the maximum value of the entries in a bag. Some eval functions are
aggregate functions, which means they operate on a bag of data to produce a scalar value; MAX is an
example of an aggregate function. Furthermore, many aggregate functions are algebraic, which means
that the result of the function may be calculated incrementally. In MapReduce terms, algebraic functions
make use of the combiner and are much more efficient to calculate (see “Combiner Functions” on page
34). MAX is an algebraic function, whereas a function to calculate the median of a collection of values is
an example of a function that is not algebraic.
Slide 13
Filter function
A special type of eval function that returns a logical Boolean result. As the name suggests, filter functions
are used in the FILTER operator to remove unwanted rows. They can also be used in other relational
operators that take Boolean conditions, and in general, in expressions using Boolean or conditional
expressions.
An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.
Slide 14
Consider the following simple example:
A = LOAD 'input/pig/multiquery/A’;
B = FILTER A BY $1 == 'banana’;
C = FILTER A BY $1 != 'banana’;
STORE B INTO 'output/b’;
STORE C INTO 'output/c’;
Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single
MapReduce job by reading A once and writing two output files from the job, one for each of B and C.
In previous versions of Pig that did not have multiquery execution, each STORE statement in a script run
in batch mode triggered execution, resulting in a job for each STORE statement.
It is possible to restore the old behavior by disabling multiquery execution with the -M or -no_multiquery
option to pig.
Slide 15
Pig program for temperature data analysis
Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single
MapReduce job by reading A once and writing two output files from the job, one for each of B and C.
In previous versions of Pig that did not have multiquery execution, each STORE statement in a script run
in batch mode triggered execution, resulting in a job for each STORE statement.
It is possible to restore the old behavior by disabling multiquery execution with the -M or -no_multiquery
option to pig.
Slide 16