Apache Pig Handy Notes Lab
Apache Pig Handy Notes Lab
Apache Pig is a platform, used to analyze large data sets representing them as data
flows.
It is designed to provide an abstraction over MapReduce, reducing the complexities of
writing a MapReduce program.
We can perform data manipulation operations very easily in Hadoop using Apache
Pig.
Pig Latin isa data flow language. This means it allows users to describe how data from
one or more inputs should be read, processed, and then stored to one or more outputs in
parallel. These data flows can be simple linear flows, or complex workflows that include
points where multiple inputs are joined and where data is split into multiple streams to be
processed by different operators.
Parser
From the above image you can see, after passing through Grunt or Pig Server, Pig Scripts are
passed to the Parser. The Parser does type checking and checks the syntax of the script. The
parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and
logical operators. The logical operators are represented as the nodes and the data flows are
represented as edges.
Optimizer
Then the DAG is submitted to the optimizer. The Optimizer performs the optimization activities
like split, merge, transform, and reorder operators etc. This optimizer provides the automatic
optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in
the pipeline at any instance of time while processing the extracted data, and for that it performs
functions like:
PushUpFilter: If there are multiple conditions in the filter and the filter can be split, Pig
splits the conditions and pushes up each condition separately. Selecting these conditions
earlier, helps in reducing the number of records remaining in the pipeline.
PushDownForEachFlatten: Applying flatten, which produces a cross product between a
complex type such as a tuple or a bag and the other fields in the record, as late as possible
in the plan. This keeps the number of records low in the pipeline.
ColumnPruner: Omitting columns that are never used or no longer needed, reducing the
size of the record. This can be applied after each operator, so that fields can be pruned as
aggressively as possible.
MapKeyPruner: Omitting map keys that are never used, reducing the size of the record.
LimitOptimizer: If the limit operator is immediately applied after a load or sort operator,
Pig converts the load or sort operator into a limit-sensitive implementation, which does
not require processing the whole data set. Applying the limit earlier, reduces the number
of records.
This is just a flavor of the optimization process. Over that it also performs Join, Order
By and Group Byfunctions.
Compiler
After the optimization process, the compiler compiles the optimized code into a series of
MapReduce jobs. The compiler is the one who is responsible for converting Pig jobs
automatically into MapReduce jobs.
Execution engine
Finally, as shown in the figure, these MapReduce jobs are submitted for execution to the
execution engine. Then the MapReduce jobs are executed and gives the required result. The
result can be displayed on the screen using “DUMP” statement and can be stored in the HDFS
using “STORE” statement.
The value of all these types can also be null. The semantics for null are similar to those used in
SQL. The concept of null in Pig means that the value is unknown. Nulls can show up in the data
in cases where values are unreadable or unrecognizable — for example, if you were to use a
wrong data type in the LOAD statement.
Null could be used as a placeholder until data is added or as a value for a field that is optional.
Pig Latin has a simple syntax with powerful semantics you’ll use to carry out two primary
operations: access and transform data.
In a Hadoop context, accessing data means allowing developers to load, store, and stream
data, whereas transforming data means taking advantage of Pig’s ability to group, join,
combine, split, filter, and sort data. The table gives an overview of the operators associated
with each operation.
Data Access LOAD Read and Write data to file system. Load
operator specifies schema.
grunt>DESCRIBE A;
quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit
COGROUP
It joins two or more tables and then performs GROUP operation on the joined table result.
CROSS
This is used to compute the cross product (cartesian product) of two or more relations.
FOREACH
This will iterate through the tuples of a relation, generating a data transformation.
JOIN
This is used to join two or more tables in a relation.
LIMIT
This will limit the number of output tuples.
SPLIT
This will split the relation into two or more relations.
UNION
It will merge the contents of two relations.
ORDER
This is used to sort a relation based on one or more fields.
Storing data in PIG
You can store the loaded data in the file system using the store operator.
Syntax
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING
PigStorage(',')as(id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray);
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage(',');