Apache Pig: Pig Is The Abstraction Over Mapreduce
Apache Pig: Pig Is The Abstraction Over Mapreduce
Apache Pig was the research project of the yahoo to create and execute MapReduce jobs on
different dataset. Pig is the abstraction over MapReduce. It is a tool used for analyzing the
structured and semi structured data. It is used along with the hadoop framework. Pig supports
a language called as Pig Latin which is a high level language. With Pig we can perform any
type of data manipulation which we generally do using MapReduce Programming. Pig is used
and favored by most of the developers or analysts who don’t want to do hard core java
programming for a job. Pig supports SQL like syntax. In the background, it is only mapreduce
which is actually running. At the present time, Pig's infrastructure layer consists of a compiler
that produces sequences of Map-Reduce programs, for which large-scale parallel
implementations already exist
Programmers can perform line by line pig operations in pig’s grunt shell or they can even write
a whole script of pig and then, can run the entire script of pig. Eventually all these pig scripts
and pig code will get converted into Map and Reduce tasks. Pig have a component called as
Pig Engine which accepts pig latin script or commands as input and convert them into
MapReduce Jobs.
MapReduce Job which a programmer performs by writing Java code will get reduced to 1/10th
of the whole java code, using pig latin language.
Below is the picture showing the workflow of the Apache Pig.
Fig: Workflow of the Apache Pig
To enter the mapreduce mode of pig we need to type the following command on the terminal.
pig
By typing above command we will get into pig’s grunt shell which will open in mapreduce mode.
Pig supports a healthy set of operations which can do almost same things as mapreduce hard
code programming can. But, of course hard code programming aka java programming will
always have higher flexibility of the mapreduce jobs which we are performing.
Fig: Dissection of a Relation in Pig
The above diagram shows the dissection of a relation.
The data which we load in grunt shell using pig latin is called as relation.
Row is a tuple which have fields and collection of tuples is called as bag.
Pig Comes with a lot of inbuilt functions. Some of those functions are like PigStorage, ABS, MAX,
MIN, group by, order by etc.
P.S. : For knowing about the pig commands, just see the Pig_all_commands.txt file in the zip
folder.