Apache Pig
Apache Pig
COM
What is Pig?
• Apache Pig is an abstraction over MapReduce.
• It is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language
known as Pig Latin.
• This language provides various operators using which programmers
can develop their own functions for reading, writing, and
processing data.
Pig Architecture & Components
• To analyze data using Apache Pig, programmers need to write
scripts using Pig Latin language.
• All these scripts are internally converted to Map and Reduce
tasks.
• Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Features of Pig
• Rich set of operators: It provides many operators to perform operations like join,
sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured
as well as unstructured. It stores the results in HDFS.
Apache Pig Vs Hive
• Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases,
Hive operates on HDFS in a similar way Apache Pig does.
Pig Latin – Data Model
Pig Execution Modes
• You can run Apache Pig in two modes.
• Local Mode
– In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
• MapReduce Mode
– MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that
exists in the HDFS.
Invoking the Grunt Shell
• Local Mode
• $ pig –x local
• MapReduce mode
• $ pig -x mapreduce (or) pig
Execution Mechanisms
• Interactive Mode (Grunt shell) – You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).
• Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig extension.
• Embedded Mode (UDF) – Apache Pig provides the provision of defining our
own functions (User Defined Functions) in programming languages such as
Java, and using them in our script.
• Interactive Mode:
grunt> customers= LOAD '/home/cloudera/customers.txt' USING
PigStorage(',');
grunt> dump customers;
• Let us now split the relation into two, one listing the students age less than
23, and the other listing the students having the age between 23 and 25.
• Example
• In this example the number of characters in the first field is computed.
• EMP.TXT
001,Robin,22,newyork
002,Stacy,25,Bhuwaneshwar
003,Kelly,22,Chennai
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, city:chararray);
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
• grunt> date_data = LOAD ‘/home/cloudera/date.txt' USING
PigStorage(',') as (id:int,date:chararray);
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
UDF’S
User Defined Functions
• Apache Pig provides extensive support
for User Defined Functions (UDF’s).
• Using these UDF’s, we can define our own functions and use
them.
• The UDF support is provided in six programming languages.
Java, Jython, Python, JavaScript, Ruby and Groovy.
Creating UDF’S
• Open Eclipse and create a new project.
• Convert the newly created project into a Maven project.
• Copy the pom.xml. This file contains the Maven dependencies
for Apache Pig and Hadoop-core jar files.
Java code
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;