0% found this document useful (0 votes)
11 views50 pages

Scripting Platform That Runs On HADOOP Clusters: High-Level Programming Language

pig commands

Uploaded by

midhun reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views50 pages

Scripting Platform That Runs On HADOOP Clusters: High-Level Programming Language

pig commands

Uploaded by

midhun reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

PIG

 Apache Pig is a platform for data anlaysis


 Apache Pig is a high-level programming language and scripting platform
that runs on HADOOP clusters designed for analyzing large data sets.
 Pig is extensible, self-optimizing, and easily programmed.
 Programmers can use Pig to write data transformations without knowing
Java.
 Pig uses both structured and unstructured data as input to perform analytics
and uses HDFS to store the results.
 It is an alternative to Map Reduce Pogramming

Why Pig?
 In the MapReduce framework, programs are required to be translated into a
sequence of Map and Reduce stages.To eliminate the complexities associated
with MapReduce an abstraction called Pig was built on top of Hadoop.
 Developers who are not good at Java struggle a lot while working with
Hadoop, especially when executing tasks related to the MapReduce
framework. Apache Pig is the best solution for all such programmers.
 Pig Latin simplifies the work of programmers by eliminating the need to
write complex codes in java for performing MapReduce tasks.
 The multi-query approach of Apache Pig reduces the length of code
drastically and minimizes development time.

38
 Pig Latin is almost similar to SQL and if you are familiar with SQL then it
becomes very easy for you to learn

History of Apache Pig


 In 2006, Apache Pig was developed as a research project at Yahoo,
especially to create and execute MapReduce jobs on every dataset.
 In 2007, Apache Pig was open sourced via Apache incubator.
 In 2008, the first release of Apache Pig came out.
 In 2010, Apache Pig graduated as an Apache top-level project.

Pig Philosophy

38
Apache Pig Architecture

As shown in the above diagram, Apache pig consists of various components.

Parser : All the Pig scripts initially go through this parser component. It
conducts various checks which include syntax checks of the script, type checking,
and other miscellaneous checks.
The Parser components produce output as a DAG (directed acyclic graph)
which depicts the Pig Latin logical operators and logical statements. In the DAG the
data flows are shown as edges and the logical operations represent Pig Latin
statements.

38
Optimizer: The Direct Acyclic Graph is passed to the logical optimizer,
which performs logical optimizations such as pushdown and projection.

Compiler : The compiler component transforms the optimized logical plan


into a sequence of MapReduce jobs.

Execution Engine: This component submits all the MapReduce jobs in


sorted order to the Hadoop. Finally, all the MapReduce jobs are executed on
Apache Hadoop to produce desired results.

Pig Components / Anatomy of Pig

1. PIG LATIN SCRIPT : contains syntax, data model and operators. A


Language used to express data flows
2. EXECUTION ENGINE: An engine on top of HADOOP-2 execution
environment, takes the scripts written in Pig Latin as an input and converts
them into MapReduce jobs.

38
Pig Latin
 Pig Latin is a Hadoop extension that simplifies hadoop programming by
giving a high-level data processing language. In order to analyze the large
volumes of data, programmers write scripts using Pig Latin language. These
scripts are then transformed internally into Map and Reduce tasks. This is a
highly flexible language and supports users in developing custom functions
for writing, reading and processing data.
 It enables the resources to focus more on data analysis by minimizing the
time taken for writing Map-Reduce programs.

Pig Latin – Statemets


 While processing data using Pig Latin, statements are the basic constructs.
 These statements work with relations. They include expressions and
schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided by Pig Latin,
through statements.
 Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
 Pig Latin Statements are generally ordered as follows:

o LOAD statement that reads data from the file system.


o Series of statements to perform transformations
o DUMP or STORE to display/store result.
Example 1:
A = load ‘students’(rollno,name,gpa);
A = filter A by gpa>4.0;

38
A=foreach A generate UPPER(name);
STORE A INTO ‘myreport’
A is a relation
 As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you need to
use the Dump operator. Only after performing the dump operation, the
MapReduce job for loading the data into the file system will be carried
out.

Comments in Pig Script


While writing a script in a file, we can include comments in it as shown
below.

38
Multi-line comments
We will begin the multi-line comments with '/*', end them with '*/'.
/* These are the multi-line comments
In the pig script */
Single –line comments
We will begin the single-line comments with '--'.
--we can write single line comments like this.

Pig Latin – Arithmetic Operators


The following table describes the arithmetic operators of Pig Latin.
Suppose a = 10 and b = 20.
Operator Description Example

+ Addition: Adds values on either side of the a + b will give 30


operator
− Subtraction: Subtracts right hand operand from a − b will give −10
left hand operand

38
Multiplication: Multiplies values on either side
* a * b will give 200
of the operator
Division: Divides left hand operand by right hand
/ b / a will give 2
operand
Modulus: Divides left hand operand by right hand
% b % a will give 0
operand and returns remainder
b = (a == 1)?
Bincond: Evaluates the Boolean operators. It has 20: 30;
three operands as shown below. if a = 1 the value of
?:
Variable x = (expression) ? value1 if true : b is 20.
value2 if false. if a!=1 the value of
b is 30.
CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case: The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END

Pig Latin – Comparison Operators


The following table describes the comparison operators of Pig Latin.

Operator Description Example

Equal: Checks if the values of two operands are equal (a = = b) is not


==
or not; if yes, then the condition becomes true. true

Not Equal : Checks if the values of two operands are


!= equal or not. If the values are not equal, then condition (a != b) is true.
becomes true.

Greater than: Checks if the value of the left operand


(a > b) is not
> is greater than the value of the right operand. If yes,
true.
then the condition becomes true.

Less than: Checks if the value of the left operand is


< less than the value of the right operand. If yes, then the (a < b) is true.
condition becomes true.

>= Greater than or equal to: Checks if the value of the (a >= b) is not

38
left operand is greater than or equal to the value of the true.
right operand. If yes, then the condition becomes true.

Less than or equal to: Checks if the value of the left


(a <= b) is
<= operand is less than or equal to the value of the right
true.
operand. If yes, then the condition becomes true.

Pattern matching: Checks whether the string in the


f1 matches
matches left-hand side matches with the constant in the right-
'.*tutorial.*'
hand side.

Pig Latin – Type Construction Operators


The following table describes the Type construction operators of Pig Latin.

Operator Description Example

Tuple constructor operator: This operator


() (Raju, 30)
is used to construct a tuple.

Bag constructor operator: This operator is {(Raju, 30),


{}
used to construct a bag. (Mohammad, 45)}

Map constructor operator : This operator


[] [name#Raja, age#30]
is used to construct a tuple.

Pig Latin – Relational Operations


The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

38
DISTINCT To remove duplicate rows from a relation.

FOREACH,
To generate data transformations based on columns of data.
GENERATE

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

To arrange a relation in a sorted order based on one or more fields


ORDER
(ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution plans to


EXPLAIN
compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

38
Pig Latin – Data types

S.No. Data Type Description & Example

1 int Represents a signed 32-bit integer. Example : 8

2 long Represents a signed 64-bit integer. Example : 5L

3 float Represents a signed 32-bit floating point. Example : 5.5F

4 double Represents a 64-bit floating point. Example : 10.5

Represents a character array (string) in Unicode UTF-8


5 chararray format.
Example : ‘tutorials point’

6 Bytearray Represents a bytearray (blob).

7 Boolean Represents a boolean value. Example : true/ false.

Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00

Represents a Java Biginteger.


9 Biginteger
Example : 60708090709

Represents a Java Bigdecimal


10 Bigdecimal
Example : 185.98376256272893883

Complex Types

11 Tuple A tuple is an ordered set of fields. Example : (raja, 30)

A bag is a collection of tuples.


12 Bag
Example : {(raju,30),(Mohhammad,45)}

A Map is a set of key-value pairs.


13 Map
Example : [ ‘name’#’Raju’, ‘age’#30]

38
Apache Pig Run Modes

Local Mode
o Here Pig language makes use of a local file system and runs in a single JVM.
The local mode is ideal for analyzing small data sets.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data
stored in the local file system.

The command for local mode grunt shell:

38
$ pig –x local

MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o All the queries written using Pig Latin are converted into MapReduce jobs
and these jobs are run on a Hadoop cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.

The command for Map reduce mode:


$ pig

38
Ways to execute Pig Program
These are the following ways of executing a Pig program on local and
MapReduce mode:
o Interactive Mode / Grunt mode in anatomy of pig: In this mode, the Pig
is executed in the Grunt shell. To invoke Grunt shell, run the pig command.
Once the Grunt mode executes, we can provide Pig Latin statements and
command interactively at the command line.
o Batch Mode / script mode in anatomy of pig: In this mode, we can run a
script file having a .pig extension. These files contain Pig Latin
commands.
o Embedded Mode: In this mode, we can define our own functions.
These functions can be called as UDF (User Defined Functions). Here,
we use programming languages like Java and Python.

Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using
the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

38
Either of these commands gives you the Grunt shell prompt as shown
below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.

After invoking the Grunt shell, you can execute a Pig script by directly
entering the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode


You can write an entire Pig Latin script in a file and execute it using the –x
command. Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.

Sample_script.pig
student= LOAD 'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',')
as (id:int, name:chararray, city:chararray);
Dump student;

38
Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Pig Latin Data Model


 The data model of Pig Latin is fully nested.
 A Relation is the outermost structure of the Pig Latin data model. And it is
a bag where −
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.
 It allows complex non-atomic datatypes such as map and tuple. Given
below is the diagrammatical representation of Pig Latin’s data model.

 The data model of Pig is fully nested.

As part of its data model, Pig supports following basic types.

Atom

38
It is a simple atomic value like int, long, double, or string.
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom.
It is stored as string and can be used as string and number.
int, long, float, double, chararray, and bytearray are the atomic values of Pig.
A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’

Tuple
It is a sequence of fields that can be of any data type.
A record that is formed by an ordered set of fields is known as a tuple, the
fields can be of any type.
A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)

Bag
It is a collection of tuples of potentially varying structures and can contain
duplicates.
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{ }’. It is similar to a table in RDBMS, but unlike a
table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}

38
Map
It is an associative array. A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should be unique.
The value might be of any type. It is represented by ‘[ ]’
Example − [name#Raja, age#30]

Relation
A relation is a bag of tuples.
The relations in Pig Latin are unordered (there is no guarantee that tuples are
processed in any particular order).

Pig on Hadoop

38
 Pig runs on Hadoop
 Pig uses both HDFS and Map Reduce programming
 By default, Pig reads input files from HDFS, Pig stores the intermediate data
(data produced by Map Reduce jobs) and the output in HDFS.
 However, Pig can also read input from and place output to other sources.
 In general, Apache Pig works on top of Hadoop. It is an analytical tool that
analyzes large datasets that exist in the Hadoop File System.
 To analyze data using Apache Pig, we have to initially load the data into
Apache Pig.

Student ID First Name Last Name Phone City

001 Rajiv Reddy 9848022337 Hyderabad

002 siddarth Battacharya 9848022338 Kolkata

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

Apache Pig - Load & Store Functions

38
The Load and Store functions in Apache Pig are used to determine how the
data goes ad comes out of Pig. These functions are used with the load and store
operators.
Given below is the list of load and store functions available in Pig.

S.No. Function & Description

1 PigStorage( ): To load and store structured files.

2 TextLoader( ): To load unstructured data into Pig.

BinStorage( ): To load and store data into Pig using machine readable
3
format.

Handling Compression: In Pig Latin, we can load and store compressed


4
data.

The Load Operator


You can load data into Apache Pig from the file system (HDFS/ Local)
using LOAD operator of Pig Latin
The load statement consists of two parts divided by the “=” operator. On the
left-hand side, we need to mention the name of the relation where we want to store
the data, and on the right-hand side, we have to define how we store the data.
Syntax:
Relation_name = LOAD 'Input file path' USING function as schema;
Where,
 relation_name : We have to mention the relation in which we want to store
the data.

38
 Input file path: We have to mention the HDFS directory where the file is
stored. (In MapReduce mode)
 Function: We have to choose a function from the set of load functions
provided by Apache Pig (BinStorage, JsonLoader, PigStorage,
TextLoader).
 Schema: We have to define the schema of the data. We can define the
required schema as follows −(column1 : data type, column2 : data type,
column3 : data type);
Note − We load the data without specifying the schema. In that case, the columns
will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the
schema named Student using the LOAD command.

Start the Pig Grunt Shell


First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce
mode as shown below.
$ Pig–x mapreduce
It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the
ExecType
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Apache Pig version
0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

38
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils - Default
bootup file /home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000

grunt>

Execute the Load Statement


Now load the data from the file student_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as (id:int,firstname:chararray,lastname:chararray,phone:chararray,
city:chararray);
or
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
as(id,firstname,lastname,phone,city);

We can take the input file separated with tab space for each column with
above one, and no need of specify the complete schema (data types) of the
relation also.

Following is the description of the above statement.

38
Relation name We have stored the data in the schema student.

We are reading data from the file student_data.txt, which is in the


Input file path
/pig_data/ directory of HDFS.

We have used the PigStorage() function. It loads and stores data as


Storage structured text files. It takes a delimiter using which each entity of
function a tuple is separated, as a parameter. By default, it takes ‘\t’ as a
parameter.

We have stored the data using the following schema.

column id firstname lastname Phone city


Schema
datatype int char array char array char array char array

Note − The load statement will simply load the data into the specified relation in
Pig. To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
The PigStorage() function loads and stores data as structured text files. It
takes a delimiter using which each entity of a tuple is separated as a parameter. By
default, it takes ‘\t’ as a parameter.

Syntax
grunt> PigStorage(field_delimiter)

Example

38
Let us suppose we have a file named student_data.txt in the HDFS
directory named /data/ with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

We can load the data using the PigStorage function as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
In the above example, we have seen that we have used comma (‘,’) delimiter.
Therefore, we have separated the values of a record using (,).

You can store the loaded data in the file system using the store operator.
Syntax
STORE Relation_name INTO ' required_directory_path ' [USING
function];

Example
Assume we have a file student_data.txt in HDFS with the following
content.
And we have read it into a relation student using the LOAD operator as shown
below.

38
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,phone:chararray,
city:chararray);

Now, let us store the relation in the HDFS directory “/pig_Output/” as shown
below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING
PigStorage(',');

Output
After executing the store statement, you will get the following output.
A directory is created with the specified name and the data will be stored in
it.
The load statement will simply load the data into the specified relation in Apache
Pig. To verify the execution of the Load statement, you have to use the Diagnostic
Operators.

Pig Latin provides four different types of diagnostic operators −


 Dump operator
 Describe operator
 Explanation operator
 Illustration operator

Dump Operator
The Dump operator is used to run the Pig Latin statements and display the
results on the screen.

38
It is generally used for debugging Purpose.
Syntax
grunt> Dump Relation_Name

Describe Operator
The describe operator is used to view the schema of a relation.
Syntax
grunt> Describe Relation_name

Explain Operator
The explain operator is used to display the logical, physical, and
MapReduce execution plans of a relation.
Syntax
grunt> explain Relation_name;

Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of
statements.
Syntax
grunt> illustrate Relation_name;

Group Operator

38
The GROUP operator is used to group the data in one or more relations. It
collects the data having the same key.
Syntax
grunt>Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

38
And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details= LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,
city:chararray);

Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;

We can verify the relation group_data using the DUMP operator as shown
below.
grunt> Dump group_data;

Output
Then you will get output displaying the contents of the relation
named group_data as shown below.
Here you can observe that the resulting schema has two columns −
 One is age, by which we have grouped the relation.
 The other is a bag, which contains the group of tuples, student records with
the respective age.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hy
dera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022
33 8,Kolkata)})

38
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,984802
2336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,984802
2334, trivendram)})
You can see the schema of the table after grouping the data using
the describe command as shown below.
grunt> Describe group_data;
group_data: {group: int,student_details: {(id: int,firstname: chararray,
lastname: chararray,age: int,phone: chararray,city: chararray)}}

Try:grunt>illustrate group_data;

Try:grunt>explain group_data;

Grouping by Multiple Columns


Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
You can verify the content of the relation named group_multiple using the Dump
operator .

38
Joins in Pig

The JOIN operator is used to combine records from two or more


relations. While performing a join operation, we declare one (or a group of)
tuple(s) from each relation, as keys. When these keys match, the two particular
tuples are matched, else the records are dropped.
Joins can be of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join

Assume that we have two files namely customers.txt and orders.txt in


the /pig_data/ directory of HDFS as shown below.

customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00

38
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as(id:int,name:chararray,age:int,address:chararray,salary:int);

grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING


PigStorage(',')
as(oid:int,date:chararray,customer_id:int,amount:int);

Let us now perform various Join operations on these two relations.

Self - join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming at least one relation.

38
Generally, in Apache Pig, to perform self-join, we will load the same data
multiple times, under different aliases (names). Therefore let us load the contents
of the file customers.txt as two tables as shown below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt'
USING PigStorage(',')
as(id:int,name:chararray,age:int,address:chararray,salary:int);

grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt'


USING PigStorage(',')
as(id:int,name:chararray,age:int,address:chararray,salary:int);

Syntax
Given below is the syntax of performing self-join operation using
the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key,
Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the
two relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Verify the relation customers3 using the DUMP operator as shown below.
grunt>Dump customers3;

Output

38
It will produce the following output, displaying the contents of the
relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An
inner join returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A
and B) based upon the join-predicate. The query compares each row of A with
each row of B to find all pairs of rows which satisfy the join-predicate. When the
join-predicate is satisfied, the column values for each matched pair of rows of A
and B are combined into a result row.

Syntax
Here is the syntax of performing inner join operation using
the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY
columnname;
Example
Let us perform inner join operation on the two relations customers and
orders as shown below.

38
grunt> coustomer_orders= JOIN customers BY id, orders BY customer_id;

Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;

Output
You will get the following output that will the contents of the relation
named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note:
Outer Join: Unlike inner join, outer join returns all the rows from at least one of
the relations.
An outer join operation is carried out in three ways −
 Left outer join
 Right outer join
 Full outer join

Left Outer Join


The left outer Join operation returns all rows from the left table, even if
there are no matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using
the JOIN operator.

38
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER,
Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and
orders as shown below.
grunt>outer_left= JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt>Dump outer_left;
Output
It will produce the following output, displaying the contents of the
relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join


The right outer join operation returns all rows from the right table, even if
there are no matches in the left table.
Syntax
Given below is the syntax of performing right outer join operation using
the JOIN operator.

38
grunt> outer_right = JOIN customers BY id RIGHT orders BY customer_id;

Full Outer Join


The full outer join operation returns rows when there is a match in
one of the relations.
Syntax
Given below is the syntax of performing full outer join using
the JOIN operator.
grunt>outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Using Multiple Keys


We can perform JOIN operation using multiple keys.
Syntax
Here is how you can perform a JOIN operation on two tables using multiple
keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2),
Relation3_name BY (key1, key2);

Cross Operator
The CROSS operator computes the cross-product of two or more relations.
This chapter explains with example how to use the cross operator in Pig Latin.
Syntax
Given below is the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;

Foreach Operator

38
The FOREACH operator is used to generate specified data transformations
based on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required
data);

grunt> foreach_data= FOREACH student_details GENERATE id,age,city;

grunt> foreach_data= FOREACH student_details GENERATE age>25;

Order By Operator
The ORDER BY operator is used to display the contents of a relation in a
sorted order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

38
grunt> order_by_data= ORDER student_details BY age DESC;

Limit Operator
The LIMIT operator is used to get a limited number of tuples from a
relation.
Syntax
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Example
grunt> student_details= LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,
city:chararray);

grunt> limit_data= LIMIT student_details 4;

Verify the relation limit_data using the DUMP operator as shown below.
grunt> Dump limit_data;

Apache Pig - User Defined Functions


In addition to the built-in functions, Apache Pig provides extensive support
for User Defined Functions (UDF’s). Using these UDF’s, we can define our own
functions and use them. The UDF support is provided in six programming
languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.

38
Executing Pig Script in Batch mode
While executing Apache Pig statements in batch mode, follow the steps
given below.
Step 1
Write all the required Pig Latin statements in a single file. We can write all
the Pig Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell
(Linux) as shown below.

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

You can execute it from the Grunt shell as well using the exec/run command as
shown below.
grunt> exec/sample_script.pig

Executing a Pig Script from HDFS


We can also execute a Pig script that resides in the HDFS. Suppose
there is a Pig script with the name Sample_script.pig in the HDFS directory
named /pig_data/. We can execute it as shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

38
38
38
38
38
38
Features of Pig
 It provides an engine for executing data flows (how your data should flow).
 Pig processes data in parallel on the Hadoop cluster
 It provides a language called “Pig Latin” to express data flows.
 Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
 Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL. understand and maintain (1/20th the lines
of code and 1/16th the development time)
 Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
 Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
 UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
 Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.

Advantages of Apache Pig


o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types
like tuple, bag, and map.

38
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving
ad-hoc processing and quick prototyping. Apache Pig is used −
 To process huge data sources such as web logs.
 To perform data processing for search platforms.
 To process time sensitive data loads.

38
Apache Pig Vs Hive
 Both Apache Pig and Hive are used to create MapReduce jobs. And in some
cases, Hive operates on HDFS in a similar way Apache Pig does. In the
following table, we have listed a few significant points that set Apache Pig
apart from Hive.

Apache Pig Hive

Apache Pig uses a language called Pig


Hive uses a language called HiveQL. It
Latin. It was originally created
was originally created at Facebook.
at Yahoo.

Pig Latin is a data flow language. HiveQL is a query processing language.

38
Pig Latin is a procedural language and it
HiveQL is a declarative language.
fits in pipeline paradigm.

Apache Pig can handle structured,


Hive is mostly for structured data.
unstructured, and semi-structured data.

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex


programs using Java or Python. programs.

It is difficult to perform data operations It provides built-in operators to perform

38
in MapReduce. data operations like union, sorting and
ordering.

It provides nested data types like tuple,


It doesn't allow nested data types.
bag, and map.

Apache Pig Vs SQL


Listed below are the major differences between Apache Pig and SQL.

Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store


data without designing a schema (values are Schema is mandatory in SQL.
stored as $01, $02 etc.)

The data model in Apache Pig is nested The data model used in SQL is
relational. flat relational.

Apache Pig provides limited opportunity There is more opportunity for


for Query optimization. query optimization in SQL.

In addition to above differences, Apache Pig Latin −


 Allows splits in the pipeline.
 Allows developers to store data anywhere in the pipeline.
 Declares execution plans.
 Provides operators to perform ETL (Extract, Transform, and Load)
functions.

38
38

You might also like