0% found this document useful (0 votes)
12 views23 pages

Apache Pig

pig notes

Uploaded by

msaicse2105g3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views23 pages

Apache Pig

pig notes

Uploaded by

msaicse2105g3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Apache Pig

Apache Pig is a high-level programming language especially designed for analyzing


large data sets.

In the MapReduce framework, programs are required to be translated into a


sequence of Map and Reduce stages.
To eliminate the complexities associated with MapReduce an abstraction called Pig
was built on top of Hadoop.
 Apache Pig allows developers to write data analysis programs using Pig
Latin.This is a highly flexible language and supports users in developing
custom functions for writing, reading and processing data.
 It enables the resources to focus more on data analysis by minimizing the time
taken for writing Map-Reduce programs.
 In order to analyze the large volumes of data , programmers write scripts using
Pig Latin language. These scripts are then transformed internally into Map and
Reduce tasks.
Why do we need Apache Pig?
Developers who are not good at Java struggle a lot while working with Hadoop,
especially when executing tasks related to the MapReduce framework.
Apache Pig is the best solution for all such programmers.
 Pig Latin simplifies the work of programmers by eliminating the need to write
complex codes in java for performing MapReduce tasks.
 The multi-query approach of Apache Pig reduces the length of code drastically
and minimizes development time.
 Pig Latin is almost similar to SQL and if you are familiar with SQL then it
becomes very easy for you to learn
Apache Pig – History
 In 2006, Apache Pig was developed as a research project at Yahoo,
especially to create and execute MapReduce jobs on every dataset.
 In 2007, Apache Pig was open sourced via Apache incubator.
 In 2008, the first release of Apache Pig came out.
 In 2010, Apache Pig graduated as an Apache top-level project.
Pig Philosopy
Apache Pig Architecture

Components of Pig Architecture As shown in the above diagram, Apache pig


consists of various components.
Parser : All the Pig scripts initially go through this parser component. It conducts
various checks which include syntax checks of the script, type checking, and other
miscellaneous checks. The Parser components produce output as a DAG (directed
acyclic graph) which depicts the Pig Latin logical operators and logical statements. In
the DAG the data flows are shown as edges and the logical operations represent Pig
Latin statements.
Optimizer: The Direct Acyclic Graph is passed to the logical optimizer, which
performs logical optimizations such as pushdown and projection.
Compiler : The compiler component transforms the optimized logical plan into a
sequence of MapReduce jobs.
Execution Engine: This component submits all the MapReduce jobs in sorted order
to the Hadoop. Finally, all the MapReduce jobs are executed on Apache Hadoop to
produce desired results.

Pig Components

1. Pig Latin-A Language used to express data flows


2. Pig Engine-an engine on top of HADOOP-2 execution environment , takes the
scripts written in Pig Latin as an input and converts them into MapReduce
jobs.

Apache Pig Run Modes


In Hadoop Pig can be executed in two different modes which are:
Local Mode
o Here Pig language makes use of a local file system and runs in a single JVM.
The local mode is ideal for analyzing small data sets.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored
in the local file system.
The command for local mode grunt shell:
$ pig-x local

MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o All the queries written using Pig Latin are converted into MapReduce jobs and
these jobs are run on a Hadoop cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.

The command for Map reduce mode:


$ pig

When to Use Each Mode


 Use Local Mode for small datasets, unit tests, debugging, or quick checks
where distributed processing is unnecessary.
 Use MapReduce Mode for production, large datasets, or whenever you want
to leverage Hadoop’s distributed computing capabilities.

Ways to execute Pig Program


These are the following ways of executing a Pig program on local and MapReduce
mode: -
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To
invoke Grunt shell, run the pig command. Once the Grunt mode executes, we
can provide Pig Latin statements and command interactively at the command
line.
o Batch Mode - In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These
functions can be called as UDF (User Defined Functions). Here, we use
programming languages like Java and Python.

Pig Latin Data Model


The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical representation
of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It
is stored as string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{ }’. It is similar to a table in RDBMS, but unlike a
table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented by
‘[ ]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

Apache Pig comes with the following features :


Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Advantages of Apache Pig
o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types
like tuple, bag, and map.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used −
 To process huge data sources such as web logs.
 To perform data processing for search platforms.
 To process time sensitive data loads.

Differences between Apache MapReduce and PIG


Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex


programs using Java or Python. programs.

It provides built-in operators to perform


It is difficult to perform data operations in
data operations like union, sorting and
MapReduce.
ordering.

It provides nested data types like tuple,


It doesn't allow nested data types.
bag, and map.

Apache Pig Vs SQL


Listed below are the major differences between Apache Pig and SQL.
Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can store


data without designing a schema (values are stored Schema is mandatory in SQL.
as $01, $02 etc.)

The data model used in SQL is


The data model in Apache Pig is nested relational.
flat relational.

Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.

In addition to above differences, Apache Pig Latin −


 Allows splits in the pipeline.
 Allows developers to store data anywhere in the pipeline.
 Declares execution plans.
 Provides operators to perform ETL (Extract, Transform, and Load) functions.
Apache Pig Vs Hive
 Both Apache Pig and Hive are used to create MapReduce jobs. And in some
cases, Hive operates on HDFS in a similar way Apache Pig does. In the
following table, we have listed a few significant points that set Apache Pig
apart from Hive.
Apache Pig Hive

Apache Pig uses a language called Pig Hive uses a language called HiveQL. It
Latin. It was originally created at Yahoo. was originally created at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing language.

Pig Latin is a procedural language and it fits


HiveQL is a declarative language.
in pipeline paradigm.

Apache Pig can handle structured,


Hive is mostly for structured data.
unstructured, and semi-structured data.

Invoking the Grunt Shell


You can invoke the Grunt shell in a desired mode (local/MapReduce) using
the −x option as shown below.
Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce or pig

Output −
Output −

Either of these commands gives you the Grunt shell prompt as shown below.
grunt>

You can exit the Grunt shell using ‘ctrl + d’ or quit command.
After invoking the Grunt shell, you can execute a Pig script by directly entering the
Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it using the –x
command. Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;

Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Pig Latin – Data Model


The data model of Pig is fully nested. A Relation is the outermost structure of
the Pig Latin data model. And it is a bag where −
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
 These statements work with relations. They
include expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided by Pig Latin,
through statements.
 Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you need to
use the Dump operator. Only after performing the dump operation, the
MapReduce job for loading the data into the file system will be carried
out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types


Given below table describes the Pig Latin data types.
S.N. Data Type Description & Example

1 Int Represents a signed 32-bit integer.


Example : 8

Represents a signed 64-bit integer.


2 long
Example : 5L

Represents a signed 32-bit floating point.


3 Float
Example : 5.5F

Represents a 64-bit floating point.


4 Double
Example : 10.5

Represents a character array (string) in Unicode UTF-8 format.


5 Chararray
Example : ‘tutorials point’

6 Bytearray Represents a Byte array (blob).

Represents a Boolean value.


7 Boolean
Example : true/ false.

Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00

Represents a Java BigInteger.


9 Biginteger
Example : 60708090709

Represents a Java BigDecimal


10 Bigdecimal
Example : 185.98376256272893883

Complex Types

A tuple is an ordered set of fields.


11 Tuple
Example : (raja, 30)

A bag is a collection of tuples.


12 Bag
Example : {(raju,30),(Mohhammad,45)}

A Map is a set of key-value pairs.


13 Map
Example : [ ‘name’#’Raju’, ‘age’#30]

Pig Latin – Arithmetic Operators


The following table describes the arithmetic operators of Pig Latin. Suppose a = 10
and b = 20.
Operator Description Example

+ Addition − Adds values on either side of the operator a + b will give 30

Subtraction − Subtracts right hand operand from left


− a − b will give −10
hand operand
Multiplication − Multiplies values on either side of the
* a * b will give 200
operator

Division − Divides left hand operand by right hand


/ b / a will give 2
operand

Modulus − Divides left hand operand by right hand


% b % a will give 0
operand and returns remainder

b = (a == 1)? 20:
Bincond − Evaluates the Boolean operators. It has 30;
three operands as shown below. if a = 1 the value of
?:
variable x = (expression) ? value1 if true : value2 if b is 20.
false. if a!=1 the value of
b is 30.

CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END

Pig Latin – Comparison Operators


The following table describes the comparison operators of Pig Latin.
Operator Description Example

Equal − Checks if the values of two operands are equal or (a = b) is not


==
not; if yes, then the condition becomes true. true

Not Equal − Checks if the values of two operands are equal


(a != b) is
!= or not. If the values are not equal, then condition becomes
true.
true.

Greater than − Checks if the value of the left operand is


(a > b) is not
> greater than the value of the right operand. If yes, then the
true.
condition becomes true.

Less than − Checks if the value of the left operand is less


< than the value of the right operand. If yes, then the condition (a < b) is true.
becomes true.

Greater than or equal to − Checks if the value of the left


(a >= b) is not
>= operand is greater than or equal to the value of the right
true.
operand. If yes, then the condition becomes true.

<= Less than or equal to − Checks if the value of the left (a <= b) is
operand is less than or equal to the value of the right true.
operand. If yes, then the condition becomes true.

Pattern matching − Checks whether the string in the left- f1 matches


matches
hand side matches with the constant in the right-hand side. '.*tutorial.*'

Pig Latin – Type Construction Operators


The following table describes the Type construction operators of Pig Latin.
Operator Description Example

Tuple constructor operator − This operator is


() (Raju, 30)
used to construct a tuple.

Bag constructor operator − This operator is used {(Raju, 30),


{}
to construct a bag. (Mohammad, 45)}

Map constructor operator − This operator is used


[] [name#Raja, age#30]
to construct a tuple.

Pig Latin – Relational Operations


The following table describes the relational operators of Pig Latin.
Operator Description

Loading and Storing

To Load the data from the file system (local/HDFS) into a


LOAD
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting
To arrange a relation in a sorted order based on one or more
ORDER
fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution plans


EXPLAIN
to compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes
large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data into
Apache Pig.
Student ID First Name Last Name Phone City

001 Rajiv Reddy 9848022337 Hyderabad

002 siddarth Battacharya 9848022338 Kolkata

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

The Load Operator


You can load data into Apache Pig from the file system (HDFS/ Local)
using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand
side, we need to mention the name of the relation where we want to store the data,
and on the right-hand side, we have to define how we store the data.
Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' [USING function] [as schema];
Where,
 relation_name − We have to mention the relation in which we want to store
the data.
 Input file path − We have to mention the HDFS directory where the file is
stored. (In MapReduce mode)
 function − We have to choose a function from the set of load functions
provided by Apache Pig (BinStorage, JsonLoader, PigStorage,
TextLoader).
 Schema − We have to define the schema of the data. We can define the
required schema as follows −(column1 : data type, column2 : data type,
column3 : data type);
Note − We load the data without specifying the schema. In that case, the columns
will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as
shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the
ExecType

2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Apache Pig version


0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Logging error
messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils - Default bootup
file /home/Hadoop/.pigbootup not found

2015-10-01 12:33:39,630 [main]


INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000

grunt>
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by executing the following
Pig Latin statement in the Grunt shell.
grunt> student = LOAD 'student_data.txt';
no schema specified no delimiter specified-default delimiter is tab space
or
grunt> student = LOAD 'student_data.txt' as ( id,firstname,lastname,phone,city);
Schema specified ,delimiter not specified
or
grunt> student = LOAD 'student_data.txt' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Both schema and delimiter are specified
We can take the input file separated with tab space for each column with above one,
and no need of specify the complete schema(data types) of the relation also.

Following is the description of the above statement.


Relation
We have stored the data in the schema student.
name

Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.

We have used the PigStorage() function. It loads and stores data as


Storage
structured text files. It takes a delimiter using which each entity of a tuple
function
is separated, as a parameter. By default, it takes ‘\t’ as a parameter.

We have stored the data using the following schema.


column id firstname lastname Phone city
schema
datatype int char array char array char array char array

Note − The load statement will simply load the data into the specified relation in Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
You can store the loaded data in the file system using the store operator.
Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING function];

Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown
below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');

Output
After executing the store statement, you will get the following output. A directory is
created with the specified name and the data will be stored in it.
The load statement will simply load the data into the specified relation in Apache
Pig. To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Pig Latin provides four different types of diagnostic operators −
 Dump operator
 Describe operator
 Explanation operator
 Illustration operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results
on the screen.
It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name

Describe Operator
The describe operator is used to view the schema of a relation.
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name

Explain Operator
The explain operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;

Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of
statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Group Operator
The GROUP operator is used to group the data in one or more relations. It collects
the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;

Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);

Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;

We can verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;

Output
Then you will get output displaying the contents of the relation
named group_data as shown below.
Here you can observe that the resulting schema has two columns −
 One is age, by which we have grouped the relation.
 The other is a bag, which contains the group of tuples, student records with
the respective age.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyder
a bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,98480223
36 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,98480223
34, trivendram)})
You can see the schema of the table after grouping the data using
the describe command as shown below.
grunt> Describe group_data;

group_data: {group: int,student_details: {(id: int,firstname: chararray,


lastname: chararray,age: int,phone: chararray,city: chararray)}}
Try:grunt>illustrate group_data;

Try:grunt>explain group_data;

Grouping by Multiple Columns


Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
You can verify the content of the relation named group_multiple using the Dump
operator .
Joins in Pig
The JOIN operator is used to combine records from two or more relations. While
performing a join operation, we declare one (or a group of) tuple(s) from each
relation, as keys. When these keys match, the two particular tuples are matched,
else the records are dropped.
Joins can be of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join
Assume that we have two files namely customers.txt and orders.txt in
the /pig_data/ directory of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING


PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
Let us now perform various Join operations on these two relations.
Self - join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple
times, under different aliases (names). Therefore let us load the contents of the
file customers.txt as two tables as shown below.

grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING


PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING


PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Syntax
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;

Output
It will produce the following output, displaying the contents of the
relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join
returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B)
based upon the join-predicate. The query compares each row of A with each row of
B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is
satisfied, the column values for each matched pair of rows of A and B are combined
into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two relations customers and orders as
shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;

Output
You will get the following output that will the contents of the relation
named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the
relations.
An outer join operation is carried out in three ways −
 Left outer join
 Right outer join
 Full outer join
Left Outer Join
The left outer Join operation returns all rows from the left table, even if there are no
matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using
the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER,
Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and orders as
shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join


The right outer join operation returns all rows from the right table, even if there are
no matches in the left table.
Syntax
Given below is the syntax of performing right outer join operation using
the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Full Outer Join


The full outer join operation returns rows when there is a match in one of the
relations.
Syntax
Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Using Multiple Keys


We can perform JOIN operation using multiple keys.
Syntax
Here is how you can perform a JOIN operation on two tables using multiple keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name
BY (key1, key2);
Cross Operator
The CROSS operator computes the cross-product of two or more relations. This
chapter explains with example how to use the cross operator in Pig Latin.
Syntax
Given below is the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;

Foreach Operator
The FOREACH operator is used to generate specified data transformations based
on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

grunt> foreach_data = FOREACH student_details GENERATE id,age,city;

grunt> foreach_data = FOREACH student_details GENERATE age>25;

Order By Operator
The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

grunt> order_by_data = ORDER student_details BY age DESC;


Limit Operator
The LIMIT operator is used to get a limited number of tuples from a relation.
Syntax
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Example
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);

grunt> limit_data = LIMIT student_details 4;

Verify the relation limit_data using the DUMP operator as shown below.
grunt> Dump limit_data;

Apache Pig - Load & Store Functions

The Load and Store functions in Apache Pig are used to determine how the data
goes ad comes out of Pig. These functions are used with the load and store
operators. Given below is the list of load and store functions available in Pig.
S.N. Function & Description

PigStorage()
1
To load and store structured files.

TextLoader()
2
To load unstructured data into Pig.

BinStorage()
3
To load and store data into Pig using machine readable format.

Handling Compression
4
In Pig Latin, we can load and store compressed data.

Apache Pig - User Defined Functions


In addition to the built-in functions, Apache Pig provides extensive support
for User Defined Functions (UDF’s). Using these UDF’s, we can define our own
functions and use them. The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.

Comments in Pig Script


While writing a script in a file, we can include comments in it as shown below.
Multi-line comments
We will begin the multi-line comments with '/*', end them with '*/'.
/* These are the multi-line comments
In the pig script */
Single –line comments
We will begin the single-line comments with '--'.
--we can write single line comments like this.

Executing Pig Script in Batch mode


While executing Apache Pig statements in batch mode, follow the steps given below.
Step 1
Write all the required Pig Latin statements in a single file. We can write all the Pig
Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell (Linux)
as shown below.
Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

You can execute it from the Grunt shell as well using the exec/run command as
shown below.
grunt> exec /sample_script.pig

Executing a Pig Script from HDFS


We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig
script with the name Sample_script.pig in the HDFS directory named /pig_data/.
We can execute it as shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

When NOT to use PIG


Pig should not be used
1. When your data is completely in the unstructured form such as videos
,text,audio.
2. When there is a time constraint because Pif is Slower than MapReduce jobs.

You might also like