Apache Pig
Apache Pig
Pig Components
MapReduce Mode
o The MapReduce mode is also known as Hadoop Mode.
o It is the default mode.
o All the queries written using Pig Latin are converted into MapReduce jobs and
these jobs are run on a Hadoop cluster.
o It can be executed against semi-distributed or fully distributed Hadoop
installation.
o Here, the input and output data are present on HDFS.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It
is stored as string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{ }’. It is similar to a table in RDBMS, but unlike a
table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented by
‘[ ]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.
Apache Pig uses a language called Pig Hive uses a language called HiveQL. It
Latin. It was originally created at Yahoo. was originally created at Facebook.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce or pig
Output −
Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’ or quit command.
After invoking the Grunt shell, you can execute a Pig script by directly entering the
Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it using the –x
command. Let us suppose we have a Pig script in a file
named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00
Complex Types
b = (a == 1)? 20:
Bincond − Evaluates the Boolean operators. It has 30;
three operands as shown below. if a = 1 the value of
?:
variable x = (expression) ? value1 if true : value2 if b is 20.
false. if a!=1 the value of
b is 30.
CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to nested 'even'
THEN
bincond operator. WHEN 1 THEN
ELSE
'odd'
END
END
<= Less than or equal to − Checks if the value of the left (a <= b) is
operand is less than or equal to the value of the right true.
operand. If yes, then the condition becomes true.
Filtering
Sorting
To arrange a relation in a sorted order based on one or more
ORDER
fields (ascending or descending).
Diagnostic Operators
In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes
large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data into
Apache Pig.
Student ID First Name Last Name Phone City
grunt>
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by executing the following
Pig Latin statement in the Grunt shell.
grunt> student = LOAD 'student_data.txt';
no schema specified no delimiter specified-default delimiter is tab space
or
grunt> student = LOAD 'student_data.txt' as ( id,firstname,lastname,phone,city);
Schema specified ,delimiter not specified
or
grunt> student = LOAD 'student_data.txt' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Both schema and delimiter are specified
We can take the input file separated with tab space for each column with above one,
and no need of specify the complete schema(data types) of the relation also.
Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.
Note − The load statement will simply load the data into the specified relation in Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
You can store the loaded data in the file system using the store operator.
Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown
below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');
Output
After executing the store statement, you will get the following output. A directory is
created with the specified name and the data will be stored in it.
The load statement will simply load the data into the specified relation in Apache
Pig. To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Pig Latin provides four different types of diagnostic operators −
Dump operator
Describe operator
Explanation operator
Illustration operator
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results
on the screen.
It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
Describe Operator
The describe operator is used to view the schema of a relation.
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name
Explain Operator
The explain operator is used to display the logical, physical, and MapReduce
execution plans of a relation.
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;
Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of
statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Group Operator
The GROUP operator is used to group the data in one or more relations. It collects
the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
We can verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation
named group_data as shown below.
Here you can observe that the resulting schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with
the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyder
a bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,98480223
36 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,98480223
34, trivendram)})
You can see the schema of the table after grouping the data using
the describe command as shown below.
grunt> Describe group_data;
Try:grunt>explain group_data;
And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
Output
It will produce the following output, displaying the contents of the
relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join
returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B)
based upon the join-predicate. The query compares each row of A with each row of
B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is
satisfied, the column values for each matched pair of rows of A and B are combined
into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two relations customers and orders as
shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
Output
You will get the following output that will the contents of the relation
named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the
relations.
An outer join operation is carried out in three ways −
Left outer join
Right outer join
Full outer join
Left Outer Join
The left outer Join operation returns all rows from the left table, even if there are no
matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using
the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER,
Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and orders as
shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Foreach Operator
The FOREACH operator is used to generate specified data transformations based
on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Order By Operator
The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Verify the relation limit_data using the DUMP operator as shown below.
grunt> Dump limit_data;
The Load and Store functions in Apache Pig are used to determine how the data
goes ad comes out of Pig. These functions are used with the load and store
operators. Given below is the list of load and store functions available in Pig.
S.N. Function & Description
PigStorage()
1
To load and store structured files.
TextLoader()
2
To load unstructured data into Pig.
BinStorage()
3
To load and store data into Pig using machine readable format.
Handling Compression
4
In Pig Latin, we can load and store compressed data.
You can execute it from the Grunt shell as well using the exec/run command as
shown below.
grunt> exec /sample_script.pig