Unit No. 8
Unit No. 8
UNIT- VIII
Introducing Pig
To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can develop
their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache
Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a
highlevel data processing language which provides a rich set of data types and
operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig
script using the Pig Latin language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go
through a series of transformations applied by the Pig Framework, to produce the
desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and
thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown
below.
BIG DATA ANALYTICS SKNSCOEK
Pig on Hadoop:
Listed below are the major differences between Apache Pig and MapReduce.
Listed below are the major differences between Apache Pig and SQL.
Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can
store data without designing a schema Schema is mandatory in SQL.
(values are stored as $01, $02 etc.)
The data model in Apache Pig is nested The data model used in SQL is flat
relational. relational.
Apache Pig provides limited opportunity There is more opportunity for query
for Query optimization. optimization in SQL.
BIG DATA ANALYTICS SKNSCOEK
Pig Features,
Pig Philosophy :
Programmers who are not so good at Java normally used to struggle working with
Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for
all such programmers.
• Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes.
For example, an operation that would require you to type 200 lines of code
(LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig.
Ultimately Apache Pig reduces the development time by almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data types like
tuples, bags, and maps that are missing from MapReduce.
BIG DATA ANALYTICS SKNSCOEK
A = load './input.txt';
C = group B by word;
local mode:
mapreduce mode:
Sample:
Sample:
Sample:
The above program steps will generate parallel executable tasks which can be
distributed across multiple machines in a Hadoop cluster to count the number
of words in a text file.
Loading data is a key part of many businesses. Data comes in from outside of the
database in text, XML, CSV, or some other arbitrary file format. The data then has to
be processed into a different formats and loaded into a database for later querying.
Sometimes there are a lot of steps involved, sometimes the data has to be translated
into an intermediate format, but most of the time it gets into the database, some
failure it to be expected, right?
Loading large volumes of data can become a problem as the volume of data increases:
the more data there is, the longer it takes to load. To get around this problem people
routinely buy bigger, faster servers and use more fast disks. There comes a point when
you can’t add more CPUs or RAM to a server and increasing the I/O capacity won’t
help. Parallelizing ETL processes can be hard even on one machine, much less scaling
ETL out across several machines.
Pig is built on top of Hadoop, so it’s able to scale across multiple processors and
servers which makes it easy to processes massive data sets. Many ETL processes lend
themselves to being decomposed into manageable chunks; Pig is no exception. Pig
builds MapReduce jobs behind the scenes to spread load across many servers. By
taking advantage of the simple building blocks of Hadoop, data professionals are able
to build simple, easily understood scripts to process and analyze massive quantities of
data in a massively parallel environment.
An advantage of being able to scale out across many servers is that doubling
throughput is often as easy as doubling the number of servers working on a problem.
BIG DATA ANALYTICS SKNSCOEK
If one server can solve a problem in 12 hours, 24 servers should be able to solve it in
30 minutes.
Values for all the above data types can be NULL. Apache Pig treats null values in a
similar way as SQL does.
The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes
such as map and tuple. Given below is the diagrammatical representation of Pig
Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is
stored as string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig. A piece of data or a simple
atomic value is known as a field.
A parameter file will contain one line per parameter. Empty lines are allowed. Perl-
style (#) comment lines are also allowed. Comments must take a full line and # must
be the first character on the line. Each parameter line will be of the form:
param_name = param_value. White spaces around = are allowed but are optional.
Keywords :
Y, Z
Identifiers :
dentifiers include the names of relations (aliases), fields, variables, and so on. In Pig,
identifiers start with a letter and can be followed by any number of letters, digits, or
underscores.
Valid identifiers:
A
A123
abc_123_BeX_
Invalid identifiers:
_A123
abc_$
A!B
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand
side, we need to mention the name of the relation where we want to store the data,
and on the right-hand side, we have to define how we store the data. Given below is
the syntax of the Load operator.
Where,
• Input file path − We have to mention the HDFS directory where the file is
stored. (In MapReduce mode)
• function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required
schema as follows −
Note − We load the data without specifying the schema. In that case, the columns will
be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.
STORE:
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by executing the following Pig
Latin statement in the Grunt shell.
Relation
We have stored the data in the schema student.
name
Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.
datatype int char array char array char array char array
BIG DATA ANALYTICS SKNSCOEK
Note − The load statement will simply load the data into the specified relation in Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators which are discussed in the next chapters.
DUMP:
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results on
the screen. It is generally used for debugging Purpose.
Syntax
Example
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
Now, let us print the contents of the relation using the Dump operator as shown
below.
Once you execute the above Pig Latin statement, it will start a MapReduce job to read
data from HDFS. It will produce the following output.
Success!
Job Stats (time in seconds):
JobId job_14459_0004
Maps 1
Reduces 0
MaxMapTime n/a
MinMapTime n/a
AvgMapTime n/a
MedianMapTime n/a
MaxReduceTime 0
MinReduceTime 0
AvgReduceTime 0
MedianReducetime 0
Alias student
Feature MAP_ONLY
Outputs hdfs://localhost:9000/tmp/temp580182027/tmp757878456,
Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager
spill count : 0Total bags proactively spilled: 0 Total records proactively spilled: 0
(1,Rajiv,Reddy,9848022337,Hyderabad)
BIG DATA ANALYTICS SKNSCOEK
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
Note − In some portions of this chapter, the commands like Load and Store are used.
Refer the respective chapters to get in-detail information on them.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that,
we can invoke any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt shell. Using
sh command from the Grunt shell, we cannot execute the commands that are a part of
the shell environment (ex − cd).
Syntax
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh option
as shown below. In this example, it lists out the files in the /pig/bin/ directory.
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.
BIG DATA ANALYTICS SKNSCOEK
Syntax
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command. In
the following example, it lists the files in the HDFS root directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
In the same way, we can invoke all the other file system shell commands from the
Grunt shell using the fs command.
FILTER:
The FILTER operator is used to select the required tuples from a relation based on a
condition.
Syntax
Example
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
BIG DATA ANALYTICS SKNSCOEK
And we have loaded this file into Pig with the relation name student_details as shown
below.
Let us now use the Filter operator to get the details of the students who belong to the
city Chennai.
Verification
Verify the relation filter_data using the DUMP operator as shown below.
SORT:
In the previous chapter, we learnt how to load data into Apache Pig. You can store the
loaded data in the file system using the store operator. This chapter explains how to
store data in Apache Pig using the Store operator.
Syntax
Example
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
GROUP BY:
The GROUP operator is used to group the data in one or more relations. It collects the
data having the same key.
Syntax
Example
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation name student_details as
shown below.
Now, let us group the records/tuples in the relation by age as shown below.
Verification
BIG DATA ANALYTICS SKNSCOEK
Verify the relation group_data using the DUMP operator as shown below.
ORDER BY:
The ORDER BY operator is used to display the contents of a relation in a sorted order
based on one or more fields.
Syntax
Example
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown
below.
Let us now sort the relation in a descending order based on the age of the student and
store it into another relation named order_by_data using the ORDER BY operator as
shown below.
Verification
Verify the relation order_by_data using the DUMP operator as shown below.
JOIN:
The JOIN operator is used to combine records from two or more relations. While
performing a join operation, we declare one (or a group of) tuple(s) from each relation,
as keys. When these keys match, the two particular tuples are matched, else the
records are dropped. Joins can be of the following types −
• Self-join
• Inner-join
• Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator in Pig Latin. Assume
that we have two files namely customers.txt and orders.txt in the /pig_data/ directory
of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as
shown below.
LIMIT:
The LIMIT operator is used to get a limited number of tuples from a relation.
Syntax
Example
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown
below.
Now, let’s sort the relation in descending order based on the age of the student and
store it into another relation named limit_data using the ORDER BY operator as
shown below.
Verification
Verify the relation limit_data using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation limit_data
as follows.
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
You can execute it from the Grunt shell as well using the exec command as shown
below.
Multi-line comments
We will begin the multi-line comments with '/*', end them with '*/'.
While executing Apache Pig statements in batch mode, follow the steps given below.
BIG DATA ANALYTICS SKNSCOEK
Step 1
Write all the required Pig Latin statements in a single file. We can write all the Pig
Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as
shown below.
Tuple,
AVG()
Parameters in Pig:
When running Pig in a production environment, you’ll likely have one or more Pig
Latin scripts that run on a recurring basis (daily, weekly, monthly, etc.) that need to
locate their input data based on when or where they are run. For example, you may
have a Pig job that performs daily log ingestion by geographic region. It would be costly
and error prone to manually edit the script to reference the location of the input data
each time log data needs to be ingested. Ideally, you’d like to pass the date and
geographic region to the Pig script as parameters at the time the script is executed.
Fortunately, Pig provides this capability via parameter substitution. There are four
different mechanisms to define parameters that can be referenced in a Pig Latin script:
You can use none, one or any combination of the above options.
Let’s look at an example Pig script that could be run to perform IIS log ingestion. The
script loads and filters an IIS log looking for requests that didn’t complete with status-
code of 200 or 201.
Note that parameter names in Pig Latin scripts are preceded by a dollar sign, $. For
example, the LOAD statement references six parameters; $WASB_SCHEME,
$ROOT_FOLDER, $YEAR, $MONTH, $DAY and $INPUTFILE.
Note also the script makes use of the %default preprocessor statement to define
default values for the WASB_SCHEME and ROOT_FOLDER parameters:
--
%default WASB_SCHEME 'wasb';
%default ROOT_FOLDER 'logs'
-- Note how the below "LOAD" statement references the WASB_SCHEME and
ROOT_FOLDER
-- parameters that were defined with the default statement directly above
-- The values for the remaining four parameters; YEAR, MONTH, DAY and INPUTFILE
will be passed
-- from the command line using a parameter file or individual command line
arguments
--
iislog = LOAD
'$WASB_SCHEME:///$ROOT_FOLDER/$YEAR/$MONTH/$DAY/$INPUTFILE' USING
PigStorage(' ')
AS (date: chararray
, time: chararray
, sourceIP: chararray
, csMethod: chararray
, csUriStem: chararray
, csUriQuery: chararray
, sourcePort: chararray
, csUsername: chararray
, cIP: chararray
, csUserAgent: chararray
, csReferer: chararray
, scStatus: chararray
, scSubStatus: chararray
, scWin32Status: chararray
, timeTaken: chararray);