0% found this document useful (0 votes)
24 views38 pages

BDA Unit-4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views38 pages

BDA Unit-4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit-4

Hadoop Environment: Setting up a Hadoop Cluster, Cluster specification, Cluster Setup and Installation,
Hadoop Configuration, Security.

Pig: Installing and Running Pig, an Example, Comparison with Databases, Pig Latin, User Defined Functions,
Data Processing Operators.

Setting up a Hadoop Cluster

To build a Hadoop cluster yourself, there are still a number of in‐ stallation options:

Apache tarballs

• The Apache Hadoop project and related projects provide binary (and source) tar‐ balls for each release.
• Installation from binary tarballs gives you the most flexibility but need the most amount of work, since
you need to decide on where the installation files, configuration files, and log files are located on the
filesystem, set their file permissions correctly, and so on.

Packages

• RPM and Debian packages are available from the Apache Bigtop project, as well as from all the
Hadoop vendors.
• Packages bring a number of advantages over tarballs: they provide a consistent filesystem layout, they
are tested together as a stack (so you know that the versions of Hadoop and Hive, say, will work
together), and they work well with configuration management tools like Puppet (is a toll to pull the
strings on multiple application servers at once).

Hadoop cluster management tools

• Cloudera Manager and Apache Ambari are examples of dedicated tools for installing and managing
a Hadoop cluster over its whole lifecycle. They provide a simple web UI, and are the recommended
way to set up a Hadoop cluster for most users and operators.

Cluster specification

A typical choice of machine for running an HDFS data node and a YARN node manager in 2014 would have
had the following specifications:

Processor : Two hex/octo-core 3 GHz CPUs,


Memory : 64−512 GB ECC RAM,

Storage : 12−24 × 1−4 TB SATA disks

Network : Gigabit Ethernet with link aggregation

Cluster Sizing

How large should your cluster be? There isn’t an exact answer to this question, but the beauty of Hadoop is
that you can start with a small cluster (say, 10 nodes) and grow it as your storage and computational needs
grow. In many ways, a better question is this: how fast does your cluster need to grow? You can get a good feel
for this by considering storage capacity.

For example, if your data grows by 1 TB a day and you have three-way HDFS replication, you need an
additional 3 TB of raw storage per day. Allow some room for intermediate files and log files (around 30%,
say), and this is in the range of one (2014-vintage) machine per week. In practice, you wouldn’t buy a new
machine each week and add it to the cluster. The value of doing a back-of-the-envelope calculation like this is
that it gives you a feel for how big your cluster should be. In this example, a cluster that holds two years’ worth
of data needs 100 machines.

Master node scenarios:

Depending on the size of the cluster, there are various configurations for running the master daemons: the
namenode, secondary namenode, resource manager, and history server.

For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the resource
manager on a single master machine. However, as the cluster gets larger, there are good reasons to separate
them.

• The namenode has high memory requirements, as it holds file and block metadata for the entire
namespace in memory. The secondary namenode, although idle most of the time, has a comparable
memory footprint to the primary when it creates a checkpoint. For file systems with a large number
of files, there may not be enough physical memory on one machine to run both the primary and
secondary namenode.
• Aside from simple resource requirements, the main reason to run masters on separate machines is
for high availability. Both HDFS and YARN support configurations where they can run masters in
active-standby pairs. If the active master fails, then the standby, running on separate hardware, takes
over with little or no interruption to the service

Network Topology

A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in Figure 10-1.
Typically there are 30 to 40 servers per rack (only 3 are shown in the diagram), with a 10 Gb switch for the
rack and an uplink to a core switch or router (at least 10 Gb or better).

Rack awareness:

• To get maximum performance out of Hadoop, it is important to configure Hadoop so that it knows the
topology of your network. If your cluster runs on a single rack, then there is nothing more to do, since
this is the default.
• However, for multirack clusters, you need to map nodes to racks. This allows Hadoop to prefer within-
rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce
tasks on nodes.
• HDFS will also be able to place replicas more intelligently to trade off performance and resilience.
• Network locations such as nodes and racks are represented in a tree, which reflects the network
“distance” between locations.
• The namenode uses the network location when determining where to place block replicas; the
MapReduce scheduler uses network location to determine where the closest replica is for input to a map
task.
• For the network in Figure 10-1, the rack topology is described by two network locations —say,
/switch1/rack1 and /switch1/rack2. Because there is only one top-level switch in this cluster, the
locations can be simplified to /rack1 and /rack2.

Cluster Setup and Installation

Hadoop Prerequisite Installation

Install JDK

ubuntu@ubuntu-VM$sudo apt update

$sudo apt install openjdk-8-jdk -y

$java -version
$ javac -version

Install OpenSSH server/client

$sudo apt install openssh-server –y

$sudo apt install openssh-client -y

Setup a non-user for Hadoop

$sudo adduser hadoop

$sudo usermod –aG sudo hadoop

$su - hadoop

Paaswordless SSh settings for hadoop

hadoop@ubuntu-VM$ssh-keygen -t rsa –P ‘’ –f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa .pub >> ~/.ssh/authorized_keys

$chmod 0600 ~/.ssh/authorized_keys

$ssh local host

Hadoop Installation

Download hadoop from hadoop.apache.org

• Download binary from https:// hadoop.apache.org/releases.html


• Use browser to download or use following wget command:
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
$ls
hadoop-3.2.2.tar.gz

• And untar the file using “tar” command


$tar xfz hadoop-3.2.2.tar.gz
$ls
hadoop-3.2.2 hadoop-3.2.2.tar.gz

Hadoop installation/configuration

We need to configure/modify/change/ following files for successfully hadoop configuration.

• bashrc
• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• mapred-site.xml
• yarn-site.xml

bashrc

$cd /home/hadoop/hadoop-3.2.2
$sudo nano ~/.bashrc

Go to end of file and then add the following lines


#Hadoop Related Options
export HADOOP_HOME=/home/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"

and then press ctrl+x → y → enter

$source ~/.bashrc

hadoop-env.sh

Get path to java home directory

$which javac

Note the output of the above command which will be like /usr/bin/javac

$readlink -f /usr/bin/javac

Note the output of the above command to be used as java home path in the next command.

$sudo nano etc/hadoop/hadoop-env.sh

Go to line in file having text “#export JAVA_HOME=”

Add below line “#export JAVA_HOME=” in this file in the end


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add ABOVE LINE line after the text “#export JAVA_HOME=” in this file.

core-site.xml

• Make necessary directories


$mkdir tmpdata

$sudo nano etc/hadoop/core-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-3.2.2/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
<description>The name of the default file system></description>
</property>
hdfs-site.xml

• Make necessary directories


$mkdir -p dfsdata/namenode
$mkdir -p dfsdata/datanode

$sudo nano etc/hadoop/hdfs-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.2/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.2/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

mapred-site.xml

$sudo nano etc/hadoop/maored-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

$sudo nano etc/hadoop/yarn-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEN
D_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

Format HDFS name node

$hdfs namenode -format

After successful format, namenode will be shutdown

Start hadoop cluster

$sbin/start-dfs.sh

Start yarn resource and name manager

$sbin/start-yarn.sh

Check if all daemons are active and running

$jps

Access HADOOP UI from Browser

• Name node interface


http://localhost:9870
• Individual data nodes
http://localhost:9864
• Yarn resource manager
http://localhost:8080

Hadoop Security

Hadoop Security is generally defined as a procedure to secure the Hadoop Data Storage unit, by
offering a virtually impenetrable wall of security against any potential cyber threat. Hadoop attains this
high calibre security wall by following the below security protocol.
1. Authentication2.Authorization
3.Auditing
Authentication:

Authentication is the first stage where the user’s credentials are verified. The credentialstypically
include the user’s dedicated User-Name and a secret password.
Entered credentials will be checked against the available details on the security database. Ifvalid, the
user will be authenticated.
Authorization:

Authorization is the second stage where the system gets to decide whether to provide
permission to the user, to access data or not.
It is based on the predesignated access control list. The Confidential information is kept secure and only
authorized personnel can access it.
Auditing:

Auditing is the last stage; it simply keeps track of the operations performed by the authenticated
user during the period in which he was logged into the cluster. This is solely done for security
purposes only.

What is Apache Pig?

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data
representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Pig.

o write data analysis programs, Pig provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading, writing, and
processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while
performing any MapReduce tasks. Apache Pig is a boon for all such programmers.

• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as
less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16
times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc.
In addition, it also provides nested data types like tuples, bags, and maps that are missing from
MapReduce.

Features of Pig

Apache Pig comes with the following features −

• Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good
at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the
programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read, process, and
write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.

Apache Pig - Architecture

The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing
language which provides a rich set of data types and operators to perform various operations on the data.

To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of transformations applied by the Pig Framework, to
produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the
programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the
major components.

Parser

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and
other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents
the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented
as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as
projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.

Installing and Running Pig

Prerequisites

It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig.

Download Apache Pig

First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org

Step 2

On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under
the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on
the link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
Step 3

Choose and click any one of these mirrors as shown below.


Step 4

These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click
the latest version among them.

Step 5

Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.

Install Apache Pig


After downloading the Apache Pig software, install it in your Linux environment by following the steps given
below.

Step 1

Create a directory with the name Pig in the same directory where the installation directories of Hadoop,
Java, and other software were installed.

$ mkdir Pig
Step 2

Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0.tar.gz
Step 3

Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and
pig.properties.

.bashrc file

In the .bashrc file, set the following variables −

• PIG_HOME folder to the Apache Pig’s installation folder,


• PATH environment variable to the bin folder, and
• PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations
(the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files).
export PIG_HOME = /home/Hadoop/Pig
export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf
$source ~/.bashrc
pig.properties file

In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various
parameters as given below.

pig -h properties
Verifying the Installation

Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will
get the version of Apache Pig as shown below.

$ pig –version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35

Execution Types (or Apache Pig Execution Modes)

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode

In this mode, all the files are installed and run from your local host and local file system. There is no need of
Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using
Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job
is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Running Pig Program (or Apache Pig Execution Mechanisms)

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In
this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a
single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our script.

Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.

Local mode MapReduce mode

Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

Pig Latin Editors There are Pig Latin syntax highlighters available for a variety of editors, including Eclipse,
IntelliJ IDEA, Vim, Emacs, and TextMate

Comparison with Databases

1. Apache PIG Vs Map Reduce;


Below are the major differences between Apache Pig and MapReduce.

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache It is quite difficult in MapReduce to perform a


Pig is pretty simple. Join operation between datasets.

Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work MapReduce.
conveniently with Apache Pig.

Apache Pig uses multi-query approach, MapReduce will require almost 20 times more
thereby reducing the length of the the number of lines to perform the same task.
codes to a great extent.

There is no need for compilation. On MapReduce jobs have a long compilation


execution, every Apache Pig operator process.
is converted internally into a
MapReduce job.

2. Apache PIG Vs SQL;


Below are the major differences between Apache Pig and SQL.

Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We Schema is mandatory in SQL.


can store data without designing a
schema (values are stored as $01, $02
etc.)

The data model in Apache Pig is nested The data model used in SQL is flat relational.
relational.

Apache Pig provides limited There is more opportunity for query


opportunity for Query optimization. optimization in SQL.
In addition to above differences, Apache Pig Latin −
➢ Allows splits in the pipeline.
➢ Allows developers to store data anywhere in the pipeline.
➢ Declares execution plans.
➢ Provides operators to perform ETL (Extract, Transform, and Load) functions.
3. Apache PIG Vs Hive;
Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases,
Hive operates on HDFS in a similar way Apache Pig does. In the following table, we
have listed a few significantpoints that set Apache Pig apart from Hive.

Apache Pig Hive

Apache Pig uses a language called Pig Latin. Hive uses a language called HiveQL. It was
It was originally created at Yahoo. originally created at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing language.

Pig Latin is a procedural language and it fits HiveQL is a declarative language.


in pipeline paradigm.

Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.

Pig Latin
Pig Latin is the language used to analyze data in Hadoop using Apache Pig.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic data types such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom

Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as
string and can be used as string and number. int, long, float, double, chararray, and bytearray are
the atomic values of Pig. A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known
as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by
‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that
every tuple contain the same number of fields or that the fields in the same position (column)
have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’

Example − [name#Raja, age#30]

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that
tuples are processed in any particular order).

Relational Operations
The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or more


fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.


EXPLAIN To view the logical, physical, or MapReduce execution plans to
compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.


Pig Latin also provides three statements—REGISTER, DEFINE, and IMPORT—that make it
possible to incorporate macros and user-defined functions into Pig scripts (see

Table 16-3).
Functions

Functions in Pig come in four types:

Eval function
A function that takes one or more expressions and returns another expression. An example of a
built-in eval function is MAX, which returns the maximum value of the entries in a bag. Some
eval functions are aggregate functions.
Filter function
A special type of eval function that returns a logical Boolean result. As the name suggests, filter
functions are used in the FILTER operator to remove unwanted rows. They can also be used in
other relational operators that take Boolean conditions, and in general, in expressions using
Boolean or conditional expressions. An example of a built-in filter function is IsEmpty, which
tests whether a bag or a map contains any items.
Load function
A function that specifies how to load data into a relation from external storage(local/HDFS).
Store function
A function that specifies how to save the contents of a relation to external storage. Often, load
and store functions are implemented by the same type. For example, PigStorage, which loads
data from delimited text files, can store data in the same format.
User-Defined Functions

In addition to the built-in functions, Apache Pig provides extensive support for User
DefinedFunctions (UDF’s). Using these UDF’s, we can define our own functions and use
them.
The UDF support is provided in six programming languages, namely, Java, Jython,
Python,JavaScript, Ruby and Groovy.
Types of UDF’s in Java
While writing UDF’s using Java, we can create and use the following three types of
functions Filter Functions − The filter functions are used as conditions in filter
statements. Thesefunctions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE
statements. Thesefunctions accept a Pig value as input and return a Pig result.

Algebraic Functions − The Algebraic functions act on inner bags in a


FOREACHGENERATE statement. These functions are used to perform full MapReduce
operations on an inner bag.

Data Processing Operators

The Load Operator

You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator
of Pig Latin.

Syntax

The load statement consists of two parts divided by the “=” operator. On the left-hand side, we
need to mention the name of the relation where we want to store the data, and on the right-hand
side, we have to define how we store the data. Given below is the syntax of the Load operator.

Relation_name = LOAD 'Input file path' USING function as schema;

Where,

• relation_name − We have to mention the relation in which we want to store the data.
• Input file path − We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)
• function − We have to choose a function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required schema
as follows −
(column1 : data type, column2 : data type, column3 : data type);

Note − We load the data without specifying the schema. In that case, the columns will be
addressed as $01, $02, etc… (check).

Example

As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.

Start the Pig Grunt Shell

First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown
below.

$ Pig –x mapreduce
grunt>
Execute the Load Statement

Now load the data from the file student_data.txt into Pig by executing the following Pig Latin
statement in the Grunt shell.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'


USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Following is the description of the above statement.

Relation We have stored the data in the schema student.


name
Input file We are reading data from the file student_data.txt, which is in the /pig_data/
path directory of HDFS.

Storage We have used the PigStorage() function. It loads and stores data as structured
function text files. It takes a delimiter using which each entity of a tuple is separated, as a
parameter. By default, it takes ‘\t’ as a parameter.

schema We have stored the data using the following schema.

column id firstname lastname phone city

datatype int char array char array char array char array

The Store Operator

You can store the loaded data in the file system using the store operator. This chapter explains
how to store data in Apache Pig using the Store operator.

Syntax

Given below is the syntax of the Store statement.

STORE Relation_name INTO ' required_directory_path ' [USING function];

Example

Assume we have a file student_data.txt in HDFS with the following content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'


USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );

Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.

grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Output

After executing the store statement, you will get the following output. A directory is created
with the specified name and the data will be stored in it.

The FILTER operator

The FILTER operator is used to select the required tuples from a relation based on a condition.

Syntax

Given below is the syntax of the FILTER operator.

grunt> Relation2_name = FILTER Relation1_name BY (condition);

Example

Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as
shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation name student_details as shown below.

runt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING


PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

Let us now use the Filter operator to get the details of the students who belong to the city
Chennai.

filter_data = FILTER student_details BY city == 'Chennai';


Verification

Verify the relation filter_data using the DUMP operator as shown below.

grunt> Dump filter_data;


Output

It will produce the following output, displaying the contents of the relation filter_data as
follows.

(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

The FOREACH operator

The FOREACH operator is used to generate specified data transformations based on the column
data.

Syntax

Given below is the syntax of FOREACH operator.

grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as
shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation name student_details as shown below.

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING


PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);

Let us now get the id, age, and city values of each student from the relation student_details and
store it into another relation named foreach_data using the foreach operator as shown below.

grunt> foreach_data = FOREACH student_details GENERATE id,age,city;


Verification

Verify the relation foreach_data using the DUMP operator as shown below.

grunt> Dump foreach_data;


Output

It will produce the following output, displaying the contents of the relation foreach_data.

(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Grouping and Joining Data

The JOIN operator is used to combine records from two or more relations. While performing a
join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these
keys match, the two particular tuples are matched, else the records are dropped. Joins can be of
the following types −

• Self-join
• Inner-join
• Outer-join − left join, right join, and full join

Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory
of HDFS as shown below.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these two files into Pig with the relations customers and orders as shown
below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')


as (oid:int, date:chararray, customer_id:int, amount:int);

Let us now perform various Join operations on these two relations.

Self - join

Self-join is used to join a table with itself as if the table were two relations, temporarily
renaming at least one relation.

Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under
different aliases (names). Therefore let us load the contents of the file customers.txt as two
tables as shown below.

grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')


as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')


as (id:int, name:chararray, age:int, address:chararray, salary:int);
Syntax

Given below is the syntax of performing self-join operation using the JOIN operator.

grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;


Example

Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.

grunt> customers3 = JOIN customers1 BY id, customers2 BY id;


Verification

Verify the relation customers3 using the DUMP operator as shown below.

grunt> Dump customers3;


Output

It will produce the following output, displaying the contents of the relation customers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join

Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows
when there is a match in both tables.

It creates a new relation by combining column values of two relations (say A and B) based upon
the join-predicate. The query compares each row of A with each row of B to find all pairs of
rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for
each matched pair of rows of A and B are combined into a result row.

Syntax

Here is the syntax of performing inner join operation using the JOIN operator.

grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;


Example

Let us perform inner join operation on the two relations customers and orders as shown below.

grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;


Verification

Verify the relation coustomer_orders using the DUMP operator as shown below.

grunt> Dump coustomer_orders;


Output

You will get the following output that will the contents of the relation
named coustomer_orders.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note −

Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations.
An outer join operation is carried out in three ways −

• Left outer join


• Right outer join
• Full outer join

Left Outer Join

The left outer Join operation returns all rows from the left table, even if there are no matches in
the right relation.

Syntax

Given below is the syntax of performing left outer join operation using the JOIN operator.

grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY


customer_id;
Example

Let us perform left outer join operation on the two relations customers and orders as shown
below.

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;


Verification

Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output

It will produce the following output, displaying the contents of the relation outer_left.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join

The right outer join operation returns all rows from the right table, even if there are no matches
in the left table.

Syntax

Given below is the syntax of performing right outer join operation using the JOIN operator.

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;


Example

Let us perform right outer join operation on the two relations customers and orders as shown
below.

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;


Verification

Verify the relation outer_right using the DUMP operator as shown below.

grunt> Dump outer_right


Output

It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join

The full outer join operation returns rows when there is a match in one of the relations.

Syntax

Given below is the syntax of performing full outer join using the JOIN operator.

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;


Example

Let us perform full outer join operation on the two relations customers and orders as shown
below.

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;


Verification

Verify the relation outer_full using the DUMP operator as shown below.

grun> Dump outer_full;


Output

It will produce the following output, displaying the contents of the relation outer_full.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Using Multiple Keys

We can perform JOIN operation using multiple keys.

Syntax

Here is how you can perform a JOIN operation on two tables using multiple keys.

grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1,


key2);

Assume that we have two files namely employee.txt and employee_contact.txt in


the /pig_data/ directory of HDFS as shown below.

employee.txt

001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001

employee_contact.txt

001,9848022337,[email protected],Hyderabad,003
002,9848022338,[email protected],Kolkata,003
003,9848022339,[email protected],Delhi,003
004,9848022330,[email protected],Pune,003
005,9848022336,[email protected],Bhuwaneshwar,003
006,9848022335,[email protected],Chennai,003
007,9848022334,[email protected],trivendram,002
008,9848022333,[email protected],Chennai,001

And we have loaded these two files into Pig with relations employee and employee_contact as
shown below.
grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);

grunt> employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING


PigStorage(',')
as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

Now, let us join the contents of these two relations using the JOIN operator as shown below.

grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);


Verification

Verify the relation emp using the DUMP operator as shown below.

grunt> Dump emp;


Output

It will produce the following output, displaying the contents of the relation named emp as shown
below.

(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113)
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113)
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113)
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw
ar,113)
(6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113)
(7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112)
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,

You might also like