Big-Data-Unit 4
Big-Data-Unit 4
Department of
1
UNIT-4
Hadoop Input/output
2
Outline
4.1 Hadoop Programming
4.5.3 Application
4.5.4 Security
3
UNIT 4
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and storage.
Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license.
Hadoop was developed, based on the paper written by Google on the MapReduce system and it
applies concepts of functional programming. Hadoop is written in the Java programming
language and ranks among the highest-level Apache projects. Hadoop was developed by Doug
Cutting and Michael J. Cafarella.
As you can see in the above image, HDFS provides a distributed way to store Big Data. Your
data is stored in blocks in Data Nodes and you specify the size of each block. Suppose you
have 512 MB of data and you have configured HDFS such that it will create 128 MB of data
blocks. Now, HDFS will divide data into 4 blocks as 512/128=4 and stores it across different
Data Nodes. While storing these data blocks into Data Nodes, data blocks are replicated on
different Data Nodes to provide fault tolerance. Hadoop follows horizontal scaling instead of
vertical scaling. In horizontal scaling, you can add new nodes to HDFS cluster on the run as per
requirement, instead of increasing the hardware stack present in each node. As you can see in
the above image, in HDFS you can store all kinds of data whether it is structured, semi-
structured or unstructured. In HDFS, there is no pre-dumping schema validation. It also follows
write once and read many models. Due to this, you can just write any kind of data once and you
can read it multiple times for finding insights.
4
The third challenge was about processing the data faster. In order to solve this, we move the
processing unit to data instead of moving data to the processing unit. It means that instead of
moving data from different nodes to a single master node for processing, the processing logic is
sent to the nodes where data is stored so as that each node can process a part of data in parallel.
Finally, all of the intermediary output produced by each node is merged together and the final
response is sent back to the client.
Features of Hadoop
Reliability
When machines are working as a single unit, if one of the machines fails, another machine will
take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
Economical
Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your Data Nodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors. But if I would have used hardware-based RAID with Oracle for
the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a
Hadoop-based project is minimized. It is easier to maintain a Hadoop environment and is
economical as well. Also, Hadoop is open-source software and hence there is no licensing cost.
Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you
are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because
you can go ahead and procure more hardware and expand your set up within minutes whenever
required.
Flexibility
Hadoop is very flexible in terms of the ability to deal with all kinds of data. We
discussed “Variety” in our previous blog, where data can be of any kind and Hadoop can store
and process them all, whether it is structured, semi-structured or unstructured data.
5
Hadoop Core Components
While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of
your Hadoop platform, but there are two services which are always mandatory for setting
up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands
for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas
YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.
HDFS
Let us go ahead with HDFS first. The main components of HDFS are the Name Node and
the Data Node. Let us talk about the roles of these two components in detail.
Name Node
It is the master daemon that maintains and manages the Data Nodes (slave nodes)
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
If a file is deleted in HDFS, the Name Node will immediately record this in the Edit Log
It regularly receives a Heartbeat and a block report from all the Data Nodes in the
cluster to ensure that the Data Nodes are alive
It keeps a record of all the blocks in the HDFS and Data Node in which they are stored
It has high availability and federation features which I will discuss
in HDFS architecture in detail
Data Node
So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental unit of
Hadoop i.e. YARN.
6
YARN
YARN comprises of two major components: Resource Manager and Node Manager.
Resource Manager
It is a cluster-level (one for each cluster) component and runs on the master machine
It manages resources and schedules applications running on top of YARN
It has two components: Scheduler & Application Manager
The Scheduler is responsible for allocating resources to the various running applications
The Application Manager is responsible for accepting job submissions and negotiating
the first container for executing the application
It keeps a track of the heartbeats from the Node Manager
Node Manager
It is a node-level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each
container
It also keeps track of node health and log management
It continuously communicates with Resource Manager to remain up-to-date
7
Hadoop Ecosystem
So far you would have figured out that Hadoop is neither a programming language nor a service,
it is a platform or framework which solves Big Data problems. You can consider it as a suite
which encompasses a number of services for ingesting, storing and analyzing huge data sets
along with tools for configuration management.
Scrobble: When a user plays a track of his or her own choice and sends the information
to Last.FM through a client application.
Radio listen: When the user tunes into a Last.FM radio station and streams a song.
Over 40M unique visitors and 500M page views each month
Scrobble stats:
Up to 800 scrobbles per second
More than 40 million scrobbles per day
Over 75 billion scrobbles so far
Radio stats:
Over 10 million streaming hours per month
Over 400 thousand unique stations per day
Each Scrobble and radio listen generates at least one logline
8
Hadoop at Last.FM
100 Nodes
8 cores per node (dual quad-core)
24GB memory per node
8TB (4 disks of 2TB each)
Hive integration to run optimized SQL queries for analysis
Last.FM started using Hadoop in 2006 because of the growth in users from thousands to
millions. With the help of Hadoop, they processed hundreds of daily, monthly, and weekly jobs
including website stats and metrics, chart generation (i.e. track statistics), metadata corrections
(e.g. misspellings of artists), indexing for search, combining/formatting data for
recommendations, data insights, evaluations & reporting. This helped Last.FM to grow
tremendously and figure out the taste of their users, based on which they started recommending
music. I hope this blog was informative and added value to your knowledge. In our next blog
on Hadoop Ecosystem, we will discuss different tools present in Hadoop Ecosystem in detail.
For example:
Earlier we had landline phones, but now we have shifted to smart phones. Similarly, how many
of you remember floppy drives that were extensively used back in the ’90s? These Floppy drives
have been replaced by Hard disks because these floppy drives had very low storage capacity and
transfer speed.
Thus, this makes floppy drives insufficient for handling the amount of data with which we are
dealing today. In fact, now we can store terabytes of data on the cloud without being
bothered about size constraints.
9
IoT connects your physical device to the internet and makes it smarter. Nowadays, we have
smart air conditioners, televisions etc. Your smart air conditioner constantly monitors your room
temperature along with the outside temperature and accordingly decides what should be the
temperature of the room. Now imagine how much data would be generated in a year by smart air
conditioner installed in tens & thousands of houses. By this, you can understand how IoT is
contributing a major share to Big Data. Now, let us talk about the largest contributor of the Big
Data which is none other than, Social media. Social media is one of the most important factors in
the evolution of Big Data as it provides information about people’s behavior. You can look at the
figure below and get an idea of how much data is getting generated every minute:
F
ig: Hadoop Tutorial – Social Media Data Generation Stats
10
Hadoop Tutorial: Big Data & Hadoop – Restaurant Analogy
Let us take an analogy of a restaurant to understand the problems associated with Big Data and
how Hadoop solved that problem. Bob is a businessman who has opened a small restaurant.
Initially, in his restaurant, he used to receive two orders per hour and he had one chef with one
food shelf in his restaurant which was sufficient enough to handle all the orders.
Now let us compare the restaurant example with the traditional scenario where data was getting
generated at a steady rate and our traditional systems like RDBMS is capable enough to handle
it, just like Bob’s chef. Here, you can relate the data storage with the restaurant’s food shelf and
the traditional processing unit with the chef as shown in the figure above.
11
After a few months, Bob thought of expanding his business and therefore, he started taking
online orders and added few more cuisines to the restaurant’s menu in order to engage a larger
audience. Because of this transition, the rate at which they were receiving orders rose to an
alarming figure of 10 orders per hour and it became quite difficult for a single cook to cope up
with the current situation. Aware of the situation in processing the orders, Bob started thinking
about the solution.
Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of
the introduction of various data growth drivers such as social media, smart phones etc. Now,
the traditional system, just like the cook in Bob’s restaurant, was not efficient enough to handle
this sudden change. Thus, there was a need for a different kind of solutions strategy to cope up
with this problem. After a lot of research, Bob came up with a solution where he hired 4 more
chefs to tackle the huge rate of orders being received. Everything was going quite well, but this
solution led to one more problems. Since four chefs were sharing the same food shelf, the very
food shelf was becoming the bottleneck of the whole process.
12
Similarly, to tackle the problem of processing huge data sets, multiple processing units were
installed so as to process the data in parallel (just like Bob hired 4 chefs). But even in this case,
bringing multiple processing units was not an effective solution because the centralized storage
unit became the bottleneck. In other words, the performance of the whole system is driven by the
performance of the central storage unit. Therefore, the moment our central storage goes down,
the whole system gets compromised. Hence, again there was a need to resolve this single point of
failure.
Bob came up with another efficient solution, he divided all the chefs into two hierarchies, that is
a junior and a Head chef and assigned each junior chef with a food shelf. Let us assume that the
dish is Meat Sauce. Now, according to Bob’s plan, one junior chef will prepare meat and the
other junior chef will prepare the sauce. Moving ahead they will transfer both meat and sauce to
the head chef, where the head chef will prepare the meat sauce after combining both the
ingredients, which then will be delivered as the final order.
13
Hadoop functions in a similar fashion as Bob’s restaurant. As the food shelf is distributed in
Bob’s restaurant, similarly, in Hadoop, the data is stored in a distributed fashion with
replications, to provide fault tolerance. For parallel processing, first the data is processed by the
slaves where it is stored for some intermediate results and then those intermediate results are
merged by master node to send the final result. Now, you must have got an idea why Big Data is
a problem statement and how Hadoop solves it.
As we just discussed above, there were three major challenges with Big Data:
Storing huge data in a traditional system is not possible. The reason is obvious, the storage will
be limited to one system and the data is increasing at a tremendous rate.
Now we know that storing is a problem, but let me tell you it is just one part of the problem. The
data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured
and structured. So, you need to make sure that you have a system to store different types of
data that is generated from various sources.
14
Finally let’s focus on the third problem, which is the processing speed
Now the time taken to process this huge amount of data is quite high as the data to be processed
is too large. To solve the storage issue and processing issue, two core components were created
in Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a
distributed fashion and is easily scalable. And, YARN solves the processing issue by reducing
the processing time drastically. Moving ahead, let us understand what is Hadoop? Hadoop is an
Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across
clusters of computers. Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Architecture
15
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
Hadoop Common − these are Java libraries and utilities required by other Hadoop
modules.
Hadoop YARN − this is a framework for job scheduling and cluster resource
management.
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
16
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server),
and execute the following command.
$ Hadoop name node -format
After formatting the HDFS, start the distributed file system. The following command will start
the name node as well as the data nodes as cluster.
$ Start-dfs.sh
After loading the information in the server, we can find the list of files in a directory, status of a
file, using ‘ls. Given below is the syntax of ls that you can pass to a directory or a filename as
an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>
Assume we have data in the file called file.txt in the local system which is ought to be saved in
the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file
system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put
command.
17
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Assume we have a file in HDFS called outfield. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfield
Step 2
Gets the file from HDFS to the local file system using get command?
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
You can shut down the HDFS by using the following command.
$ Stop-dfs.sh
There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are
demonstrated here, although these basic operations will get you started. Running ./bin/hadoop
dfs with no additional arguments will list all the commands that can be run with the Shell
system. Furthermore, $HADOOP_HOME/bin/hadoop fs -help command Name will display a
short usage summary for the operation in question, if you are stuck.
A table of all the operations is shown below. The following conventions are used for parameters
−
"<Path>" means any file or directory name.
"<Path>..." means one or more file or directory names.
"<File>" means any filename.
"<Src>" and "<dest>" are path names in a directed operation.
"<Locals>" and "<local Dest>" are paths as above, but on the local file system.
All other files and path names refer to the objects inside HDFS.
18
1 -ls <path>
Lists the contents of the directory specified by path, showing the names, permissions,
owner, size and modification date for each entry.
2 -lsr <path>
Behaves like -ls, but recursively displays entries in all subdirectories of path.
3 -du <path>
Shows disk usage, in bytes, for all the files which match path; filenames are reported with
the full HDFS protocol prefix.
4 -dus <path>
Like -du, but prints a summary of disk usage of all files/directories in the path.
5 -mv <src><dest>
Moves the file or directory indicated by src to dest, within HDFS.
7 -rm <path>
Removes the file or empty directory identified by path.
8 -rmr <path>
Removes the file or directory identified by path. Recursively deletes any child entries (i.e.,
files or subdirectories of path).
19
10 -copyFromLocal <localSrc> <dest>
Identical to -put
14 -cat <filen-ame>
Displays the contents of filename on stdout.
17 -mkdir <path>
Creates a directory named path in HDFS.
Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).
20
19 -touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.
26 -help <cmd-name>
Returns usage information for one of the commands listed above. You must omit the
leading '-' character in cmd.
21
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − the map or mappers job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − this stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
22
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Terminology
Pay Load − Applications implement the Map and the Reduce functions, and form the
core of the job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
Named Node − Node that manages the Hadoop Distributed File System (HDFS).
Data Node − Node where data is presented in advance before any processing takes
place.
Master Node − Node where Job Tracker runs and which accepts job requests from
clients.
Slave Node − Node where Map and Reduce program runs.
Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to Job Tracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
23
Task − an execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a Slave Node.
Example Scenario
Given below is the data regarding the electrical consumption of an organization? It contains the
monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records. They will simply write the logic to
produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the large scale industries of
a particular state, since its formation.
When we write applications to process such bulk data,
They will take a lot of time to execute.
There will be heavy network traffic when we move data from source to network server
and so on.
24
1980 26 27 28 28 30 31 30 29
1981 31 32 33 34 35 36 34
1984 39 38 39 39 41 42 43 40 39 38 40
1985 38 39 41 00 40 39 45
Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to
create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
For Hadoop streaming, we are considering the word-count problem. Any job in Hadoop must
have two phases: mapper and reducer. We have written codes for the mapper and the reducer in
python script to run it under Hadoop. One can also write the same in Perl and Ruby.
Mapper Phase Code
!/usr/bin/python
Import sys
# Input takes from standard input for my line in sys.stdin:
# Remove whitespace either side
My line = myline.strip ()
25
# Count was not a number, so silently ignore this line continue
If current words == word:
Current count += count
Else:
If current word:
# Write result to standard output print '%s\t%s' % (current word, current count)
Current count = count
Current word = word
# Do not forget to output the last word if needed!
If current words == word:
Print '%s\t%s' % (current word, current count)
Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop home directory.
Make sure these files have execution permission (chmod +x mapper.py and chmod +x
reducer.py). As python is indentation sensitive so the same code can be download from the
below link.
Execution of Word Count Program
$ $HADOOP_HOME/bin/hadoop jar contrib./streaming/hadoop-streaming-1.
2.1. Jar \
-input input_dirs \
-output output_dir \
-mapper <path/mapper.py \
-reducer <path/reducer.py
Where "\" is used for line continuation for clear readability.
For Example,
./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input my input -output my output
-mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py
In the above example, both the mapper and the reducer are python scripts that read the input
from standard input and emit the output to standard output. The utility will create a Map/Reduce
job, submit the job to an appropriate cluster, and monitor the progress of the job until it
completes.
When a script is specified for mappers, each mapper task will launch the script as a separate
process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines
and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper
collects the line-oriented outputs from the standard output (STDOUT) of the process and
converts each line into a key/value pair, which is collected as the output of the mapper. By
default, the prefix of a line up to the first tab character is the key and the rest of the line
(excluding the tab character) will be the value. If there is no tab character in the line, then the
entire line is considered as the key and the value is null. However, this can be customized, as
per one need.
26
When a script is specified for reducers, each reducer task will launch the script as a separate
process, and then the reducer is initialized. As the reducer task runs, it converts its input
key/values pairs into lines and feeds the lines to the standard input (STDIN) of the process. In
the meantime, the reducer collects the line-oriented outputs from the standard output
(STDOUT) of the process, converts each line into a key/value pair, which is collected as the
output of the reducer. By default, the prefix of a line up to the first tab character is the key and
the rest of the line (excluding the tab character) is the value. However, this can be customized as
per specific requirements.
Important Commands
27
not specified, TextOutputformat is
used as the default.
For backwards-compatibility:
-inputreader Optional specifies a record reader class
(instead of an input format class).
Advantages of Hadoop
28
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.
In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reasons for the emergence of Hadoop.
In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
Doug Cutting gave named his project Hadoop after his son's toy elephant.
In 2007, Yahoo runs two clusters of 1000 machines.
In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
In 2013, Hadoop 2.2 was released.
In 2017, Hadoop 3.0 was released.
29
Year
30
2013 Apache Hadoop 2.2 version released.
Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically becomes active.
o Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.
31
What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application HBase, Spark etc. to work on it. Different Yarn applications
can co-exist on the same cluster so MapReduce, HBase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.
Components of YARN
Job tracker & Task tracker were used in previous version of Hadoop, which were responsible for
handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and Node Manager to overcome the shortfall of Job tracker & Task tracker.
Benefits of YARN
o Scalability: Map Reduce 1 hits scalability bottleneck at 4000 nodes and 40000 task, but
Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utilization: Node Manager manages a pool of resources, rather than a fixed number of
the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.
What is HBase
HBase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable. It is based on Google's Big Table. It has set of tables which keep data in
key value format. HBase is well suited for sparse data sets which are very common in big data
use cases. HBase provides APIs enabling development in practically any programming language.
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.
32
Why HBase
Features of HBase
What is HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook. Hive provides the functionality of reading, writing, and
managing large datasets residing in distributed storage. It runs SQL like queries called HQL
(Hive query language) which gets internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
33
Features of Hive
Limitations of Hive
Hive Pig
34
What is Sqoop
Sqoop is a command-line interface application for transferring data between relational databases
and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as
saved jobs which can be run multiple times to import updates made to a database since the last
import. Using Sqoop, Data can be moved into HDFS/hive/HBase from My SQL/ Post gre
SQL/Oracle/SQL Server/DB2 and vice versa.
Sqoop Working
Step 1: Sqoop send the request to Relational DB to send the return the metadata information
about the table (Metadata here is the data about the table in relational DB).
Step 2: From the received information it will generate the java classes (Reason why you should
have Java configured before get it working-Sqoop internally uses JDBC API to generate data).
Step 3: Now Sqoop (As its written in java? tries to package the compiled classes to beagle to
generate table structure), post compiling creates jar file (Java packaging standard).
What is Spark?
Apache Spark is an open-source cluster computing framework. Its primary purpose is to handle
the real-time generated data. Spark was built on the top of the Hadoop MapReduce. It was
optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes
data to and from computer hard drives. So, Spark processes the data much quicker than other
alternatives.
The Spark was initiated by Matei Zaharias at UC Berkeley's AMPLab in 2009. It was open
sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation. In 2014, the Spark emerged
as a Top-Level Apache Project.
Fast - It provides high performance for both batch and streaming data, using a state-of-
the-art DAG scheduler, a query optimizer, and a physical execution engine.
Easy to Use - It facilitates to write the application in Java, Scala, Python, R, and SQL. It
also provides more than 80 high-level operators.
Generality - It provides a collection of libraries including SQL and Data Frames, MLlib
for machine learning, GraphX, and Spark Streaming.
35
Lightweight - It is a light unified analytics engine which is used for large scale data
processing.
Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes, standalone,
or in the cloud.
Uses of Spark
Data integration: The data generated by systems are not consistent enough to combine
for analysis. To fetch consistent data from systems we can use processes like Extract,
transform, and load (ETL). Spark is used to reduce the cost and time required for this
ETL process.
Stream processing: It is always difficult to handle the real-time generated data such as
log files. Spark is capable enough to operate streams of data and refuses potentially
fraudulent operations.
Machine learning: Machine learning approaches become more feasible and increasingly
accurate due to enhancement in the volume of data. As spark is capable of storing data in
memory and can run repeated queries quickly, it makes it easy to work on machine
learning algorithms.
Interactive analytics: Spark is able to generate the respond rapidly. So, instead of
running pre-defined queries, we can handle the data interactively.
Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop.
The language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce
jobs and get executed on data stored in HDFS. Apart from that, Pig can also execute its job in
Apache Tez or Apache Spark. Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data File System. Every task
which can be achieved using PIG can also be achieved using java used in MapReduce.
1) Ease of programming
Writing complex java programs for map reduces is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
36
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data
set.
4) Flexible
5) In-built operators
It is difficult to perform data It provides built-in operators to perform data operations like union,
operations in MapReduce.
Sorting and ordering.
It doesn't allow nested data It provides nested data types like tuples, bag, and map.
types.
37
Advantages of Apache Pig
o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple, bag,
and map.
Pre-requisite
o Java Installation - Check whether the Java is installed or not using the following
command.
1. $java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.
1. $hadoop version
Steps to install Apache Pig
o Download the Apache Pig tar file.
o Unzip the downloaded tar file.
1. export PIG_HOME=/home/hduser/pig-0.16.0
2. export PATH=$PATH:$PIG_HOME/bin
o Update the environment variable
1. $ source ~/.bashrc
o Let's test the installation on the command prompt type
1. $ pig -h
o Let's start the pig in MapReduce mode.
38
1. $ pig
Apache Pig Run Modes
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
o It executes in a single JVM and is used for development experimenting and prototyping.
o Here, files are installed and run using local host.
o The local mode works on a local file system. The input and output data stored in the local
file system.
1. $ pig-x local
MapReduce Mode
1. $ pig
Or,
39
Ways to execute Pig Program
These are the following ways of executing a Pig program on local and MapReduce mode: -
o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt
shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These files
contain Pig Latin commands.
Embedded Mode - In this mode, we can define our own functions. These functions can be
called as UDF (User Defined Functions). Here, we use programming
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
The Pig Latin statements are used to process the data. It is an operator that accepts a relation as
an input and generates another relation as an output.
Convent Description
ion
() The parenthesis can enclose one or more items. It can also be used to indicate
[] The straight brackets can enclose one or more items. It can also be used to
40
Indicate the map data type.
Example - [INNER | OUTER]
{} The curly brackets enclose two or more items. It can also be used to
Type Description
41
Byte array It defines the byte array.
Complex Types
Type Description
42
Pig Data Types
Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.
43
Pig Example
Use case: Using Pig find the most occurred start letter.
Solution:
Case 1: Load the data into bag named "lines". The entire line is stuck to element line of type
character array.
Case 2: The text in the bag lines needs to be tokenized this produces one word per row.
Case 3: To retain the first letter of each word type the below command .This commands uses
substring method to take the first character.
Case 4: Create a bag for unique character where the grouped bag will contain the same character
for each occurrence of that character.
Case 6: Arrange the output according to count in descending order using the commands below.
Case 8: Store the result in HDFS. The result is saved in output directory under soon folder.
44
Pig UDF (User Defined Functions)
To specify custom processing, Pig provides support for user-defined functions (UDFs). Thus, Pig
allows us to create our own functions. Currently, Pig UDFs can be implemented using the
following programming languages: -
o Java
o Python
o Jython
o JavaScript
o Ruby
o Groovy
Among all the languages, Pig provides the most extensive support for Java functions. However,
limited support is provided to languages like Python, Jython, JavaScript, Ruby, and Groovy.
In Pig,
Let's see an example of a simple EVAL Function to convert the provided string to uppercase.
UPPER.java
1. package com.hadoop;
2. import java.io.IOException;
3. import org.apache.pig.EvalFunc;
4. import org.apache.pig.data.Tuple;
5. public class Test Upper extends Eval Func<String> {
6. public String exec(Tuples input) throws IO Exception {
7. if (input == null || input.size() == 0)
8. return null;
9. try{
10. String str = (String) input. Get(0);
11. return str.toUpperCase();
12. }catch(Exception e){
13. throw new IOException("Caught exception processing input row ", e);
45
14. }
15. }
16. }
Create the jar file and export it into the specific directory. For that ,right click on project
- Export - Java - JAR file - Next.
Now, provide a specific name to the jar file and save it in a local system directory.
Create a text file in your local machine and insert the list of tuples.
1. $ nano pscript.pig
While it comes to analyze large sets of data, as well as to represent them as data flows, we use
Apache Pig. It is nothing but an abstraction over MapReduce. So, in this Hadoop Pig Tutorial,
we will discuss the whole concept of Hadoop Pig. Apart from its Introduction, it also includes
History, need, its Architecture as well as its Features. Moreover, we will see, some Comparisons
like Pig Vs Hive, Apache Pig Vs SQL and Hadoop Pig Vs MapReduce.
46
What is Hadoop Pig?
Hadoop Pig is nothing but an abstraction over MapReduce. While it comes to analyze large sets
of data, as well as to represent them as data flows, we use Apache Pig. Generally, we use it
with Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop.
In addition, Pig offers a high-level language to write data analysis programs which we call as Pig
Latin. One of the major advantages of this language is, it offers several operators. Through them,
programmers can develop their own functions for reading, writing, and processing data.
It has following key properties such as:
Ease of programming
Basically, when all the complex tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, that makes them easy to write, understand, and
maintain.
Optimization opportunities
It allows users to focus on semantics rather than efficiency, to optimize their execution
automatically, in which tasks are encoded permits the system.
Extensibility
In order to do special-purpose processing, users can create their own functions.
Hence, programmers need to write scripts using Pig Latin language to analyze data using Apache
Pig. However, all these scripts are internally converted to Map and Reduce tasks. It is possible
with a component, we call as Pig Engine.
47
code (Loc) in Java can be easily done by typing as less as just 10 LoC in Apache Pig.
Hence, it shows, Pig reduces the development time by almost 16 times.
When you are familiar with SQL, it is easy to learn Pig. Because Pig Latin is SQL-like
language.
It offers many built-in operators, in order to support data operations such as joins, filters,
ordering, and many more. Also, it offers nested data types that are missing from MapReduce
such as tuples, bags, and maps.
Hadoop Pig Tutorial – Using Pig
There are several scenarios, where we can use Pig. Such as:
While data loads are time sensitive.
Also, while processing various data sources.
While we require analytical insights through sampling.
Where Not to Use Pig?
Also, there are some Scenarios, where we cannot use. Such as:
While the data is completely unstructured. Such as video, audio, and readable text.
Where time constraints exist. Since Pig is slower than MapReduce jobs.
Also, when more power is required to optimize the codes, we cannot use Pig.
48
Now, you can see several components in the Hadoop Pig framework. The major components are:
I. Parser
At first, all the Pig Scripts are handled by the Parser. Basically, Parser checks the syntax of the
script, does type checking, and other miscellaneous checks. Afterward, Parser’s output will be a
DAG (directed acyclic graph). That represents the Pig Latin statements as well as logical
operators. Basically, the logical operators of the script are represented as the nodes and the data
flows are represented as edges, in the DAG (the logical plan).
ii. Optimizer
Further, DAG is passed to the logical optimizer. That carries out the logical optimizations. Like
projection and push down.
iii. Compiler
It compiles the optimized logical plan into a series of MapReduce jobs.
Iv. Execution Engine
At last, MapReduce jobs are submitted to Hadoop in a sorted order. Hence, these MapReduce
jobs are executed finally on Hadoop that produces the desired results.
Pig Features
Now in the Hadoop Pig Tutorial is the time to learn the Features of Pig which makes it what it is.
There are several features of Pig. Such as:
I. Rich set of operators
In order to perform several operations, Pig offers many operators. Such as join, sort, filer and
many more.
ii. Ease of programming
Since you are good at SQL, it is easy to write a Pig script. Because of Pig Latin as same as SQL.
iii. Optimization opportunities
In Apache Pig, all the tasks optimize their execution automatically. As a result, the programmers
need to focus only on the semantics of the language.
iv. Extensibility
Through Pig, it is easy to read, process, and write data. It is possible by using the existing
operators. Also, users can develop their own functions.
v. UDFs
By using Pig, we can create User-defined Functions in other programming languages like Java.
Also, can invoke or embed them in Pig Scripts.
vi. Handles all kinds of data
Pig generally analyzes all kinds of data. Even both structured and unstructured. Moreover, it
stores the results in HDFS.
49
Pig Such as:
Basic knowledge of Linux Operating System
Fundamental programming skills
Pig Vs MapReduce
Some major differences between Hadoop Pig and MapReduce are:
Apache Pig
It is a data flow language.
MapReduce
However, it is a data processing paradigm.
Hadoop Pig
Pig is a high-level language.
MapReduce
Well, it is a low level and rigid.
Pig
In Apache Pig, performing a Join operation is pretty simple.
MapReduce
But, in MapReduce, it is quite difficult to perform a Join operation between datasets.
Pig
With a basic knowledge of SQL, any novice programmer can work conveniently with Pig.
MapReduce
But, to work with MapReduce, exposure to Java is essential.
Hadoop Pig
Generally, it uses multi-query approach, thereby reducing the length of the codes to a great
extent.
MapReduce
Although, to perform the same task it needs almost 20 times more the number of lines.
Apache Pig
Here, we do not require any compilation. Every Pig operator is converted internally into a
MapReduce job, at the time of execution.
MapReduce
It has a long compilation process.
50
Hadoop Pig Vs SQL
Here, are the major differences between Apache Pig and SQL.
Pig
It is a procedural language.
SQL
While it is a declarative language.
Pig
Here, the schema is optional. Although, without designing a schema, we can store data.
However, it stores values as $01, $02 etc.
SQL
In SQL, Schema is mandatory.
Pig
In Pig, data model is nested relational.
SQL
In SQL, data model used is flat relational.
Pig
Here, we have limited opportunity for Query Optimization.
SQL
While here we have more opportunity for query optimization.
Also, Apache Pig Latin −
Offer splits in the pipeline.
Provides developers to store data anywhere in the pipeline.
It also declares execution plans.
Offers operators to perform ETL (Extract, Transform, and Load) functions.
Any doubt yet in Hadoop Pig Tutorial. Please Comment.
51
Apache Pig Vs Hive
Basically, to create MapReduce jobs, we use both Pig and Hive. Also, we can say, at
times, Hive operates on HDFS as same as Pig does. So, here we are listing few significant points
those set Apache Pig apart from Hive.
Hadoop Pig
Pig Latin is a language, Apache Pig uses. Originally, it was created at Yahoo.
Hive
HiveQL is a language, Hive uses. It was originally created at Facebook.
Pig
It is a data flow language.
Hive
Whereas, it is a query processing language.
Pig
Moreover, it is a procedural language which fits in pipeline paradigm.
Hive
It is a declarative language.
Apache Pig
Also, can handle structured, unstructured, and semi-structured data.
Hive
Whereas, it is mostly for structured data.
52
Apache Pig
This Apache Pig tutorial provides the basic introduction to Apache Pig – high-level
tool over MapReduce. This tutorial helps professionals who are working on Hadoop and would
like to perform MapReduce operations using a high-level scripting language instead of
developing complex codes in Java.
As a research project at Yahoo in the year 2006, Apache Pig was developed in order to create
and execute MapReduce jobs on large data-sets. In 2007 Apache Pig was open sourced, later in
2008, Apache Pig’s first release came out.
Pig was created to simplify the burden of writing complex Java codes to perform MapReduce
jobs. Earlier Hadoop developers have to write complex java codes in order to perform data
analysis. Apache Pig provides a high-level language known as Pig Latin which helps Hadoop
developers to write data analysis programs. By using various operators provided by Pig Latin
language programmers can develop their own functions for reading, writing, and processing data.
In order to perform analysis using Apache Pig, programmers have to write scripts using Pig Latin
language to process data stored in Hadoop Distributed File System. Internally, all these scripts
are converted to Map and Reduce tasks. A component known as Pig Engine is present inside
53
Apache Pig in which Pig Latin scripts are taken as input and these scripts gets converted into
For all those Programmers who are not so good at Java normally, have to struggle a lot for
working with Hadoop, especially when they need to perform any MapReduce tasks. Apache Pig
comes up as a helpful tool for all such programmers. There is no need of developing complex
Java codes to perform MapReduce tasks. By simply writing Pig Latin scripts programmers can
now easily perform MapReduce tasks without having need of writing complex codes in Java.
Apache Pig reduces the length of codes by using multi-query approach. For example, to perform
an operation we need to write 200 lines of code in Java that we can easily perform just by typing
less than 10 lines of code in Apache Pig. Hence, ultimately our almost 16 times development
time gets reduced using Apache Pig. If developers have knowledge of SQL language, then it is
very easy to learn Pig Latin language as it is similar to SQL language. Many built-in operators
are provided by Apache Pig to support data operations like filters, joins, ordering, etc.
In addition, nested data types like tuples, bags, and maps which are not present in MapReduce
are also provided by Pig.
d. Features of Pig
54
Rich Set of Operators: Pig consists of a collection of rich set of operators in order to perform
operations such as join, filer, sort and many more.
Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for
developers to write a Pig script. If you have knowledge of SQL language, then it is very easy to
learn Pig Latin language as it is similar to SQL language.
Optimization opportunities: The execution of the task in Apache Pig gets automatically
optimized by the task itself, hence the programmers need to only focus on the semantics of the
language.
Extensibility: By using the existing operators, users can easily develop their own functions to
read, process, and write data.
User Define Functions (UDF’s): With the help of facility provided by Pig of creating UDF’s,
we can easily create User Defined Functions on a number of programming languages such as
Java and invoke or embed them in Pig Scripts.
All types of data handling: Analysis of all types of Data (i.e. both structured as well as
unstructured) is provided by Apache Pig and the results are stored inside HDFS.
Apache Pig Installation on Ubuntu – A Pig Tutorial
Install Pig
This Pig tutorial briefs how to install and configure Apache Pig. Apache Pig is an abstraction
over MapReduce. Pig is basically a tool to easily perform analysis of larger sets of data by
representing them as data flows
55
Apache Pig Installation on Ubuntu
You must have Hadoop and Java JDK installed on your system. Hence, before installing Pig you
should install Hadoop and Java by following the steps given in this installation guide.
ii. Downloading Pig
56
Apache Pig Installation_Bashrc File
After adding the above parameters save this file by using “CTRL+X” and then “Y” on your
keyboard.
Step 4:
Update .bashrc file by executing below command:
[php]dataflair@ubuntu:~$ source .bashrc[/php]
After refreshing the .bashrc file Pig gets successfully installed. In order to check the version of
your Pig file execute the below command:
[php]dataflair@ubuntu:~$ pig -version[/php]
If the below output appears means you had successfully configured Pig.
57
Apache Pig Version
iv. Starting Pig
We can start Pig in one of the following two modes mentioned below:
1. Local Mode
2. Cluster Mode
To start using pig in local mode ‘-x local’ option is used whereas while executing only “pig”
command without any options, Pig starts in the cluster mode. While running pig in local mode, it
can only access files present on the local file system. Whereas, on starting pig in cluster mode
pig can access files present in HDFS.
To start Pig in Local Mode execute the below command:
[php]dataflair@ubuntu:~$ pig -x local[/php]
And if you get the below output that means Pig started successfully in Local mode.
58
Running_Pig_Local_mode
To start Pig in Cluster-Mode execute the below command:
[php]dataflair@ubuntu:~$ pig[/php]
And if you get the below output that means Pig started successfully in Cluster mode.
59
1. Apache Pig Architecture
In order to write a Pig script, we do require a Pig Latin language. Moreover, we need an
execution environment to execute them. So, in this article “Introduction to Apache Pig
Architecture”, we will study the complete architecture of Apache Pig. It includes its
components, Pig Latin Data Model and Pig Job Execution Flow in depth.
60
What is Apache Pig Architecture?
In Pig, there is a language we use to analyze data in Hadoop. That is what we call Pig Latin.
Also, it is a high-level data processing language that offers a rich set of data types and operators
to perform several operations on the data.
Moreover, in order to perform a particular task, programmers need to write a Pig script using the
Pig Latin language and execute them using any of the execution mechanisms (Grunt Shell,
UDFs, Embedded) using Pig. To produce the desired output, these scripts will go through a
series of transformations applied by the Pig Framework, after execution. Further, Pig converts
these scripts into a series of MapReduce jobs internally. Therefore it makes the programmer’s
job easy. Here, is the architecture of Apache Pig.
62
I. Atom
Atom is defined as any single value in Pig Latin, irrespective of their data. Basically, we can use
it as string and number and store it as the string. Atomic values of Pig are int, long, float, double,
char array, and byte array. Moreover, a field is a piece of data or a simple atomic value in Pig.
For Example − ‘Shoba’ or ‘25’
ii. Tuples
Tuples is a record that is formed by an ordered set of fields. However, the fields can be of any
type. In addition, a tuples is similar to a row in a table of RDBMS.
For Example − (Shubham, 25)
iii. Bag
An unordered set of tuples is what we call Bag. To be more specific, a Bag is a collection of
tuples (non-unique). Moreover, each tuples can have any number of fields (flexible schema).
Generally, we represent a bag by ‘{}’.
For Example − {(Shubham, 25), (Pulkit, 35)}
In addition, when a bag is a field in a relation, in that way it is known as the inner bag.
Example − {Shubham, 25, {9826022258, [email protected],}}
iv. Map
A set of key-value pairs is what we call a map (or data map). Basically, the key needs to be of
type char array and should be unique. Also, the value might be of any type. And, we represent
it by ‘[]’
For Example − [name#Shubham, age#25]
v. Relation
A bag of tuples is what we call Relation. In Pig Latin, the relations are unordered. Also, there is
no guarantee that tuples are processed in any particular order.
So, this was all in Apache Pig Architecture. Hope you like our explanation.
63
1. Pig Latin Tutorial
Apache Pig offers High-level language like Pig Latin to perform data analysis programs. So, in
this Pig Latin tutorial, we will discuss the basics of Pig Latin. Such as Pig Latin statements, data
types, general operators, and Pig Latin UDF in detail. Also, we will see its examples to
understand it well.
64
Statements in Pig Latin
Also, make sure, statements are the basic constructs while processing data using Pig Latin.
Basically, statements work with relations. Also, includes expressions and schemas.
Here, every statement ends with a semicolon (;).
Moreover, through statements, we will perform several operations using operators, those are
offered by Pig Latin.
However, Pig Latin statements take a relation as input and produce another relation as
output, while performing all other operations Except LOAD and STORE.
Its semantic checking will be carried out, once we enter a Load statement in the Grunt shell.
Although, we need to use the Dump operator, in order to see the contents of the schema.
Because, the MapReduce job for loading the data into the file system will be carried out,
only after performing the dump operation.
Pig Latin Example –
Here, is a Pig Latin statement. Basically, that loads data to Apache Pig.
Grunt> Employee data = LOAD 'Employee_data.txt' USING Pig Storage (',') as
65
It represents a character array (string) in Unicode UTF-8 format.
For Example: ‘Data Flair’
Bytearray
This data type represents a Byte array (blob).
Boolean
“Boolean” represents a Boolean value.
For Example: true/ false.
Note: It is case insensitive.
Datetime
It represents a date-time.
For Example: 1970-01-01T00:00:00.000+00:00
Biginteger
This data type represents a Java BigInteger.
For Example: 60708090709
Bigdecimal
“Bigdecimal” represents a Java BigDecimal
For Example: 185.98376256272893883
I.Complex Types
Tuple
An ordered set of fields is what we call a tuple.
For Example: (Ankit, 32)
Bag
A collection of tuples is what we call a bag.
For Example: {(Ankit, 32), (Neha, 30)}
Map
A set of key-value pairs is what we call a Map.
Example: [ ‘name’#’Ankit’, ‘age’#32]
ii. Null Values
It is possible that values for all the above data types can be NULL. However, SQL and Pig treat
null values in the same way.
On defining a null Value, It can be an unknown value or a non-existent value. Moreover, we use
it as a placeholder for optional values. Either, these nulls can be the result of an operation or it
can occur naturally.
66
Pig Latin Arithmetic Operators
Here, is the list of arithmetic operators of Pig Latin. Let’s assume, value of A = 20 and B = 40.
+
Addition − It simply adds values on either side of the operator.
For Example: 60, it comes to adding A+B.
−
Subtraction – This operator subtracts right-hand operand from left-hand operand.
For Example: −20, it comes on subtracting A-B
*
Multiplication − It simply Multiplies values on either side of the operators.
For Example: 800, it comes to multiplying A*B.
/
Division − This operator divides left-hand operand by right-hand operand
For Example: 2, it comes to dividing, b/a
%
Modulus − It Divides left-hand operand by right-hand operand and returns the remainder
For Example: 0, it comes to dividing, b % a.
?:
Bincond − this operator evaluates the Boolean operators. Generally, it has three operands. Such
as:
variable x = (expression) ?, value1 if true or value2 if false.
For Example:
b = (a == 1)? 20: 40;
CASE
WHEN
THEN
ELSE END
Case − It is equivalent to the nested bincond operator.
For Example- CASE f2 % 2
WHEN 0 THEN ‘even’
67
WHEN 1 THEN ‘odd’
END
Comparison Operators in Pig Latin
Here, is the list of the comparison operators of Pig Latin. Let’s assume, value of A = 20 and B =
40.
==
Equal − this operator checks if the values of two operands are equal or not. So, if yes, then the
condition becomes true.
For Example- (a = b) is not true
!=
Not Equal − It will check if the values of two operands are equal or not. So, if the values are not
equal, then condition becomes true.
For Example- (a! = b) is true.
>
Greater than − this operator checks if the value of the left operand is greater than the value of the
right operand. Hence, if yes, then the condition becomes true.
For Example- (a > b) is not true.
<
Less than − It simply checks if the value of the left operand is less than the value of the right
operand. So, if yes, then the condition becomes true.
For Example- (a < b) is true.
>=
Greater than or equal to − It will check if the value of the left operand is greater than or equal to
the value of the right operand. Hence, if yes, then the condition becomes true.
For Example- (a >= b) is not true.
<=
Less than or equal to − this operator checks if the value of the left operand is less than or equal to
the value of the right operand. So, if yes, then the condition becomes true.
For Example- (a <= b) is true.
matches
68
Pattern matching − It simply checks whether the string in the left-hand side matches with the
constant in the right-hand side.
For Example- f1 matches ‘.*data flair.*’
Type Construction Operators
In our previous blog, we have seen Apache Pig introduction and pig architecture in detail. Now
this article covers the basics of Pig Latin Operators such as comparison, general and relational
operators. Moreover, we will also cover the type construction operators as well. We will also
discuss the Pig Latin statements in this blog with an example.
69
Pig Latin Operators and Statements – A Complete Guide
What is Pig Latin?
Pig Latin is the language which analyzes the data in Hadoop using Apache Pig. An interpreter
layer transforms Pig Latin statements into MapReduce jobs. Then Hadoop process these jobs
further. Pig Latin is a simple language with SQL like semantics. Anyone can use it in a
productive manner. Latin has a rich set of functions. These functions exhibit data manipulation.
Furthermore, they are extensible by writing user-defined functions (UDF) using java.
Pig Latin Operators
a. Arithmetic Operators
+ Addition − It add values on any single side of the operator. if a= 10, b= 30,
a + b gives 40
Subtraction − It reduces the value of right hand operand from left hand
if a= 40, b= 30,
− operand. a-b gives 10
70
Division − This operator divides the left hand operand by right hand
if a= 40, b= 20,
/ operand. b / a results to 2
Modulus − It divides the left hand operand by right hand operand with
if a= 40, b= 30,
% remainder as result. b%a results to 10
b = (a == 1)? 40:
20;
Bincond − It evaluates the Boolean operators. Moreover, it has three
if a = 1 the value is
?: operands below.
40.
Variable x = (expression)? Value1 if true: value2 if false.
if a!=1 the value is
20.
CASE f2 % 4
CASE
WHEN 0 THEN
WHEN
Case − This operator is equal to the nested bincond. ‘even’
THEN
WHEN 1 THEN
ELSE
‘odd’
END
END
Comparison Operators
Equal − This operator checks whether the values of two operands are
If a=10, b=20, then (a
== equal or not. If yes, then the condition becomes true. = b) is not true
Not Equal − Checks the values of two operands are equal or not. If the
If a=10, b=20, then (a !
!= values are equal, then condition becomes false else true. = b) is true
Less than − This operator checks the value of the left operand is less
(a < b) is true, if a=10,
< than the right operand. If condition fulfills, then it returns true. b=20.
Greater than or equal to − It checks the value of the left operand with
>= right hand. It checks whether it is greater or equal to the right operand. If a=20, b=50, true(a
If yes, then it returns true. >= b) is not true.
71
Less than or equal to − The value of the left operand is less than or
If a=20, b=20, (a <= b)
<= equal to that of the right operand. Then the condition still returns true. is true.
The above table describes the Type construction pig latin operators.
Operator Description Example
Relational Operations
72
COGROUP There is a grouping of the data into two or more relations.
GROUP It groups the data in a single relation.
Diagnostic Operators
EXPLAIN We can view the logical, physical execution plans to evaluate a relation.
The statements are the basic constructs while processing data using Pig Latin.
The statements can work with relations including expressions and schemas.
However, every statement terminates with a semicolon (;).
We will perform different operations using Pig Latin operators.
Pig Latin statements input a relation and produce some other relation as output.
The semantic checking initiates as we enter a Load step in the Grunt shell. We use the
Dump operator to view the contents of the schema. The MapReduce job initiates for loading
the data into the file system. It performs only after the dump operation.
73
For Example
Following is a Pig Latin statement; it loads the data to Apache Pig.
[php]grunt> Sample data = LOAD ‘sample_data.txt’ USING Pig Storage(‘,’)as
( id:int, name:chararray, contact:chararray, city:chararray );[/php]
So, this was all in Pig Latin Tutorial. Hope you like our explanation.
In this article, we will cover the Apache Pig Architecture. It is actually developed on top
of Hadoop. Moreover, we will see the various components of Apache Hive and Pig Latin Data
Model. The Apache Pig provides a high-level language. We will also see the two modes to run
this component.
The language which analyzes data in Hadoop using Pig called as Pig Latin. Therefore, it is a
high-level data processing language. While it provides a wide range of data types and operators
to perform data operations. To perform a task using Pig, programmers need to write a Pig script
using the Pig Latin language. They execute them with any of the execution mechanisms such as
(Grunt Shell, UDFs, and Embedded). These scripts will also go through a series of
transformations after execution. Moreover, the Pig Framework produces the desired output.
Apache Pig converts these scripts into many MapReduce jobs. Thus, it makes the job easy for
developers.
Components of Apache Pig
There are various components in Apache Pig Architecture which makes its execution faster as
discussed below:
74
Components of Apache Pig
a. Parser
The Parser handles the Pig Scripts and checks the syntax of the script. It includes type checking
with other checks. Therefore, an output of the parser will be a Directed Graph. However, it
represents the Pig Latin statements and logical operators.
In the DAG, the script operators are actually represented as the nodes. Moreover, the data flows
are eventually represented as edges.
b. Optimizer
The logical optimizer then receives the logical plan (DAG). In fact, it carries out the logical
optimization such as projection and push down.
c. Compiler
The compiler converts the logical plan into a series of MapReduce jobs
d. Execution Engine
In the end, the MapReduce jobs get submitted to Hadoop in a sorted order. Therefore these
MapReduce jobs execute on the Hadoop and produce the desired results.
There is a complete nested data model of Pig Latin. Meanwhile, it allows complex non-atomic
data types such as map and tuple.
75
a. Field and Atom
Atom is a single value in Pig Latin, with any data type. The storage occurs in form of string and
we can also use it as string and number. Various atomic values of Pig are int, long, float, double,
char array, and byte array. Furthermore, any simple atomic value or data is actually considered as
a field.
For Example − ‘data flair’ or ‘12’
b. Tuples
A record which contains an ordered set of fields is a Tuple. Thus, the fields can be of any type. A
tuple is same as the row in a table of RDBMS.
For Example − (Data flair, 12)
c. Bag
A bag contains an unordered set of tuples. Therefore, a collection of tuples (non-unique) is can
be a bag. Each tuple may have any number of fields. We can represent the bag as ‘{}’. It is same
as a table in RDBMS. However, it is not necessary that every tuple contains the same fields.
Hence, the fields in the same position (column) may not have the same type.
Example − {(Data flair, 12), (Training, 11)}
While a bag can be a field in a relation which is an inner bag.
Example − {Data flair, 12, {1212121212, [email protected],}}
d. Map
A map (or data map) contains the set of many key-value pairs. Meanwhile, the key has to be of
type char array and unique. The value can be of any type. We can represent it by ‘[]’.
Example − [name#Dataflair, age#11]
e. Relation
Furthermore, a relation contains the bag of tuples. There may be no serial order of processing in
the relations.
Job Execution Flow
The developer creates the scripts, and then it goes to the local file system as functions. Moreover,
when the developers submit Pig Script, it contacts with Pig Latin Compiler. The compiler then
splits the task and run a series of MR jobs. Meanwhile, Pig Compiler retrieves data from
the HDFS. The output file again goes to the HDFS after running MR jobs.
76
a. Pig Execution Modes
We can run Pig in two execution modes. These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing. We can thus store data on a single machine
or in a distributed environment like Clusters. The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to create a file, load the code and execute the
script. Then comes the Grunt shell or interactive shell for running Apache Pig commands.
Hence, the last one named as embedded mode, which we can use JDBC to run SQL programs
from Java.
However, in this mode, pig implements on single JVM and access the file system. This mode is
better for dealing with the small data sets. Meanwhile, the parallel mapper execution is
impossible. The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig local mode of execution. Therefore, Pig
always looks for the local file system path while loading data.
c. Pig Map Reduce Mode
In this mode, a user could have proper Hadoop cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig also translates the queries into Map reduce jobs and
runs on top of Hadoop cluster. Hence, this mode as a Map reduce runs on a distributed cluster.
The statements like LOAD, STORE read the data from the HDFS file system and to show
output. These Statements are also used to process data.
d. Storing Results
The intermediate data generates during the processing of MR jobs. Pig stores this data in a non-
permanent location on HDFS storage. The temporary location then created inside HDFS for
storing this intermediate data.
We can use DUMP for getting the final results to the output screen. The output results stored
using STORE operator.
77
programming, Handles all kinds of data, Extensibility and many more. So, in this blog, “Apache
Pig Features” we will discuss all these features in detail and try to understand why Pig should be
chosen.
iv. Extensibility
Extensibility is one of the most interesting features it has. It means users can develop their own
functions to read, process, and write data, using the existing operators.
v. UDF’s
It is also a very amazing feature that it offers the facility to create User-defined Functions in
other programming languages like Java. Meanwhile, invoke or embed them in Pig Scripts.
78
vi. Handles all kinds of data
Handling all kinds of data is one of the reasons for easy programming. That means it analyzes all
kinds of data. Either structured or unstructured. Also, it stores the results in HDFS.
vii. Join operation
In Apache Pig, performing a Join operation is pretty simple.
x. Optional Schema
However, the schema is optional, in Apache Pig. Hence, without designing a schema we can
store data. So, values are stored as $01, $02 etc.
xi. Pipeline
Apache Pig Latin allows splits in the pipeline
As we all know, we use Apache Pig to analyze large sets of data, as well as to represent them as
data flows. However, Pig attains many more advantages in it. In the same place, there are some
disadvantages also. So, in this article “Pig Advantages and Disadvantages”, we will discuss all
the advantages as well as disadvantages of Apache Pig in detail.
79
Pig Advantages and Disadvantages | Apache Pig Pros and Cons
a. Advantages of Apache Pig
First, let’s check the benefits of Apache Pig –
It consumes less time while development. Hence, we can say, it is one of the major advantages.
Especially considering vanilla MapReduce jobs’ complexity, time-spent, and maintenance of the
programs.
Ii. Easy to learn
Well, the Learning curve of Apache Pig is not steep. That implies anyone who does not know
how to write vanilla MapReduce or SQL for that matter could pick up and can write MapReduce
jobs.
iii. Procedural language
Apache Pig is a procedural language, not declarative, unlike SQL. Hence, we can easily follow
the commands. Also, offers better expressiveness in the transformation of data in every step.
80
Moreover, while we compare it to vanilla MapReduce, it is much more like the English
language. In addition, it is very concise and unlike Java but more like Python.
iv. Dataflow
It is a data flow language. That means here everything is about data even though we sacrifice
control structures like for loop or if structures. By “this data and because of data”, data
transformation is a first class citizen. Also, we cannot create for loops without data. We need to
always transform and manipulate data.
V. Easy to control execution
We can control the execution of every step because it is procedural in nature. Also, a benefit that
it is, straight forward. That implies we can write our own UDF(User Defined Function) and
inject in one specific part in the pipeline.
vi. UDFs
As per its name, it does not get evaluated unless you do not produce an output file or does not
output any message. It is a benefit of the logical plan. That it could optimize the program
beginning to end and optimizer could produce an efficient plan to execute.
viii. Usage of Hadoop features
Through Pig, we can enjoy everything that Hadoop offers. Such as parallelization, fault-tolerance
with many relational database features.
ix. Effective for unstructured
Pig is quite effective for unstructured and messy large datasets. Basically, Pig is one of the best
tools to make the large unstructured data to structure.
x. Base pipeline
Here, we have UDFs which we want to parallelize and utilize for large amounts of data. That
means we can use Pig as a base pipeline where it does the hard work. For that, we just apply our
UDF in the step that we want.
b. Limitations of Apache Pig
Now, have a look at Apache Pig disadvantages –
1. Errors of Pig
81
2. Not mature
3. Support
4. Minor one
5. Implicit data schema
6. Delay in execution
I. Errors of Pig
Errors that Pig produces due to UDFs(Python) are not helpful at all. At times, while something
goes wrong, it just gives the error such as exec error in UDF, even if the problem is related to
syntax or the type error, it lets alone a logical one.
ii. Not mature
Pig is still in the development, even if it has been around for quite some time.
iii. Support
Generally, Google and Stack Overflow do not lead good solutions for the problems.
iv. Implicit data schema
In Apache Pig, Data Schema is not enforced explicitly but implicitly. It is also a huge
disadvantage. As it does not enforce an explicit schema, sometimes one data structure goes byte
array, which is a “raw” data type. It is up to the time we coerce the fields even the strings, they
turn byte array without notice. It leads to propagation for other steps of the data processing.
v. Minor one
Here is an absence of good IDE or plug-in for Vim. That offers more functionality than syntax
completion to write the pig scripts.
vi. Delay in execution
Unless either we dump or store an intermediate or final result the commands are not executed.
This increases the iteration between debug and resolve the issue.
So, this was all on Pig Advantages and Disadvantages.
The document “Apache Pig Careers Scope with Latest Salary trends”, shows the popularity
of Apache Pig along with the latest salary trends. Since, to analyze large sets of data
representing them as data flow or to perform manipulation operations in Hadoop, we use Pig.
Many companies are adopting Apache Pig very rapidly. That means Careers in Pig and Pig Jobs
82
are increasing day by day. So, this article includes all this information. Also, we will see why we
should learn Apache Pig.
83
Apache Pig Grunt Shell Commands
1. Apache Pig Grunt Shell
There are so many shell and utility commands offered by the Apache Pig Grunt Shell. So, in this
article “Introduction to Apache Pig Grunt Shell”, we will discuss all shell and utility commands
in detail.
84
Example
by using the sh option, we can invoke the ls command of Linux shell from the Grunt shell. Here,
it lists out the files in the /pig/bin/ directory.
Grunt> sh ls
Pig
pig_1444799121955.log
pig.cmd
pig.py
Ii. Fs Command
Moreover, we can invoke any fs Shell commands from the Grunt shell by using the fs command.
Syntax
the syntax of fs command is:
Grunt> sh File System command parameters
Example
by using fs command, we can invoke the ls command of HDFS from the Grunt shell. Here, it
lists the files in the HDFS root directory.
Grunt> fs –ls
Found 3 items
Similarly, using the fs command we can invoke all the other file system shell commands from
the Grunt shell.
4. Utility Commands
It offers a set of Pig Grunt Shell utility commands. It involves clear, help, history, quiet, and set.
Also, there are some commands to control Pig from the Grunt shell, such as exec, kill, and run.
Here is the description of the utility commands offered by the Grunt shell.
85
I. clear Command
In order to clear the screen of the Grunt shell, we use Clear Command.
Syntax
the syntax of the clear command is:
Grunt> clear
Commands
<Pig Latin statement>;
Fs <fs arguments>
[-pram <param_name>=<pCram_value>]
Show the execution plan to compute the alias or for the entire script.
Default parallel: Script-level reduces parallelism. Basic input size heuristics used by
default.
Debug: Set debug on or off. The default is off.
job.name: A single-quoted name for jobs. Default is Pig Latin:<script name>
Job. Priority: Priority for jobs. Values: very low, low, normal, high, very high.
Default is normal stream.skippath: String that contains the path.
This is used by streaming any Hadoop property.
Help – Display this message.
History [-n] – Display the list statements in the cache.
-n – Hide line numbers.
Quit – Quit the grunt shell.
87
iii. History Command
It is the very useful command; it displays a list of statements executed/used so far since the
Grunt sell is invoked.
Syntax
since opening the Grunt shell, let’s suppose we have executed three statements:
Grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING Pig
Storage(',');
Then, using the history command will produce the following output.
Grunt> history
Orders = LOAD 'hdfs: //localhost:9000/pig data /orders.txt' USING Pig Storage (',');
By passing any whole number as a value to this key, we can set the number of reducers for a map
job.
debug
Also, by passing on/off to this key, we can turn off or turn on the debugging feature in Pig.
job.name
Moreover, by passing a string value to this key we can set the Job name to the required job.
Job. priority
By passing one of the following values to this key, we can set the job priority to a job −
1. very low
2. low
88
3. normal
4. high
5. very high
stream.skippath
By passing the desired path in the form of a string to this key, we can set the path from where the
data is not to be transferred, for streaming.
V. quit Command
We can quit from the Grunt shell, using this command.
Syntax
it quit from the Grunt shell:
Grunt> quit
Now see the following commands. By using them we can control Apache Pig from the Grunt
shell.
vi. Exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
the syntax of the utility command exec is:
Grunt> exec [–pram param_name = param_value] [–param_file filename] [script]
Example
let’s suppose there is a file named Employee.txt in the /pig data/ directory of HDFS. Its content
is:
Employee.txt
Now, suppose we have a script file named sample_script.pig in the /pig data/ directory of HDFS.
Its content is:
Sample_script.pig
Dump Employee;
89
Now, let us execute the above script from the Grunt shell using the exec command as shown
below.
Grunt> exec /sample_script.pig
Example
Assume there is a running Pig job having id Id_0055. By using the kill command, we can kill it
from the Grunt shell.
Grunt> kill Id_0055
Example
so, let’s suppose there is a file named Employee.txt in the /pig data/ directory of HDFS. Its
content is:
Employee.txt
Afterwards, suppose we have a script file named sample_script.pig in the local filesystem. Its
content is:
Sample_script.pig
90
Further, using the run command, let’s run the above script from the Grunt shell.
Grunt> run /sample_script.pig
Then, using the Dump operator, we can see the output of the script.
Grunt> Dump;
Also, it is very important to note that there is one difference between exec and the run command.
That is if we use to run, the statements from the script are available in the command history.
Apache Pig Built in Functions
In this article “Apache Pig Built in Functions”, we will discuss all the Apache Pig Built-in
Functions in detail. It includes eval, load/store, math, bag and tuples functions and many more.
Also, we will see their syntax along with their functions and descriptions to understand them
well.
So, let’s start Pig Built in Functions.
2. Pig Functions
There is a huge set of Apache Pig Built in Functions available. Such as the eval, load/store, math,
string, date and time, bag and tuples functions. Basically, there are two main properties which
differentiate built in functions from user-defined functions (UDFs) such as:
We do not need to register built in functions since Pig knows where they are.
Also, we do not need to qualify built in functions, while using them, because again Pig
knows where to find them.
I. AVG ()
AVG Syntax
AVG (expression)
We use AVG(), to compute the average of the numerical values within a bag.
AVG Example
In this example, the average GPA for each Employee is computed
A = LOAD ‘Employee.txt’ AS (name:chararray, term: char array, gpa:float);
DUMP A;
(johny,fl,3.9F)
(johny,wt,3.7F)
91
(johny,sp,4.0F)
(johny,sm,3.8F)
(Mariya,fl,3.8F)
(Mariya,wt,3.9F)
(Mariya,sp,4.0F)
(Mariya,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(johny,{(johny,fl,3.9F),(johny,wt,3.7F),(johny,sp,4.0F),(johny,sm,3.8F)})
(Mariya,{(Mariya,fl,3.8F),(Mariya,wt,3.9F),(Mariya,sp,4.0F),(Mariya,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(john),(john),(john),(john)},3.850000023841858)
({(Maria),(Maria),(Maria),(Maria)},3.925000011920929)
ii. BagToString ()
This function is used to concatenate the elements of a bag into a string. We can place a delimiter
between these values (optional) while concatenating.
iii. CONCAT ()
The syntax of CONCAT()
CONCAT (expression, expression)
We use this Pig Function to concatenate two or more expressions of the same type.
Example of CONCAT()
In this example, field’s f1, an underscore string literal, f2 and f3 are concatenated.
X = LOAD ‘data’ as (f1: chararray, f2: chararray, f3: chararray);
DUMP X;
iv. COUNT ()
The syntax of COUNT()
COUNT (expression)
While counting the number of tuples in a bag, we use it to get the number of elements in a bag.
Example of COUNT()
In this example, we count the tuples in the bag:
X = LOAD ‘data’ AS (f1:int,f2:int,f3:int);
DUMP X;
92
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Y = GROUP X BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
A = FOREACH Y GENERATE COUNT(X);
DUMP A;
(1L)
(2L)
(1L)
(2L)
v. COUNT_STAR()
The syntax of COUNT_STAR()
COUNT_STAR (expression)
We can say it is similar to the COUNT() function. To get the number of elements in a bag, we
use it.
Example of COUNT_STAR()
To count the tuples in a bag.
A = FOREACH Y GENERATE COUNT_STAR(X);
93
ii. Current Time ()
It returns the date-time object of the current time.
iii. Get Day (date time)
to get the day of a month as a return from the date-time object, we use it.
iv. Get Hour (date time)
Get Hour returns the hour of a day from the date-time object.
v. Get Mille Second (date time)
It returns the millisecond of a second from the date-time object.
There is an extensive support for User Defined Functions (UDF’s) in Apache Pig. In this article
“Apache Pig UDF”, we will learn the whole concept of Apache Pig UDFs. Moreover, we will
also learn its introduction. In addition, we will discuss types of Pig UDF, way to write as well as
the way to use these UDF’s in detail.
94
all parts of the processing like data load/store, column transformation, and aggregation, using
Java. Although, make sure the UDF’s written using Java language work efficiently as compared
to other languages since Apache Pig has been written in Java.
Also, we have a Java repository for UDF’s named Piggybank, in Apache Pig. Basically, we can
access Java UDF’s written by other users, and contribute our own UDF’s, using Piggybank.
Types of Pig UDF in Java
We can create and use several types of functions while writing Pig UDF using Java, such as:
Basically, to perform full MapReduce operations on an inner bag, we use these functions writing
95
Firstly, create a new project after opening the Eclipse (say project1).
Then convert the newly created project into a Maven project.
Further, copy the following content in the pom.xml. Basically, this file contains the Maven
dependencies for Apache Pig and Hadoop – core jar files.
However, it is necessary to inherit the Eval Func class and provide implementation to exec ()
function, while writing UDF’s. The code required for the UDF is written, within this function.
Also, see that we have to return the code to convert the contents of the given column to
uppercase, in the above example.
Right-click on the Sample_Eval.java file just after compiling the class without errors that
display us a menu, then select Export.
Now, we will get the following window, by clicking Export. Then, click on the JAR file.
Also, by clicking Next> button proceed further. In this way, we will get another window.
Through that, we need to enter the path in the local file system. Especially, where we need
to store the jar file.
Now click the Finish button, we can see, a Jar file sample_udf.jar is created, in the specified
folder. That jar file contains the UDF written in Java.
Using Pig UDF
Now, follow these steps, after writing the UDF and generating the Jar file −
Step 1: Registering the Jar file
basically, using the Register operator, we have to register the Jar file that contains the UDF, just
after writing UDF (in Java). Also, users can intimate the location of the UDF to Apache Pig, by
registering the Jar file.
Syntax
So, the syntax of the Register operator is-
REGISTER path;
Example
For Example, let’s register the sample_udf.jar created above.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$. /pig –x local
REGISTER '/$PIG_HOME/sample_udf.jar'
96
It is very important to suppose the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
Using the Define operator, we can define an alias to UDF after registering the UDF.
Syntax
So, the syntax of the Define operator.
Apache Pig Reading Data and Storing Data Operators
Objective
For the purpose of Reading and Storing Data, there are two operators available in Apache Pig.
Such as Load Operator and Store Operator. So, in this article “Apache Pig Reading Data and
Storing Data Operators”, we will cover the whole concept of Pig Reading Data and Storing Data
with load and Store Operators. Also, we will cover their syntax as well as their examples to
understand
b. Terms
1. ‘data’
It signifies the name of the file or directory, in single quotes.
Once we specify a directory name, all the files in the directory are loaded.
In addition, to specify files at the file system or directory levels we can use Hadoop -
supported globing.
97
2 USING
Keyword.
The default load function Pig Storage is used if the USING clause is omitted.
3. Function
The load function.
Also, we can use a built-in function. Basically, Pig Storage is the default load function. That
does not need to be specified (simply omit the USING clause).
However, we can write our own load function, if our data is in a format that cannot be
processed by the built-in functions.
5. Schema
Generally, enclosed in parentheses, a schema using the AS keyword.
Basically, schema specifies the type of the data which the loader produces. Make sure
depending on the loader, if the data does not conform to the schema, there are two
possibilities, either a null value or an error is generated.
Also, very important to note that the loader may not immediately convert the data to the
specified format, for performance reasons. Although, still we can operate on the data assuming
the specified type.
Learn Apache Pig Execution: Modes and Mechanism
Apache Pig Execution Modes
Moreover, there are two modes in Apache Pig Execution, in which we can run Apache Pig such
as, Local Mode and HDFS mode. Let’s discuss both in detail:
a. Local Mode
Basically, in this mode, all the files are installed and run on your local host and local file system.
That implies we do not need Hadoop or HDFS anymore. Also, we can say we generally use this
mode for testing purpose. In other words, the pig implements on single JVM and accesses the file
system, in this mode. Especially, for dealing with the small data sets, Local mode is better. In the
same duration, the parallel mapper execution is impossible. However, the previous version of the
Hadoop is not thread-safe.
At the same place, the user can offer –x local to get into Pig local mode of execution. Hence, Pig
always looks for the local file system path while loading data.
98
B. MapReduce Mode
Basically, while we load or process the data that exists in the Hadoop File System (HDFS) using
Apache Pig, is Map Reduced mode. Also, while we execute the Pig Latin statements to process
the data, a MapReduce job is invoked in the back-end to perform a particular operation on the
data that exists in the HDFS, in this mode. To be more specific, in this mode, a user could have
proper Hadoop cluster setup and installations. By default, Apache pig installs as in MR mode. In
addition, Pig translates the queries into MapReduce jobs and runs on top of Hadoop cluster.
Hence, this mode as a MapReduce runs on a distributed cluster.
Apache Pig Execution Mechanisms
There are three ways, in which Apache Pig scripts can be executed such as interactive mode,
batch mode, and embedded mode.
A. Interactive Mode (Grunt shell)
By using the Grunt shell, we can run Apache Pig in interactive mode. By using Dump operator,
we can enter the Pig Latin statements and get the output, in this shell.
b. Batch Mode (Script)
Also, by writing the Pig Latin script in a single file with the .pig extension, we can run Apache
Pig in Batch mode.
c. Embedded Mode (UDF)
By using User Defined Functions in our script, Pig offers the provision of defining our own
functions (User Defined Functions) in programming languages such as Java.
2. MapReduce mode
$. /pig -x MapReduce
99