0% found this document useful (0 votes)
25 views99 pages

Big-Data-Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views99 pages

Big-Data-Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Jaipur Engineering College And Research Centre

Subject Name with code: Big Data Analytics (BDA) (7AID4-01)

Presented by: Ms. Ati Garg

Assistant Professor, AI&DS

Branch and SEM - AI&DS/ VII SEM

Department of

Artificial Intelligence and Data Science

1
UNIT-4

Hadoop Input/output

2
Outline
4.1 Hadoop Programming

4.1.1 Features of Hadoop and Installation

4.1.2 Hadoop Application and Architecture

4.1.3 Hadoop Techniques and Function

4.2 Pig Data

4.2.1 Pig Data Working and application

4.2.2 Pig Data Architecture

4.3 Pig Latin Application Flow

4.3.1 Pig Latin Application Working

4.3.2 ABCs Pig Latin Application

4.4 Local Modes of Running Pig Scripts

4.5 Distributed Modes of Running Pig Scripts

4.5.1 Pig Scripts working

4.5.2 Function and Features

4.5.3 Application

4.5.4 Security

4.6 Pig Scripts Interface

4.6.1 Requirements of Pig Scripts

4.6.2 Security of Pig Scripts

4.7 Scripting with Pig Latin

3
UNIT 4

INTRODUCTION TO HADOOP PROGRAMMING


HADOOP

Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and storage.

Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license.

Hadoop was developed, based on the paper written by Google on the MapReduce system and it
applies concepts of functional programming. Hadoop is written in the Java programming
language and ranks among the highest-level Apache projects. Hadoop was developed by Doug
Cutting and Michael J. Cafarella.

The first problem is storing huge amount of data.

As you can see in the above image, HDFS provides a distributed way to store Big Data. Your
data is stored in blocks in Data Nodes and you specify the size of each block. Suppose you
have 512 MB of data and you have configured HDFS such that it will create 128 MB of data
blocks. Now, HDFS will divide data into 4 blocks as 512/128=4 and stores it across different
Data Nodes. While storing these data blocks into Data Nodes, data blocks are replicated on
different Data Nodes to provide fault tolerance. Hadoop follows horizontal scaling instead of
vertical scaling. In horizontal scaling, you can add new nodes to HDFS cluster on the run as per
requirement, instead of increasing the hardware stack present in each node. As you can see in
the above image, in HDFS you can store all kinds of data whether it is structured, semi-
structured or unstructured. In HDFS, there is no pre-dumping schema validation. It also follows
write once and read many models. Due to this, you can just write any kind of data once and you
can read it multiple times for finding insights.

4
The third challenge was about processing the data faster. In order to solve this, we move the
processing unit to data instead of moving data to the processing unit. It means that instead of
moving data from different nodes to a single master node for processing, the processing logic is
sent to the nodes where data is stored so as that each node can process a part of data in parallel.
Finally, all of the intermediary output produced by each node is merged together and the final
response is sent back to the client.

Features of Hadoop

Reliability
When machines are working as a single unit, if one of the machines fails, another machine will
take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical
Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your Data Nodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors. But if I would have used hardware-based RAID with Oracle for
the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a
Hadoop-based project is minimized. It is easier to maintain a Hadoop environment and is
economical as well. Also, Hadoop is open-source software and hence there is no licensing cost.

Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you
are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because
you can go ahead and procure more hardware and expand your set up within minutes whenever
required.

Flexibility
Hadoop is very flexible in terms of the ability to deal with all kinds of data. We
discussed “Variety” in our previous blog, where data can be of any kind and Hadoop can store
and process them all, whether it is structured, semi-structured or unstructured data.

5
Hadoop Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of
your Hadoop platform, but there are two services which are always mandatory for setting
up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands
for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas
YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

HDFS
Let us go ahead with HDFS first. The main components of HDFS are the Name Node and
the Data Node. Let us talk about the roles of these two components in detail.

Name Node

 It is the master daemon that maintains and manages the Data Nodes (slave nodes)
 It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to the file system metadata
 If a file is deleted in HDFS, the Name Node will immediately record this in the Edit Log
 It regularly receives a Heartbeat and a block report from all the Data Nodes in the
cluster to ensure that the Data Nodes are alive
 It keeps a record of all the blocks in the HDFS and Data Node in which they are stored
 It has high availability and federation features which I will discuss
in HDFS architecture in detail

Data Node

 It is the slave daemon which runs on each slave machine


 The actual data is stored on Data Nodes
 It is responsible for serving read and write requests from the clients
 It is also responsible for creating blocks, deleting blocks and replicating the same
based on the decisions taken by the Name Node
 It sends heartbeats to the Name Node periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds

So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental unit of
Hadoop i.e. YARN.

6
YARN
YARN comprises of two major components: Resource Manager and Node Manager.

Resource Manager

 It is a cluster-level (one for each cluster) component and runs on the master machine
 It manages resources and schedules applications running on top of YARN
 It has two components: Scheduler & Application Manager
 The Scheduler is responsible for allocating resources to the various running applications
 The Application Manager is responsible for accepting job submissions and negotiating
the first container for executing the application
 It keeps a track of the heartbeats from the Node Manager

Node Manager

 It is a node-level component (one on each node) and runs on each slave machine
 It is responsible for managing containers and monitoring resource utilization in each
container
 It also keeps track of node health and log management
 It continuously communicates with Resource Manager to remain up-to-date

7
Hadoop Ecosystem

So far you would have figured out that Hadoop is neither a programming language nor a service,
it is a platform or framework which solves Big Data problems. You can consider it as a suite
which encompasses a number of services for ingesting, storing and analyzing huge data sets
along with tools for configuration management.

Last.FM is internet radio and community-driven music discovery service founded


in 2002. Users transmit information to Last.FM servers indicating which songs they are listening
to. The received data is processed and stored so that, the user can accesses it in the form of
charts. Thus, Last.FM can make intelligent taste and compatible decisions for generating
recommendations. The data is obtained from one of the two sources stated below:

 Scrobble: When a user plays a track of his or her own choice and sends the information
to Last.FM through a client application.
 Radio listen: When the user tunes into a Last.FM radio station and streams a song.

 Over 40M unique visitors and 500M page views each month
 Scrobble stats:
 Up to 800 scrobbles per second
 More than 40 million scrobbles per day
 Over 75 billion scrobbles so far
 Radio stats:
 Over 10 million streaming hours per month
 Over 400 thousand unique stations per day
 Each Scrobble and radio listen generates at least one logline

8
Hadoop at Last.FM

 100 Nodes
 8 cores per node (dual quad-core)
 24GB memory per node
 8TB (4 disks of 2TB each)
 Hive integration to run optimized SQL queries for analysis

Last.FM started using Hadoop in 2006 because of the growth in users from thousands to
millions. With the help of Hadoop, they processed hundreds of daily, monthly, and weekly jobs
including website stats and metrics, chart generation (i.e. track statistics), metadata corrections
(e.g. misspellings of artists), indexing for search, combining/formatting data for
recommendations, data insights, evaluations & reporting. This helped Last.FM to grow
tremendously and figure out the taste of their users, based on which they started recommending
music. I hope this blog was informative and added value to your knowledge. In our next blog
on Hadoop Ecosystem, we will discuss different tools present in Hadoop Ecosystem in detail.

What is Big Data?

For example:

Earlier we had landline phones, but now we have shifted to smart phones. Similarly, how many
of you remember floppy drives that were extensively used back in the ’90s? These Floppy drives
have been replaced by Hard disks because these floppy drives had very low storage capacity and
transfer speed.

Thus, this makes floppy drives insufficient for handling the amount of data with which we are
dealing today. In fact, now we can store terabytes of data on the cloud without being
bothered about size constraints.

9
IoT connects your physical device to the internet and makes it smarter. Nowadays, we have
smart air conditioners, televisions etc. Your smart air conditioner constantly monitors your room
temperature along with the outside temperature and accordingly decides what should be the
temperature of the room. Now imagine how much data would be generated in a year by smart air
conditioner installed in tens & thousands of houses. By this, you can understand how IoT is
contributing a major share to Big Data. Now, let us talk about the largest contributor of the Big
Data which is none other than, Social media. Social media is one of the most important factors in
the evolution of Big Data as it provides information about people’s behavior. You can look at the
figure below and get an idea of how much data is getting generated every minute:

F
ig: Hadoop Tutorial – Social Media Data Generation Stats

10
Hadoop Tutorial: Big Data & Hadoop – Restaurant Analogy

Let us take an analogy of a restaurant to understand the problems associated with Big Data and
how Hadoop solved that problem. Bob is a businessman who has opened a small restaurant.
Initially, in his restaurant, he used to receive two orders per hour and he had one chef with one
food shelf in his restaurant which was sufficient enough to handle all the orders.

Fig: Hadoop Tutorial – Traditional Restaurant Scenario

Now let us compare the restaurant example with the traditional scenario where data was getting
generated at a steady rate and our traditional systems like RDBMS is capable enough to handle
it, just like Bob’s chef. Here, you can relate the data storage with the restaurant’s food shelf and
the traditional processing unit with the chef as shown in the figure above.

Fig: Hadoop Tutorial – Traditional Scenario

11
After a few months, Bob thought of expanding his business and therefore, he started taking
online orders and added few more cuisines to the restaurant’s menu in order to engage a larger
audience. Because of this transition, the rate at which they were receiving orders rose to an
alarming figure of 10 orders per hour and it became quite difficult for a single cook to cope up
with the current situation. Aware of the situation in processing the orders, Bob started thinking
about the solution.

Fig: Hadoop Tutorial – Distributed Processing Scenario

Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of
the introduction of various data growth drivers such as social media, smart phones etc. Now,
the traditional system, just like the cook in Bob’s restaurant, was not efficient enough to handle
this sudden change. Thus, there was a need for a different kind of solutions strategy to cope up
with this problem. After a lot of research, Bob came up with a solution where he hired 4 more
chefs to tackle the huge rate of orders being received. Everything was going quite well, but this
solution led to one more problems. Since four chefs were sharing the same food shelf, the very
food shelf was becoming the bottleneck of the whole process.

Fig: Hadoop Tutorial – Distributed Processing Scenario Failure

12
Similarly, to tackle the problem of processing huge data sets, multiple processing units were
installed so as to process the data in parallel (just like Bob hired 4 chefs). But even in this case,
bringing multiple processing units was not an effective solution because the centralized storage
unit became the bottleneck. In other words, the performance of the whole system is driven by the
performance of the central storage unit. Therefore, the moment our central storage goes down,
the whole system gets compromised. Hence, again there was a need to resolve this single point of
failure.

Fig: Hadoop Tutorial – Solution to Restaurant Problem

Bob came up with another efficient solution, he divided all the chefs into two hierarchies, that is
a junior and a Head chef and assigned each junior chef with a food shelf. Let us assume that the
dish is Meat Sauce. Now, according to Bob’s plan, one junior chef will prepare meat and the
other junior chef will prepare the sauce. Moving ahead they will transfer both meat and sauce to
the head chef, where the head chef will prepare the meat sauce after combining both the
ingredients, which then will be delivered as the final order.

Fig: Hadoop Tutorial – Hadoop in Restaurant Analogy

13
Hadoop functions in a similar fashion as Bob’s restaurant. As the food shelf is distributed in
Bob’s restaurant, similarly, in Hadoop, the data is stored in a distributed fashion with
replications, to provide fault tolerance. For parallel processing, first the data is processed by the
slaves where it is stored for some intermediate results and then those intermediate results are
merged by master node to send the final result. Now, you must have got an idea why Big Data is
a problem statement and how Hadoop solves it.

As we just discussed above, there were three major challenges with Big Data:

 The first problem is storing the colossal amount of data

Storing huge data in a traditional system is not possible. The reason is obvious, the storage will
be limited to one system and the data is increasing at a tremendous rate.

 The second problem is storing heterogeneous data

Now we know that storing is a problem, but let me tell you it is just one part of the problem. The
data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured
and structured. So, you need to make sure that you have a system to store different types of
data that is generated from various sources.

14
Finally let’s focus on the third problem, which is the processing speed

Now the time taken to process this huge amount of data is quite high as the data to be processed
is too large. To solve the storage issue and processing issue, two core components were created
in Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a
distributed fashion and is easily scalable. And, YARN solves the processing issue by reducing
the processing time drastically. Moving ahead, let us understand what is Hadoop? Hadoop is an
Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across
clusters of computers. Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and


 Storage layer (Hadoop Distributed File System).

15
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − these are Java libraries and utilities required by other Hadoop
modules.
 Hadoop YARN − this is a framework for job scheduling and cluster resource
management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.

16
Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.

Starting HDFS

Initially you have to format the configured HDFS file system, open namenode (HDFS server),
and execute the following command.
$ Hadoop name node -format
After formatting the HDFS, start the distributed file system. The following command will start
the name node as well as the data nodes as cluster.
$ Start-dfs.sh

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status of a
file, using ‘ls. Given below is the syntax of ls that you can pass to a directory or a filename as
an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS

Assume we have data in the file called file.txt in the local system which is ought to be saved in
the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file
system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put
command.

17
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS

Assume we have a file in HDFS called outfield. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfield
Step 2
Gets the file from HDFS to the local file system using get command?
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS

You can shut down the HDFS by using the following command.
$ Stop-dfs.sh
There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are
demonstrated here, although these basic operations will get you started. Running ./bin/hadoop
dfs with no additional arguments will list all the commands that can be run with the Shell
system. Furthermore, $HADOOP_HOME/bin/hadoop fs -help command Name will display a
short usage summary for the operation in question, if you are stuck.
A table of all the operations is shown below. The following conventions are used for parameters

"<Path>" means any file or directory name.
"<Path>..." means one or more file or directory names.
"<File>" means any filename.
"<Src>" and "<dest>" are path names in a directed operation.
"<Locals>" and "<local Dest>" are paths as above, but on the local file system.
All other files and path names refer to the objects inside HDFS.

Sr.No Command & Description

18
1 -ls <path>
Lists the contents of the directory specified by path, showing the names, permissions,
owner, size and modification date for each entry.

2 -lsr <path>
Behaves like -ls, but recursively displays entries in all subdirectories of path.

3 -du <path>
Shows disk usage, in bytes, for all the files which match path; filenames are reported with
the full HDFS protocol prefix.

4 -dus <path>
Like -du, but prints a summary of disk usage of all files/directories in the path.

5 -mv <src><dest>
Moves the file or directory indicated by src to dest, within HDFS.

6 -cp <src> <dest>


Copies the file or directory identified by src to dest, within HDFS.

7 -rm <path>
Removes the file or empty directory identified by path.

8 -rmr <path>
Removes the file or directory identified by path. Recursively deletes any child entries (i.e.,
files or subdirectories of path).

9 -put <localSrc> <dest>


Copies the file or directory from the local file system identified by localSrc to dest within
the DFS.

19
10 -copyFromLocal <localSrc> <dest>
Identical to -put

11 -moveFromLocal <localSrc> <dest>


Copies the file or directory from the local file system identified by localSrc to dest within
HDFS, and then deletes the local copy on success.

12 -get [-crc] <src> <localDest>


Copies the file or directory in HDFS identified by src to the local file system path
identified by localDest.

13 -getmerge <src> <localDest>


Retrieves all files that match the path src in HDFS, and copies them to a single, merged
file in the local file system identified by localDest.

14 -cat <filen-ame>
Displays the contents of filename on stdout.

15 -copyToLocal <src> <localDest>


Identical to -get

16 -moveToLocal <src> <localDest>


Works like -get, but deletes the HDFS copy on success.

17 -mkdir <path>
Creates a directory named path in HDFS.
Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).

18 -setrep [-R] [-w] rep <path>


Sets the target replication factor for files identified by path to rep. (The actual replication
factor will move toward the target over time)

20
19 -touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.

20 -test -[ezd] <path>


Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

21 -stat [format] <path>


Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

22 -tail [-f] <file2name>


Shows the last 1KB of file on stdout.

23 -chmod [-R] mode,mode,... <path>...


Changes the file permissions associated with one or more objects identified by path....
Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}.
Assumes if no scope is specified and does not apply an umask.

24 -chown [-R] [owner][:[group]] <path>...


Sets the owning user and/or group for files or directories identified by path.... Sets owner
recursively if -R is specified.

25 -chgrp [-R] group <path>...


Sets the owning group for files or directories identified by path.... Sets group recursively if
-R is specified.

26 -help <cmd-name>
Returns usage information for one of the commands listed above. You must omit the
leading '-' character in cmd.

21
What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

The Algorithm

 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − the map or mappers job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage − this stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

22
Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology

 Pay Load − Applications implement the Map and the Reduce functions, and form the
core of the job.
 Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 Named Node − Node that manages the Hadoop Distributed File System (HDFS).
 Data Node − Node where data is presented in advance before any processing takes
place.
 Master Node − Node where Job Tracker runs and which accepts job requests from
clients.
 Slave Node − Node where Map and Reduce program runs.
 Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker − Tracks the task and reports status to Job Tracker.
 Job − A program is an execution of a Mapper and Reducer across a dataset.

23
 Task − an execution of a Mapper or a Reducer on a slice of data.
 Task Attempt − A particular instance of an attempt to execute a task on a Slave Node.

Example Scenario

Given below is the data regarding the electrical consumption of an organization? It contains the
monthly electrical consumption and the annual average for various years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg

1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records. They will simply write the logic to
produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the large scale industries of
a particular state, since its formation.
When we write applications to process such bulk data,
 They will take a lot of time to execute.
 There will be heavy network traffic when we move data from source to network server
and so on.

To solve these problems, we have the MapReduce framework.


Input Data
The above data is saved as sample.txtand given as input. The input file looks as shown below.
1979 23 2 43 24 25 26 25 26 25

24
1980 26 27 28 28 30 31 30 29
1981 31 32 33 34 35 36 34
1984 39 38 39 39 41 42 43 40 39 38 40
1985 38 39 41 00 40 39 45
Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to
create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Example Using Python

For Hadoop streaming, we are considering the word-count problem. Any job in Hadoop must
have two phases: mapper and reducer. We have written codes for the mapper and the reducer in
python script to run it under Hadoop. One can also write the same in Perl and Ruby.
Mapper Phase Code
!/usr/bin/python
Import sys
# Input takes from standard input for my line in sys.stdin:
# Remove whitespace either side
My line = myline.strip ()

# Break the line into words


Words = myline.split ()

# Iterate the words list


For my word in words:
# Write the results to standard output
Print '%s\t%s' % (my word, 1)
Make sure this file has execution permission (chmod +x /home/
expert/hadoop-1.2.1/mapper.py).
Reducer Phase Code
#! /usr/bin/python
From operator import item getter
Import sys
current word = ""
Current count = 0
word = ""
# Input takes from standard input for my line in sys.stdin:
# Remove whitespace either side
My line = myline.strip ()
# Split the input we got from mapper.py word,
Count = myline.split ('\t', 1)
# Convert count variable to integer
Try:
Count = int (count)
Except Value Error:

25
# Count was not a number, so silently ignore this line continue
If current words == word:
Current count += count
Else:
If current word:
# Write result to standard output print '%s\t%s' % (current word, current count)
Current count = count
Current word = word
# Do not forget to output the last word if needed!
If current words == word:
Print '%s\t%s' % (current word, current count)
Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop home directory.
Make sure these files have execution permission (chmod +x mapper.py and chmod +x
reducer.py). As python is indentation sensitive so the same code can be download from the
below link.
Execution of Word Count Program
$ $HADOOP_HOME/bin/hadoop jar contrib./streaming/hadoop-streaming-1.
2.1. Jar \
-input input_dirs \
-output output_dir \
-mapper <path/mapper.py \
-reducer <path/reducer.py
Where "\" is used for line continuation for clear readability.
For Example,
./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input my input -output my output
-mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

How Streaming Works

In the above example, both the mapper and the reducer are python scripts that read the input
from standard input and emit the output to standard output. The utility will create a Map/Reduce
job, submit the job to an appropriate cluster, and monitor the progress of the job until it
completes.
When a script is specified for mappers, each mapper task will launch the script as a separate
process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines
and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper
collects the line-oriented outputs from the standard output (STDOUT) of the process and
converts each line into a key/value pair, which is collected as the output of the mapper. By
default, the prefix of a line up to the first tab character is the key and the rest of the line
(excluding the tab character) will be the value. If there is no tab character in the line, then the
entire line is considered as the key and the value is null. However, this can be customized, as
per one need.

26
When a script is specified for reducers, each reducer task will launch the script as a separate
process, and then the reducer is initialized. As the reducer task runs, it converts its input
key/values pairs into lines and feeds the lines to the standard input (STDIN) of the process. In
the meantime, the reducer collects the line-oriented outputs from the standard output
(STDOUT) of the process, converts each line into a key/value pair, which is collected as the
output of the reducer. By default, the prefix of a line up to the first tab character is the key and
the rest of the line (excluding the tab character) is the value. However, this can be customized as
per specific requirements.

Important Commands

Parameters Options Description

-input directory/file-name Require


Input location for mapper.
d

-output directory-name Require


Output location for reducer.
d

-mapper executable or script or JavaClassName Require


Mapper executable.
d

-reducer executable or script or JavaClassName Require


Reducer executable.
d

Makes the mapper, reducer, or


-file file-name Optional combiner executable available
locally on the compute nodes.

Class you supply should return


key/value pairs of Text class. If
-inputformat JavaClassName Optional
not specified, TextInputFormat is
used as the default.

-outputformat JavaClassName Optional Class you supply should take


key/value pairs of Text class. If

27
not specified, TextOutputformat is
used as the default.

-partitioner JavaClassName Class that determines which


Optional
reduce a key is sent to.

-combiner streamingCommand or JavaClassName Combiner executable for map


Optional
output.

Passes the environment variable to


-cmdenv name=value Optional
streaming commands.

For backwards-compatibility:
-inputreader Optional specifies a record reader class
(instead of an input format class).

-verbose Optional Verbose output.

Creates output lazily. For example,


if the output format is based on
-lazyOutput Optional FileOutputFormat, the output file
is created only on the first call to
output.collect (or Context.write).

-numReduceTasks Optional Specifies the number of reducers.

-mapdebug Optional Script to call when map task fails.

-reducedebug Optional Script to call when reduces task


fails.

Advantages of Hadoop

28
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.

 In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
 While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reasons for the emergence of Hadoop.
 In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
 In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
 In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
 In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
 Doug Cutting gave named his project Hadoop after his son's toy elephant.
 In 2007, Yahoo runs two clusters of 1000 machines.
 In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
 In 2013, Hadoop 2.2 was released.
 In 2017, Hadoop 3.0 was released.

29
Year

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 o Hadoop introduced.


o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.

2007 o Yahoo runs 2 clusters of 1000 machines.


o Hadoop includes HBase.

2008 o YARN JIRA opened


o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
o cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloud era was founded as a Hadoop distributor.

2009 o Yahoo runs 17 clusters of 24,000 machines.


o Hadoop becomes capable enough to sort a petabytes.
o MapReduce and HDFS become separate subproject.

2010 o Hadoop added the support for Kerberos.


o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011 o Apache Zookeeper released.


o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.

30
2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

Features of HDFS

o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically becomes active.
o Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.

Goals of HDFS

o Handling the hardware failure - The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.

31
What is YARN

Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application HBase, Spark etc. to work on it. Different Yarn applications
can co-exist on the same cluster so MapReduce, HBase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.

Components of YARN

o Client: For submitting MapReduce jobs.


o Resource Manager: To manage the use of resources across the cluster
o Node Manager: For launching and monitoring the computer containers on machines in
the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.

Job tracker & Task tracker were used in previous version of Hadoop, which were responsible for
handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and Node Manager to overcome the shortfall of Job tracker & Task tracker.

Benefits of YARN

o Scalability: Map Reduce 1 hits scalability bottleneck at 4000 nodes and 40000 task, but
Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utilization: Node Manager manages a pool of resources, rather than a fixed number of
the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.

What is HBase

HBase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable. It is based on Google's Big Table. It has set of tables which keep data in
key value format. HBase is well suited for sparse data sets which are very common in big data
use cases. HBase provides APIs enabling development in practically any programming language.
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.

32
Why HBase

o RDBMS get exponentially slow as the data becomes large


o Expects data to be highly structured, i.e. ability to fit in a well-defined schema
o Any change in schema might require a downtime
o For sparse datasets, too much of overhead of maintaining NULL values

Features of HBase

 Horizontally scalable: You can add any number of columns anytime.


 Automatic Failover: Automatic failover is a resource that allows a system administrator
to automatically switch data handling to a standby system in the event of system
compromise
 Integrations with Map/Reduce framework: Al the commands and java codes internally
implements Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
 Sparse, distributed, persistent, multidimensional sorted map, which is indexed by row
key, column key, and timestamp.
 Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
 Fundamentally, it's a platform for storing and retrieving data with random access.
 It doesn't care about data types (storing an integer in one row and a string in another for
the same column).
 It doesn't enforce relationships within your data.
 It is designed to run on a cluster of computers, built using commodity hardware.

What is HIVE

Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook. Hive provides the functionality of reading, writing, and
managing large datasets residing in distributed storage. It runs SQL like queries called HQL
(Hive query language) which gets internally converted to MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

33
Features of Hive

These are the following features of Hive:

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or
Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

Limitations of Hive

o Hive is not capable of handling real-time data.


o It is not designed for online transaction processing.
o Hive queries contain high latency.

Differences between Hive and Pig

Hive Pig

Hive is commonly used by Data Pig is commonly used by programmers.


Analysts.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS It works on client-side of HDFS cluster.


cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.

34
What is Sqoop

Sqoop is a command-line interface application for transferring data between relational databases
and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as
saved jobs which can be run multiple times to import updates made to a database since the last
import. Using Sqoop, Data can be moved into HDFS/hive/HBase from My SQL/ Post gre
SQL/Oracle/SQL Server/DB2 and vice versa.

Sqoop Working

Step 1: Sqoop send the request to Relational DB to send the return the metadata information
about the table (Metadata here is the data about the table in relational DB).

Step 2: From the received information it will generate the java classes (Reason why you should
have Java configured before get it working-Sqoop internally uses JDBC API to generate data).

Step 3: Now Sqoop (As its written in java? tries to package the compiled classes to beagle to
generate table structure), post compiling creates jar file (Java packaging standard).

What is Spark?

Apache Spark is an open-source cluster computing framework. Its primary purpose is to handle
the real-time generated data. Spark was built on the top of the Hadoop MapReduce. It was
optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes
data to and from computer hard drives. So, Spark processes the data much quicker than other
alternatives.

History of Apache Spark

The Spark was initiated by Matei Zaharias at UC Berkeley's AMPLab in 2009. It was open
sourced in 2010 under a BSD license.

In 2013, the project was acquired by Apache Software Foundation. In 2014, the Spark emerged
as a Top-Level Apache Project.

Features of Apache Spark

 Fast - It provides high performance for both batch and streaming data, using a state-of-
the-art DAG scheduler, a query optimizer, and a physical execution engine.
 Easy to Use - It facilitates to write the application in Java, Scala, Python, R, and SQL. It
also provides more than 80 high-level operators.
 Generality - It provides a collection of libraries including SQL and Data Frames, MLlib
for machine learning, GraphX, and Spark Streaming.

35
 Lightweight - It is a light unified analytics engine which is used for large scale data
processing.
 Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes, standalone,
or in the cloud.

Uses of Spark

 Data integration: The data generated by systems are not consistent enough to combine
for analysis. To fetch consistent data from systems we can use processes like Extract,
transform, and load (ETL). Spark is used to reduce the cost and time required for this
ETL process.
 Stream processing: It is always difficult to handle the real-time generated data such as
log files. Spark is capable enough to operate streams of data and refuses potentially
fraudulent operations.
 Machine learning: Machine learning approaches become more feasible and increasingly
accurate due to enhancement in the volume of data. As spark is capable of storing data in
memory and can run repeated queries quickly, it makes it easy to work on machine
learning algorithms.
 Interactive analytics: Spark is able to generate the respond rapidly. So, instead of
running pre-defined queries, we can handle the data interactively.

What is Apache Pig

Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop.
The language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce
jobs and get executed on data stored in HDFS. Apart from that, Pig can also execute its job in
Apache Tez or Apache Spark. Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data File System. Every task
which can be achieved using PIG can also be achieved using java used in MapReduce.

Features of Apache Pig

Let's see the various uses of Pig technology.

1) Ease of programming

Writing complex java programs for map reduces is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.

2) Optimization opportunities

It is how tasks are encoded permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
36
3) Extensibility

A user-defined function is written in which the user can write their logic to execute over the data
set.

4) Flexible

It can easily handle structured as well as unstructured data.

5) In-built operators

It contains various types of operators such as sort, filter and joins.

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a low-level data processing It is a high-level data flow tool.


tool.

Here, it is required to develop It is not required to develop complex programs.


complex programs using Java
or Python.

It is difficult to perform data It provides built-in operators to perform data operations like union,
operations in MapReduce.
Sorting and ordering.

It doesn't allow nested data It provides nested data types like tuples, bag, and map.
types.

37
Advantages of Apache Pig

o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple, bag,
and map.

Apache Pig Installation

In this section, we will perform the pig installation.

Pre-requisite

o Java Installation - Check whether the Java is installed or not using the following
command.

1. $java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.

1. $hadoop version
Steps to install Apache Pig
o Download the Apache Pig tar file.
o Unzip the downloaded tar file.

1. $ tar -xvf pig-0.16.0.tar.gz


o Open the bashrc file.

1. $ sudo nano ~/.bashrc


o Now, provide the following PIG_HOME path.

1. export PIG_HOME=/home/hduser/pig-0.16.0
2. export PATH=$PATH:$PIG_HOME/bin
o Update the environment variable

1. $ source ~/.bashrc
o Let's test the installation on the command prompt type

1. $ pig -h
o Let's start the pig in MapReduce mode.

38
1. $ pig
Apache Pig Run Modes

Apache Pig executes in two modes: Local Mode and MapReduce Mode.

Local Mode

o It executes in a single JVM and is used for development experimenting and prototyping.
o Here, files are installed and run using local host.
o The local mode works on a local file system. The input and output data stored in the local
file system.

The command for local mode grunt shell:

1. $ pig-x local

MapReduce Mode

o The MapReduce mode is also known as Hadoop Mode.


o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
o It can be executed against semi-distributed or fully distributed Hadoop installation.
o Here, the input and output data are present on HDFS.

The command for Map reduce mode:

1. $ pig

Or,

1. $ pig -x map reduce

39
Ways to execute Pig Program

These are the following ways of executing a Pig program on local and MapReduce mode: -

o Interactive Mode - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt
shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin
statements and command interactively at the command line.
o Batch Mode - In this mode, we can run a script file having a .pig extension. These files
contain Pig Latin commands.

Embedded Mode - In this mode, we can define our own functions. These functions can be
called as UDF (User Defined Functions). Here, we use programming
Pig Latin

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.

Pig Latin Statements

The Pig Latin statements are used to process the data. It is an operator that accepts a relation as
an input and generates another relation as an output.

o It can span multiple lines.


o Each statement must end with a semi-colon.
o It may include expression and schemas.
o By default, these statements are processed using multi-query execution

Pig Latin Conventions

Convent Description
ion

() The parenthesis can enclose one or more items. It can also be used to indicate

the tuples data type.


Example - (10, xyz, (3,6,9))

[] The straight brackets can enclose one or more items. It can also be used to

40
Indicate the map data type.
Example - [INNER | OUTER]

{} The curly brackets enclose two or more items. It can also be used to

indicate the bag data type


Example - { block | nested block }

... The horizontal ellipsis points indicate that you can


repeat a portion of the code.
Example - cat path [path ...]

Latin Data Types

Simple Data Types

Type Description

int It defines the signed 32-bit integer.


Example - 2

long It defines the signed 64-bit integer.


Example - 2L or 2l

float It defines 32-bit floating point number.


Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F

double It defines 64-bit floating point number.


Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F

Char array It defines character array in Unicode UTF-8 format.


Example – java point

41
Byte array It defines the byte array.

Boolean It defines the Boolean type values.


Example - true/false

date It defines the values in date time order.


time Example - 1970-01- 01T00:00:00.000+00:00

big It defines Java Big Integer values.


integer Example - 5000000000000

big It defines Java Big Decimal values.


decimal Example - 52.232344535345

Complex Types

Type Description

tuples It defines an ordered set of fields.


Example - (15,12)

bag It defines a collection of tuples.


Example - {(15,12), (12,15)}

map It defines a set of key-value pairs.


Example - [open apache]

o Languages like Java and Python.

42
Pig Data Types

Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.

Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

Char Array Character array hello java point

Byte Array BLOB(Byte array)

tuples Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open apache]

43
Pig Example

Use case: Using Pig find the most occurred start letter.

Solution:

Case 1: Load the data into bag named "lines". The entire line is stuck to element line of type
character array.

1. grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);

Case 2: The text in the bag lines needs to be tokenized this produces one word per row.

1. grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) As token: chararray;

Case 3: To retain the first letter of each word type the below command .This commands uses
substring method to take the first character.

1. grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as letter : chararray;

Case 4: Create a bag for unique character where the grouped bag will contain the same character
for each occurrence of that character.

1. grunt>lettergrp = GROUP letters by letter;

Case 5: The number of occurrence is counted in each group.

1. grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(letters);

Case 6: Arrange the output according to count in descending order using the commands below.

1. grunt>OrderCnt = ORDER count letter BY $1 DESC;

Case 7: Limit to one to give the result.

1. grunt> result =LIMIT Order Cant 1;

Case 8: Store the result in HDFS. The result is saved in output directory under soon folder.

1. grunt> STORE result into 'home/soon/output';

44
Pig UDF (User Defined Functions)

To specify custom processing, Pig provides support for user-defined functions (UDFs). Thus, Pig
allows us to create our own functions. Currently, Pig UDFs can be implemented using the
following programming languages: -

o Java
o Python
o Jython
o JavaScript
o Ruby
o Groovy

Among all the languages, Pig provides the most extensive support for Java functions. However,
limited support is provided to languages like Python, Jython, JavaScript, Ruby, and Groovy.

Example of Pig UDF

In Pig,

o All UDFs must extend "org.apache.pig.EvalFunc"


o All functions must override the "exec" method.

Let's see an example of a simple EVAL Function to convert the provided string to uppercase.

UPPER.java

1. package com.hadoop;
2. import java.io.IOException;
3. import org.apache.pig.EvalFunc;
4. import org.apache.pig.data.Tuple;
5. public class Test Upper extends Eval Func<String> {
6. public String exec(Tuples input) throws IO Exception {
7. if (input == null || input.size() == 0)
8. return null;
9. try{
10. String str = (String) input. Get(0);
11. return str.toUpperCase();
12. }catch(Exception e){
13. throw new IOException("Caught exception processing input row ", e);

45
14. }
15. }
16. }
 Create the jar file and export it into the specific directory. For that ,right click on project
- Export - Java - JAR file - Next.
 Now, provide a specific name to the jar file and save it in a local system directory.
 Create a text file in your local machine and insert the list of tuples.

1. $ nano pig sample

Upload the text files on HDFS in the specific directory.

1. $ hdfs hdfs -put pig example /pig example


o Create a pig file in your local machine and write the script.

1. $ nano pscript.pig

Now, run the script in the terminal to get the output.\

Hadoop Pig Tutorial – Objective

While it comes to analyze large sets of data, as well as to represent them as data flows, we use
Apache Pig. It is nothing but an abstraction over MapReduce. So, in this Hadoop Pig Tutorial,
we will discuss the whole concept of Hadoop Pig. Apart from its Introduction, it also includes
History, need, its Architecture as well as its Features. Moreover, we will see, some Comparisons
like Pig Vs Hive, Apache Pig Vs SQL and Hadoop Pig Vs MapReduce.

Hadoop Pig Tutorial: A Comprehensive Guide to Pig Hadoop

46
What is Hadoop Pig?
Hadoop Pig is nothing but an abstraction over MapReduce. While it comes to analyze large sets
of data, as well as to represent them as data flows, we use Apache Pig. Generally, we use it
with Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop.
In addition, Pig offers a high-level language to write data analysis programs which we call as Pig
Latin. One of the major advantages of this language is, it offers several operators. Through them,
programmers can develop their own functions for reading, writing, and processing data.
It has following key properties such as:
 Ease of programming
Basically, when all the complex tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, that makes them easy to write, understand, and
maintain.
 Optimization opportunities
It allows users to focus on semantics rather than efficiency, to optimize their execution
automatically, in which tasks are encoded permits the system.
 Extensibility
In order to do special-purpose processing, users can create their own functions.
Hence, programmers need to write scripts using Pig Latin language to analyze data using Apache
Pig. However, all these scripts are internally converted to Map and Reduce tasks. It is possible
with a component, we call as Pig Engine.

Hadoop Pig Tutorial – History


Apache Pig was developed as a research project, in 2006, at Yahoo. Basically, to create and
execute MapReduce jobs on every dataset it was created. By Apache incubator, Pig was open
sourced, in 2007. Then the first release of Apache Pig came out in 2008. Further, Hadoop Pig
graduated as an Apache top-level project, in 2010.
4. Why Do We Need Apache Pig?
While performing any MapReduce tasks, there is a case Programmers who are not so good at
Java normally used to struggle to work with Hadoop. Thus, we can say, Pig is a boon for all such
programmers because:
 Without having to type complex codes in Java, using Pig Latin, programmers can perform
MapReduce tasks easily.
 It also helps in reduce the length of codes, since Pig uses multi-query approach. Let’s
understand it with an example. Here an operation that would require us to type 200 lines of

47
code (Loc) in Java can be easily done by typing as less as just 10 LoC in Apache Pig.
Hence, it shows, Pig reduces the development time by almost 16 times.
 When you are familiar with SQL, it is easy to learn Pig. Because Pig Latin is SQL-like
language.
 It offers many built-in operators, in order to support data operations such as joins, filters,
ordering, and many more. Also, it offers nested data types that are missing from MapReduce
such as tuples, bags, and maps.
Hadoop Pig Tutorial – Using Pig
There are several scenarios, where we can use Pig. Such as:
 While data loads are time sensitive.
 Also, while processing various data sources.
 While we require analytical insights through sampling.
Where Not to Use Pig?
Also, there are some Scenarios, where we cannot use. Such as:
 While the data is completely unstructured. Such as video, audio, and readable text.
 Where time constraints exist. Since Pig is slower than MapReduce jobs.
 Also, when more power is required to optimize the codes, we cannot use Pig.

Architecture of Hadoop Pig


Here, the image, which shows the architecture of Apache Pig.

Apache Pig Architecture: Hadoop Pig Tutorial

48
Now, you can see several components in the Hadoop Pig framework. The major components are:
I. Parser
At first, all the Pig Scripts are handled by the Parser. Basically, Parser checks the syntax of the
script, does type checking, and other miscellaneous checks. Afterward, Parser’s output will be a
DAG (directed acyclic graph). That represents the Pig Latin statements as well as logical
operators. Basically, the logical operators of the script are represented as the nodes and the data
flows are represented as edges, in the DAG (the logical plan).
ii. Optimizer
Further, DAG is passed to the logical optimizer. That carries out the logical optimizations. Like
projection and push down.
iii. Compiler
It compiles the optimized logical plan into a series of MapReduce jobs.
Iv. Execution Engine
At last, MapReduce jobs are submitted to Hadoop in a sorted order. Hence, these MapReduce
jobs are executed finally on Hadoop that produces the desired results.
Pig Features
Now in the Hadoop Pig Tutorial is the time to learn the Features of Pig which makes it what it is.
There are several features of Pig. Such as:
I. Rich set of operators
In order to perform several operations, Pig offers many operators. Such as join, sort, filer and
many more.
ii. Ease of programming
Since you are good at SQL, it is easy to write a Pig script. Because of Pig Latin as same as SQL.
iii. Optimization opportunities
In Apache Pig, all the tasks optimize their execution automatically. As a result, the programmers
need to focus only on the semantics of the language.
iv. Extensibility
Through Pig, it is easy to read, process, and write data. It is possible by using the existing
operators. Also, users can develop their own functions.
v. UDFs
By using Pig, we can create User-defined Functions in other programming languages like Java.
Also, can invoke or embed them in Pig Scripts.
vi. Handles all kinds of data
Pig generally analyzes all kinds of data. Even both structured and unstructured. Moreover, it
stores the results in HDFS.

49
Pig Such as:
 Basic knowledge of Linux Operating System
 Fundamental programming skills

Pig Vs MapReduce
Some major differences between Hadoop Pig and MapReduce are:
 Apache Pig
It is a data flow language.
 MapReduce
However, it is a data processing paradigm.
 Hadoop Pig
Pig is a high-level language.
 MapReduce
Well, it is a low level and rigid.
 Pig
In Apache Pig, performing a Join operation is pretty simple.
 MapReduce
But, in MapReduce, it is quite difficult to perform a Join operation between datasets.
 Pig
With a basic knowledge of SQL, any novice programmer can work conveniently with Pig.
 MapReduce
But, to work with MapReduce, exposure to Java is essential.
 Hadoop Pig
Generally, it uses multi-query approach, thereby reducing the length of the codes to a great
extent.
 MapReduce
Although, to perform the same task it needs almost 20 times more the number of lines.
 Apache Pig
Here, we do not require any compilation. Every Pig operator is converted internally into a
MapReduce job, at the time of execution.
 MapReduce
It has a long compilation process.

50
Hadoop Pig Vs SQL
Here, are the major differences between Apache Pig and SQL.
 Pig
It is a procedural language.
 SQL
While it is a declarative language.
 Pig
Here, the schema is optional. Although, without designing a schema, we can store data.
However, it stores values as $01, $02 etc.
 SQL
In SQL, Schema is mandatory.
 Pig
In Pig, data model is nested relational.
 SQL
In SQL, data model used is flat relational.
 Pig
Here, we have limited opportunity for Query Optimization.
 SQL
While here we have more opportunity for query optimization.
Also, Apache Pig Latin −
 Offer splits in the pipeline.
 Provides developers to store data anywhere in the pipeline.
 It also declares execution plans.
 Offers operators to perform ETL (Extract, Transform, and Load) functions.
Any doubt yet in Hadoop Pig Tutorial. Please Comment.

51
Apache Pig Vs Hive
Basically, to create MapReduce jobs, we use both Pig and Hive. Also, we can say, at
times, Hive operates on HDFS as same as Pig does. So, here we are listing few significant points
those set Apache Pig apart from Hive.
 Hadoop Pig
Pig Latin is a language, Apache Pig uses. Originally, it was created at Yahoo.
 Hive
HiveQL is a language, Hive uses. It was originally created at Facebook.
 Pig
It is a data flow language.
 Hive
Whereas, it is a query processing language.
 Pig
Moreover, it is a procedural language which fits in pipeline paradigm.
 Hive
It is a declarative language.
 Apache Pig
Also, can handle structured, unstructured, and semi-structured data.
 Hive
Whereas, it is mostly for structured data.

Pig vs. Hive | Difference between Pig and Hive


Applications of Pig
For performing tasks involving ad-hoc processing and quick prototyping, data scientists
generally use Apache Pig. More of its applications are:
1. In order to process huge data sources like weblogs.
2. Also, to perform data processing for search platforms.
3. Moreover, to process time sensitive data loads.
So, this was all on Hadoop Pig Tutorial. Hope you like our explanation.

52
Apache Pig

This Apache Pig tutorial provides the basic introduction to Apache Pig – high-level
tool over MapReduce. This tutorial helps professionals who are working on Hadoop and would
like to perform MapReduce operations using a high-level scripting language instead of
developing complex codes in Java.

Introduction to Pig Training Tutorial Data Flair


Apache Pig Introduction
a. History of Apache Pig

As a research project at Yahoo in the year 2006, Apache Pig was developed in order to create
and execute MapReduce jobs on large data-sets. In 2007 Apache Pig was open sourced, later in
2008, Apache Pig’s first release came out.

b. Introduction to Apache Pig

Pig was created to simplify the burden of writing complex Java codes to perform MapReduce
jobs. Earlier Hadoop developers have to write complex java codes in order to perform data
analysis. Apache Pig provides a high-level language known as Pig Latin which helps Hadoop
developers to write data analysis programs. By using various operators provided by Pig Latin
language programmers can develop their own functions for reading, writing, and processing data.
In order to perform analysis using Apache Pig, programmers have to write scripts using Pig Latin
language to process data stored in Hadoop Distributed File System. Internally, all these scripts
are converted to Map and Reduce tasks. A component known as Pig Engine is present inside

53
Apache Pig in which Pig Latin scripts are taken as input and these scripts gets converted into

c. Need for Pig

For all those Programmers who are not so good at Java normally, have to struggle a lot for
working with Hadoop, especially when they need to perform any MapReduce tasks. Apache Pig
comes up as a helpful tool for all such programmers. There is no need of developing complex
Java codes to perform MapReduce tasks. By simply writing Pig Latin scripts programmers can
now easily perform MapReduce tasks without having need of writing complex codes in Java.
Apache Pig reduces the length of codes by using multi-query approach. For example, to perform
an operation we need to write 200 lines of code in Java that we can easily perform just by typing
less than 10 lines of code in Apache Pig. Hence, ultimately our almost 16 times development
time gets reduced using Apache Pig. If developers have knowledge of SQL language, then it is
very easy to learn Pig Latin language as it is similar to SQL language. Many built-in operators
are provided by Apache Pig to support data operations like filters, joins, ordering, etc.
In addition, nested data types like tuples, bags, and maps which are not present in MapReduce
are also provided by Pig.

d. Features of Pig

Apache Pig comes with the below unique features:

Apache Pig Features

54
Rich Set of Operators: Pig consists of a collection of rich set of operators in order to perform
operations such as join, filer, sort and many more.

Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for
developers to write a Pig script. If you have knowledge of SQL language, then it is very easy to
learn Pig Latin language as it is similar to SQL language.

Optimization opportunities: The execution of the task in Apache Pig gets automatically
optimized by the task itself, hence the programmers need to only focus on the semantics of the
language.

Extensibility: By using the existing operators, users can easily develop their own functions to
read, process, and write data.

User Define Functions (UDF’s): With the help of facility provided by Pig of creating UDF’s,
we can easily create User Defined Functions on a number of programming languages such as
Java and invoke or embed them in Pig Scripts.

All types of data handling: Analysis of all types of Data (i.e. both structured as well as
unstructured) is provided by Apache Pig and the results are stored inside HDFS.
Apache Pig Installation on Ubuntu – A Pig Tutorial
Install Pig

This Pig tutorial briefs how to install and configure Apache Pig. Apache Pig is an abstraction
over MapReduce. Pig is basically a tool to easily perform analysis of larger sets of data by
representing them as data flows

Download and Install Pig on Ubuntu Training Tutorial Data Flair

55
Apache Pig Installation on Ubuntu

I. Pre-Requisite to Install Pig

You must have Hadoop and Java JDK installed on your system. Hence, before installing Pig you
should install Hadoop and Java by following the steps given in this installation guide.
ii. Downloading Pig

You can download Pig file from the below link:


https://archive.cloudera.com/cdh5/cdh/5/
hadoop-2.5.0-cdh5.3.2 is already installed on the system hence the supported pig version will be
downloaded from here which is pig-0.12.0-cdh5.3.2.
iii. Installing Pig

The steps for Apache Pig installation are given below:


Step 1:
Move the downloaded pig-0.12.0-cdh5.3.2.tar file from Downloads folder to the Directory where
you had installed Hadoop.
Step 2:
Untar pig-0.12.0-cdh5.3.2.tar file by executing the below command on your terminal:
[php]dataflair@ubuntu:~$ tar zxvf pig-0.12.0-cdh5.3.2.tar[/php]
Step 3:
Now we need to configure pig. In order to configure pig, we need to edit “.bashrc” file. To edit
this file execute below command:
[php]dataflair@ubuntu:~$ nano .bashrc[/php]
And in this file we need to add the following:
[php]export PATH=$PATH:/home/data flair/pig-0.12.0-cdh5.3.2/bin
export PIG_HOME=/home/data flair/pig-0.12.0-cdh5.3.2
export PIG_CLASSPATH=$HADOOP_HOME/conf[/php]

56
Apache Pig Installation_Bashrc File
After adding the above parameters save this file by using “CTRL+X” and then “Y” on your
keyboard.
Step 4:
Update .bashrc file by executing below command:
[php]dataflair@ubuntu:~$ source .bashrc[/php]
After refreshing the .bashrc file Pig gets successfully installed. In order to check the version of
your Pig file execute the below command:
[php]dataflair@ubuntu:~$ pig -version[/php]
If the below output appears means you had successfully configured Pig.

57
Apache Pig Version
iv. Starting Pig

We can start Pig in one of the following two modes mentioned below:

1. Local Mode
2. Cluster Mode
To start using pig in local mode ‘-x local’ option is used whereas while executing only “pig”
command without any options, Pig starts in the cluster mode. While running pig in local mode, it
can only access files present on the local file system. Whereas, on starting pig in cluster mode
pig can access files present in HDFS.
To start Pig in Local Mode execute the below command:
[php]dataflair@ubuntu:~$ pig -x local[/php]
And if you get the below output that means Pig started successfully in Local mode.

58
Running_Pig_Local_mode
To start Pig in Cluster-Mode execute the below command:
[php]dataflair@ubuntu:~$ pig[/php]
And if you get the below output that means Pig started successfully in Cluster mode.

59
1. Apache Pig Architecture
In order to write a Pig script, we do require a Pig Latin language. Moreover, we need an
execution environment to execute them. So, in this article “Introduction to Apache Pig
Architecture”, we will study the complete architecture of Apache Pig. It includes its
components, Pig Latin Data Model and Pig Job Execution Flow in depth.

Apache Pig Architecture – Learn Pig Hadoop Working

60
What is Apache Pig Architecture?

In Pig, there is a language we use to analyze data in Hadoop. That is what we call Pig Latin.
Also, it is a high-level data processing language that offers a rich set of data types and operators
to perform several operations on the data.
Moreover, in order to perform a particular task, programmers need to write a Pig script using the
Pig Latin language and execute them using any of the execution mechanisms (Grunt Shell,
UDFs, Embedded) using Pig. To produce the desired output, these scripts will go through a
series of transformations applied by the Pig Framework, after execution. Further, Pig converts
these scripts into a series of MapReduce jobs internally. Therefore it makes the programmer’s
job easy. Here, is the architecture of Apache Pig.

Architecture of Apache Pig

Apache Pig Components


61
There are several components in the Apache Pig framework. Let’s study these major components
in detail:
I. Parser
At first, all the Pig Scripts are handled by the Parser. Parser basically checks the syntax of the
script, does type checking, and other miscellaneous checks. Afterwards, Parser’s output will be a
DAG (directed acyclic graph) that represents the Pig Latin statements as well as logical
operators.
The logical operators of the script are represented as the nodes and the data flows are represented
as edges in DAG (the logical plan)
ii. Optimizer
Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries out the logical
optimizations further such as projection and push down.
iii. Compiler
Then compiler compiles the optimized logical plan into a series of MapReduce jobs.
iv. Execution engine
Eventually, all the MapReduce jobs are submitted to Hadoop in a sorted order. Ultimately, it
produces the desired results while these MapReduce jobs are executed on Hadoop.

Pig Latin Data Model

Apache Pig architecture – Pig Latin Data Model


Pig Latin data model is fully nested. Also, it allows complex non-atomic data types like map and
tuples. Let’s discuss this data model in detail:

62
I. Atom
Atom is defined as any single value in Pig Latin, irrespective of their data. Basically, we can use
it as string and number and store it as the string. Atomic values of Pig are int, long, float, double,
char array, and byte array. Moreover, a field is a piece of data or a simple atomic value in Pig.
For Example − ‘Shoba’ or ‘25’
ii. Tuples
Tuples is a record that is formed by an ordered set of fields. However, the fields can be of any
type. In addition, a tuples is similar to a row in a table of RDBMS.
For Example − (Shubham, 25)
iii. Bag
An unordered set of tuples is what we call Bag. To be more specific, a Bag is a collection of
tuples (non-unique). Moreover, each tuples can have any number of fields (flexible schema).
Generally, we represent a bag by ‘{}’.
For Example − {(Shubham, 25), (Pulkit, 35)}
In addition, when a bag is a field in a relation, in that way it is known as the inner bag.
Example − {Shubham, 25, {9826022258, [email protected],}}
iv. Map
A set of key-value pairs is what we call a map (or data map). Basically, the key needs to be of
type char array and should be unique. Also, the value might be of any type. And, we represent
it by ‘[]’
For Example − [name#Shubham, age#25]
v. Relation
A bag of tuples is what we call Relation. In Pig Latin, the relations are unordered. Also, there is
no guarantee that tuples are processed in any particular order.
So, this was all in Apache Pig Architecture. Hope you like our explanation.

63
1. Pig Latin Tutorial
Apache Pig offers High-level language like Pig Latin to perform data analysis programs. So, in
this Pig Latin tutorial, we will discuss the basics of Pig Latin. Such as Pig Latin statements, data
types, general operators, and Pig Latin UDF in detail. Also, we will see its examples to
understand it well.

What is Pig Latin?


What is Pig Latin?
While we need to analyze data in Hadoop using Apache Pig, we use Pig Latin language.
Basically, first, we need to transform Pig Latin statements into MapReduce jobs using an
interpreter layer. In this way, the Hadoop process these jobs.
However, we can say, Pig Latin is a very simple language with SQL like semantics. It is possible
to use it in a productive manner. It also contains a rich set of functions. Those exhibits data
manipulation. Moreover, by writing user-defined functions (UDF) using Java, we can extend
them easily. That implies they are extensible in nature.
Data Model in Pig Latin
The data model of Pig is fully nested. In addition, the outermost structure of the Pig Latin data
model is a Relation. Also, it is a bag. While−
 A bag, what we call a collection of tuples.
 A tuples, what we call an ordered set of fields.
 A field, what we call a piece of data.

64
Statements in Pig Latin
Also, make sure, statements are the basic constructs while processing data using Pig Latin.
 Basically, statements work with relations. Also, includes expressions and schemas.
 Here, every statement ends with a semicolon (;).
 Moreover, through statements, we will perform several operations using operators, those are
offered by Pig Latin.
 However, Pig Latin statements take a relation as input and produce another relation as
output, while performing all other operations Except LOAD and STORE.
 Its semantic checking will be carried out, once we enter a Load statement in the Grunt shell.
Although, we need to use the Dump operator, in order to see the contents of the schema.
Because, the MapReduce job for loading the data into the file system will be carried out,
only after performing the dump operation.
Pig Latin Example –
Here, is a Pig Latin statement. Basically, that loads data to Apache Pig.
Grunt> Employee data = LOAD 'Employee_data.txt' USING Pig Storage (',') as

(Idling, firstname: chararray, lastname: chararray, phone: chararray, city: chararray );

5. Pig Latin Datatypes

Further, is the list of Pig Latin data types. Such as:


 int
“Int” represents a signed 32-bit integer.
For Example: 10
 long
It represents a signed 64-bit integer.
For Example: 10L
 float
This data type represents a signed 32-bit floating point.
For Example: 10.5F
 double
“Double” represents a 64-bit floating point.
For Example: 10.5
 chararray

65
It represents a character array (string) in Unicode UTF-8 format.
For Example: ‘Data Flair’
 Bytearray
This data type represents a Byte array (blob).
 Boolean
“Boolean” represents a Boolean value.
For Example: true/ false.
Note: It is case insensitive.
 Datetime
It represents a date-time.
For Example: 1970-01-01T00:00:00.000+00:00
 Biginteger
This data type represents a Java BigInteger.
For Example: 60708090709
 Bigdecimal
“Bigdecimal” represents a Java BigDecimal
For Example: 185.98376256272893883
I.Complex Types

 Tuple
An ordered set of fields is what we call a tuple.
For Example: (Ankit, 32)
 Bag
A collection of tuples is what we call a bag.
For Example: {(Ankit, 32), (Neha, 30)}
 Map
A set of key-value pairs is what we call a Map.
Example: [ ‘name’#’Ankit’, ‘age’#32]
ii. Null Values

It is possible that values for all the above data types can be NULL. However, SQL and Pig treat
null values in the same way.
On defining a null Value, It can be an unknown value or a non-existent value. Moreover, we use
it as a placeholder for optional values. Either, these nulls can be the result of an operation or it
can occur naturally.
66
Pig Latin Arithmetic Operators
Here, is the list of arithmetic operators of Pig Latin. Let’s assume, value of A = 20 and B = 40.
 +
Addition − It simply adds values on either side of the operator.
For Example: 60, it comes to adding A+B.
 −
Subtraction – This operator subtracts right-hand operand from left-hand operand.
For Example: −20, it comes on subtracting A-B
 *
Multiplication − It simply Multiplies values on either side of the operators.
For Example: 800, it comes to multiplying A*B.
 /
Division − This operator divides left-hand operand by right-hand operand
For Example: 2, it comes to dividing, b/a
 %
Modulus − It Divides left-hand operand by right-hand operand and returns the remainder
For Example: 0, it comes to dividing, b % a.
 ?:
Bincond − this operator evaluates the Boolean operators. Generally, it has three operands. Such
as:
variable x = (expression) ?, value1 if true or value2 if false.
For Example:
b = (a == 1)? 20: 40;

If a = 1 the value of b is 20.

If a!=1 the value of b is 40.

 CASE
WHEN
THEN
ELSE END
Case − It is equivalent to the nested bincond operator.
For Example- CASE f2 % 2
WHEN 0 THEN ‘even’

67
WHEN 1 THEN ‘odd’
END
Comparison Operators in Pig Latin

Here, is the list of the comparison operators of Pig Latin. Let’s assume, value of A = 20 and B =
40.
 ==
Equal − this operator checks if the values of two operands are equal or not. So, if yes, then the
condition becomes true.
For Example- (a = b) is not true
 !=
Not Equal − It will check if the values of two operands are equal or not. So, if the values are not
equal, then condition becomes true.
For Example- (a! = b) is true.
 >
Greater than − this operator checks if the value of the left operand is greater than the value of the
right operand. Hence, if yes, then the condition becomes true.
For Example- (a > b) is not true.
 <
Less than − It simply checks if the value of the left operand is less than the value of the right
operand. So, if yes, then the condition becomes true.
For Example- (a < b) is true.
 >=
Greater than or equal to − It will check if the value of the left operand is greater than or equal to
the value of the right operand. Hence, if yes, then the condition becomes true.
For Example- (a >= b) is not true.
 <=
Less than or equal to − this operator checks if the value of the left operand is less than or equal to
the value of the right operand. So, if yes, then the condition becomes true.
For Example- (a <= b) is true.
 matches

68
Pattern matching − It simply checks whether the string in the left-hand side matches with the
constant in the right-hand side.
For Example- f1 matches ‘.*data flair.*’
Type Construction Operators

Here, is the list of the Type construction operators of Pig Latin.


 ()
Tuples constructor operator − To construct a tuples, we use this operator.
For Example- (Ankit, 32)
 {}
Bag constructor operator − Moreover, to construct a bag, we use this operator.
For Example- {(Ankit, 32), (Neha, 30)}
 []
Map constructor operator − In order to construct a tuple, we use this operator.
For Example- [name#Ankit, age#32]
So, this was all in Pig Latin Tutorial. Hope you like our explanation.

Pig Latin Operators and Statements

In our previous blog, we have seen Apache Pig introduction and pig architecture in detail. Now
this article covers the basics of Pig Latin Operators such as comparison, general and relational
operators. Moreover, we will also cover the type construction operators as well. We will also
discuss the Pig Latin statements in this blog with an example.

69
Pig Latin Operators and Statements – A Complete Guide
What is Pig Latin?

Pig Latin is the language which analyzes the data in Hadoop using Apache Pig. An interpreter
layer transforms Pig Latin statements into MapReduce jobs. Then Hadoop process these jobs
further. Pig Latin is a simple language with SQL like semantics. Anyone can use it in a
productive manner. Latin has a rich set of functions. These functions exhibit data manipulation.
Furthermore, they are extensible by writing user-defined functions (UDF) using java.
Pig Latin Operators

a. Arithmetic Operators

These pig latin operators are basic mathematical operators.


Operator Description Example

+ Addition − It add values on any single side of the operator. if a= 10, b= 30,
a + b gives 40
Subtraction − It reduces the value of right hand operand from left hand
if a= 40, b= 30,
− operand. a-b gives 10

Multiplication − This operation multiplies the values on either side of


a * b gives you
* the operator. 1200

70
Division − This operator divides the left hand operand by right hand
if a= 40, b= 20,
/ operand. b / a results to 2

Modulus − It divides the left hand operand by right hand operand with
if a= 40, b= 30,
% remainder as result. b%a results to 10

b = (a == 1)? 40:
20;
Bincond − It evaluates the Boolean operators. Moreover, it has three
if a = 1 the value is
?: operands below.
40.
Variable x = (expression)? Value1 if true: value2 if false.
if a!=1 the value is
20.
CASE f2 % 4
CASE
WHEN 0 THEN
WHEN
Case − This operator is equal to the nested bincond. ‘even’
THEN
WHEN 1 THEN
ELSE
‘odd’
END
END
Comparison Operators

This table contains the comparison operators of Pig Latin.


Operator Description Example

Equal − This operator checks whether the values of two operands are
If a=10, b=20, then (a
== equal or not. If yes, then the condition becomes true. = b) is not true

Not Equal − Checks the values of two operands are equal or not. If the
If a=10, b=20, then (a !
!= values are equal, then condition becomes false else true. = b) is true

Greater than − It checks whether the right operand value is greater


If a=10, b=20, then(a >
> than that of the right operand. If yes, then the condition becomes true. b) is not true.

Less than − This operator checks the value of the left operand is less
(a < b) is true, if a=10,
< than the right operand. If condition fulfills, then it returns true. b=20.

Greater than or equal to − It checks the value of the left operand with
>= right hand. It checks whether it is greater or equal to the right operand. If a=20, b=50, true(a
If yes, then it returns true. >= b) is not true.

71
Less than or equal to − The value of the left operand is less than or
If a=20, b=20, (a <= b)
<= equal to that of the right operand. Then the condition still returns true. is true.

Pattern matching − This checks the string in the left-hand matches


matches f1 matches ‘.*df.*’
with the constant in the RHS.

Type Construction Operators

The above table describes the Type construction pig latin operators.
Operator Description Example

() Tuple constructor operator − This operator constructs a tuple. (Dataflair, 20)

Bag constructor operator − To construct a bag, we use this


{(Dataflair, 10), (training,
{} operator. 25)}

[] Map constructor operator − This operator construct a tuple. [name#DF, age#12]

Relational Operations

The above table describes the relational operators of Pig Latin.


Operator Description
Loading and Storing
LOAD It loads the data from a file system into a relation.
STORE It stores a relation to the file system (local/HDFS).
Filtering

FILTER There is a removal of unwanted rows from a relation.

DISTINCT We can remove duplicate rows from a relation by this operator.

FOREACH, GENERATE It transforms the data based on the columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN We can join two or more relations.

72
COGROUP There is a grouping of the data into two or more relations.
GROUP It groups the data in a single relation.

CROSS We can create the cross product of two or more relations.


Sorting

ORDER It arranges a relation in an order based on one or more fields.

LIMIT We can get a particular number of tuples from a relation.

Combining and Splitting

UNION We can combine two or more relations into one relation.

SPLIT To split a single relation into more relations.

Diagnostic Operators

DUMP It prints the content of a relationship through the console.

DESCRIBE It describes the schema of a relation.

EXPLAIN We can view the logical, physical execution plans to evaluate a relation.

ILLUSTRATE It displays all the execution steps as the series of statements.

Pig Latin – Statements

The statements are the basic constructs while processing data using Pig Latin.
 The statements can work with relations including expressions and schemas.
 However, every statement terminates with a semicolon (;).
 We will perform different operations using Pig Latin operators.
 Pig Latin statements input a relation and produce some other relation as output.
 The semantic checking initiates as we enter a Load step in the Grunt shell. We use the
Dump operator to view the contents of the schema. The MapReduce job initiates for loading
the data into the file system. It performs only after the dump operation.

73
For Example
Following is a Pig Latin statement; it loads the data to Apache Pig.
[php]grunt> Sample data = LOAD ‘sample_data.txt’ USING Pig Storage(‘,’)as
( id:int, name:chararray, contact:chararray, city:chararray );[/php]
So, this was all in Pig Latin Tutorial. Hope you like our explanation.

Apache Pig Architecture and Execution Modes

In this article, we will cover the Apache Pig Architecture. It is actually developed on top
of Hadoop. Moreover, we will see the various components of Apache Hive and Pig Latin Data
Model. The Apache Pig provides a high-level language. We will also see the two modes to run
this component.

Introduction to Apache Pig Architecture


What is Apache Pig Architecture?

The language which analyzes data in Hadoop using Pig called as Pig Latin. Therefore, it is a
high-level data processing language. While it provides a wide range of data types and operators
to perform data operations. To perform a task using Pig, programmers need to write a Pig script
using the Pig Latin language. They execute them with any of the execution mechanisms such as
(Grunt Shell, UDFs, and Embedded). These scripts will also go through a series of
transformations after execution. Moreover, the Pig Framework produces the desired output.
Apache Pig converts these scripts into many MapReduce jobs. Thus, it makes the job easy for
developers.
Components of Apache Pig
There are various components in Apache Pig Architecture which makes its execution faster as
discussed below:

74
Components of Apache Pig
a. Parser

The Parser handles the Pig Scripts and checks the syntax of the script. It includes type checking
with other checks. Therefore, an output of the parser will be a Directed Graph. However, it
represents the Pig Latin statements and logical operators.
In the DAG, the script operators are actually represented as the nodes. Moreover, the data flows
are eventually represented as edges.
b. Optimizer

The logical optimizer then receives the logical plan (DAG). In fact, it carries out the logical
optimization such as projection and push down.
c. Compiler

The compiler converts the logical plan into a series of MapReduce jobs
d. Execution Engine

In the end, the MapReduce jobs get submitted to Hadoop in a sorted order. Therefore these
MapReduce jobs execute on the Hadoop and produce the desired results.

Pig Latin Data Model

There is a complete nested data model of Pig Latin. Meanwhile, it allows complex non-atomic
data types such as map and tuple.

75
a. Field and Atom

Atom is a single value in Pig Latin, with any data type. The storage occurs in form of string and
we can also use it as string and number. Various atomic values of Pig are int, long, float, double,
char array, and byte array. Furthermore, any simple atomic value or data is actually considered as
a field.
For Example − ‘data flair’ or ‘12’

b. Tuples

A record which contains an ordered set of fields is a Tuple. Thus, the fields can be of any type. A
tuple is same as the row in a table of RDBMS.
For Example − (Data flair, 12)
c. Bag

A bag contains an unordered set of tuples. Therefore, a collection of tuples (non-unique) is can
be a bag. Each tuple may have any number of fields. We can represent the bag as ‘{}’. It is same
as a table in RDBMS. However, it is not necessary that every tuple contains the same fields.
Hence, the fields in the same position (column) may not have the same type.
Example − {(Data flair, 12), (Training, 11)}
While a bag can be a field in a relation which is an inner bag.
Example − {Data flair, 12, {1212121212, [email protected],}}
d. Map

A map (or data map) contains the set of many key-value pairs. Meanwhile, the key has to be of
type char array and unique. The value can be of any type. We can represent it by ‘[]’.
Example − [name#Dataflair, age#11]
e. Relation

Furthermore, a relation contains the bag of tuples. There may be no serial order of processing in
the relations.
Job Execution Flow
The developer creates the scripts, and then it goes to the local file system as functions. Moreover,
when the developers submit Pig Script, it contacts with Pig Latin Compiler. The compiler then
splits the task and run a series of MR jobs. Meanwhile, Pig Compiler retrieves data from
the HDFS. The output file again goes to the HDFS after running MR jobs.

76
a. Pig Execution Modes

We can run Pig in two execution modes. These modes depend upon where the Pig script is going
to run. It also depends on where the data is residing. We can thus store data on a single machine
or in a distributed environment like Clusters. The three different modes to run Pig programs are:
Non-interactive shell or script mode- The user has to create a file, load the code and execute the
script. Then comes the Grunt shell or interactive shell for running Apache Pig commands.
Hence, the last one named as embedded mode, which we can use JDBC to run SQL programs
from Java.

b. Pig Local mode

However, in this mode, pig implements on single JVM and access the file system. This mode is
better for dealing with the small data sets. Meanwhile, the parallel mapper execution is
impossible. The older version of the Hadoop is not thread-safe.
While the user can provide –x local to get into Pig local mode of execution. Therefore, Pig
always looks for the local file system path while loading data.
c. Pig Map Reduce Mode

In this mode, a user could have proper Hadoop cluster setup and installations on it. By default,
Apache Pig installs as in MR mode. The Pig also translates the queries into Map reduce jobs and
runs on top of Hadoop cluster. Hence, this mode as a Map reduce runs on a distributed cluster.
The statements like LOAD, STORE read the data from the HDFS file system and to show
output. These Statements are also used to process data.
d. Storing Results

The intermediate data generates during the processing of MR jobs. Pig stores this data in a non-
permanent location on HDFS storage. The temporary location then created inside HDFS for
storing this intermediate data.
We can use DUMP for getting the final results to the output screen. The output results stored
using STORE operator.

Apache Pig Features


After learning the complete Introduction of Apache Pig, in this article, we will discuss the 12
best features of Apache Pig. There are many features of Apache Pig such as ease of

77
programming, Handles all kinds of data, Extensibility and many more. So, in this blog, “Apache
Pig Features” we will discuss all these features in detail and try to understand why Pig should be
chosen.

Top 12 Apache Pig Features You Must Know


Top 12 Hadoop Pig Features
There are lot many Apache Pig features. Let’s discuss them one by one:
I. Rich set of operators
One of the major advantages is, in order to perform several operations, there is a huge set of
operators offered by Apache Pig, such as join, sort, filer, etc.

ii. Ease of programming


Basically, for SQL Programmer, Pig Latin is a boon. It is as similar to SQL. Hence, if you are
good at SQL it is easy to write a Pig script.

iii. Optimization opportunities


Also, it’s a benefit working here because in Apache Pig the tasks optimize their execution
automatically. Hence, as a result, programmers only need to focus on the semantics of the
language.

iv. Extensibility
Extensibility is one of the most interesting features it has. It means users can develop their own
functions to read, process, and write data, using the existing operators.

v. UDF’s
It is also a very amazing feature that it offers the facility to create User-defined Functions in
other programming languages like Java. Meanwhile, invoke or embed them in Pig Scripts.

78
vi. Handles all kinds of data
Handling all kinds of data is one of the reasons for easy programming. That means it analyzes all
kinds of data. Either structured or unstructured. Also, it stores the results in HDFS.
vii. Join operation
In Apache Pig, performing a Join operation is pretty simple.

viii. Multi-query approach


Apache Pig uses multi-query approach. Basically, this reduces the length of the codes to a great
extent.

ix. No need for compilation


Here, we do not require any compilation. Since every Apache Pig operator is converted
internally into a MapReduce job on execution.

x. Optional Schema
However, the schema is optional, in Apache Pig. Hence, without designing a schema we can
store data. So, values are stored as $01, $02 etc.

xi. Pipeline
Apache Pig Latin allows splits in the pipeline

xii. Data flow language


Apache Pig is data flow language.

Apache Pig Advantages and Disadvantages


1. Apache Pig Pros and Cons

As we all know, we use Apache Pig to analyze large sets of data, as well as to represent them as
data flows. However, Pig attains many more advantages in it. In the same place, there are some
disadvantages also. So, in this article “Pig Advantages and Disadvantages”, we will discuss all
the advantages as well as disadvantages of Apache Pig in detail.

79
Pig Advantages and Disadvantages | Apache Pig Pros and Cons
a. Advantages of Apache Pig
First, let’s check the benefits of Apache Pig –

1. Less development time


2. Easy to learn
3. Procedural language
4. Dataflow
5. Easy to control execution
6. UDFs
7. Lazy evaluation
8. Usage of Hadoop features
9. Effective for unstructured
10. Base Pipeline
I. Less development time

It consumes less time while development. Hence, we can say, it is one of the major advantages.
Especially considering vanilla MapReduce jobs’ complexity, time-spent, and maintenance of the
programs.
Ii. Easy to learn

Well, the Learning curve of Apache Pig is not steep. That implies anyone who does not know
how to write vanilla MapReduce or SQL for that matter could pick up and can write MapReduce
jobs.
iii. Procedural language

Apache Pig is a procedural language, not declarative, unlike SQL. Hence, we can easily follow
the commands. Also, offers better expressiveness in the transformation of data in every step.

80
Moreover, while we compare it to vanilla MapReduce, it is much more like the English
language. In addition, it is very concise and unlike Java but more like Python.
iv. Dataflow

It is a data flow language. That means here everything is about data even though we sacrifice
control structures like for loop or if structures. By “this data and because of data”, data
transformation is a first class citizen. Also, we cannot create for loops without data. We need to
always transform and manipulate data.
V. Easy to control execution
We can control the execution of every step because it is procedural in nature. Also, a benefit that
it is, straight forward. That implies we can write our own UDF(User Defined Function) and
inject in one specific part in the pipeline.
vi. UDFs

It is possible to write our own UDFs.


vii. Lazy evaluation

As per its name, it does not get evaluated unless you do not produce an output file or does not
output any message. It is a benefit of the logical plan. That it could optimize the program
beginning to end and optimizer could produce an efficient plan to execute.
viii. Usage of Hadoop features

Through Pig, we can enjoy everything that Hadoop offers. Such as parallelization, fault-tolerance
with many relational database features.
ix. Effective for unstructured

Pig is quite effective for unstructured and messy large datasets. Basically, Pig is one of the best
tools to make the large unstructured data to structure.

x. Base pipeline

Here, we have UDFs which we want to parallelize and utilize for large amounts of data. That
means we can use Pig as a base pipeline where it does the hard work. For that, we just apply our
UDF in the step that we want.
b. Limitations of Apache Pig
Now, have a look at Apache Pig disadvantages –

1. Errors of Pig

81
2. Not mature
3. Support
4. Minor one
5. Implicit data schema
6. Delay in execution
I. Errors of Pig

Errors that Pig produces due to UDFs(Python) are not helpful at all. At times, while something
goes wrong, it just gives the error such as exec error in UDF, even if the problem is related to
syntax or the type error, it lets alone a logical one.
ii. Not mature

Pig is still in the development, even if it has been around for quite some time.
iii. Support

Generally, Google and Stack Overflow do not lead good solutions for the problems.
iv. Implicit data schema

In Apache Pig, Data Schema is not enforced explicitly but implicitly. It is also a huge
disadvantage. As it does not enforce an explicit schema, sometimes one data structure goes byte
array, which is a “raw” data type. It is up to the time we coerce the fields even the strings, they
turn byte array without notice. It leads to propagation for other steps of the data processing.
v. Minor one

Here is an absence of good IDE or plug-in for Vim. That offers more functionality than syntax
completion to write the pig scripts.
vi. Delay in execution

Unless either we dump or store an intermediate or final result the commands are not executed.
This increases the iteration between debug and resolve the issue.
So, this was all on Pig Advantages and Disadvantages.

Apache Pig Careers Scope – Latest Salary Trends 2019

The document “Apache Pig Careers Scope with Latest Salary trends”, shows the popularity
of Apache Pig along with the latest salary trends. Since, to analyze large sets of data
representing them as data flow or to perform manipulation operations in Hadoop, we use Pig.
Many companies are adopting Apache Pig very rapidly. That means Careers in Pig and Pig Jobs

82
are increasing day by day. So, this article includes all this information. Also, we will see why we
should learn Apache Pig.

Apache Pig Careers Scope – with Latest Salary Trends


Pig Careers Opportunities
Apache Pig is an abstraction over MapReduce which is also a tool/platform to perform large
sets of data representing them as data flows. Moreover, we can perform all the data manipulation
operations in Hadoop using Pig.
Initially, as a research project at Yahoo, Apache Pig was developed, in the year 2006 to create
and execute MapReduce jobs on every dataset. Then via Apache incubator, Apache Pig was open
sourced, in the year 2007. The first release of Apache Pig came out, in the year in 2008. Further,
Apache Pig graduated as an Apache top-level project, in 2010.

Companies Using Apache Pig


Apache Pig has a career in many companies like:
 Salesforce.com
 AXA
 GlaxoSmithKline
 Slingshot Aerospace
 BHP Billiton
Latest – Apache Pig Salary Trends
For manipulation operations in Hadoop, Pig has been the buzzword lately. As per high demand,
the pay package for professionals with Pig skill is on par with industry standards. Similarly, here
also census only belongs to the U.S. but you can get a good idea about Pig Salary trends.

83
Apache Pig Grunt Shell Commands
1. Apache Pig Grunt Shell
There are so many shell and utility commands offered by the Apache Pig Grunt Shell. So, in this
article “Introduction to Apache Pig Grunt Shell”, we will discuss all shell and utility commands
in detail.

Apache Pig Grunt Shell


Introduction to Apache Pig Grunt Shell
We can run your Pig scripts in the shell after invoking the Grunt shell. Moreover, there are
certain useful shell and utility commands offered by the Grunt shell. So, let’s discuss all
commands one by one.

Apache Pig Grunt Shell Commands


In order to write Pig Latin scripts, we use the Grunt shell of Apache Pig. By using sh and fs we
can invoke any shell commands, before that.
I. sh Command
We can invoke any shell commands from the Grunt shell, using the sh command. But make sure,
we cannot execute the commands that are a part of the shell environment (ex − cd), using the sh
command.
Syntax
The syntax of the sh command is:
Grunt> sh shell command parameters

84
Example
by using the sh option, we can invoke the ls command of Linux shell from the Grunt shell. Here,
it lists out the files in the /pig/bin/ directory.
Grunt> sh ls

Pig

pig_1444799121955.log

pig.cmd

pig.py

Ii. Fs Command
Moreover, we can invoke any fs Shell commands from the Grunt shell by using the fs command.
Syntax
the syntax of fs command is:
Grunt> sh File System command parameters

Example
by using fs command, we can invoke the ls command of HDFS from the Grunt shell. Here, it
lists the files in the HDFS root directory.
Grunt> fs –ls

Found 3 items

Drwxrwxrwx - Hadoop super group 0 2015-09-08 14:13 HBase

Drwxr-xr-x - Hadoop super group 0 2015-09-09 14:52 seqgen_data

Drwxr-xr-x - Hadoop super group 0 2015-09-08 11:30 twitter data

Similarly, using the fs command we can invoke all the other file system shell commands from
the Grunt shell.
4. Utility Commands
It offers a set of Pig Grunt Shell utility commands. It involves clear, help, history, quiet, and set.
Also, there are some commands to control Pig from the Grunt shell, such as exec, kill, and run.
Here is the description of the utility commands offered by the Grunt shell.

85
I. clear Command
In order to clear the screen of the Grunt shell, we use Clear Command.
Syntax
the syntax of the clear command is:
Grunt> clear

ii. Help Command


The help command gives you a list of Pig commands or Pig properties.
Syntax
by using the help command, we can get a list of Pig commands.
Grunt> help

Commands
<Pig Latin statement>;

File system commands:

Fs <fs arguments>

Equivalent to Hadoop dfs command:


http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic Commands
Describe <alias> [::< alias]

[-script <pig script>] [-out <path>] [-brief] [-dot|-xml]

[-pram <param_name>=<pCram_value>]

[-param_file <filename>] [<Alias>] -

Show the execution plan to compute the alias or for the entire script.

 -script: Explain the entire script.


 -out: Store the output into directory rather than print to stout.
 -brief: Don’t expand nested plans (presenting a smaller graph for the overview).
 -dot: Generate the output in .dot format. Default is text format.
 -xml: Generate the output in .xml format. Default is text format.
 -pram <param_name: See parameter substitution for details.
 -param_file <filename>: See parameter substitution for details.
 Alias: Alias to explain.
86
 Dump <alias>: Compute the alias and writes the results to stout.
Utility Commands
Exec [-pram <param_name>=param_value] [-param_file <filename>] <script> -

Execute the script with access to grunt environment including aliases.

 -pram <param_name – See parameter substitution for details.


 -param_file <filename> – See parameter substitution for details.
 Script – Script to be executed.
Run [-pram <param_name>=param_value] [-param_file <filename>] <script> -

Execute the script with access to grunt environment.

 -pram <param_name: See parameter substitution for details.


 -param_file <filename>: See parameter substitution for details.
 Script: Script to be executed.
 Sh <shell command>: Invoke a shell command.
 Kill <job_id>: Kill the hadoop job specified by the hadoop job id.
 Set <key> <value>: Provide execution parameters to Pig. Keys and values are case
sensitive.
The following keys are supported:

 Default parallel: Script-level reduces parallelism. Basic input size heuristics used by
default.
 Debug: Set debug on or off. The default is off.
 job.name: A single-quoted name for jobs. Default is Pig Latin:<script name>
 Job. Priority: Priority for jobs. Values: very low, low, normal, high, very high.
 Default is normal stream.skippath: String that contains the path.
 This is used by streaming any Hadoop property.
 Help – Display this message.
 History [-n] – Display the list statements in the cache.
 -n – Hide line numbers.
 Quit – Quit the grunt shell.

87
iii. History Command
It is the very useful command; it displays a list of statements executed/used so far since the
Grunt sell is invoked.
Syntax
since opening the Grunt shell, let’s suppose we have executed three statements:
Grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING Pig
Storage(',');

Grunt> orders = LOAD 'hdfs: //localhost:9000/pig_data/orders.txt' USING Pig Storage (',');

Grunt> Employee = LOAD 'hdfs: //localhost:9000/pig_data/Employee.txt' USING Pig Storage


(',');

Then, using the history command will produce the following output.

Grunt> history

Customers = LOAD 'hdfs: //localhost:9000/pig data/customers.txt' USING Pig Storage (',');

Orders = LOAD 'hdfs: //localhost:9000/pig data /orders.txt' USING Pig Storage (',');

Employee = LOAD 'hdfs: //localhost:9000/pig_data/Employee.txt' USING Pig Storage (',');

iv. Set Command


Basically, to show/assign values to keys, we use set command in Pig.
Syntax
several keys we can set values for, using this command. Such as:
Default parallel

By passing any whole number as a value to this key, we can set the number of reducers for a map
job.
 debug
Also, by passing on/off to this key, we can turn off or turn on the debugging feature in Pig.
 job.name
Moreover, by passing a string value to this key we can set the Job name to the required job.
 Job. priority
By passing one of the following values to this key, we can set the job priority to a job −
1. very low
2. low

88
3. normal
4. high
5. very high
stream.skippath
By passing the desired path in the form of a string to this key, we can set the path from where the
data is not to be transferred, for streaming.
V. quit Command
We can quit from the Grunt shell, using this command.
Syntax
it quit from the Grunt shell:
Grunt> quit

Now see the following commands. By using them we can control Apache Pig from the Grunt
shell.
vi. Exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
the syntax of the utility command exec is:
Grunt> exec [–pram param_name = param_value] [–param_file filename] [script]

Example
let’s suppose there is a file named Employee.txt in the /pig data/ directory of HDFS. Its content
is:
Employee.txt

001, Maul, Hyderabad

002, Anker, Kolkata

003, Shoba, Delhi

Now, suppose we have a script file named sample_script.pig in the /pig data/ directory of HDFS.
Its content is:
Sample_script.pig

Employee = LOAD 'hdfs: //localhost:9000/pig_data/Employee.txt' USING Pig Storage (',')

As (id: into, name: char array, city: char array);

Dump Employee;
89
Now, let us execute the above script from the Grunt shell using the exec command as shown
below.
Grunt> exec /sample_script.pig

vii. Kill Command


By using this command, we can kill a job from the Grunt shell.
Syntax
given below is the syntax of the kill command.
Grunt> kill Jobbed

Example
Assume there is a running Pig job having id Id_0055. By using the kill command, we can kill it
from the Grunt shell.
Grunt> kill Id_0055

viii. Run Command


By using the run command, we can run a Pig script from the Grunt shell.
Syntax
the syntax of the run command is:
Grunt> run [–pram param_name = param_value] [–param_file file_name] script

Example
so, let’s suppose there is a file named Employee.txt in the /pig data/ directory of HDFS. Its
content is:
Employee.txt

001, Mehul, Hyderabad

002, Ankur, Kolkata

003, Shubham, Delhi

Afterwards, suppose we have a script file named sample_script.pig in the local filesystem. Its
content is:
Sample_script.pig

Employee= LOAD 'hdfs: //localhost:9000/pig_data/Employee.txt' USING

Pig Storage (',') as (id: int, name: chararray, city: chararray);

90
Further, using the run command, let’s run the above script from the Grunt shell.
Grunt> run /sample_script.pig

Then, using the Dump operator, we can see the output of the script.
Grunt> Dump;

(1, Mehul, Hyderabad)

(2, Ankur, Kolkata)

(3, Shubham, Delhi)

Also, it is very important to note that there is one difference between exec and the run command.
That is if we use to run, the statements from the script are available in the command history.
Apache Pig Built in Functions
In this article “Apache Pig Built in Functions”, we will discuss all the Apache Pig Built-in
Functions in detail. It includes eval, load/store, math, bag and tuples functions and many more.
Also, we will see their syntax along with their functions and descriptions to understand them
well.
So, let’s start Pig Built in Functions.
2. Pig Functions
There is a huge set of Apache Pig Built in Functions available. Such as the eval, load/store, math,
string, date and time, bag and tuples functions. Basically, there are two main properties which
differentiate built in functions from user-defined functions (UDFs) such as:
 We do not need to register built in functions since Pig knows where they are.
 Also, we do not need to qualify built in functions, while using them, because again Pig
knows where to find them.
I. AVG ()
 AVG Syntax
AVG (expression)
We use AVG(), to compute the average of the numerical values within a bag.
 AVG Example
In this example, the average GPA for each Employee is computed
A = LOAD ‘Employee.txt’ AS (name:chararray, term: char array, gpa:float);
DUMP A;
(johny,fl,3.9F)
(johny,wt,3.7F)

91
(johny,sp,4.0F)
(johny,sm,3.8F)
(Mariya,fl,3.8F)
(Mariya,wt,3.9F)
(Mariya,sp,4.0F)
(Mariya,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(johny,{(johny,fl,3.9F),(johny,wt,3.7F),(johny,sp,4.0F),(johny,sm,3.8F)})
(Mariya,{(Mariya,fl,3.8F),(Mariya,wt,3.9F),(Mariya,sp,4.0F),(Mariya,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(john),(john),(john),(john)},3.850000023841858)
({(Maria),(Maria),(Maria),(Maria)},3.925000011920929)
ii. BagToString ()
This function is used to concatenate the elements of a bag into a string. We can place a delimiter
between these values (optional) while concatenating.
iii. CONCAT ()
 The syntax of CONCAT()
CONCAT (expression, expression)
We use this Pig Function to concatenate two or more expressions of the same type.
 Example of CONCAT()
In this example, field’s f1, an underscore string literal, f2 and f3 are concatenated.
X = LOAD ‘data’ as (f1: chararray, f2: chararray, f3: chararray);
DUMP X;
iv. COUNT ()
 The syntax of COUNT()
COUNT (expression)
While counting the number of tuples in a bag, we use it to get the number of elements in a bag.
 Example of COUNT()
In this example, we count the tuples in the bag:
X = LOAD ‘data’ AS (f1:int,f2:int,f3:int);
DUMP X;

92
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Y = GROUP X BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
A = FOREACH Y GENERATE COUNT(X);
DUMP A;
(1L)
(2L)
(1L)
(2L)
v. COUNT_STAR()
 The syntax of COUNT_STAR()
COUNT_STAR (expression)
We can say it is similar to the COUNT() function. To get the number of elements in a bag, we
use it.
 Example of COUNT_STAR()
To count the tuples in a bag.
A = FOREACH Y GENERATE COUNT_STAR(X);

Date and Time Functions


Here is the list of Date and Time – Pig Built in functions.
I. To Date (milliseconds)
According to the given parameters, it returns a date-time object. There are more alternative for
these functions. Such as To Date (io string), To Date (user string, format), To Date (user string,
format, time zone)

93
ii. Current Time ()
It returns the date-time object of the current time.
iii. Get Day (date time)
to get the day of a month as a return from the date-time object, we use it.
iv. Get Hour (date time)
Get Hour returns the hour of a day from the date-time object.
v. Get Mille Second (date time)
It returns the millisecond of a second from the date-time object.

Apache Pig User Defined Functions and Its Types


Apache Pig UDF (Pig User Defined Functions)

There is an extensive support for User Defined Functions (UDF’s) in Apache Pig. In this article
“Apache Pig UDF”, we will learn the whole concept of Apache Pig UDFs. Moreover, we will
also learn its introduction. In addition, we will discuss types of Pig UDF, way to write as well as
the way to use these UDF’s in detail.

Pig UDF | Apache Pig User Defined Functions


What is Apache Pig UDF?
Apache Pig offers extensive support for Pig UDF, in addition to the built-in functions. Basically,
we can define our own functions and use them, using these UDF’s. Moreover, in six
programming languages, UDF support is available. Such as Java, Python, Python, JavaScript,
Ruby, and Groovy. However, we can say, complete support is only provided in Java. While in all
the remaining languages limited support is provided. In addition, we can write UDF’s involving

94
all parts of the processing like data load/store, column transformation, and aggregation, using
Java. Although, make sure the UDF’s written using Java language work efficiently as compared
to other languages since Apache Pig has been written in Java.
Also, we have a Java repository for UDF’s named Piggybank, in Apache Pig. Basically, we can
access Java UDF’s written by other users, and contribute our own UDF’s, using Piggybank.
Types of Pig UDF in Java
We can create and use several types of functions while writing Pig UDF using Java, such as:

Pig UDF – Types of Pig UDF in Java


a. Filter Functions
In filter statements, we use the filter functions as conditions. Basically, it accepts a Pig value as
input and returns a Boolean value.
b. Eval Functions
In FOREACH GENERATE statements, we use the Eval functions. Basically, it accepts a Pig
value as input and returns a Pig result.
c. Algebraic Functions
In a FOREACH GENERATE statement, we use the Algebraic functions act on inner bags.

Basically, to perform full MapReduce operations on an inner bag, we use these functions writing

Pig UDF using Java


We have to integrate the jar file Pig-0.15.0.jar, in order to write a UDF using Java. However, at
first, make sure we have installed Eclipse and Maven in our system because here we are
discussing how to write a sample UDF using Eclipse.
So, in order to write a UDF function, follow these steps −

95
 Firstly, create a new project after opening the Eclipse (say project1).
 Then convert the newly created project into a Maven project.
 Further, copy the following content in the pom.xml. Basically, this file contains the Maven
dependencies for Apache Pig and Hadoop – core jar files.
However, it is necessary to inherit the Eval Func class and provide implementation to exec ()
function, while writing UDF’s. The code required for the UDF is written, within this function.
Also, see that we have to return the code to convert the contents of the given column to
uppercase, in the above example.
 Right-click on the Sample_Eval.java file just after compiling the class without errors that
display us a menu, then select Export.
 Now, we will get the following window, by clicking Export. Then, click on the JAR file.
 Also, by clicking Next> button proceed further. In this way, we will get another window.
Through that, we need to enter the path in the local file system. Especially, where we need
to store the jar file.
 Now click the Finish button, we can see, a Jar file sample_udf.jar is created, in the specified
folder. That jar file contains the UDF written in Java.
Using Pig UDF
Now, follow these steps, after writing the UDF and generating the Jar file −
Step 1: Registering the Jar file
basically, using the Register operator, we have to register the Jar file that contains the UDF, just
after writing UDF (in Java). Also, users can intimate the location of the UDF to Apache Pig, by
registering the Jar file.
 Syntax
So, the syntax of the Register operator is-
REGISTER path;

 Example
For Example, let’s register the sample_udf.jar created above.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.

$cd PIG_HOME/bin

$. /pig –x local

REGISTER '/$PIG_HOME/sample_udf.jar'

96
It is very important to suppose the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
Using the Define operator, we can define an alias to UDF after registering the UDF.
 Syntax
So, the syntax of the Define operator.
Apache Pig Reading Data and Storing Data Operators

Objective

For the purpose of Reading and Storing Data, there are two operators available in Apache Pig.
Such as Load Operator and Store Operator. So, in this article “Apache Pig Reading Data and
Storing Data Operators”, we will cover the whole concept of Pig Reading Data and Storing Data
with load and Store Operators. Also, we will cover their syntax as well as their examples to
understand

What is Apache Pig Reading Data and Storing Data?


As we all know, generally, Apache Pig works on top of Hadoop. To define, Pig is an analytical
tool that analyzes large datasets that exist in the Hadoop File System. However, we have to
initially load the data into Apache Pig, in order to analyze data using Apache Pig. For that, we
use the load Operator. Afterwards, using the Store Operator, we can store the loaded data in the
file system.
i. LOAD
Basically, Load Operator loads data from the file system.
a. Syntax
LOAD 'data' [USING function] [AS schema];

b. Terms

1. ‘data’
It signifies the name of the file or directory, in single quotes.
Once we specify a directory name, all the files in the directory are loaded.
In addition, to specify files at the file system or directory levels we can use Hadoop -
supported globing.

97
2 USING
Keyword.
 The default load function Pig Storage is used if the USING clause is omitted.
3. Function
The load function.
 Also, we can use a built-in function. Basically, Pig Storage is the default load function. That
does not need to be specified (simply omit the USING clause).
 However, we can write our own load function, if our data is in a format that cannot be
processed by the built-in functions.
5. Schema
Generally, enclosed in parentheses, a schema using the AS keyword.
 Basically, schema specifies the type of the data which the loader produces. Make sure
depending on the loader, if the data does not conform to the schema, there are two
possibilities, either a null value or an error is generated.
Also, very important to note that the loader may not immediately convert the data to the
specified format, for performance reasons. Although, still we can operate on the data assuming
the specified type.
Learn Apache Pig Execution: Modes and Mechanism
Apache Pig Execution Modes
Moreover, there are two modes in Apache Pig Execution, in which we can run Apache Pig such
as, Local Mode and HDFS mode. Let’s discuss both in detail:
a. Local Mode
Basically, in this mode, all the files are installed and run on your local host and local file system.
That implies we do not need Hadoop or HDFS anymore. Also, we can say we generally use this
mode for testing purpose. In other words, the pig implements on single JVM and accesses the file
system, in this mode. Especially, for dealing with the small data sets, Local mode is better. In the
same duration, the parallel mapper execution is impossible. However, the previous version of the
Hadoop is not thread-safe.
At the same place, the user can offer –x local to get into Pig local mode of execution. Hence, Pig
always looks for the local file system path while loading data.

98
B. MapReduce Mode
Basically, while we load or process the data that exists in the Hadoop File System (HDFS) using
Apache Pig, is Map Reduced mode. Also, while we execute the Pig Latin statements to process
the data, a MapReduce job is invoked in the back-end to perform a particular operation on the
data that exists in the HDFS, in this mode. To be more specific, in this mode, a user could have
proper Hadoop cluster setup and installations. By default, Apache pig installs as in MR mode. In
addition, Pig translates the queries into MapReduce jobs and runs on top of Hadoop cluster.
Hence, this mode as a MapReduce runs on a distributed cluster.
Apache Pig Execution Mechanisms
There are three ways, in which Apache Pig scripts can be executed such as interactive mode,
batch mode, and embedded mode.
A. Interactive Mode (Grunt shell)
By using the Grunt shell, we can run Apache Pig in interactive mode. By using Dump operator,
we can enter the Pig Latin statements and get the output, in this shell.
b. Batch Mode (Script)
Also, by writing the Pig Latin script in a single file with the .pig extension, we can run Apache
Pig in Batch mode.
c. Embedded Mode (UDF)
By using User Defined Functions in our script, Pig offers the provision of defining our own
functions (User Defined Functions) in programming languages such as Java.

Invoking the Grunt Shell


By using the −x option, we can invoke the Grunt shell in the desired mode (local/MapReduce).
Command
1. Local mode
$. /pig –x local

2. MapReduce mode
$. /pig -x MapReduce

These commands give us the Grunt shell prompt.


Grunt>

99

You might also like